hive取上季度最後一天詳情 - hive取上季度最後一天,字段,數據,hive,Hive,大數據 lazihuman 博客

hive取上季度最後一天_hive取上季度最後一天

hive取上季度最後一天_數據_02

hive取上季度最後一天_字段_03

hive取上季度最後一天_hive_04

hive取上季度最後一天_hive_05

今日內容:
1) 分桶表的相關優化 -- 理解
2) 建模分層操作 -- 需要操作
3) 全量流程的統計分析: -- 需求操作 (嘗試自己實現)
數據的採集, 數據的清洗轉換, 數據維度退化, 數據的統計分析
4) 增量流程的: 如何對拉鍊表實現增量處理 -- 理解

1.意向客户主題看板_需求説明:
需求一: 計期內，新增意向客户（包含自己錄入的意向客户）總數。
指標: 意向數量
維度:
時間維度:
年月天小時
新老維度:
線上線下:

涉及表:
customer_relationship(意向表)
涉及的字段:
create_date_time
基於這個字段統計意向用户數量: customer_id:先去重

需求二: 統計指定時間段內，新增的意向客户，所在城市區域人數熱力圖
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
區域維度:
涉及表:
customer_relationship(意向表)
customer (客户表(學員表))
涉及的字段:
意向表中: create_date_time

客户表: area

基於這個字段統計意向用户數量: customer_id:先去重
兩個表關聯條件:
意向表.customer_id=客户表.id

需求三: 統計指定時間段內，新增的意向客户中，意向學科人數排行榜。學科名稱要關聯查詢出來
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
學科維度
涉及表:
customer_relationship(意向表),
itcast_subject(學科表)
customer_clue(線索表)

涉及字段:
線索表 :
clue_state : 可以幫助識別新老用户
deleted : 用於判斷數據是否刪除
create_date_time
意向表 :
origin_type: 此字段可以幫助判斷是否為線上還是線下
如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下
基於這個字段統計意向用户數量: customer_id:先去重
學科表:
name
關聯條件:
線索表.customer_relationship_id = 意向表.id
學科表.id = 意向表.itcast_subject_id

需求四: 統計指定時間段內，新增的意向客户中，意向校區人數排行榜
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
校區維度

學校id，同步時，0和null轉換為統一數據，都轉換為-1

涉及表:
customer_relationship(意向表),
customer_clue(線索表),
itcast_school(校區表)
涉及字段:
線索表 :
clue_state : 可以幫助識別新老用户
deleted : 用於判斷數據是否刪除
create_date_time
意向表 :
origin_type: 此字段可以幫助判斷是否為線上還是線下
如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下
基於這個字段統計意向用户數量: customer_id:先去重
校區表:
name
關聯條件:
意向表.itcast_school_id = 校區表.id
線索表.customer_relationship_id = 意向表.id

需求五: 統計指定時間段內，新增的意向客户中，不同來源渠道的意向客户佔比。
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
來源渠道

涉及表:
customer_relationship(意向表),
customer_clue(線索表)
涉及字段:
線索表 :
clue_state : 可以幫助識別新老用户
deleted : 用於判斷數據是否刪除
意向表:
create_date_time
origin_type: 此字段可以幫助判斷是否為線上還是線下此字段也表示來源渠道
如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下
基於這個字段統計意向用户數量: customer_id:先去重
關聯條件:
線索表.customer_relationship_id = 意向表.id

需求6: 統計指定時間段內，新增的意向客户中，各諮詢中心產生的意向客户數佔比情況
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
各諮詢中心

涉及表:
customer_relationship(意向表),
employee: 員工表
scrm_department : 部門表
customer_clue(線索表)
涉及字段:
線索表 :
clue_state : 可以幫助識別新老用户
意向表:
create_date_time
origin_type: 此字段可以幫助判斷是否為線上還是線下此字段也表示來源渠道
如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下
基於這個字段統計意向用户數量: customer_id:先去重
員工表:
tdepart_id : 部門id
部門表:
name
關聯條件:
線索表.customer_relationship_id = 意向表.id
員工表.tdepart_id = 部門表.id
意向表.creator = 員工表.id

總結:
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
產品屬性維度:
地區維度 , 來源渠道, 學科維度, 校區維度 , 各諮詢中心

涉及表: 7張表
customer_relationship(意向表),
涉及到字段: create_date_time , origin_type , customer_id
employee: 員工表
涉及到字段 : tdepart_id 和 id
scrm_department : 部門表
涉及到字段 : name 和 id
customer_clue(線索表)
涉及到字段 : clue_state ,deleted ,create_date_time ,customer_relationship_id
itcast_school(校區表) :
涉及到字段 : name 和 id
itcast_subject(學科表)
涉及到字段 : name 和 id
customer(客户表)
涉及到字段: area 和 id
表關聯:
線索表.customer_relationship_id = 意向表.id
員工表.tdepart_id = 部門表.id
意向表.creator = 員工表.id
意向表.itcast_school_id = 校區表.id
學科表.id = 意向表.itcast_subject_id
意向表.customer_id=客户表.id

意向主題看板案例_導入原始業務數據 --- 此層在實際工作中不存在
create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

將原來發的知行教育分析平台資料中 --> 原始完整數據集 --> scrm --> 將7個表依次導入MySQL中

意向主題看板案例_建模分析:
ODS層:
事實表: 意向表
額外放置一張表: 線索表 (説明: 此表由於是後續主題看板事實表, 為了方便後續的處理, 將此表放置在ODS層)
表: 內部表 + 分桶表 + 分區表 + 拉鍊表實施
DIM層: 維度層
員工表, 校區表, 學科表, 客户表 ,部門表
表: 外部表 + 分區表
關於以上兩層: 只需要一對對應原生數據表結構構建即可, 構建時注意添加一個 start_time(抽取時間)
數據格式和壓縮方式: ORC + ZLIB(SNAPPY)

DW層:
DWD: 清洗轉換以及如果表字段過多, 可以抽取相關的字段 , 對 ODS層表進行處理
清洗工作:
清理掉以及被標識為刪除的數據
轉換工作:
將 origin_type中數據轉換為 0 和 1 形成一個新的字段, 用於標識線上上下
create_date_time將時間轉換為年月日小時
學校id，同步時，0和null轉換為統一數據，都轉換為-1
涉及到字段:
普通字段:
id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat,
itcast_school_id ,itcast_subject_id,creator,hourinfo
分區:
年(yearinfo) , 月(monthinfo) 日(dayinfo)

基於維度提前聚合操作 (不能做) 維度退化
將六個維度表, 和 DWD的事實表進行組合, 形成一張表, 從而實現維度退化操作
思想: 考慮要從各個維度表中獲取那些字段數據, 將這些字段數據全部糅雜在一個表即可
相關字段:
普通字段:
customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type,
itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id,
department_name ,hourinfo
分區字段:
年(yearinfo) , 月(monthinfo) 日(dayinfo)

要想生成這個表的數據, 此處需要進行從ODS+DIM 進行七表聯查得出此表結果

DWS: 指標只有一個, 表也就只有一個
customerid_total,clue_state_stat,origin_type_stat,area,origin_type,
itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,
department_id, department_name , time_type,group_type ,hourinfo ,time_str

分區:
年(yearinfo) , 月(monthinfo) 日(dayinfo)
time_type: 1(年) 2(月) 3(日) 4(小時)
group_type: 1地區維度 , 2來源渠道, 3學科維度, 4校區維度 , 5各諮詢中心 ,6 總意向量

數據結果:
1000 0 0 年 -1 -1 -1 -1
1000 0 1 年 -1 -1 -1 -1
1000 1 0 年 -1 -1 -1 -1
1000 1 1 年 -1 -1 -1 -1
1000 0 0 年 11 -1 -1 -1
1000 0 1 年 11 -1 -1 -1
1000 1 0 年 11 -1 -1 -1
1000 1 1 年 11 -1 -1 -1
1000 0 0 年 11 01 -1 -1
1000 0 1 年 11 01 -1 -1
1000 1 0 年 11 01 -1 -1
1000 1 1 年 11 01 -1 -1
1000 0 0 年 11 -1 山西 -1
1000 0 1 年 11 -1 山西 -1
1000 1 0 年 11 -1 山西 -1
1000 1 1 年 11 -1 山西 -1
1000 0 0 年 11 01 -1 weixin
1000 0 1 年 11 01 -1 weixin
1000 1 0 年 11 01 -1 weixin
1000 1 1 年 11 01 -1 weixin

app層: 不要 DWS已經成功將各個維度分析完成....

2. 分桶表的相關優化:
分桶表: 分文件將一個文件拆分多個文件的操作, 具體拆分多少, 取決於設置的分桶的數量
底層是如何實現分文件呢? 核心採用 MR 分區, 採用 Hash取模計算法對分桶字段進行分區操作
會將數據進行打散操作, 同時保證相同數據會發往同一個reduce中

桶表的操作:
創建表:
create table test_buck(id int, name string)
clustered by(id) sorted by (id asc) into 6 buckets -- 主要此處代碼
row format delimited fields terminated by '\t';

插入數據:
--啓用桶表
set hive.enforce.bucketing=true;
insert into ...

注意: 桶表不能使用 load data 方式來插入桶表數據,
set hive.strict.checks.bucketing = true; 禁止桶表使用load data 默認true
如何將數據插入到桶表:
對桶表建立一張臨時表(千萬不能桶表) 通過 load data 方式將數據進行加載到臨時表, 然後通過 insert into 從臨時表
將數據加載到桶表中

作用:
數據的抽樣處理 : 將一個文件的數據拆分為多個文件後, 從中獲取其中某幾個文件來進行處理, 這個過程數據採樣
作用:
1. 測試的時候, 由於數據過於龐大, 可以對數據進行採樣, 然後在採樣的結果上進行統計分析即可,提升快速開發的效率
2. 對整體數據分析不是很方便, 可以進行採樣分析, 得出的結果依然可以反映整個數據的結果信息
如何實現抽樣:
格式:
tablesample(bucket x out of y on column) as a

請將抽樣函數放置在別名之前, 表之後
函數説明: tablesample(bucket x out of y on column)
X : 從第幾個桶開始抽 x的值必須小於等於y的值
y : 抽桶數量比例 , 必須是桶的倍數或者因子
按照那個字段進行分桶抽樣

例子: 表有 10個桶分桶字段為id

tablesample(bucket 3 out of 5 on id):
思考 : 會抽出幾個桶? 10/5 = 2
會抽出那兩個桶呢?
第三個桶和第八個桶

提升多表join的查詢性能 : 主要的手段就是 map join
1. mapjoin: 適合於小表和大表的join操作
必備條件:
set hive.auto.convert.join=true; -- 必須開啓 mapjoin的優化默認值為true
set hive.auto.convert.join.noconditionaltask.size=512000000; 小表閾值默認值為 20971520 (20M)

2. 中等大小的表和大表進行join: 要求使用 map join 可以使用 Bucket-MapJoin
實現必備條件:
1) 兩個表的關聯條件的字段必須是分桶字段
2) 中型表的分桶數量小於等於大表的分桶數量並且必須是大表桶的倍數
開啓 bucket_mapjoin : set hive.optimize.bucketmapjoin = true
兩個表必須是分桶表

一旦將以上的條件都滿足, hive自動採用 Bucket-MapJoin 如果不滿足, hive會檢測是否滿足 map join, 如果不滿足, 那麼就採用
原始 reduce join的方案

3. 大表和大表 join: 要求使用 map join 可以採用 SMB Join
基於 Bucket-MapJoin 實施的, 首先要先滿足 Bucket-MapJoin
實現必備條件:
兩個表的關聯條件的字段必須是分桶字段,
兩個表的分桶數量必須相等
3) 開啓 bucket_mapjoin : set hive.optimize.bucketmapjoin = true
4) 兩個表必須是分桶表 : 啓用 set hive.enforce.bucketing=true;
5) 開啓 SMB join的必備三項條件 :
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin.sortedmerge = true; --開啓 SMBjoin
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
set hive.enforce.sorting=true;

建表操作:
create table test_smb_2(mid string,age_id string)
CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;
--3. 意向用户主題看板: 建模分層操作
準備工作: 開啓寫入壓縮
set hive.exec.orc.compression.strategy=COMPRESSION;
--3.1: 創建 ODS層表: 2張表 (意向表和線索表)
CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` (
`id` int COMMENT '客户關係id',
`create_date_time` STRING COMMENT '創建時間',
`update_date_time` STRING COMMENT '最後更新時間',
`deleted` int COMMENT '是否被刪除（禁用）',
`customer_id` int COMMENT '所屬客户id',
`first_id` int COMMENT '第一條客户關係id',
`belonger` int COMMENT '歸屬人',
`belonger_name` STRING COMMENT '歸屬人姓名',
`initial_belonger` int COMMENT '初始歸屬人',
`distribution_handler` int COMMENT '分配處理人',
`business_scrm_department_id` int COMMENT '歸屬部門',
`last_visit_time` STRING COMMENT '最後回訪時間',
`next_visit_time` STRING COMMENT '下次回訪時間',
`origin_type` STRING COMMENT '數據來源',
`itcast_school_id` int COMMENT '校區Id',
`itcast_subject_id` int COMMENT '學科Id',
`intention_study_type` STRING COMMENT '意向學習方式',
`anticipat_signup_date` STRING COMMENT '預計報名時間',
`level` STRING COMMENT '客户級別',
`creator` int COMMENT '創建人',
`current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人',
`creator_name` STRING COMMENT '創建者姓名',
`origin_channel` STRING COMMENT '來源渠道',
`comment` STRING COMMENT '備註',
`first_customer_clue_id` int COMMENT '第一條線索id',
`last_customer_clue_id` int COMMENT '最後一條線索id',
`process_state` STRING COMMENT '處理狀態',
`process_time` STRING COMMENT '處理狀態變動時間',
`payment_state` STRING COMMENT '支付狀態',
`payment_time` STRING COMMENT '支付狀態變動時間',
`signup_state` STRING COMMENT '報名狀態',
`signup_time` STRING COMMENT '報名時間',
`notice_state` STRING COMMENT '通知狀態',
`notice_time` STRING COMMENT '通知狀態變動時間',
`lock_state` STRING COMMENT '鎖定狀態',
`lock_time` STRING COMMENT '鎖定狀態修改時間',
`itcast_clazz_id` int COMMENT '所屬ems班級id',
`itcast_clazz_time` STRING COMMENT '報班時間',
`payment_url` STRING COMMENT '付款鏈接',
`payment_url_time` STRING COMMENT '支付鏈接生成時間',
`ems_student_id` int COMMENT 'ems的學生id',
`delete_reason` STRING COMMENT '刪除原因',
`deleter` int COMMENT '刪除人',
`deleter_name` STRING COMMENT '刪除人姓名',
`delete_time` STRING COMMENT '刪除時間',
`course_id` int COMMENT '課程ID',
`course_name` STRING COMMENT '課程名稱',
`delete_comment` STRING COMMENT '刪除原因説明',
`close_state` STRING COMMENT '關閉裝填',
`close_time` STRING COMMENT '關閉狀態變動時間',
`appeal_id` int COMMENT '申訴id',
`tenant` int COMMENT '租户',
`total_fee` DECIMAL COMMENT '報名費總金額',
`belonged` int COMMENT '小週期歸屬人',
`belonged_time` STRING COMMENT '歸屬時間',
`belonger_time` STRING COMMENT '歸屬時間',
`transfer` int COMMENT '轉移人',
`transfer_time` STRING COMMENT '轉移時間',
`follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取',
`transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號',
`transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名',
`end_time` STRING COMMENT '有效截止時間')
comment '客户關係表'
PARTITIONED BY(start_time STRING)
clustered by(id) sorted by(id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue (
id int COMMENT 'customer_clue_id',
create_date_time STRING COMMENT '創建時間',
update_date_time STRING COMMENT '最後更新時間',
deleted STRING COMMENT '是否被刪除（禁用）',
customer_id int COMMENT '客户id',
customer_relationship_id int COMMENT '客户關係id',
session_id STRING COMMENT '七陌會話id',
sid STRING COMMENT '訪客id',
status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）',
users STRING COMMENT '所屬坐席',
create_time STRING COMMENT '七陌創建時間',
platform STRING COMMENT '平台來源（pc-網站諮詢|wap-wap諮詢|sdk-app諮詢|weixin-微信諮詢）',
s_name STRING COMMENT '用户名稱',
seo_source STRING COMMENT '搜索來源',
seo_keywords STRING COMMENT '關鍵字',
ip STRING COMMENT 'IP地址',
referrer STRING COMMENT '上級來源頁面',
from_url STRING COMMENT '會話來源頁面',
landing_page_url STRING COMMENT '訪客着陸頁面',
url_title STRING COMMENT '諮詢頁面title',
to_peer STRING COMMENT '所屬技能組',
manual_time STRING COMMENT '人工開始時間',
begin_time STRING COMMENT '坐席領取時間 ',
reply_msg_count int COMMENT '客服回覆消息數',
total_msg_count int COMMENT '消息總數',
msg_count int COMMENT '客户發送消息數',
comment STRING COMMENT '備註',
finish_reason STRING COMMENT '結束類型',
finish_user STRING COMMENT '結束坐席',
end_time STRING COMMENT '會話結束時間',
platform_description STRING COMMENT '客户平台信息',
browser_name STRING COMMENT '瀏覽器名稱',
os_info STRING COMMENT '系統名稱',
area STRING COMMENT '區域',
country STRING COMMENT '所在國家',
province STRING COMMENT '省',
city STRING COMMENT '城市',
creator int COMMENT '創建人',
name STRING COMMENT '客户姓名',
idcard STRING COMMENT '身份證號',
phone STRING COMMENT '手機號',
itcast_school_id int COMMENT '校區Id',
itcast_school STRING COMMENT '校區',
itcast_subject_id int COMMENT '學科Id',
itcast_subject STRING COMMENT '學科',
wechat STRING COMMENT '微信',
qq STRING COMMENT 'qq號',
email STRING COMMENT '郵箱',
gender STRING COMMENT '性別',
level STRING COMMENT '客户級別',
origin_type STRING COMMENT '數據來源渠道',
information_way STRING COMMENT '資訊方式',
working_years STRING COMMENT '開始工作時間',
technical_directions STRING COMMENT '技術方向',
customer_state STRING COMMENT '當前客户狀態',
valid STRING COMMENT '該線索是否是網資有效線索',
anticipat_signup_date STRING COMMENT '預計報名時間',
clue_state STRING COMMENT '線索狀態',
scrm_department_id int COMMENT 'SCRM內部部門id',
superior_url STRING COMMENT '諸葛獲取上級頁面URL',
superior_source STRING COMMENT '諸葛獲取上級頁面URL標題',
landing_url STRING COMMENT '諸葛獲取着陸頁面URL',
landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源',
info_url STRING COMMENT '諸葛獲取留諮頁URL',
info_source STRING COMMENT '諸葛獲取留諮頁URL標題',
origin_channel STRING COMMENT '投放渠道',
course_id int COMMENT '課程編號',
course_name STRING COMMENT '課程名稱',
zhuge_session_id STRING COMMENT 'zhuge會話id',
is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複',
tenant int COMMENT '租户id',
activity_id STRING COMMENT '活動id',
activity_name STRING COMMENT '活動名稱',
follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取',
shunt_mode_id int COMMENT '匹配到的技能組id',
shunt_employee_group_id int COMMENT '所屬分流員工組',
ends_time STRING COMMENT '有效時間')
comment '客户關係表'
PARTITIONED BY(starts_time STRING)
clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

--3.2: 創建 DIM層表: 5張表
CREATE DATABASE IF NOT EXISTS itcast_dimen;
CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` (
`id` int COMMENT 'key id',
`customer_relationship_id` int COMMENT '當前意向id',
`create_date_time` STRING COMMENT '創建時間',
`update_date_time` STRING COMMENT '最後更新時間',
`deleted` int COMMENT '是否被刪除（禁用）',
`name` STRING COMMENT '姓名',
`idcard` STRING COMMENT '身份證號',
`birth_year` int COMMENT '出生年份',
`gender` STRING COMMENT '性別',
`phone` STRING COMMENT '手機號',
`wechat` STRING COMMENT '微信',
`qq` STRING COMMENT 'qq號',
`email` STRING COMMENT '郵箱',
`area` STRING COMMENT '所在區域',
`leave_school_date` date COMMENT '離校時間',
`graduation_date` date COMMENT '畢業時間',
`bxg_student_id` STRING COMMENT '博學谷學員ID，可能未關聯到，不存在',
`creator` int COMMENT '創建人ID',
`origin_type` STRING COMMENT '數據來源',
`origin_channel` STRING COMMENT '來源渠道',
`tenant` int,
`md_id` int COMMENT '中台id')
comment '客户表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.employee (
id int COMMENT '員工id',
email STRING COMMENT '公司郵箱，OA登錄賬號',
real_name STRING COMMENT '員工的真實姓名',
phone STRING COMMENT '手機號，目前還沒有使用；隱私問題OA接口沒有提供這個屬性，',
department_id STRING COMMENT 'OA中的部門編號，有負值',
department_name STRING COMMENT 'OA中的部門名',
remote_login STRING COMMENT '員工是否可以遠程登錄',
job_number STRING COMMENT '員工工號',
cross_school STRING COMMENT '是否有跨校區權限',
last_login_date STRING COMMENT '最後登錄日期',
creator int COMMENT '創建人',
create_date_time STRING COMMENT '創建時間',
update_date_time STRING COMMENT '最後更新時間',
deleted STRING COMMENT '是否被刪除（禁用）',
scrm_department_id int COMMENT 'SCRM內部部門id',
leave_office STRING COMMENT '離職狀態',
leave_office_time STRING COMMENT '離職時間',
reinstated_time STRING COMMENT '復職時間',
superior_leaders_id int COMMENT '上級領導ID',
tdepart_id int COMMENT '直屬部門',
tenant int COMMENT '租户',
ems_user_name STRING COMMENT 'ems用户名稱'
)
comment '員工表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` (
`id` int COMMENT '部門id',
`name` STRING COMMENT '部門名稱',
`parent_id` int COMMENT '父部門id',
`create_date_time` STRING COMMENT '創建時間',
`update_date_time` STRING COMMENT '更新時間',
`deleted` STRING COMMENT '刪除標誌',
`id_path` STRING COMMENT '編碼全路徑',
`tdepart_code` int COMMENT '直屬部門',
`creator` STRING COMMENT '創建者',
`depart_level` int COMMENT '部門層級',
`depart_sign` int COMMENT '部門標誌，暫時默認1',
`depart_line` int COMMENT '業務線，存儲業務線編碼',
`depart_sort` int COMMENT '排序字段',
`disable_flag` int COMMENT '禁用標誌',
`tenant` int COMMENT '租户')
comment 'scrm部門表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` (
`id` int COMMENT '自增主鍵',
`create_date_time` timestamp COMMENT '創建時間',
`update_date_time` timestamp COMMENT '最後更新時間',
`deleted` STRING COMMENT '是否被刪除（禁用）',
`name` STRING COMMENT '校區名稱',
`code` STRING COMMENT '校區標識',
`tenant` int COMMENT '租户')
comment '校區字典表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` (
`id` int COMMENT '自增主鍵',
`create_date_time` timestamp COMMENT '創建時間',
`update_date_time` timestamp COMMENT '最後更新時間',
`deleted` STRING COMMENT '是否被刪除（禁用）',
`name` STRING COMMENT '學科名稱',
`code` STRING COMMENT '學科編碼',
`tenant` int COMMENT '租户')
comment '學科字典表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

--3.3 構建 DWD層: -- 演示 join優化
CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` (
`rid` int COMMENT 'id',
`customer_id` STRING COMMENT '客户id',
`create_date_time` STRING COMMENT '創建時間',
`itcast_school_id` STRING COMMENT '校區id',
`deleted` STRING COMMENT '是否被刪除',
`origin_type` STRING COMMENT '來源渠道',
`itcast_subject_id` STRING COMMENT '學科id',
`creator` int COMMENT '創建人',
`hourinfo` STRING COMMENT '小時信息',
`origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上'
)
comment '客户意向dwd表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(rid) sorted by(rid) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 3.4: 構建 DWM層
create database itcast_dwm;
CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` (
`customer_id` STRING COMMENT 'id信息',
`create_date_time` STRING COMMENT '創建時間',
`area` STRING COMMENT '區域信息',
`itcast_school_id` STRING COMMENT '校區id',
`itcast_school_name` STRING COMMENT '校區名稱',
`deleted` STRING COMMENT '是否被刪除',
`origin_type` STRING COMMENT '來源渠道',
`itcast_subject_id` STRING COMMENT '學科id',
`itcast_subject_name` STRING COMMENT '學科名稱',
`hourinfo` STRING COMMENT '小時信息',
`origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上',
`clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户',
`tdepart_id` STRING COMMENT '創建者部門id',
`tdepart_name` STRING COMMENT '諮詢中心名稱'
)
comment '客户意向dwm表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(customer_id) sorted by(customer_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 3.5 構建 DWS 層
CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws (
`customer_total` INT COMMENT '聚合意向客户數',
`area` STRING COMMENT '區域信息',
`itcast_school_id` STRING COMMENT '校區id',
`itcast_school_name` STRING COMMENT '校區名稱',
`origin_type` STRING COMMENT '來源渠道',
`itcast_subject_id` STRING COMMENT '學科id',
`itcast_subject_name` STRING COMMENT '學科名稱',
`hourinfo` STRING COMMENT '小時信息',
`origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上',
`clue_state_stat` STRING COMMENT '客户屬性：0.老客户；1.新客户',
`tdepart_id` STRING COMMENT '創建者部門id',
`tdepart_name` STRING COMMENT '諮詢中心名稱',
`time_str` STRING COMMENT '時間明細',
`groupType` STRING COMMENT '產品屬性類別：1.總意向量；2.區域信息；3.校區、學科組合分組；4.來源渠道；5.諮詢中心;',
`time_type` STRING COMMENT '時間維度：1、按小時聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；'
)
comment '客户意向dws表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

4. 意向主題看板案例_數據的採集:
4.1: 完成 DIM層的數據採集:
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d") as start_time from customer where $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table customer \
-m 1 \
--split-by id

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select id,email,real_name,-1 as phone,department_id,department_name,remote_login,job_number,cross_school,last_login_date,creator,create_date_time,update_date_time,deleted,scrm_department_id,leave_office,leave_office_time,reinstated_time,superior_leaders_id,tdepart_id,tenant,ems_user_name,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee where $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table employee \
-m 1 \
--split-by id

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from scrm_department where $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table scrm_department \
-m 1 \
--split-by id

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_school where $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table itcast_school \
-m 1 \
--split-by id

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_subject where $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table itcast_subject \
-m 1 \
--split-by id

4.2: 完成ODS層的數據採集
由於ODS層表時兩張桶表數據, 而 sqoop 無法支持桶表數據的導入工作, 此時解決方案: 為對應的桶表構建臨時表, 然後通過sqoop將數據導入到臨時表
在通過臨時表使用 insert into 的方式將數據導入分桶表中即可

4.2.1: 意向表的數據導入
第一步: 創建意向表的臨時表結構
CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` (
`id` int COMMENT '客户關係id',
`create_date_time` STRING COMMENT '創建時間',
`update_date_time` STRING COMMENT '最後更新時間',
`deleted` int COMMENT '是否被刪除（禁用）',
`customer_id` int COMMENT '所屬客户id',
`first_id` int COMMENT '第一條客户關係id',
`belonger` int COMMENT '歸屬人',
`belonger_name` STRING COMMENT '歸屬人姓名',
`initial_belonger` int COMMENT '初始歸屬人',
`distribution_handler` int COMMENT '分配處理人',
`business_scrm_department_id` int COMMENT '歸屬部門',
`last_visit_time` STRING COMMENT '最後回訪時間',
`next_visit_time` STRING COMMENT '下次回訪時間',
`origin_type` STRING COMMENT '數據來源',
`itcast_school_id` int COMMENT '校區Id',
`itcast_subject_id` int COMMENT '學科Id',
`intention_study_type` STRING COMMENT '意向學習方式',
`anticipat_signup_date` STRING COMMENT '預計報名時間',
`level` STRING COMMENT '客户級別',
`creator` int COMMENT '創建人',
`current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人',
`creator_name` STRING COMMENT '創建者姓名',
`origin_channel` STRING COMMENT '來源渠道',
`comment` STRING COMMENT '備註',
`first_customer_clue_id` int COMMENT '第一條線索id',
`last_customer_clue_id` int COMMENT '最後一條線索id',
`process_state` STRING COMMENT '處理狀態',
`process_time` STRING COMMENT '處理狀態變動時間',
`payment_state` STRING COMMENT '支付狀態',
`payment_time` STRING COMMENT '支付狀態變動時間',
`signup_state` STRING COMMENT '報名狀態',
`signup_time` STRING COMMENT '報名時間',
`notice_state` STRING COMMENT '通知狀態',
`notice_time` STRING COMMENT '通知狀態變動時間',
`lock_state` STRING COMMENT '鎖定狀態',
`lock_time` STRING COMMENT '鎖定狀態修改時間',
`itcast_clazz_id` int COMMENT '所屬ems班級id',
`itcast_clazz_time` STRING COMMENT '報班時間',
`payment_url` STRING COMMENT '付款鏈接',
`payment_url_time` STRING COMMENT '支付鏈接生成時間',
`ems_student_id` int COMMENT 'ems的學生id',
`delete_reason` STRING COMMENT '刪除原因',
`deleter` int COMMENT '刪除人',
`deleter_name` STRING COMMENT '刪除人姓名',
`delete_time` STRING COMMENT '刪除時間',
`course_id` int COMMENT '課程ID',
`course_name` STRING COMMENT '課程名稱',
`delete_comment` STRING COMMENT '刪除原因説明',
`close_state` STRING COMMENT '關閉裝填',
`close_time` STRING COMMENT '關閉狀態變動時間',
`appeal_id` int COMMENT '申訴id',
`tenant` int COMMENT '租户',
`total_fee` DECIMAL COMMENT '報名費總金額',
`belonged` int COMMENT '小週期歸屬人',
`belonged_time` STRING COMMENT '歸屬時間',
`belonger_time` STRING COMMENT '歸屬時間',
`transfer` int COMMENT '轉移人',
`transfer_time` STRING COMMENT '轉移時間',
`follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取',
`transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號',
`transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名',
`end_time` STRING COMMENT '有效截止時間')
comment '客户關係表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

第二步: 使用sqoop 完成數據導入到臨時表:
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name,date_format("9999-12-31","%Y-%m-%d") as end_time, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from customer_relationship where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table customer_relationship_tmp \
-m 1 \
--split-by id

--第三步: 將臨時表的數據, 在次灌入到 ODS的分桶的意向表中:
--分區
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive壓縮
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--寫入時壓縮生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶 set hive.optimize.bucketmapjoin = true;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_ods.customer_relationship partition(start_time)
select * from customer_relationship_tmp;

4.2.2: 將線索表數據導入到ods層的表中
第一步: 建立線索表的臨時表:
CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp (
id int COMMENT 'customer_clue_id',
create_date_time STRING COMMENT '創建時間',
update_date_time STRING COMMENT '最後更新時間',
deleted STRING COMMENT '是否被刪除（禁用）',
customer_id int COMMENT '客户id',
customer_relationship_id int COMMENT '客户關係id',
session_id STRING COMMENT '七陌會話id',
sid STRING COMMENT '訪客id',
status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）',
users STRING COMMENT '所屬坐席',
create_time STRING COMMENT '七陌創建時間',
platform STRING COMMENT '平台來源（pc-網站諮詢|wap-wap諮詢|sdk-app諮詢|weixin-微信諮詢）',
s_name STRING COMMENT '用户名稱',
seo_source STRING COMMENT '搜索來源',
seo_keywords STRING COMMENT '關鍵字',
ip STRING COMMENT 'IP地址',
referrer STRING COMMENT '上級來源頁面',
from_url STRING COMMENT '會話來源頁面',
landing_page_url STRING COMMENT '訪客着陸頁面',
url_title STRING COMMENT '諮詢頁面title',
to_peer STRING COMMENT '所屬技能組',
manual_time STRING COMMENT '人工開始時間',
begin_time STRING COMMENT '坐席領取時間 ',
reply_msg_count int COMMENT '客服回覆消息數',
total_msg_count int COMMENT '消息總數',
msg_count int COMMENT '客户發送消息數',
comment STRING COMMENT '備註',
finish_reason STRING COMMENT '結束類型',
finish_user STRING COMMENT '結束坐席',
end_time STRING COMMENT '會話結束時間',
platform_description STRING COMMENT '客户平台信息',
browser_name STRING COMMENT '瀏覽器名稱',
os_info STRING COMMENT '系統名稱',
area STRING COMMENT '區域',
country STRING COMMENT '所在國家',
province STRING COMMENT '省',
city STRING COMMENT '城市',
creator int COMMENT '創建人',
name STRING COMMENT '客户姓名',
idcard STRING COMMENT '身份證號',
phone STRING COMMENT '手機號',
itcast_school_id int COMMENT '校區Id',
itcast_school STRING COMMENT '校區',
itcast_subject_id int COMMENT '學科Id',
itcast_subject STRING COMMENT '學科',
wechat STRING COMMENT '微信',
qq STRING COMMENT 'qq號',
email STRING COMMENT '郵箱',
gender STRING COMMENT '性別',
level STRING COMMENT '客户級別',
origin_type STRING COMMENT '數據來源渠道',
information_way STRING COMMENT '資訊方式',
working_years STRING COMMENT '開始工作時間',
technical_directions STRING COMMENT '技術方向',
customer_state STRING COMMENT '當前客户狀態',
valid STRING COMMENT '該線索是否是網資有效線索',
anticipat_signup_date STRING COMMENT '預計報名時間',
clue_state STRING COMMENT '線索狀態',
scrm_department_id int COMMENT 'SCRM內部部門id',
superior_url STRING COMMENT '諸葛獲取上級頁面URL',
superior_source STRING COMMENT '諸葛獲取上級頁面URL標題',
landing_url STRING COMMENT '諸葛獲取着陸頁面URL',
landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源',
info_url STRING COMMENT '諸葛獲取留諮頁URL',
info_source STRING COMMENT '諸葛獲取留諮頁URL標題',
origin_channel STRING COMMENT '投放渠道',
course_id int COMMENT '課程編號',
course_name STRING COMMENT '課程名稱',
zhuge_session_id STRING COMMENT 'zhuge會話id',
is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複',
tenant int COMMENT '租户id',
activity_id STRING COMMENT '活動id',
activity_name STRING COMMENT '活動名稱',
follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取',
shunt_mode_id int COMMENT '匹配到的技能組id',
shunt_employee_group_id int COMMENT '所屬分流員工組',
ends_time STRING COMMENT '有效時間')
comment '客户關係表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

第二步: 使用sqoop 完成數據導入到線索表臨時表

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,date_format("9999-12-31","%Y-%m-%d") as ends_time,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time from customer_clue where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table customer_clue_tmp \
-m 1 \
--split-by id

第三步: 將臨時表的數據, 導入到線索表:

insert into table itcast_ods.customer_clue partition(starts_time)
select * from itcast_ods.customer_clue_tmp;

4.3: 完成數據清洗轉換處理工作: ODS的意向表 --> DWD層清洗後的意向表
需要清洗和轉換的操作都有哪些?
清洗:
將標記為delete=1進行清除
轉換工作:
create_date_time字段, 需要轉換出有年月天小時
origin_type 中數據生成一個新的字段 origin_type_stat 用於區分線上和線下
學校id和學科ID，同步時，0和null轉換為統一數據，都轉換為-1

清洗轉換的SQL :
INSERT INTO TABLE itcast_dwd.itcast_intention_dwd partition(yearinfo,monthinfo,dayinfo)
select
id as rid,
customer_id,
create_date_time,
if(itcast_school_id is null or itcast_school_id =0,'-1',itcast_school_id) as itcast_school_id ,
deleted,
origin_type,
if(itcast_subject_id is null or itcast_subject_id =0,'-1',itcast_subject_id) as itcast_subject_id,
creator,
substr(create_date_time,12,2) as hourinfo,
if(origin_type in('NETSERVICE','PRESIGNUP'),'1','0') as origin_type_stat,
substr(create_date_time,1,4) as yearinfo,
substr(create_date_time,6,2) as monthinfo,
substr(create_date_time,9,2) as dayinfo
from itcast_ods.customer_relationship TABLESAMPLE(BUCKET 1 OUT OF 10 on id) as cr where deleted = 0;

--4.4: 完成數據轉換操作: DWD --> DWM
--分區
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive壓縮
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--寫入時壓縮生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_dwm.itcast_intention_dwm partition(yearinfo,monthinfo,dayinfo)
select
iid.customer_id,
iid.create_date_time,
dcu.area,
iid.itcast_school_id,
dis.name,
iid.deleted,
iid.origin_type,
iid.itcast_subject_id,
disub.name,
iid.hourinfo,
iid.origin_type_stat,
if(cc.clue_state ='VALID_NEW_CLUES' , '1', if(cc.clue_state ='VALID_PUBLIC_NEW_CLUE','0','-1') ) as clue_state_stat, -- 找新老用户
demp.tdepart_id,
dsd.name,
iid.yearinfo,
iid.monthinfo,
iid.dayinfo
from itcast_dwd.itcast_intention_dwd as iid
left join itcast_ods.customer_clue as cc on iid.rid = cc.customer_relationship_id
left join itcast_dimen.itcast_school as dis on dis.id = iid.itcast_school_id
left join itcast_dimen.itcast_subject as disub on disub.id=iid.itcast_subject_id
left join itcast_dimen.customer as dcu on dcu.id = iid.customer_id
left join itcast_dimen.employee as demp on demp.id = iid.creator
left join itcast_dimen.scrm_department as dsd on dsd.id = demp.tdepart_id;

經過測試發現: itcast_intention_dwd 和 customer_clue 產生 SMB的mapjoin優化
其餘表均為普通 map join

4.5) 統計分析:
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
產品屬性維度:
地區維度 , 來源渠道, 學科維度, 校區維度 , 各諮詢中心

--需求1: 按照月統計新老用户以及線上下產生意向用户數量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select
count(distinct customer_id ) as customer_total,
'-1' as area,
'-1' as itcast_school_id,
'-1' as itcast_school_name,
'-1' as origin_type,
'-1' as itcast_subject_id,
'-1' as itcast_subject_name,
'-1' as hourinfo,
origin_type_stat,
clue_state_stat,
'-1' as tdepart_id,
'-1' as tdepart_name,
concat(yearinfo,'-',monthinfo) as time_str,
'1' as grouptype ,
'4' as time_type,
yearinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo, clue_state_stat,
origin_type_stat;

-- 需求2: 按照天統計新老用户以及線上下以及各個地區產生意向用户數量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select
count(distinct customer_id ) as customer_total,
area,
'-1' as itcast_school_id,
'-1' as itcast_school_name,
'-1' as origin_type,
'-1' as itcast_subject_id,
'-1' as itcast_subject_name,
'-1' as hourinfo,
origin_type_stat,
clue_state_stat,
'-1' as tdepart_id,
'-1' as tdepart_name,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'2' as grouptype ,
'2' as time_type,
yearinfo,
monthinfo,
dayinfo
from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo,dayinfo, clue_state_stat,
origin_type_stat,area;

今日內容: day14-------------------------------------------------
1) 訪問諮詢主題看板_增量流程 -- 操作
2) 意向客户主題看板_需求分析 -- 最好能夠自己分析出來
3) 意向客户主題看板_建模分析 -- 理解嘗試自己進行分析
4) 分桶join優化過程 -- 理解 + 記錄

1) 訪問諮詢主題看板_增量流程
什麼是增量流程: 每一天都要對上一天的數據進行相關的操作
1. 數據採集: 業務數據庫 --> ODS層
將業務數據庫中上一天的數據導入到ODS層
2. 數據的轉換: ODS層 --> DWD層
將ODS層上一天的數據, 進行清洗轉換工作, 將數據導入到DWD層
3. 數據的分析: DWD層 --> DWS層
將DWD層中上一天的數據, 進行統計分析, 將結果數據導入到DWS層
4. 數據的導出: DWS層 --> 業務數據庫(BI)
此處可以執行全量導出,因為每一天的統計結果數據量都是差不多

0.準備工作: 重新造一份上一天的數據, 在實際生產中是不存在
-- 創建一個表: 將數據添加這個新表中
CREATE TABLE web_chat_ems_2020_11 AS
SELECT * FROM web_chat_ems_2019_07 WHERE create_time BETWEEN '2019-07-01 00:00:00' AND '2019-07-01 23:59:59' ;
-- 修改主表中時間字段為上一天的時間
UPDATE web_chat_ems_2020_11 SET create_time= CONCAT('2020-11-28',' ',SUBSTR(create_time,12)) ;
-- 創建一個副表, 由於副表數據本身就是主表對應數據, 直接灌入到一個新表即可
CREATE TABLE web_chat_text_ems_2020_11 AS SELECT * FROM web_chat_text_ems_2019_07 ;

1. 數據採集的增量操作:
1.1: 如何從MySQL中獲取上一天的數據?

SELECT
id,create_date_time,session_id,sid,create_time,seo_source,
seo_keywords,ip,`area`,country,province,city,origin_channel,
`user` AS user_match, manual_time,begin_time,end_time,last_customer_msg_time_stamp,
last_agent_msg_time_stamp,reply_msg_count,msg_count,browser_name,os_info, '2020-11-28' AS starts_time
FROM web_chat_ems_2020_11
WHERE create_time BETWEEN '2020-11-28 00:00:00' AND '2020-11-28 23:59:59';

SELECT
wcte.* , '2020-11-28' AS start_time
FROM
(SELECT id FROM web_chat_ems_2020_11 WHERE create_time BETWEEN '2020-11-28 00:00:00' AND '2020-11-28 23:59:59') AS tmp1
JOIN web_chat_text_ems_2020_11 wcte ON tmp1.id = wcte.id ;

1.2: 將上一天的數據導入的ODS層: sqoop
思考: 以上這兩個每天都要執行, 只需要更換一下日期即可
如何解決呢? shell 腳本
功能: 編寫一個shell腳本, 如果外部傳遞了日期參數, 採用這個指定日期導入數據, 如果沒有傳遞參數, 使用上一天日期

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root --password 123456 \
--query 'SELECT
id,create_date_time,session_id,sid,create_time,seo_source,
seo_keywords,ip,`area`,country,province,city,origin_channel,
`user` AS user_match, manual_time,begin_time,end_time,last_customer_msg_time_stamp,
last_agent_msg_time_stamp,reply_msg_count,msg_count,browser_name,os_info, "2020-11-28" AS starts_time
FROM web_chat_ems_2020_11
WHERE create_time BETWEEN "2020-11-28 00:00:00" AND "2020-11-28 23:59:59" and $CONDITIONS' \
--fields-terminated-by '\t' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
-m 3 \
--split-by id

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root --password 123456 \
--query 'SELECT
wcte.* , "2020-11-28" AS start_time
FROM
(SELECT id FROM web_chat_ems_2020_11 WHERE create_time BETWEEN "2020-11-28 00:00:00" AND '2020-11-28 23:59:59') AS tmp1
JOIN web_chat_text_ems_2020_11 wcte ON tmp1.id = wcte.id and $CONDITIONS' \
--fields-terminated-by '\t' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
-m 3 \
--split-by id

-- 編寫好的shell腳本, 需要每一天都要對上一天的數據進行數據採集的工作, 此時可以通過 oozie 來解決

-- 清洗轉換操作: 將ODS中上一天的數據進行清洗轉換工作即可

--動態分區配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive壓縮
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--寫入時壓縮生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time,"yyyy-MM-dd HH:mm:ss") AS create_time,
wce.seo_source,
wce.ip,wce.area,
cast( if( wce.msg_count is null , 0 ,wce.msg_count ) as int ) as msg_count,
wce.origin_channel,
wcte.referrer, wcte.from_url,
wcte.landing_page_url, wcte.url_title,
wcte.platform_description, wcte.other_params,
wcte.history,
substr(wce.create_time,12,2) as hourinfo,
substr(wce.create_time,1,4) as yearinfo ,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time,6,2) as monthinfo,
substr(wce.create_time,9,2) as dayinfo
from (select * from itcast_ods.web_chat_ems where starts_time = '2020-11-28' ) wce
join (select * from itcast_ods.web_chat_text_ems where start_time = '2020-11-28') wcte
on wce.id = wcte.id ;

--增量統計分析:
在進行增量統計分析時候, 有可能會發生隨着增量數據統計會導致之前的統計結果失效的問題
比如説:
從2020年 1月份到 2020年 11月27 號統計每年每季度每月每天每小時
將11月28號的數據加入到整個數據集以後, 再次進行統計:
每天統計結果只需要將新的一天在新增數據即可
每小時統計結果, 只需要在之前上面在新增數據即可
每月的數據 1~10月份的數據不會受到影響, 但是 11月份的節點數據可能會受到影響, 此時需要將之前的數據
給刪除掉
每季度統計結果, 1,2,3季度的數據, 不會受到影響, 但是第4季度的數據會受到影響,此時在按照季度統計的時候
需要將第4季度數據給刪除掉
每年的統計結果那麼對2020年度的統計結果依然會受到影響, 需要先將按照2020年統計的數據, 先刪除, 然後才能統計

但是, hive不支持刪除某一個行數據(無法啊隨機刪除),思考如何解決呢? 支持刪除分區
刪除分區的格式:
alter table 表名 drop partition(分區字段=值....)

例如説: 按照年來統計最新增量數據:
alter table visit_dws drop partition(yearinfo='2020',quarterinfo='-1',monthinfo='-1',dayinfo='-1')
例如説: 按照季度來統計有增量數據
alter table visit_dws drop partition(yearinfo='2020',quarterinfo='4',monthinfo='-1',dayinfo='-1')

至於後續的統計操作, 大家只需要將對應的要統計的數據通過where條件篩選出來即可
例如:
按天來統計各地區的訪問量數據

select ..... from dwd表 where yearinfo='2020' and quarterinfo= '4' and monthinfo ='11'
and dayinfo ='28'
group by 年季度月天地區 ;

按照年來統計各地區
select ..... from dwd表 where yearinfo='2020'
group by 年季度月天地區 ;

-- 導出數據: (簡單化)
做法:
將MySQL原有數據中, 刪除當年的數據, 因為不管在怎麼影響, 都會影響之前年

接下來:
將DWS層表數據篩選出 2020年所有的統計結果, 直接全部導出即可

看板1作業:
將第一個看板的指標和維度以及如何進行維度分析的過程, 以及在統計過程中, 涉及到了那些優化的點,
需要能夠拿自己的話講出來

3.意向客户主題看板_需求説明:
需求一: 計期內，新增意向客户（包含自己錄入的意向客户）總數。
指標: 意向數量
維度:
時間維度:
年月天小時
新老維度:
線上線下:

客户表: area

基於這個字段統計意向用户數量: customer_id:先去重
兩個表關聯條件:
意向表.customer_id=客户表.id

需求四: 統計指定時間段內，新增的意向客户中，意向校區人數排行榜
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
校區維度

注意：學校id，同步時，0和null轉換為統一數據，都轉換為-1

總結:
指標: 意向數量
維度:
時間維度: 年月天小時
新老維度:
線上線下
產品屬性維度:
地區維度 , 來源渠道, 學科維度, 校區維度 , 各諮詢中心

意向主題看板案例_導入原始業務數據 --- 此層在實際工作中不存在
create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

將原來發的知行教育分析平台資料中 --> 原始完整數據集 --> scrm --> 將7個表依次導入MySQL中

意向主題看板案例_建模分析:
ODS層:
事實表: 意向表
額外放置一張表: 線索表 (説明: 此表由於是後續主題看板事實表, 為了方便後續的處理, 將此表放置在ODS層)
表: 內部表 + 分桶表 + 分區表 + 拉鍊表實施
DIM層: 維度層
員工表, 校區表, 學科表, 客户表 ,部門表
表: 外部表 + 分區表
關於以上兩層: 只需要一對對應原生數據表結構構建即可, 構建時注意添加一個 start_time(抽取時間)
數據格式和壓縮方式: ORC + ZLIB(SNAPPY)

DW層:
DWD: 清洗轉換以及如果表字段過多, 可以抽取相關的字段 , 對 ODS層表進行處理
清洗工作:
清理掉以及被標識為刪除的數據
轉換工作:
將 origin_type中數據轉換為 0 和 1 形成一個新的字段, 用於標識線上上下
create_date_time將時間轉換為年月日小時
涉及到字段:
普通字段:
id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat,
itcast_school_id ,itcast_subject_id,creator,hourinfo
分區:
年(yearinfo) , 月(monthinfo) 日(dayinfo)

DWM: 基於維度提前聚合操作 (不能做) 維度退化
將六個維度表, 和 DWD的事實表進行組合, 形成一張表, 從而實現維度退化操作
思想: 考慮要從各個維度表中獲取那些字段數據, 將這些字段數據全部糅雜在一個表即可
相關字段:
普通字段:
customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type,
itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id,
department_name ,hourinfo
分區字段:
年(yearinfo) , 月(monthinfo) 日(dayinfo)

要想生成這個表的數據, 此處需要進行從ODS+DIM 進行七表聯查得出此表結果

app層: 不要 DWS已經成功將各個維度分析完成....

意向客户主題看板

1. 學習目標

瞭解意向客户主題看板需求

掌握Hive分桶的用法

掌握Map Join的用法

掌握Bucket-Map Join的用法

掌握SMB Join的用法

能夠採集意向客户全量數據

能夠使用Hive執行計劃

能夠編寫意向客户指標的DWD清洗轉換SQL

能夠編寫意向客户指標的DWM中間層SQL

能夠編寫意向客户指標的DWS業務層SQL

能夠導出分析結果到Mysql

瞭解拉鍊表的增量採集導入過程

掌握變更數據的增量清洗過程

掌握變更數據的增量分析過程

能夠使用Sqoop導出增量數據到Mysql

2. 主題需求

包含的指標有：1、總意向量、2、意向學員位置熱力圖、3、意向學科排名、4、意向校區排名、5、來源渠道佔比、6、意向貢獻中心佔比。

hive取上季度最後一天_數據_06

1.1 總意向量

説明：計期內，新增意向客户（包含自己錄入的意向客户）總數。

展現：線狀圖

條件：年、月、線上線下

維度：年、月、線上線下

指標：總意向客户量

粒度：天，可以下鑽到小時數據。

數據來源：客户管理系統的customer_relationship意向表

SQL：

SELECT
date_format(
cr.create_date_time,
'%Y-%m-%d'
),
count(DISTINCT cr.customer_id)
FROM
customer_relationship cr
WHERE
cr.create_date_time >= '2019-12-01'
AND cr.create_date_time <= '2019-12-31 23:59:59'
GROUP BY
date_format(
cr.create_date_time,
'%Y-%m-%d'
);

1.2 意向學員位置熱力圖

説明：統計指定時間段內，新增的意向客户，所在城市區域人數熱力圖。

展現：地圖熱力圖

維度：年、月、線上線下

指標：按照地區聚合意向客户id數量

粒度：天，可以下鑽到小時數據。

條件：年、月、線上線下

數據來源：客户管理系統的customer(客户靜態信息表) 、customer_relationship（客户意向表）

SQL：

SELECT
c.area '區域',
count(DISTINCT cr.customer_id) '總數',
DATE_FORMAT(cr.create_date_time,'%Y-%m-%d') '客户創建時間'
FROM
customer c, customer_relationship cr
WHERE cr.customer_id = c.id
AND cr.create_date_time > '2019-11-01 00:00:00'
AND cr.create_date_time < '2019-11-30 23:59:59'
GROUP BY DATE_FORMAT(cr.create_date_time,'%Y-%m-%d'), c.area
ORDER BY DATE_FORMAT(cr.create_date_time,'%Y-%m-%d') ASC, count(1) DESC

1.3 意向學科排名

説明：統計指定時間段內，新增的意向客户中，意向學科人數排行榜。學科名稱要關聯查詢出來。

展現：柱狀圖

條件：年、月、線上線下

維度：年、月、線上線下、學科

指標：學科意向客户量

粒度：天，可以下鑽到小時數據。

數據來源：客户管理系統的customer_clue（客户線索表）、customer_relationship（客户意向表）、itcast_subject（學科表）

SQL：

意向學科，要以意向表的學科字段為準，不能以線索表為準。

SELECT cr.itcast_subject_id,
sj.name,
count(DISTINCT cr.customer_id)
FROM customer_clue cc,
customer_relationship cr
left join itcast_subject sj on cr.itcast_subject_id = sj.id
WHERE cc.clue_state = 'VALID_NEW_CLUES' --新客户新線索
AND ! cc.deleted
AND cr.origin_type IN ('NETSERVICE', 'PRESIGNUP') #線上（排除挖掘錄入量）
AND cc.create_date_time > '2019-10-01 00:00:00'
AND cc.create_date_time < '2019-11-30 23:59:59'
AND cc.customer_relationship_id = cr.id
GROUP BY cr.itcast_subject_id
ORDER BY count(1) DESC;

1.4 意向校區排名

説明：統計指定時間段內，新增的意向客户中，意向校區人數排行榜。

展現：柱狀圖

條件：年、月、線上線下

維度：年、月、線上線下、校區

指標：校區意向客户量

粒度：天，可以下鑽到小時數據。

數據來源：客户管理系統的

注意：學校id，同步時，0和null轉換為統一數據，都轉換為-1

SQL：

SELECT cr.itcast_school_id,
sc.name,
count(DISTINCT cr.customer_id)
FROM customer_clue cc,
customer_relationship cr
left join itcast_school sc on cr.itcast_school_id = sc.id
WHERE cc.clue_state = 'VALID_NEW_CLUES' --新客户新線索
AND ! cc.deleted
AND cr.origin_type IN ('NETSERVICE', 'PRESIGNUP') #線上（排除挖掘錄入量）
AND cc.create_date_time > '2019-10-01 00:00:00'
AND cc.create_date_time < '2019-11-30 23:59:59'
AND cc.customer_relationship_id = cr.id
GROUP BY cr.itcast_school_id
ORDER BY count(1) DESC;

1.5 來源渠道佔比

説明：統計指定時間段內，新增的意向客户中，不同來源渠道的意向客户佔比。

展現：餅狀圖

條件：年、月、線上線下

維度：年、月、線上線下、來源渠道

粒度：天，可以下鑽到小時數據。

指標：來源渠道意向客户量

數據來源：客户管理系統的customer_clue（客户線索表）、customer_relationship（客户意向表）

SQL：

SELECT
cr.origin_type '來源渠道',
count(DISTINCT cr.customer_id) '總數'
FROM
customer_relationship cr
LEFT JOIN customer_clue cc ON cc.customer_relationship_id = cr.id
WHERE
cc.clue_state = 'VALID_NEW_CLUES'
AND cr.create_date_time < '2019-11-30 23:59:59'
AND cr.create_date_time < '2019-11-30 23:59:59'
AND cr.origin_type IN ('NETSERVICE','PRESIGNUP') #線上（排除挖掘錄入量）
AND ! cc.deleted
GROUP BY
cr.origin_type;

1.6 意向貢獻中心佔比

説明：統計指定時間段內，新增的意向客户中，各諮詢中心產生的意向客户數佔比情況。

展現：餅狀圖

條件：年、月、線上線下

維度：年、月、線上線下、諮詢中心

指標：諮詢中心意向客户數

粒度：天，可以下鑽到小時數據。

數據來源：客户管理系統的customer_relationship（客户意向表）、employee（員工表）、scrm_department（部門表）

SQL：

SELECT
e.tdepart_id,
sd.`name`,
count(DISTINCT cr.customer_id) '總數'
FROM
customer_relationship cr
LEFT JOIN employee e ON cr.creator = e.id
LEFT JOIN scrm_department sd ON e.tdepart_id = sd.id
WHERE
cc.clue_state = 'VALID_NEW_CLUES'
AND cr.create_date_time >= '2019-10-01 00:00:00'
AND cr.create_date_time <= '2019-11-30 23:59:59'
AND cr.origin_type IN ('NETSERVICE','PRESIGNUP') #線上（排除挖掘錄入量）
GROUP BY
e.tdepart_id;

1.7 原始數據結構

hive取上季度最後一天_hive_07

1.7.1 建庫

意向客户數據，來源於諮詢管理系統的數據庫：scrm。

create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

測試數據

Mysql測試數據可以通過導入已準備好的sql文件進行創建：【Home\講義\完整原始數據\scrm.sql】。可以通過mysql腳本導入：

mysql -h 192.168.52.150 -P 3306 -uroot -p source G:\知行教育大數據平台\講義\完整原始數據\scrm.sql

1.7.2 customer客户靜態信息表

主要用來關聯獲取客户的靜態信息，比如地區信息。

CREATE TABLE `customer` ( `id` int(11) NOT NULL AUTO_INCREMENT, `customer_relationship_id` int(11) DEFAULT NULL COMMENT '當前意向id', `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '創建時間', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最後更新時間', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被刪除（禁用）', `name` varchar(128) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT '姓名', `idcard` varchar(24) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '身份證號', `birth_year` int(5) DEFAULT NULL COMMENT '出生年份', `gender` varchar(8) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT 'MAN' COMMENT '性別', `phone` varchar(24) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT '手機號', `wechat` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '微信', `qq` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT 'qq號', `email` varchar(56) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '郵箱', `area` varchar(128) DEFAULT '' COMMENT '所在區域', `leave_school_date` date DEFAULT NULL COMMENT '離校時間', `graduation_date` date DEFAULT NULL COMMENT '畢業時間', `bxg_student_id` varchar(64) DEFAULT NULL COMMENT '博學谷學員ID，可能未關聯到，不存在', `creator` int(11) DEFAULT NULL COMMENT '創建人ID', `origin_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '數據來源', `origin_channel` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '來源渠道', `tenant` int(11) NOT NULL DEFAULT '0', `md_id` int(11) DEFAULT '0' COMMENT '中台id', PRIMARY KEY (`id`), KEY `employee_id` (`creator`) USING BTREE, KEY `customer_relationship_id` (`customer_relationship_id`) USING BTREE, KEY `index_idcard` (`idcard`) USING BTREE, KEY `index_phone` (`phone`) USING BTREE, KEY `index_create_time` (`create_date_time`) USING BTREE, KEY `index_qq` (`qq`) USING BTREE, KEY `idx_update_time` (`update_date_time`) USING BTREE, CONSTRAINT `customer_ibfk_1` FOREIGN KEY (`creator`) REFERENCES `employee` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2061222 DEFAULT CHARSET=utf8;

1.7.3 customer_relationship客户意向表

意向客户主表，用來統計事實數據。

根據需求，客户的意向數據，會存在更新的情況，需要將更新的數據進行重新統計以得到正確的結果；同時要能夠查看這些數據的歷史快照。

CREATE TABLE `customer_relationship` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最後更新時間', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被刪除（禁用）', `customer_id` int(11) NOT NULL DEFAULT '0' COMMENT '所屬客户id', `first_id` int(11) DEFAULT NULL COMMENT '第一條客户關係id', `belonger` int(11) DEFAULT NULL COMMENT '歸屬人', `belonger_name` varchar(10) DEFAULT NULL COMMENT '歸屬人姓名', `initial_belonger` int(11) DEFAULT NULL COMMENT '初始歸屬人', `distribution_handler` int(11) DEFAULT NULL COMMENT '分配處理人', `business_scrm_department_id` int(11) DEFAULT '0' COMMENT '歸屬部門', `last_visit_time` datetime DEFAULT NULL COMMENT '最後回訪時間', `next_visit_time` datetime DEFAULT NULL COMMENT '下次回訪時間', `origin_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '數據來源', `itcast_school_id` int(11) DEFAULT NULL COMMENT '校區Id', `itcast_subject_id` int(11) DEFAULT NULL COMMENT '學科Id', `intention_study_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '意向學習方式', `anticipat_signup_date` date DEFAULT NULL COMMENT '預計報名時間', `level` varchar(8) DEFAULT NULL COMMENT '客户級別', `creator` int(11) DEFAULT NULL COMMENT '創建人', `current_creator` int(11) DEFAULT NULL COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', `creator_name` varchar(32) DEFAULT '' COMMENT '創建者姓名', `origin_channel` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '來源渠道', `comment` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '備註', `first_customer_clue_id` int(11) DEFAULT '0' COMMENT '第一條線索id', `last_customer_clue_id` int(11) DEFAULT '0' COMMENT '最後一條線索id', `process_state` varchar(32) DEFAULT NULL COMMENT '處理狀態', `process_time` datetime DEFAULT NULL COMMENT '處理狀態變動時間', `payment_state` varchar(32) DEFAULT NULL COMMENT '支付狀態', `payment_time` datetime DEFAULT NULL COMMENT '支付狀態變動時間', `signup_state` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '報名狀態', `signup_time` datetime DEFAULT NULL COMMENT '報名時間', `notice_state` varchar(32) DEFAULT NULL COMMENT '通知狀態', `notice_time` datetime DEFAULT NULL COMMENT '通知狀態變動時間', `lock_state` bit(1) DEFAULT b'0' COMMENT '鎖定狀態', `lock_time` datetime DEFAULT NULL COMMENT '鎖定狀態修改時間', `itcast_clazz_id` int(11) DEFAULT NULL COMMENT '所屬ems班級id', `itcast_clazz_time` datetime DEFAULT NULL COMMENT '報班時間', `payment_url` varchar(1024) DEFAULT '' COMMENT '付款鏈接', `payment_url_time` datetime DEFAULT NULL COMMENT '支付鏈接生成時間', `ems_student_id` int(11) DEFAULT NULL COMMENT 'ems的學生id', `delete_reason` varchar(64) DEFAULT NULL COMMENT '刪除原因', `deleter` int(11) DEFAULT NULL COMMENT '刪除人', `deleter_name` varchar(32) DEFAULT NULL COMMENT '刪除人姓名', `delete_time` datetime DEFAULT NULL COMMENT '刪除時間', `course_id` int(11) DEFAULT NULL COMMENT '課程ID', `course_name` varchar(64) DEFAULT NULL COMMENT '課程名稱', `delete_comment` varchar(255) DEFAULT '' COMMENT '刪除原因説明', `close_state` varchar(32) DEFAULT NULL COMMENT '關閉裝填', `close_time` datetime DEFAULT NULL COMMENT '關閉狀態變動時間', `appeal_id` int(11) DEFAULT NULL COMMENT '申訴id', `tenant` int(11) NOT NULL DEFAULT '0' COMMENT '租户', `total_fee` decimal(19,0) DEFAULT NULL COMMENT '報名費總金額', `belonged` int(11) DEFAULT NULL COMMENT '小週期歸屬人', `belonged_time` datetime DEFAULT NULL COMMENT '歸屬時間', `belonger_time` datetime DEFAULT NULL COMMENT '歸屬時間', `transfer` int(11) DEFAULT NULL COMMENT '轉移人', `transfer_time` datetime DEFAULT NULL COMMENT '轉移時間', `follow_type` int(4) DEFAULT '0' COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `transfer_bxg_oa_account` varchar(64) DEFAULT NULL COMMENT '轉移到博學谷歸屬人OA賬號', `transfer_bxg_belonger_name` varchar(64) DEFAULT NULL COMMENT '轉移到博學谷歸屬人OA姓名', PRIMARY KEY (`id`), KEY `customer_id` (`customer_id`) USING BTREE, KEY `appeal_id` (`appeal_id`) USING BTREE, KEY `create_date_time` (`create_date_time`) USING BTREE, KEY `next_visit_time` (`next_visit_time`) USING BTREE, KEY `last_visit_time` (`last_visit_time`) USING BTREE, KEY `itcast_school_id` (`itcast_school_id`) USING BTREE, KEY `index_delete` (`delete_time`) USING BTREE, KEY `index_class_id` (`itcast_clazz_id`) USING BTREE, KEY `belonger` (`belonger`) USING BTREE, KEY `creator` (`creator`) USING BTREE, KEY `index_itcast_subject_id` (`itcast_subject_id`) USING BTREE, KEY `idex_distribution` (`distribution_handler`) USING BTREE, CONSTRAINT `customer_relationship_ibfk_1` FOREIGN KEY (`customer_id`) REFERENCES `customer` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2060127 DEFAULT CHARSET=utf8;

1.7.4 customer_clue客户線索表

客户線索表主要保存的是客户諮詢時留下來的手機號、微信號等聯繫線索。在意向客户統計時，主要用來判斷是新客户還是老客户，clue_state字段的值'VALID_NEW_CLUES'代表是新客户，'VALID_PUBLIC_NEW_CLUE'代表是老客户。

根據需求，客户的線索數據，也會存在更新的情況，需要將更新的數據進行重新統計以得到正確的結果；同時要能夠查看這些數據的歷史快照。

CREATE TABLE `customer_clue` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '創建時間', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最後更新時間', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被刪除（禁用）', `customer_id` int(11) DEFAULT NULL COMMENT '客户id', `customer_relationship_id` int(11) DEFAULT NULL COMMENT '客户關係id', `session_id` varchar(48) COLLATE utf8_bin DEFAULT '' COMMENT '七陌會話id', `sid` varchar(48) COLLATE utf8_bin DEFAULT '' COMMENT '訪客id', `status` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', `user` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '所屬坐席', `create_time` datetime DEFAULT NULL COMMENT '七陌創建時間', `platform` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', `s_name` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '用户名稱', `seo_source` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '搜索來源', `seo_keywords` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '關鍵字', `ip` varchar(48) COLLATE utf8_bin DEFAULT '' COMMENT 'IP地址', `referrer` text COLLATE utf8_bin COMMENT '上級來源頁面', `from_url` text COLLATE utf8_bin COMMENT '會話來源頁面', `landing_page_url` text COLLATE utf8_bin COMMENT '訪客着陸頁面', `url_title` varchar(1024) COLLATE utf8_bin DEFAULT '' COMMENT '諮詢頁面title', `to_peer` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '所屬技能組', `manual_time` datetime DEFAULT NULL COMMENT '人工開始時間', `begin_time` datetime DEFAULT NULL COMMENT '坐席領取時間 ', `reply_msg_count` int(11) DEFAULT '0' COMMENT '客服回覆消息數', `total_msg_count` int(11) DEFAULT '0' COMMENT '消息總數', `msg_count` int(11) DEFAULT '0' COMMENT '客户發送消息數', `comment` varchar(1024) COLLATE utf8_bin DEFAULT '' COMMENT '備註', `finish_reason` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '結束類型', `finish_user` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '結束坐席', `end_time` datetime DEFAULT NULL COMMENT '會話結束時間', `platform_description` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '客户平台信息', `browser_name` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '瀏覽器名稱', `os_info` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '系統名稱', `area` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '區域', `country` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '所在國家', `province` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '省', `city` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '城市', `creator` int(11) DEFAULT '0' COMMENT '創建人', `name` varchar(64) COLLATE utf8_bin DEFAULT '' COMMENT '客户姓名', `idcard` varchar(24) COLLATE utf8_bin DEFAULT '' COMMENT '身份證號', `phone` varchar(24) COLLATE utf8_bin DEFAULT '' COMMENT '手機號', `itcast_school_id` int(11) DEFAULT NULL COMMENT '校區Id', `itcast_school` varchar(128) COLLATE utf8_bin DEFAULT '' COMMENT '校區', `itcast_subject_id` int(11) DEFAULT NULL COMMENT '學科Id', `itcast_subject` varchar(128) COLLATE utf8_bin DEFAULT '' COMMENT '學科', `wechat` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '微信', `qq` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT 'qq號', `email` varchar(56) COLLATE utf8_bin DEFAULT '' COMMENT '郵箱', `gender` varchar(8) COLLATE utf8_bin DEFAULT 'MAN' COMMENT '性別', `level` varchar(8) COLLATE utf8_bin DEFAULT NULL COMMENT '客户級別', `origin_type` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '數據來源渠道', `information_way` varchar(32) COLLATE utf8_bin DEFAULT NULL COMMENT '資訊方式', `working_years` date DEFAULT NULL COMMENT '開始工作時間', `technical_directions` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '技術方向', `customer_state` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '當前客户狀態', `valid` bit(1) DEFAULT b'0' COMMENT '該線索是否是網資有效線索', `anticipat_signup_date` date DEFAULT NULL COMMENT '預計報名時間', `clue_state` varchar(32) COLLATE utf8_bin DEFAULT 'NOT_SUBMIT' COMMENT '線索狀態', `scrm_department_id` int(11) DEFAULT NULL COMMENT 'SCRM內部部門id', `superior_url` text COLLATE utf8_bin COMMENT '諸葛獲取上級頁面URL', `superior_source` varchar(1024) COLLATE utf8_bin DEFAULT NULL COMMENT '諸葛獲取上級頁面URL標題', `landing_url` text COLLATE utf8_bin COMMENT '諸葛獲取着陸頁面URL', `landing_source` varchar(1024) COLLATE utf8_bin DEFAULT NULL COMMENT '諸葛獲取着陸頁面URL來源', `info_url` text COLLATE utf8_bin COMMENT '諸葛獲取留諮頁URL', `info_source` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '諸葛獲取留諮頁URL標題', `origin_channel` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '投放渠道', `course_id` int(32) DEFAULT NULL, `course_name` varchar(255) COLLATE utf8_bin DEFAULT NULL, `zhuge_session_id` varchar(500) COLLATE utf8_bin DEFAULT NULL, `is_repeat` int(4) NOT NULL DEFAULT '0' COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', `tenant` int(11) NOT NULL DEFAULT '0' COMMENT '租户id', `activity_id` varchar(16) COLLATE utf8_bin DEFAULT NULL COMMENT '活動id', `activity_name` varchar(64) COLLATE utf8_bin DEFAULT NULL COMMENT '活動名稱', `follow_type` int(4) DEFAULT '0' COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `shunt_mode_id` int(11) DEFAULT NULL COMMENT '匹配到的技能組id', `shunt_employee_group_id` int(11) DEFAULT NULL COMMENT '所屬分流員工組', PRIMARY KEY (`id`), KEY `customer_id` (`customer_id`) USING BTREE, KEY `customer_relationship_id` (`customer_relationship_id`) USING BTREE, KEY `phone` (`phone`) USING BTREE, KEY `idcard` (`idcard`) USING BTREE, KEY `session_id` (`session_id`) USING BTREE, KEY `index_date_time` (`create_date_time`) USING BTREE, KEY `index_creator` (`creator`) USING BTREE, CONSTRAINT `customer_clue_ibfk_1` FOREIGN KEY (`customer_id`) REFERENCES `customer` (`id`), CONSTRAINT `customer_clue_ibfk_2` FOREIGN KEY (`customer_relationship_id`) REFERENCES `customer_relationship` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2060711 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

1.7.5 employee員工表

主要用來關聯獲取員工信息，比如員工所在的部門id。

create table employee ( id int auto_increment primary key, email varchar(64) not null comment '公司郵箱，OA登錄賬號', real_name varchar(32) not null comment '員工的真實姓名', phone varchar(32) not null comment '手機號，目前還沒有使用；隱私問題OA接口沒有提供這個屬性，', department_id varchar(64) default '0' null comment 'OA中的部門編號，有負值', department_name varchar(64) default '' null comment 'OA中的部門名', remote_login bit not null comment '員工是否可以遠程登錄', job_number varchar(64) null comment '員工工號', cross_school bit not null comment '是否有跨校區權限', last_login_date datetime not null comment '最後登錄日期', creator int(32) null comment '創建人', create_date_time datetime default CURRENT_TIMESTAMP not null comment '創建時間', update_date_time timestamp default CURRENT_TIMESTAMP not null on update CURRENT_TIMESTAMP comment '最後更新時間', deleted bit default b'0' not null comment '是否被刪除（禁用）', scrm_department_id int(32) null comment 'SCRM內部部門id', leave_office bit null comment '離職狀態', leave_office_time datetime null comment '離職時間', reinstated_time datetime null comment '復職時間', superior_leaders_id int null comment '上級領導ID', tdepart_id int null comment '直屬部門', tenant int default 0 not null, ems_user_name varchar(32) null ) comment '員工信息表';

1.7.6 scrm_department部門表

用來獲取部門名稱等信息。

CREATE TABLE `scrm_department` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '部門id', `name` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '部門名稱', `parent_id` int(11) DEFAULT NULL COMMENT '父部門id', `create_date_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '創建時間', `update_date_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新時間', `deleted` bit(1) DEFAULT b'0' COMMENT '刪除標誌', `id_path` varchar(1000) COLLATE utf8_bin DEFAULT NULL COMMENT '編碼全路徑', `tdepart_code` int(11) DEFAULT NULL COMMENT '直屬部門', `creator` varchar(32) COLLATE utf8_bin DEFAULT NULL COMMENT '創建者', `depart_level` int(4) DEFAULT NULL COMMENT '部門層級', `depart_sign` int(4) DEFAULT NULL COMMENT '部門標誌，暫時默認1', `depart_line` int(11) DEFAULT NULL COMMENT '業務線，存儲業務線編碼', `depart_sort` int(5) DEFAULT NULL COMMENT '排序字段', `disable_flag` int(1) DEFAULT NULL COMMENT '禁用標誌', `tenant` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=149 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

1.7.7 itcast_school學校表

用來獲取學校名稱等信息。

CREATE TABLE `itcast_school` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '創建時間', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最後更新時間', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被刪除（禁用）', `name` varchar(32) COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT '校區名稱', `code` varchar(32) COLLATE utf8_bin NOT NULL, `tenant` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=30 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

1.7.8 itcast_subject學科表

用來獲取學科名稱等信息。

CREATE TABLE `itcast_subject` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL COMMENT '創建時間', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最後更新時間', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被刪除（禁用）', `name` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '學科名稱', `code` varchar(32) COLLATE utf8_bin DEFAULT NULL, `tenant` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

2. 建模分析

2.1 指標和維度

根據主題需求，我們來進行指標和維度的提取：

從1.1~1.6統計的分別是，總意向客户數、地區意向客户、學科意向客户、校區意向客户、來源渠道意向客户和諮詢中心意向客户，維度都包含了年、月、線上線下。

每個指標都指明統計的是新增客户，我們可以將數據分為新客户和老客户進行統計。

我們可以提取出共有的指標：意向客户量。維度：年、月、線上線下、新老客户。

因為數據粒度都是展示到天，而且可以下鑽到小時，所以我們的統計維度中也需要增加天和小時。

不同指標的產品屬性也需要增加到維度中：

意向學員位置熱力圖，是將不同地區的意向客户數量進行統計；

意向學科排名，雖然最終要的結果是學科的排名，但這個排名的依據是根據學科統計出來的意向學員數量；

意向校區排名，要的結果是校區排名，但排名的依據也是根據校區統計出來的意向學員數量；

來源渠道佔比，指的是不同來源渠道意向學員數量的總體佔比，底層的依據還是意向學員數量；

意向貢獻中心佔比，和來源渠道佔比類似，依據的是不同諮詢中心的意向學員數量；

所以維度應該包括：年、月、天、小時、線上線下、新老客户、地區、學科、校區、來源渠道、諮詢中心。

2.2 分層設計

hive取上季度最後一天_hive取上季度最後一天_08

我們可以採取結果導向的方式來進行倒推：

最終需要統計的數據維度：年、月、天、小時、線上線下、新老客户、地區、學科、校區、來源渠道、諮詢中心；
在需求中，每個指標的條件都包含有時間和線上線下、新老客户，也就是説無論哪一種業務維度都需要按照時間、線上線下和新老客户來進行區分，可以將這三個維度作為單獨字段；
因此我們將維度分為四類：時間維度（年、月、天）、數據來源（線上線下）、客户屬性（新老客户）和產品屬性維度（總意向量、地區、學科、校區、來源渠道、諮詢中心）；
首先將數據抽取到ODS源數據層，然後將明細數據通過清洗轉換後存入DWD層；
在DWM，關聯相關的維度數據，並轉換出需要的信息；
DWS層在DWM關聯後的數據上進行統計，得出數據集市；
將OLAP需要的數據和字段同步至mysql；
ODS——》DWD——》DWM——》DWS。

3. 實現

3.1 建模

3.1.1 指標和維度

指標：意向客户量是單位時間內新增的意向客户量（包含線上線下），以天為單位顯示。

維度：

l 時間維度：年、月、天、小時

l 數據來源：線上線下

l 客户屬性：新客户、老客户

l 地區、學科、校區、來源渠道、諮詢中心。

3.1.2 事實表和維度表

customer_relationship客户意向表，包含了意向客户信息；顯然此表就是意向客户指標的基礎事實。

customer客户靜態信息表主要用來關聯獲取客户的靜態信息，比如地區信息。是我們的維度數據。

customer_clue客户線索表主要用來判斷是新客户還是老客户；也屬於要關聯的維度信息；但因為此表包含了後續其他指標的事實數據，所以不放在維度DIM層。

類似的，employee員工表、scrm_department部門表、itcast_school學校表、itcast_subject學科表都屬於維度信息，所以作為維度表放在維度層。

3.1.3 Hive分桶

分桶是將數據集分解成更容易管理的若干部分的一個技術，是比分區更為細粒度的數據範圍劃分。

3.1.3.1 為什麼要分桶？

3.1.3.1.1 獲得更高的查詢處理效率

在分區數量過於龐大以至於可能導致文件系統崩潰時，或數據集找不到合理的分區字段時，我們就需要使用分桶來解決問題了。

分區中的數據可以被進一步拆分成桶，不同於分區對列直接進行拆分，桶往往使用列的哈希值對數據打散，並分發到各個不同的桶中從而完成數據的分桶過程。

注意，hive使用對分桶所用的值進行hash，並用hash結果除以桶的個數做取餘運算的方式來分桶，保證了每個桶中都有數據，但每個桶中的數據條數不一定相等。

如果另外一個表也按照同樣的規則分成了一個個小文件。兩個表join的時候，就不必要掃描整個表，只需要匹配相同分桶的數據即可，從而提升效率。

在數據量足夠大的情況下，分桶比分區有更高的查詢效率。

hive取上季度最後一天_hive_09

3.1.3.1.2 數據採樣

在真實的大數據分析過程中，由於數據量較大，開發和自測的過程比較慢，嚴重影響系統的開發進度。此時就可以使用分桶來進行數據採樣。採樣使用的是一個具有代表性的查詢結果而不是全部結果，通過對採樣數據的分析，來達到快速開發和自測的目的，節省大量的研發成本。

3.1.3.2 分桶和分區的區別

分桶對數據的處理比分區更加細粒度化：分區針對的是數據的存儲路徑；分桶針對的是數據文件；
分桶是按照列的哈希函數進行分割的，相對比較平均；而分區是按照列的值來進行分割的，容易造成數據傾斜；
分桶和分區兩者不干擾，可以把分區表進一步分桶。

3.1.3.3 操作

創建分桶表

create table test_buck(id int, name string) clustered by(id) sorted by (id asc) into 6 buckets row format delimited fields terminated by '\t';

CLUSTERED BY來指定劃分桶所用列；

SORTED BY對桶中的一個或多個列進行排序；

into 6 buckets指定劃分桶的個數。

分桶規則：HIVE對key的hash值除bucket個數取餘數，保證數據均勻隨機分佈在所有bucket裏。

查看分桶表信息

hive取上季度最後一天_數據_10

desc formatted test_buck;

插入數據

--啓用桶表 set hive.enforce.bucketing=true; insert into table test_buck select id, name from temp_buck;

hive.enforce.bucketing：啓用桶表，數據分桶是否被強制執行，默認false，如果開啓，則寫入table數據時會啓動分桶。

3.1.3.4 文本數據處理

注意：對於分桶表，不能使用load data的方式進行數據插入操作，因為load data導入的數據不會有分桶結構。

如何避免針對桶表使用load data插入數據的誤操作呢？

--限制對桶表進行load操作 set hive.strict.checks.bucketing = true;

也可以在CM的hive配置項中修改此配置，當針對桶表執行load data操作時會報錯。

hive取上季度最後一天_字段_12

那麼對於文本數據如何處理呢？

(1. 先創建臨時表，通過load data將txt文本導入臨時表。

--創建臨時表 create table temp_buck(id int, name string) row format delimited fields terminated by '\t'; --導入數據 load data local inpath '/tools/test_buck.txt' into table temp_buck;

(2. 使用insert select語句間接的把數據從臨時表導入到分桶表。

--啓用桶表 set hive.enforce.bucketing=true; --限制對桶表進行load操作 set hive.strict.checks.bucketing = true; --insert select insert into table test_buck select id, name from temp_buck; --分桶成功

3.1.3.5 數據採樣

對錶分桶一般有兩個目的，提高數據查詢效率、抽樣調查。通過前面的講解，我們已經可以對分桶表進行正常的創建並導入數據了。一般在實際生產中，對於非常大的數據集，有時用户需要使用的是一個具有代表性的查詢結果而不是全部結果，比如在開發自測的時候。這個時候Hive就可以通過對錶進行抽樣來滿足這個需求。

語法

select * from table tablesample(bucket x out of y on column)

hive根據y的大小，決定抽樣的比例。y必須是table總bucket數的倍數或者因子。

例如，table總共分了10份bucket，當y=2時，抽取(10/2=)5個bucket的數據，當y=10時，抽取(10/10=)1個bucket的數據。

x表示從哪個bucket開始抽取，如果需要取多個分區，以後的分區號為當前分區號加上y。

例如，table總bucket數為6，tablesample(bucket 1 out of 2)，表示總共抽取（6/2=）3個bucket的數據，從第1個bucket開始，抽取第1(x)個和第3(x+y)個和第5(x+y)個bucket的數據。

注意：x的值必須小於等於y的值。否則會拋出異常：FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck。

栗子

select * from test_buck tablesample(bucket 1 out of 10 on id);

注意：sqoop不支持分桶表，如果需要從sqoop導入數據到分桶表，可以通過中間臨時表進行過度。ODS也可以不做分桶，從DWD明細層開始分桶。

3.1.3.6 Map Join

MapJoin顧名思義，就是在Map階段進行表之間的連接。而不需要進入到Reduce階段才進行連接。這樣就節省了在Shuffle階段時要進行的大量數據傳輸。從而起到了優化作業的作用。

要使MapJoin能夠順利進行，那就必須滿足這樣的條件：除了一份表的數據分佈在不同的Map中外，其他連接的表的數據必須在每個Map中有完整的拷貝。

所以並不是所有的場景都適合用MapJoin。它通常會用在如下的一些情景：在二個要連接的表中，有一個很大，有一個很小，這個小表可以存放在內存中而不影響性能。

這樣我們就把小表文件複製到每一個Map任務的本地，再讓Map把文件讀到內存中待用。

在Hive v0.7之前，需要使用hint提示 /*+ mapjoin(table) */才會執行MapJoin。Hive v0.7之後的版本已經不需要給出MapJoin的指示就進行優化。現在可以通過如下配置參數來進行控制：

set hive.auto.convert.join=true;

Hive還提供另外一個參數--表文件的大小作為開啓和關閉MapJoin的閾值：

--舊版本為hive.mapjoin.smalltable.filesize set hive.auto.convert.join.noconditionaltask.size=512000000

注意，如果hive.auto.convert.join是關閉的，則本參數不起作用。否則，如果參與連接的N個表(或分區)中的N-1個的總大小小於512MB，則直接將連接轉為Map連接。默認值為20MB。

hive取上季度最後一天_hive_14

MapJoin的使用場景：

1. 關聯操作中有一張表非常小

2. 不等值的鏈接操作

3.1.3.6.1 大小表關聯

select f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)

該語句中B表有30億行記錄，A表只有100行記錄，而且B表中數據傾斜特別嚴重，有一個key上有15億行記錄，在運行過程中特別的慢，而且在reduece的過程中遇到執行時間過長或者內存不夠的問題。

MAPJION會把小表全部讀入內存中，在map階段直接拿另外一個表的數據和內存中表數據做匹配，由於在map時進行了join操作，省去了reduce運行的效率會高很多。

這樣就不會由於數據傾斜導致某個reduce上落數據太多而失敗。於是原來的sql可以通過使用hint的方式指定join時使用mapjoin。

select /+ mapjoin(A)/ f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)

在實際使用中，只要根據業務調整小表的閾值即可，hive會自動幫我們完成mapjoin，提高執行的效率。

3.1.3.6.2 不等連接

mapjoin還有一個很大的好處是能夠進行不等連接的join操作，如果將不等條件寫在where中，那麼mapreduce過程中會進行笛卡爾積，運行效率特別低，如果使用mapjoin操作，在map的過程中就完成了不等值的join操作，效率會高很多。

select A.a ,A.b from A join B where A.a>B.a

3.1.3.7 Bucket-MapJoin

3.1.3.7.1 作用

兩個表join的時候，小表不足以放到內存中，但是又想用map side join這個時候就要用到bucket Map join。其方法是兩個join表在join key上都做hash bucket，並且把你打算複製的那個（相對）小表的bucket數設置為大表的倍數。這樣數據就會按照key join，做hash bucket。小表依然複製到所有節點，Map join的時候，小表的每一組bucket加載成hashtable，與對應的一個大表bucket做局部join，這樣每次只需要加載部分hashtable就可以了。

3.1.3.7.2 條件

1） set hive.optimize.bucketmapjoin = true;
2）一個表的bucket數是另一個表bucket數的整數倍
3） bucket列 == join列
4）必須是應用在map join的場景中

注意：如果表不是bucket的，則只是做普通join。

3.1.3.8 SMB Join

全稱Sort Merge Bucket Join。

3.1.3.8.1 作用

大表對小表應該使用MapJoin來進行優化，但是如果是大表對大表，如果進行shuffle，那就非常可怕，第一個慢不用説，第二個容易出異常，此時就可以使用SMB Join來提高性能。SMB Join基於bucket-mapjoin的有序bucket，可實現在map端完成join操作，可以有效地減少或避免shuffle的數據量。SMB join的條件和Map join類似但又不同。

3.1.3.8.2 條件

bucket mapjoin	SMB join
set hive.optimize.bucketmapjoin = true;	set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;
一個表的bucket數是另一個表bucket數的整數倍	小表的bucket數=大表bucket數
bucket列 == join列	Bucket 列 == Join 列 == sort 列
必須是應用在map join的場景中	必須是應用在bucket mapjoin 的場景中

3.1.3.8.3 確保分同列排序

hive並不檢查兩個join的表是否已經做好bucket且sorted，需要用户自己去保證join的表數據sorted，否則可能數據不正確。

有兩個辦法：

1）hive.enforce.sorting 設置為 true。開啓強制排序時，插數據到表中會進行強制排序，默認false。

2）插入數據時通過在sql中用distributed c1 sort by c1 或者 cluster by c1

另外，表創建時必須是CLUSTERED且SORTED，如下：

create table test_smb_2(mid string,age_id string) CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;

綜上，涉及到分桶表操作的齊全配置為：

--寫入數據強制分桶 set hive.enforce.bucketing=true; --寫入數據強制排序 set hive.enforce.sorting=true; --開啓bucketmapjoin set hive.optimize.bucketmapjoin = true; --開啓SMB Join set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;

開啓MapJoin的配置（hive.auto.convert.join和hive.auto.convert.join.noconditionaltask.size），還有限制對桶表進行load操作（hive.strict.checks.bucketing）可以直接設置在hive的配置項中，無需在sql中聲明。

自動嘗試SMB聯接（hive.optimize.bucketmapjoin.sortedmerge）也可以在設置中進行提前配置。

3.1.4 分層

hive取上季度最後一天_數據_15

3.1.4.1 ODS

寫入時壓縮生效

set hive.exec.orc.compression.strategy=COMPRESSION;

拉鍊表：意向客户看板中，對意向數據有新的需求：將customer_relationship的數據更新涉及到的維度按照最新值重新統計（比如2020年7月份的數據有修改更新，則需要將7月份的統計數據重新計算）；同時要有歷史快照。

此時需要使用緩慢漸變維，推薦採用SCD2拉鍊表的形式來做，既能滿足數據更新的需求，又能滿足數據歷史快照的需求。需要在start_time字段的基礎上，增加新的end_time字段，以標識封鏈時間。

內外部表：ODS層是原始數據，一般不允許修改，所以使用外部表保證數據的安全性，避免誤刪除；ODS中的customer_relationship客户意向表和customer_clue客户線索表，因為使用拉鍊表需要覆蓋操作，所以沒有定義為外部表。

分桶採集：sqoop不支持分桶表，如果需要從sqoop導入數據到分桶表，需要通過中間臨時表進行過度。也可以ODS不做分桶，從DWD明細層開始分桶。

分桶關聯與採樣：ODS層的customer_relationship客户意向表和customer_clue客户線索表是存在關聯關係的，customer_relationship通過 id 關聯customer_clue表的 customer_relationship_id ，可以獲取新老客户信息。因此我們將這兩個字段作為分桶字段。可用於數據採樣和MapJoin。

分區：在之前的訪問諮詢主題看板中，為了便於後續T+1抽取數據時，方便獲取昨天的數據，ODS模型要在原始mysql表的基礎之上增加start_time字段，並且可以使用start_time字段做分區以提升查詢的性能。

3.1.4.1.1 customer_relationship客户意向表

DROP TABLE itcast_ods.`customer_relationship`; CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` ( `id` int COMMENT '客户關係id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `customer_id` int COMMENT '所屬客户id', `first_id` int COMMENT '第一條客户關係id', `belonger` int COMMENT '歸屬人', `belonger_name` STRING COMMENT '歸屬人姓名', `initial_belonger` int COMMENT '初始歸屬人', `distribution_handler` int COMMENT '分配處理人', `business_scrm_department_id` int COMMENT '歸屬部門', `last_visit_time` STRING COMMENT '最後回訪時間', `next_visit_time` STRING COMMENT '下次回訪時間', `origin_type` STRING COMMENT '數據來源', `itcast_school_id` int COMMENT '校區Id', `itcast_subject_id` int COMMENT '學科Id', `intention_study_type` STRING COMMENT '意向學習方式', `anticipat_signup_date` STRING COMMENT '預計報名時間', `level` STRING COMMENT '客户級別', `creator` int COMMENT '創建人', `current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', `creator_name` STRING COMMENT '創建者姓名', `origin_channel` STRING COMMENT '來源渠道', `comment` STRING COMMENT '備註', `first_customer_clue_id` int COMMENT '第一條線索id', `last_customer_clue_id` int COMMENT '最後一條線索id', `process_state` STRING COMMENT '處理狀態', `process_time` STRING COMMENT '處理狀態變動時間', `payment_state` STRING COMMENT '支付狀態', `payment_time` STRING COMMENT '支付狀態變動時間', `signup_state` STRING COMMENT '報名狀態', `signup_time` STRING COMMENT '報名時間', `notice_state` STRING COMMENT '通知狀態', `notice_time` STRING COMMENT '通知狀態變動時間', `lock_state` STRING COMMENT '鎖定狀態', `lock_time` STRING COMMENT '鎖定狀態修改時間', `itcast_clazz_id` int COMMENT '所屬ems班級id', `itcast_clazz_time` STRING COMMENT '報班時間', `payment_url` STRING COMMENT '付款鏈接', `payment_url_time` STRING COMMENT '支付鏈接生成時間', `ems_student_id` int COMMENT 'ems的學生id', `delete_reason` STRING COMMENT '刪除原因', `deleter` int COMMENT '刪除人', `deleter_name` STRING COMMENT '刪除人姓名', `delete_time` STRING COMMENT '刪除時間', `course_id` int COMMENT '課程ID', `course_name` STRING COMMENT '課程名稱', `delete_comment` STRING COMMENT '刪除原因説明', `close_state` STRING COMMENT '關閉裝填', `close_time` STRING COMMENT '關閉狀態變動時間', `appeal_id` int COMMENT '申訴id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '報名費總金額', `belonged` int COMMENT '小週期歸屬人', `belonged_time` STRING COMMENT '歸屬時間', `belonger_time` STRING COMMENT '歸屬時間', `transfer` int COMMENT '轉移人', `transfer_time` STRING COMMENT '轉移時間', `follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號', `transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名', `end_time` STRING COMMENT '有效截止時間') comment '客户關係表' PARTITIONED BY(start_time STRING) clustered by(id) sorted by(id) into 10 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

3.1.4.1.2 customer_clue客户線索表

使用start_time字段分區以提升條件查詢性能。customer_clue是後面有效線索主題看板的事實表，需求也要求將數據更新涉及到的維度按照最新值重新統計、要有歷史快照。採用拉鍊表(SCD2)的形式來做，增加新的end_time字段，以標識封鏈時間。

DROP TABLE itcast_ods.customer_clue; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

3.1.4.2 Dimen

為了保證數據安全，採用外部表。

建庫

CREATE DATABASE IF NOT EXISTS itcast_dimen;

3.1.4.2.1 Customer客户靜態信息表

CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` ( `id` int COMMENT 'key id', `customer_relationship_id` int COMMENT '當前意向id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `name` STRING COMMENT '姓名', `idcard` STRING COMMENT '身份證號', `birth_year` int COMMENT '出生年份', `gender` STRING COMMENT '性別', `phone` STRING COMMENT '手機號', `wechat` STRING COMMENT '微信', `qq` STRING COMMENT 'qq號', `email` STRING COMMENT '郵箱', `area` STRING COMMENT '所在區域', `leave_school_date` date COMMENT '離校時間', `graduation_date` date COMMENT '畢業時間', `bxg_student_id` STRING COMMENT '博學谷學員ID，可能未關聯到，不存在', `creator` int COMMENT '創建人ID', `origin_type` STRING COMMENT '數據來源', `origin_channel` STRING COMMENT '來源渠道', `tenant` int, `md_id` int COMMENT '中台id') comment '客户表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.2 employee員工表

CREATE TABLE IF NOT EXISTS itcast_dimen.employee ( id int COMMENT '員工id', email STRING COMMENT '公司郵箱，OA登錄賬號', real_name STRING COMMENT '員工的真實姓名', phone STRING COMMENT '手機號，目前還沒有使用；隱私問題OA接口沒有提供這個屬性，', department_id STRING COMMENT 'OA中的部門編號，有負值', department_name STRING COMMENT 'OA中的部門名', remote_login STRING COMMENT '員工是否可以遠程登錄', job_number STRING COMMENT '員工工號', cross_school STRING COMMENT '是否有跨校區權限', last_login_date STRING COMMENT '最後登錄日期', creator int COMMENT '創建人', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', scrm_department_id int COMMENT 'SCRM內部部門id', leave_office STRING COMMENT '離職狀態', leave_office_time STRING COMMENT '離職時間', reinstated_time STRING COMMENT '復職時間', superior_leaders_id int COMMENT '上級領導ID', tdepart_id int COMMENT '直屬部門', tenant int COMMENT '租户', ems_user_name STRING COMMENT 'ems用户名稱' ) comment '員工表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.3 scrm_department部門表

CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` ( `id` int COMMENT '部門id', `name` STRING COMMENT '部門名稱', `parent_id` int COMMENT '父部門id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '更新時間', `deleted` STRING COMMENT '刪除標誌', `id_path` STRING COMMENT '編碼全路徑', `tdepart_code` int COMMENT '直屬部門', `creator` STRING COMMENT '創建者', `depart_level` int COMMENT '部門層級', `depart_sign` int COMMENT '部門標誌，暫時默認1', `depart_line` int COMMENT '業務線，存儲業務線編碼', `depart_sort` int COMMENT '排序字段', `disable_flag` int COMMENT '禁用標誌', `tenant` int COMMENT '租户') comment 'scrm部門表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.4 itcast_school學校表

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` ( `id` int COMMENT '自增主鍵', `create_date_time` timestamp COMMENT '創建時間', `update_date_time` timestamp COMMENT '最後更新時間', `deleted` STRING COMMENT '是否被刪除（禁用）', `name` STRING COMMENT '校區名稱', `code` STRING COMMENT '校區標識', `tenant` int COMMENT '租户') comment '校區字典表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.5 itcast_subject學科表

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` ( `id` int COMMENT '自增主鍵', `create_date_time` timestamp COMMENT '創建時間', `update_date_time` timestamp COMMENT '最後更新時間', `deleted` STRING COMMENT '是否被刪除（禁用）', `name` STRING COMMENT '學科名稱', `code` STRING COMMENT '學科編碼', `tenant` int COMMENT '租户') comment '學科字典表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.3 DWD

ODS事實數據customer_relationship清洗轉換後存入DWD明細層。

DW和APP層是統計數據，為了使覆蓋插入等操作更方便，滿足業務需求的同時，提高開發和測試效率，推薦使用內部表。

drop table itcast_dwd.`itcast_intention_dwd`; CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` ( `rid` int COMMENT 'id', `customer_id` STRING COMMENT '客户id', `create_date_time` STRING COMMENT '創建時間', `itcast_school_id` STRING COMMENT '校區id', `deleted` STRING COMMENT '是否被刪除', `origin_type` STRING COMMENT '來源渠道', `itcast_subject_id` STRING COMMENT '學科id', `creator` int COMMENT '創建人', `hourinfo` STRING COMMENT '小時信息', `origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上' ) comment '客户意向dwd表' PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING) clustered by(rid) sorted by(rid) into 10 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as ORC TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.4 DWM

關聯所有維表，並對獲取的字段進行轉換，便於統計時直接使用。

create database itcast_dwm; drop table itcast_dwm.`itcast_intention_dwm`; CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` ( `customer_id` STRING COMMENT 'id信息', `create_date_time` STRING COMMENT '創建時間', `area` STRING COMMENT '區域信息', `itcast_school_id` STRING COMMENT '校區id', `itcast_school_name` STRING COMMENT '校區名稱', `deleted` STRING COMMENT '是否被刪除', `origin_type` STRING COMMENT '來源渠道', `itcast_subject_id` STRING COMMENT '學科id', `itcast_subject_name` STRING COMMENT '學科名稱', `hourinfo` STRING COMMENT '小時信息', `origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上', `clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '創建者部門id', `tdepart_name` STRING COMMENT '諮詢中心名稱' ) comment '客户意向dwm表' PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING) clustered by(customer_id) sorted by(customer_id) into 10 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as ORC TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.5 DWS

在DWM層的基礎上，按照業務的要求進行統計分析；有三個常駐維度，分別增加對應的屬性標識：

l 時間維度：1.年、2.月、3.天、4.小時

l 數據來源：0.線下；1.線上

l 客户屬性：0.老客户、1.新客户

l 產品屬性維度：1.總意向量；2.區域信息；3.校區、學科組合分組；4.來源渠道；5.貢獻中心；

drop Table itcast_dws.itcast_intention_dws; CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws ( `customer_total` INT COMMENT '聚合意向客户數', `area` STRING COMMENT '區域信息', `itcast_school_id` STRING COMMENT '校區id', `itcast_school_name` STRING COMMENT '校區名稱', `origin_type` STRING COMMENT '來源渠道', `itcast_subject_id` STRING COMMENT '學科id', `itcast_subject_name` STRING COMMENT '學科名稱', `hourinfo` STRING COMMENT '小時信息', `origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上', `clue_state_stat` STRING COMMENT '客户屬性：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '創建者部門id', `tdepart_name` STRING COMMENT '諮詢中心名稱', `time_str` STRING COMMENT '時間明細', `groupType` STRING COMMENT '產品屬性類別：1.總意向量；2.區域信息；3.校區、學科組合分組；4.來源渠道；5.貢獻中心;', `time_type` STRING COMMENT '時間維度：1、按小時聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；' ) comment '客户意向dws表' PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.6 APP

如果用户需要具體的報表展示，可以針對不同的報表頁面設計APP層結構，然後導出至OLAP系統的mysql中。此係統使用FineReport，需要通過寬表來進行靈活的展現。因此APP層不再進行細化。直接將DWS層導出至mysql即可。

3.2 全量流程

hive取上季度最後一天_hive取上季度最後一天_16

3.2.1 數據採集

2.1.1.1 Dimen層

2.1.1.1.1 Customer客户表

2.1.1.1.1.1 SQL：

select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time from customer;

2.1.1.1.1.2 Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from customer where $CONDITIONS' \ --hcatalog-database itcast_dimen \ --hcatalog-table customer \ -m 100 \ --split-by id

2.1.1.1.2 employee員工表

2.1.1.1.2.1 SQL：

select id, email, real_name, -1 as phone, department_id, department_name, remote_login, job_number, cross_school, last_login_date, creator, create_date_time, update_date_time, deleted, scrm_department_id, leave_office, leave_office_time, reinstated_time, superior_leaders_id, tdepart_id, tenant, ems_user_name, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee;

2.1.1.1.2.2 Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select id,email,real_name,-1 as phone,department_id,department_name,remote_login,job_number,cross_school,last_login_date,creator,create_date_time,update_date_time,deleted,scrm_department_id,leave_office,leave_office_time,reinstated_time,superior_leaders_id,tdepart_id,tenant,ems_user_name,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee where $CONDITIONS' \ --hcatalog-database itcast_dimen \ --hcatalog-table employee \ -m 100 \ --split-by id

2.1.1.1.3 scrm_department部門表

2.1.1.1.3.1 SQL：

select *, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time from scrm_department;

2.1.1.1.3.2 Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from scrm_department where $CONDITIONS' \ --hcatalog-database itcast_dimen \ --hcatalog-table scrm_department \ -m 100 \ --split-by id

2.1.1.1.4 itcast_school學校表

2.1.1.1.4.1 SQL：

select *, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time from itcast_school;

2.1.1.1.4.2 Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_school where $CONDITIONS' \ --hcatalog-database itcast_dimen \ --hcatalog-table itcast_school \ -m 100 \ --split-by id

2.1.1.1.5 itcast_subject學科表

2.1.1.1.5.1 SQL：

select *, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time from itcast_subject;

2.1.1.1.5.2 Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_subject where $CONDITIONS' \ --hcatalog-database itcast_dimen \ --hcatalog-table itcast_subject \ -m 100 \ --split-by id

2.1.1.2 ODS層

Sqoop不支持分桶表，需要通過臨時表的方式實現。

2.1.1.2.1 customer_relationship意向表

SQL：

select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, date_format("9999-12-31", "%Y-%m-%d") as end_time from customer_relationship;

Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time,date_format("9999-12-31","%Y-%m-%d") as end_time from customer_relationship where $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_relationship \ -m 10 \ --split-by id

報錯：

hive取上季度最後一天_hive取上季度最後一天_17

common.HCatException : 2016 : Error operation not supported : Store into a partition with bucket definition from Pig/Mapreduce is not supported

這個錯誤是由於sqoop不支持將數據導入分桶表所引起的問題，但是如果我們想在ODS進行分桶的話，如何來做呢？

我們可以通過臨時表的方式來進行抽取數據，然後將臨時表數據再同步到ODS分桶表即可。

2.1.1.2.1.1 重建ods臨時表，注意不要有分桶

DROP TABLE itcast_ods.`customer_relationship_tmp`; CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` ( `id` int COMMENT '客户關係id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `customer_id` int COMMENT '所屬客户id', `first_id` int COMMENT '第一條客户關係id', `belonger` int COMMENT '歸屬人', `belonger_name` STRING COMMENT '歸屬人姓名', `initial_belonger` int COMMENT '初始歸屬人', `distribution_handler` int COMMENT '分配處理人', `business_scrm_department_id` int COMMENT '歸屬部門', `last_visit_time` STRING COMMENT '最後回訪時間', `next_visit_time` STRING COMMENT '下次回訪時間', `origin_type` STRING COMMENT '數據來源', `itcast_school_id` int COMMENT '校區Id', `itcast_subject_id` int COMMENT '學科Id', `intention_study_type` STRING COMMENT '意向學習方式', `anticipat_signup_date` STRING COMMENT '預計報名時間', `level` STRING COMMENT '客户級別', `creator` int COMMENT '創建人', `current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', `creator_name` STRING COMMENT '創建者姓名', `origin_channel` STRING COMMENT '來源渠道', `comment` STRING COMMENT '備註', `first_customer_clue_id` int COMMENT '第一條線索id', `last_customer_clue_id` int COMMENT '最後一條線索id', `process_state` STRING COMMENT '處理狀態', `process_time` STRING COMMENT '處理狀態變動時間', `payment_state` STRING COMMENT '支付狀態', `payment_time` STRING COMMENT '支付狀態變動時間', `signup_state` STRING COMMENT '報名狀態', `signup_time` STRING COMMENT '報名時間', `notice_state` STRING COMMENT '通知狀態', `notice_time` STRING COMMENT '通知狀態變動時間', `lock_state` STRING COMMENT '鎖定狀態', `lock_time` STRING COMMENT '鎖定狀態修改時間', `itcast_clazz_id` int COMMENT '所屬ems班級id', `itcast_clazz_time` STRING COMMENT '報班時間', `payment_url` STRING COMMENT '付款鏈接', `payment_url_time` STRING COMMENT '支付鏈接生成時間', `ems_student_id` int COMMENT 'ems的學生id', `delete_reason` STRING COMMENT '刪除原因', `deleter` int COMMENT '刪除人', `deleter_name` STRING COMMENT '刪除人姓名', `delete_time` STRING COMMENT '刪除時間', `course_id` int COMMENT '課程ID', `course_name` STRING COMMENT '課程名稱', `delete_comment` STRING COMMENT '刪除原因説明', `close_state` STRING COMMENT '關閉裝填', `close_time` STRING COMMENT '關閉狀態變動時間', `appeal_id` int COMMENT '申訴id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '報名費總金額', `belonged` int COMMENT '小週期歸屬人', `belonged_time` STRING COMMENT '歸屬時間', `belonger_time` STRING COMMENT '歸屬時間', `transfer` int COMMENT '轉移人', `transfer_time` STRING COMMENT '轉移時間', `follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號', `transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名', `end_time` STRING COMMENT '有效截止時間') comment '客户關係表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

2.1.1.2.1.2 抽取數據到臨時表

SQL：

select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, date_format("9999-12-31", "%Y-%m-%d") as end_time from customer_relationship;

Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time,date_format("9999-12-31","%Y-%m-%d") as end_time from customer_relationship where $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_relationship_tmp \ -m 10 \ --split-by id

2.1.1.2.1.3 將數據覆蓋插入到ODS

--分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert overwrite table itcast_ods.customer_relationship partition(start_time) select * from itcast_ods.customer_relationship_tmp;

2.1.1.2.2 Customer_clue線索表

2.1.1.2.2.1 重建ods表，注意不要有分桶

DROP TABLE itcast_ods.customer_clue_tmp; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

2.1.1.2.2.2 抽取數據到臨時表

SQL：

select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_time from customer_clue;

Sqoop：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time,date_format("9999-12-31","%Y-%m-%d") as ends_time from customer_clue where $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_clue_tmp \ -m 10 \ --split-by id

2.1.1.2.2.3 將數據覆蓋插入到ODS

insert overwrite table itcast_ods.customer_clue partition(starts_time) select * from itcast_ods.customer_clue_tmp;

3.2.2 數據清洗轉換

3.2.2.1 Hive執行計劃

2.1.1.2.3 作用

用户提交HiveQL查詢後，Hive會把查詢語句轉換為MapReduce作業。Hive會自動完成整個執行過程，一般情況下，我們並不用知道內部是如何運行的。

執行計劃可以告訴我們查詢過程的關鍵信息，用來幫助我們判定優化措施是否已經生效。

3.2.2.1.1 基礎語法

EXPLAIN的使用非常簡單，只需要在正常HiveQL前面加上EXPLAIN就可以了。執行計劃運行時的HiveQL不會真正執行作業，只是基於優化器生成了最優的執行路徑：

EXPLAIN [EXTENDED] query

extended輸出更加詳細的信息；

3.2.2.1.2 執行計劃分為兩部分

stage依賴(STAGE DEPENDENCIES)

(1) 這部分展示本次查詢分為兩個stage：Stage-1，Stage-0.

(2) 一般Stage-0是最終給查詢用户展示數據用的，如LIMITE操作就會在這部分。

(3) Stage-1是mr程序的執行階段。

1 STAGE DEPENDENCIES: 2 Stage-1 is a root stage 3 Stage-0 depends on stages: Stage-1

stage詳細執行計劃(STAGE PLANS)

(1) 包含了整個查詢所有Stage的大部分處理過程。

(2) 特定優化是否生效，主要通過此部分內容查看。

名次解釋

TableScan:查看錶

alias: emp：所需要的表

Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE：這張表的基本統計信息：行數、大小等；

expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)：表中需要輸出的字段及類型

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7：輸出的的字段編號

compressed: true：輸出是否壓縮；

input format: org.apache.hadoop.mapred.SequenceFileInputFormat：文件輸入調用的Java類，顯示以文本Text格式輸入；

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat：文件輸出調用的java類，顯示以文本Text格式輸出；

3.2.2.1.3 樣例

DWD階段執行計劃：

1 STAGE DEPENDENCIES: 2 Stage-1 is a root stage 3 Stage-0 depends on stages: Stage-1 4 5 STAGE PLANS: 6 Stage: Stage-1 7 Map Reduce 8 Map Operator Tree: 9 TableScan 10 alias: rs 11 Statistics: Num rows: 1109147 Data size: 236547154 Basic stats: COMPLETE Column stats: COMPLETE 12 Filter Operator 13 predicate: (((hash(id) & 2147483647) % 10) = 0) (type: boolean) 14 Statistics: Num rows: 554573 Data size: 118273474 Basic stats: COMPLETE Column stats: COMPLETE 15 Select Operator 16 expressions: id (type: int), customer_id (type: int), create_date_time (type: string), if((itcast_school_id is null or (itcast_school_id = 0)), -1, itcast_school_id) (type: int), deleted (type: int), origin_type (type: string), if((itcast_subject_id is null or (itcast_subject_id = 0)), -1, itcast_subject_id) (type: int), substr(create_date_time, 12, 2) (type: string), if((origin_type = 'NETSERVICE'), '1', if((origin_type = 'PRESIGNUP'), '1', '0')) (type: string), substr(create_date_time, 1, 4) (type: string), substr(create_date_time, 6, 2) (type: string), substr(create_date_time, 9, 2) (type: string) 17 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11 18 Statistics: Num rows: 554573 Data size: 631104074 Basic stats: COMPLETE Column stats: COMPLETE 19 File Output Operator 20 compressed: false 21 Statistics: Num rows: 554573 Data size: 631104074 Basic stats: COMPLETE Column stats: COMPLETE 22 table: 23 input format: org.apache.hadoop.mapred.SequenceFileInputFormat 24 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat 25 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 26 27 Stage: Stage-0 28 Fetch Operator 29 limit: -1 30 Processor Tree: 31 ListSink

3.2.2.2 DWD

3.2.2.2.1 分析

在DWD層對customer_relationship意向客户事實表做清洗轉換：

清洗掉已刪除的數據；

判斷學校id和學科id，空值統一轉換為-1；

將origin_type來源渠道字段轉換為線上/線下，如果origin_type是NETSERVICE和PRESIGNUP類型，即為1線上，否則為0線下。

3.2.2.2.2 代碼

--分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert into table itcast_dwd.itcast_intention_dwd partition (yearinfo,monthinfo,dayinfo) select rs.id as rid, rs.customer_id, rs.create_date_time, if((rs.itcast_school_id is null) or (rs.itcast_school_id = 0), -1, rs.itcast_school_id) as itcast_school_id, rs.deleted, rs.origin_type, if((rs.itcast_subject_id is null) or (rs.itcast_subject_id = 0), -1, rs.itcast_subject_id) as itcast_subject_id, substr(rs.create_date_time, 12, 2) hourinfo, if(rs.origin_type='NETSERVICE', '1', if(rs.origin_type='PRESIGNUP', '1', '0')) as origin_type_stat, substr(rs.create_date_time, 1, 4) yearinfo, substr(rs.create_date_time, 6, 2) monthinfo, substr(rs.create_date_time, 9, 2) dayinfo from itcast_ods.customer_relationship rs where rs.deleted = 0;

3.2.2.2.3 測試

測試時，可以通過分區和分桶採樣的方式。

分區針對的是固定日期，而分桶採樣則側重抽查，更具有代表性。由於第一次是全量抽取數據，所以日期分區下的數據非常龐大，此時使用分桶來進行採樣測試可以提升開發和測試效率。

注意tablesample關鍵字所在的位置，是在表名之後，別名之前。

2.1.1.2.4 執行計劃驗證

在select之前添加Explain，先來查看查詢執行計劃，可以看到分桶採樣已經生效，提高了開發和測試時的執行效率。

hive取上季度最後一天_數據_18

insert into table itcast_dwd.itcast_intention_dwd partition (yearinfo,monthinfo,dayinfo) select rs.id as rid, rs.customer_id, rs.create_date_time, if((rs.itcast_school_id is null) or (rs.itcast_school_id = 0), -1, rs.itcast_school_id) as itcast_school_id, rs.deleted, rs.origin_type, if((rs.itcast_subject_id is null) or (rs.itcast_subject_id = 0), -1, rs.itcast_subject_id) as itcast_subject_id, substr(rs.create_date_time, 12, 2) hourinfo, if(rs.origin_type='NETSERVICE', '1', if(rs.origin_type='PRESIGNUP', '1', '0')) as origin_type_stat, substr(rs.create_date_time, 1, 4) yearinfo, substr(rs.create_date_time, 6, 2) monthinfo, substr(rs.create_date_time, 9, 2) dayinfo from itcast_ods.customer_relationship tablesample(bucket 1 out of 10 on id) rs where rs.deleted = 0;

2.1.1.2.5 動態分區報錯

hive取上季度最後一天_hive取上季度最後一天_19

提高動態分區數量和文件數量，在sql前添加：

set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000;

2.1.1.2.6 內存溢出

注意，如果遇到因硬件配置而導致的內存溢出問題，有以下幾種處理辦法：

2.1.1.2.6.1 硬件內存充足

hive取上季度最後一天_hive_20

按照訪問諮詢看板中增加內存的設置進行配置：

提高Yarn的NodeManager內存配置

修改參數yarn.nodemanager.resource.memory-mb。

提高MR的內存配置

修改參數mapreduce.map.java.opts、mapreduce.reduce.java.opts、mapreduce.map.memory.mb、mapreduce.reduce.memory.mb。

2.1.1.2.6.2 硬件內存不足

開啓有序動態分區，並關閉Map Join，但過程會比較慢。

也可以通過where條件，按照日期分批進行清洗轉換。

查看各個年份數據分佈情況：

select count(1), substr(create_date_time, 1, 4) from itcast_ods.customer_relationship group by substr(create_date_time, 1, 4);

hive取上季度最後一天_hive取上季度最後一天_21

從結果可以看出，數據按年分配比較均勻，因此可以按照年份來進行分批計算。

insert into table itcast_dwd.itcast_intention_dwd partition (yearinfo,monthinfo,dayinfo) select rs.id as rid, rs.customer_id, rs.create_date_time, if((rs.itcast_school_id is null) or (rs.itcast_school_id = 0), -1, rs.itcast_school_id) as itcast_school_id, rs.deleted, rs.origin_type, if((rs.itcast_subject_id is null) or (rs.itcast_subject_id = 0), -1, rs.itcast_subject_id) as itcast_subject_id, substr(rs.create_date_time, 12, 2) hourinfo, if(rs.origin_type='NETSERVICE', '1', if(rs.origin_type='PRESIGNUP', '1', '0')) as origin_type_stat, substr(rs.create_date_time, 1, 4) yearinfo, substr(rs.create_date_time, 6, 2) monthinfo, substr(rs.create_date_time, 9, 2) dayinfo from itcast_ods.customer_relationship tablesample(bucket 1 out of 10 on id) rs where rs.create_date_time between '2011-01-01 00:00:00' and '2012-01-01 00:00:00';

2.1.1.2.6.3 本地模式（虛擬機環境）

set hive.exec.mode.local.auto=true;

3.2.2.3 DWM

3.2.2.3.1 分析

意向客户量指標，最終統計的是去重後的客户；所以不能採用先count後sum的形式進行。因此在DWM中間層，我們不做統計，只將相關的維度數據進行關聯，並轉換出我們需要的信息。

hive取上季度最後一天_數據_22

通過id關聯customer_clue表的customer_relationship_id，將clue_state狀態轉換為新老客户，如果clue_state狀態為VALID_NEW_CLUES，則為新客户，為VALID_PUBLIC_NEW_CLUE，則為老客户，否則為無效數據。

通過customer_id關聯customer表id獲取到區域信息area；

通過creator關聯employee表獲取tdepart_id諮詢中心單位id；再用employee的department_id和scrm_department表id關聯獲取單位名稱name。

通過itcast_subject_id學科id和itcast_subject學科表id進行關聯，獲取到學科名稱name。

通過itcast_school_id學科id和itcast_school校區表id進行關聯，獲取到校區名稱name。

3.2.2.3.2 代碼

--分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert into table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo) select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfo from itcast_dwd.itcast_intention_dwd dwd left join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.rid left join itcast_dimen.customer cus on dwd.customer_id = cus.id left join itcast_dimen.employee e on dwd.creator = e.id left join itcast_dimen.scrm_department dept on e.department_id = dept.id left join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id and sub.name is not null left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id;

3.2.2.3.3 測試

可以使用分桶採樣來進行測試。這裏因為我們在DWD層已經將數據分桶後減少了9/10，也可以不用再分桶。

3.2.2.3.3.1 執行計劃驗證

可以看到分桶採樣，以及SMB Join都生效了，去掉Reduce過程避免了數據傾斜的問題，提升了執行效率。

explain select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfo from itcast_dwd.itcast_intention_dwd tablesample(bucket 1 out of 10 on rid) dwd left join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.rid left join itcast_dimen.customer cus on dwd.customer_id = cus.id left join itcast_dimen.employee e on dwd.creator = e.id left join itcast_dimen.scrm_department dept on e.department_id = dept.id left join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id;

hive取上季度最後一天_數據_23

3.2.2.3.3.2 運行插入

insert into table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo) select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfo from itcast_dwd.itcast_intention_dwd dwd left join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.rid left join itcast_dimen.customer cus on dwd.customer_id = cus.id left join itcast_dimen.employee e on dwd.creator = e.id left join itcast_dimen.scrm_department dept on e.department_id = dept.id left join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id;

3.2.3 統計分析

3.2.3.1 DWS

3.2.3.1.1 分析

DWS層基於DWM清洗轉換關聯後的數據，使用count+distinct來統計指標。

在建模分析階段，我們已經得到了指標相關的維度。分四大類：

l 時間維度：1.年、2.月、3.天、4.小時

l 產品屬性維度：1.總意向量；2.區域信息；3.校區、學科組合分組；4.來源渠道；5.貢獻中心；

l 數據來源：0.線下；1.線上

l 客户屬性：0.老客户、1.新客户

代碼按照產品屬性分開統計；時間屬性、線上線下和客户屬性作為常駐字段，每一種統計分組中都要包含。

3.2.3.1.2 代碼

3.2.3.1.2.1 新增總意向量

--分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; --總意向量分組（按照時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '1' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.2 意向學員位置熱力圖

--地區分組（按照地區、時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '2' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by area, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '2' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by area, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by area, yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo) as time_str, '2' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by area, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.3 學科、校區排名

--學科、校區分組（按照學科、校區、時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, itcast_school_id, itcast_school_name, '-1' as origin_type, itcast_subject_id, itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '3' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, itcast_school_id, itcast_school_name, '-1' as origin_type, itcast_subject_id, itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '3' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, itcast_school_id, itcast_school_name, '-1' as origin_type, itcast_subject_id, itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '3' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, itcast_school_id, itcast_school_name, '-1' as origin_type, itcast_subject_id, itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo) as time_str, '3' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.4 來源渠道佔比

--來源渠道分組（按照來源渠道、時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '4' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by origin_type, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '4' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by origin_type, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '4' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by origin_type, yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo) as time_str, '4' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by origin_type, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.5 諮詢中心佔比

--諮詢中心分組（按照諮詢中心、時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, tdepart_id, tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '5' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by tdepart_id, tdepart_name, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, tdepart_id, tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '5' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm group by tdepart_id, tdepart_name, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, tdepart_id, tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '5' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by tdepart_id, tdepart_name, yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, tdepart_id, tdepart_name, concat(yearinfo) as time_str, '5' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm group by tdepart_id, tdepart_name, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.2 測試

由於從ODS—>DWD層—>DWM層，已經通過分桶採樣減少了數據，因此在DWS層無需重複採樣。

3.2.4 導出數據

3.2.4.1 創建mysql表

CREATE TABLE itcast_intention_app ( `customer_total` int(11) COMMENT '聚合意向客户數', `area` varchar(32) COMMENT '區域信息', `itcast_school_id` varchar(32) COMMENT '校區id', `itcast_school_name` varchar(32) COMMENT '校區名稱', `origin_type` varchar(32) COMMENT '來源渠道', `itcast_subject_id` varchar(32) COMMENT '學科id', `itcast_subject_name` varchar(32) COMMENT '學科名稱', `hourinfo` varchar(32) COMMENT '小時信息', `origin_type_stat` varchar(32) COMMENT '數據來源:0.線下；1.線上', `clue_state_stat` varchar(32) COMMENT '客户屬性：0.老客户；1.新客户', `tdepart_id` varchar(32) COMMENT '創建者', `tdepart_name` varchar(32) COMMENT '諮詢中心名稱', `time_str` varchar(32) COMMENT '時間明細', `groupType` varchar(32) COMMENT '產品屬性類別：1.總意向量；2.區域信息；3.校區、學科組合分組；4.來源渠道；5.貢獻中心;', `time_type` varchar(32) COMMENT '聚合時間類型：1、按小時聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；', `dayinfo` varchar(32) COMMENT '日信息', `monthinfo` varchar(32) COMMENT '月信息', `yearinfo` varchar(32) COMMENT '年信息' );

3.2.4.2 Sqoop導出腳本

sqoop export \ --connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \ --username root \ --password '123456' \ --table itcast_intention_app \ --hcatalog-database itcast_dws \ --hcatalog-table itcast_intention_dws \ -m 100

3.3 增量流程

3.3.1 數據採集

2.1.1.3 Dimen層

3.3.1.1 Customer客户表

維表數據量少，可直接全部覆蓋，同全量過程。

3.3.1.2 employee員工表

同全量過程。

3.3.1.3 scrm_department部門表

同全量過程。

3.3.1.4 itcast_school學校表

同全量過程。

3.3.1.5 itcast_subject學科表

同全量過程。

3.3.1.6 ODS層

3.3.1.7 拉鍊表採集

2.1.1.3.0.1 拉鍊表回顧

拉鍊表就是之前我們講過的SCD2，它的優點是即滿足了反應數據的歷史狀態，又能在最大程度上節省存儲。

拉鍊表的實現需要在原始字段基礎上增加兩個新字段：

l start_time(表示該條記錄的生命週期開始時間——週期快照時的狀態)

l end_time(該條記錄的生命週期結束時間)

hive取上季度最後一天_hive_24

2.1.1.3.0.2 採集實現步驟阿善用到

建立增量數據臨時表update；
抽取昨日增量數據(新增和更新)到update表；
建立合併數據臨時表tmp；
合併昨日增量數據（update表）與歷史數據（拉鍊表）

(1) 新數據end_time設為’9999-12-31’，也就是當前有效；

(2) 如果增量數據有重複id的舊數據，將舊數據end_time更新為前天（昨日-1），也就是從昨天開始不再生效；

(3) 合併後的數據寫入tmp表；

將臨時表的數據，覆蓋到拉鍊表中；
下次抽取需要重建update表和tmp表。

hive取上季度最後一天_hive取上季度最後一天_25

查詢拉鍊表數據時，可以通過start_time和end_time查詢出快照數據。

3.3.1.8 Customer_relationship

因為需求需要將customer_relationship更新數據涉及到的維度重新統計；同時要有歷史快照。推薦採用拉鍊表(SCD2)的形式來做。需要在start_time字段的基礎上，增加新的end_time字段，以標識封鏈時間。

3.3.1.8.1 重建customer_relationship_update增量表

每次使用update表都需要重建，以避免因為數據重複而導致的問題。

DROP TABLE IF EXISTS itcast_ods.customer_relationship_update; CREATE TABLE IF NOT EXISTS itcast_ods.customer_relationship_update ( id int COMMENT '客户關係id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted int COMMENT '是否被刪除（禁用）', customer_id int COMMENT '所屬客户id', first_id int COMMENT '第一條客户關係id', belonger int COMMENT '歸屬人', belonger_name STRING COMMENT '歸屬人姓名', initial_belonger int COMMENT '初始歸屬人', distribution_handler int COMMENT '分配處理人', business_scrm_department_id int COMMENT '歸屬部門', last_visit_time STRING COMMENT '最後回訪時間', next_visit_time STRING COMMENT '下次回訪時間', origin_type STRING COMMENT '數據來源', itcast_school_id int COMMENT '校區Id', itcast_subject_id int COMMENT '學科Id', intention_study_type STRING COMMENT '意向學習方式', anticipat_signup_date STRING COMMENT '預計報名時間', level STRING COMMENT '客户級別', creator int COMMENT '創建人', current_creator int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', creator_name STRING COMMENT '創建者姓名', origin_channel STRING COMMENT '來源渠道', comment STRING COMMENT '備註', first_customer_clue_id int COMMENT '第一條線索id', last_customer_clue_id int COMMENT '最後一條線索id', process_state STRING COMMENT '處理狀態', process_time STRING COMMENT '處理狀態變動時間', payment_state STRING COMMENT '支付狀態', payment_time STRING COMMENT '支付狀態變動時間', signup_state STRING COMMENT '報名狀態', signup_time STRING COMMENT '報名時間', notice_state STRING COMMENT '通知狀態', notice_time STRING COMMENT '通知狀態變動時間', lock_state STRING COMMENT '鎖定狀態', lock_time STRING COMMENT '鎖定狀態修改時間', itcast_clazz_id int COMMENT '所屬ems班級id', itcast_clazz_time STRING COMMENT '報班時間', payment_url STRING COMMENT '付款鏈接', payment_url_time STRING COMMENT '支付鏈接生成時間', ems_student_id int COMMENT 'ems的學生id', delete_reason STRING COMMENT '刪除原因', deleter int COMMENT '刪除人', deleter_name STRING COMMENT '刪除人姓名', delete_time STRING COMMENT '刪除時間', course_id int COMMENT '課程ID', course_name STRING COMMENT '課程名稱', delete_comment STRING COMMENT '刪除原因説明', close_state STRING COMMENT '關閉裝填', close_time STRING COMMENT '關閉狀態變動時間', appeal_id int COMMENT '申訴id', tenant int COMMENT '租户', total_fee DECIMAL COMMENT '報名費總金額', belonged int COMMENT '小週期歸屬人', belonged_time STRING COMMENT '歸屬時間', belonger_time STRING COMMENT '歸屬時間', transfer int COMMENT '轉移人', transfer_time STRING COMMENT '轉移時間', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', transfer_bxg_oa_account STRING COMMENT '轉移到博學谷歸屬人OA賬號', transfer_bxg_belonger_name STRING COMMENT '轉移到博學谷歸屬人OA姓名', end_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

3.3.1.8.2 抽取昨日新增和更新數據（邏輯刪除也屬於更新操作）

因為增量抽取是T+1，所以Sql中需要增加where條件，只查詢昨天一天的數據（新增和更新），而不是所有表數據。

新增的數據create_time=昨天；更新的數據update_time=昨天。

注意，更新的數據可能是以前創建的數據，創建日期可能不是昨天。業務方將更新週期限制在30天內，也就是説，昨天更改的數據，create_time<=’30天前的日期’，而update_time的值就是昨天的日期。

查詢條件需要包含創建日期和更新日期，因為需要將昨日新增和修改的數據都抽取到數倉中。

2.1.1.3.0.2.1 SQL：

select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, "9999-12-31" as end_time from customer_relationship where ( create_date_time >= "2011-12-04 00:00:00" and create_date_time < "2011-12-05 00:00:00" ) or ( update_date_time >= "2011-12-04 00:00:00" and update_date_time < "2011-12-05 00:00:00" );

2.1.1.3.0.2.2 Sqoop腳本：

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/scrm \ --username root \ --password 123456 \ --query ' select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, date_format("9999-12-31", "%Y-%m-%d") as end_time from customer_relationship where ( create_date_time >= "2011-12-04 00:00:00" and create_date_time < "2011-12-05 00:00:00" ) or ( update_date_time >= "2011-12-04 00:00:00" and update_date_time < "2011-12-05 00:00:00" ) and $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_relationship_update --hive-partition-key start_time \ --hive-partition-value 2020-07-15 \ -m 100 \ --split-by id

3.3.1.8.3 重建customer_relationship_tmp臨時表

每次使用tmp表都需要重建，以避免因為數據重複而導致的問題。

DROP TABLE itcast_ods.`customer_relationship_tmp`; CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` ( `id` int COMMENT '客户關係id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `customer_id` int COMMENT '所屬客户id', `first_id` int COMMENT '第一條客户關係id', `belonger` int COMMENT '歸屬人', `belonger_name` STRING COMMENT '歸屬人姓名', `initial_belonger` int COMMENT '初始歸屬人', `distribution_handler` int COMMENT '分配處理人', `business_scrm_department_id` int COMMENT '歸屬部門', `last_visit_time` STRING COMMENT '最後回訪時間', `next_visit_time` STRING COMMENT '下次回訪時間', `origin_type` STRING COMMENT '數據來源', `itcast_school_id` int COMMENT '校區Id', `itcast_subject_id` int COMMENT '學科Id', `intention_study_type` STRING COMMENT '意向學習方式', `anticipat_signup_date` STRING COMMENT '預計報名時間', `level` STRING COMMENT '客户級別', `creator` int COMMENT '創建人', `current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', `creator_name` STRING COMMENT '創建者姓名', `origin_channel` STRING COMMENT '來源渠道', `comment` STRING COMMENT '備註', `first_customer_clue_id` int COMMENT '第一條線索id', `last_customer_clue_id` int COMMENT '最後一條線索id', `process_state` STRING COMMENT '處理狀態', `process_time` STRING COMMENT '處理狀態變動時間', `payment_state` STRING COMMENT '支付狀態', `payment_time` STRING COMMENT '支付狀態變動時間', `signup_state` STRING COMMENT '報名狀態', `signup_time` STRING COMMENT '報名時間', `notice_state` STRING COMMENT '通知狀態', `notice_time` STRING COMMENT '通知狀態變動時間', `lock_state` STRING COMMENT '鎖定狀態', `lock_time` STRING COMMENT '鎖定狀態修改時間', `itcast_clazz_id` int COMMENT '所屬ems班級id', `itcast_clazz_time` STRING COMMENT '報班時間', `payment_url` STRING COMMENT '付款鏈接', `payment_url_time` STRING COMMENT '支付鏈接生成時間', `ems_student_id` int COMMENT 'ems的學生id', `delete_reason` STRING COMMENT '刪除原因', `deleter` int COMMENT '刪除人', `deleter_name` STRING COMMENT '刪除人姓名', `delete_time` STRING COMMENT '刪除時間', `course_id` int COMMENT '課程ID', `course_name` STRING COMMENT '課程名稱', `delete_comment` STRING COMMENT '刪除原因説明', `close_state` STRING COMMENT '關閉裝填', `close_time` STRING COMMENT '關閉狀態變動時間', `appeal_id` int COMMENT '申訴id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '報名費總金額', `belonged` int COMMENT '小週期歸屬人', `belonged_time` STRING COMMENT '歸屬時間', `belonger_time` STRING COMMENT '歸屬時間', `transfer` int COMMENT '轉移人', `transfer_time` STRING COMMENT '轉移時間', `follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號', `transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名', `end_time` STRING COMMENT '有效截止時間') comment '客户關係表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

3.3.1.8.4 合併增量數據與歷史數據（根據需求僅更新30天之內的數據）

獲取update表的更新數據，新數據end_time為’9999-12-31’，start_time為昨日日期；
獲取拉鍊表歷史數據：

(1) 更新舊數據end_time

① 將歷史表customer_relationship（拉鍊表）與新增/更新數據表customer_relationship_update通過id進行關聯，如果update中有與歷史表重複的id，證明有此條id數據已有新的變更；

② end_time不變的條件：

1) 沒有更新的數據保留原始end_time；

2) 歷史表已是失效的數據，保留原始有效結束日期end_time；

③ 否則（有更新的數據，且舊數據目前正在生效），修改end_time為前天（昨天之前）；

(2) 因為業務方將更新週期限制在30天內(只會修改30天之內的數據，即create_time在30天之內)，所以只需查詢更新30天內的數據(end_time)即可；

將 1.update 與 2.拉鍊表合併，覆蓋插入到臨時表中。

實現：

insert overwrite table itcast_ods.customer_relationship_tmppartition (start_time) select * from ( -- 一、update表更新的數據 select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, '9999-12-31' end_time, '2020-07-15' as start_time from itcast_ods.customer_relationship_updatewhere start_time='2020-07-15' union all -- 二、歷史拉鍊表數據，並根據update判斷更新end_time有效期 select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.first_id, rs.belonger, rs.belonger_name, rs.initial_belonger, rs.distribution_handler, rs.business_scrm_department_id, rs.last_visit_time, rs.next_visit_time, rs.origin_type, rs.itcast_school_id, rs.itcast_subject_id, rs.intention_study_type, rs.anticipat_signup_date, rs.level, rs.creator, rs.current_creator, rs.creator_name, rs.origin_channel, rs.comment, rs.first_customer_clue_id, rs.last_customer_clue_id, rs.process_state, rs.process_time, rs.payment_state, rs.payment_time, rs.signup_state, rs.signup_time, rs.notice_state, rs.notice_time, rs.lock_state, rs.lock_time, rs.itcast_clazz_id, rs.itcast_clazz_time, rs.payment_url, rs.payment_url_time, rs.ems_student_id, rs.delete_reason, rs.deleter, rs.deleter_name, rs.delete_time, rs.course_id, rs.course_name, rs.delete_comment, rs.close_state, rs.close_time, rs.appeal_id, rs.tenant, rs.total_fee, rs.belonged, rs.belonged_time, rs.belonger_time, rs.transfer, rs.transfer_time, rs.follow_type, rs.transfer_bxg_oa_account, rs.transfer_bxg_belonger_name, 更新end_time：如果沒有匹配到變更數據，或者當前已經是無效的歷史數據，則保留原始end_time過期時間；否則變更end_time時間為前天（昨天之前有效） if(up.id is null or rs.end_time<'9999-12-31', rs.end_time, date_add(up.start_time,-1)) end_time, rs.start_time from itcast_ods.customer_relationship rs left join ( select * from itcast_ods.customer_relationship_update where start_time='2020-07-15' ) up on rs.id=up.id --4、時間限制：歷史表中30天之內的數據才有可能變更，結果會按照所屬分區進行覆蓋插入 where rs.start_time >= date_add(up.start_time,-30) )his order by his.id, start_time;

3.3.1.8.5 臨時表覆蓋到拉鍊表

注意如果有分區的情況下，只會覆蓋所屬分區的數據，所以不用在上一個步驟中查詢出所有歷史數據，我們只需要查詢出30天內的數據即可，30天前的數據不會被覆蓋。

INSERT OVERWRITE TABLE itcast_ods.customer_relationship partition (start_time) SELECT * from itcast_ods.customer_relationship_tmp;

3.3.1.8.6 測試

完整執行流程後，觀察拉鍊表中對應條件的數據是否有變化：

SELECT * from itcast_ods.customer_relationship WHERE create_date_time BETWEEN "2011-12-04 00:00:00" and "2011-12-05 00:00:00";

3.3.1.8.7 Oozie腳本

將拉鍊表的完整過程寫入到shell腳本中。

#! /bin/bash HIVE_HOME=/usr/bin/hive if [[ $1 == "" ]]; then TD_DATE=`date -d ''1 days ago'' "+%Y-%m-%d"` else TD_DATE=$1 fi output=$(${HIVE_HOME} -S -e " SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; DROP TABLE IF EXISTS itcast_ods.customer_relationship_update; CREATE TABLE IF NOT EXISTS itcast_ods.customer_relationship_update ( id int COMMENT '客户關係id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted int COMMENT '是否被刪除（禁用）', customer_id int COMMENT '所屬客户id', first_id int COMMENT '第一條客户關係id', belonger int COMMENT '歸屬人', belonger_name STRING COMMENT '歸屬人姓名', initial_belonger int COMMENT '初始歸屬人', distribution_handler int COMMENT '分配處理人', business_scrm_department_id int COMMENT '歸屬部門', last_visit_time STRING COMMENT '最後回訪時間', next_visit_time STRING COMMENT '下次回訪時間', origin_type STRING COMMENT '數據來源', itcast_school_id int COMMENT '校區Id', itcast_subject_id int COMMENT '學科Id', intention_study_type STRING COMMENT '意向學習方式', anticipat_signup_date STRING COMMENT '預計報名時間', level STRING COMMENT '客户級別', creator int COMMENT '創建人', current_creator int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', creator_name STRING COMMENT '創建者姓名', origin_channel STRING COMMENT '來源渠道', comment STRING COMMENT '備註', first_customer_clue_id int COMMENT '第一條線索id', last_customer_clue_id int COMMENT '最後一條線索id', process_state STRING COMMENT '處理狀態', process_time STRING COMMENT '處理狀態變動時間', payment_state STRING COMMENT '支付狀態', payment_time STRING COMMENT '支付狀態變動時間', signup_state STRING COMMENT '報名狀態', signup_time STRING COMMENT '報名時間', notice_state STRING COMMENT '通知狀態', notice_time STRING COMMENT '通知狀態變動時間', lock_state STRING COMMENT '鎖定狀態', lock_time STRING COMMENT '鎖定狀態修改時間', itcast_clazz_id int COMMENT '所屬ems班級id', itcast_clazz_time STRING COMMENT '報班時間', payment_url STRING COMMENT '付款鏈接', payment_url_time STRING COMMENT '支付鏈接生成時間', ems_student_id int COMMENT 'ems的學生id', delete_reason STRING COMMENT '刪除原因', deleter int COMMENT '刪除人', deleter_name STRING COMMENT '刪除人姓名', delete_time STRING COMMENT '刪除時間', course_id int COMMENT '課程ID', course_name STRING COMMENT '課程名稱', delete_comment STRING COMMENT '刪除原因説明', close_state STRING COMMENT '關閉裝填', close_time STRING COMMENT '關閉狀態變動時間', appeal_id int COMMENT '申訴id', tenant int COMMENT '租户', total_fee DECIMAL COMMENT '報名費總金額', belonged int COMMENT '小週期歸屬人', belonged_time STRING COMMENT '歸屬時間', belonger_time STRING COMMENT '歸屬時間', transfer int COMMENT '轉移人', transfer_time STRING COMMENT '轉移時間', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', transfer_bxg_oa_account STRING COMMENT '轉移到博學谷歸屬人OA賬號', transfer_bxg_belonger_name STRING COMMENT '轉移到博學谷歸屬人OA姓名', end_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(start_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB'); ") SQOOP_HOME=/usr/bin/sqoop output=$(${SQOOP_HOME} import \ --connect jdbc:mysql://172.17.0.202:3306/scrm \ --username root \ --password 123456 \ --query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, date_format("9999-12-31", "%Y-%m-%d") as end_time from customer_relationship where ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s") ) and $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_relationship_update \ --hive-partition-key start_time \ --hive-partition-value ${TD_DATE} \ -m 100 \ --split-by id) output=$(${HIVE_HOME} -S -e " SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; DROP TABLE itcast_ods.customer_clue_tmp; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB'); insert overwrite table itcast_ods.`customer_relationship_tmp` partition (start_time) select * from ( select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, '9999-12-31' end_time, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time from itcast_ods.customer_relationship_update where start_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") union all select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.first_id, rs.belonger, rs.belonger_name, rs.initial_belonger, rs.distribution_handler, rs.business_scrm_department_id, rs.last_visit_time, rs.next_visit_time, rs.origin_type, rs.itcast_school_id, rs.itcast_subject_id, rs.intention_study_type, rs.anticipat_signup_date, rs.level, rs.creator, rs.current_creator, rs.creator_name, rs.origin_channel, rs.comment, rs.first_customer_clue_id, rs.last_customer_clue_id, rs.process_state, rs.process_time, rs.payment_state, rs.payment_time, rs.signup_state, rs.signup_time, rs.notice_state, rs.notice_time, rs.lock_state, rs.lock_time, rs.itcast_clazz_id, rs.itcast_clazz_time, rs.payment_url, rs.payment_url_time, rs.ems_student_id, rs.delete_reason, rs.deleter, rs.deleter_name, rs.delete_time, rs.course_id, rs.course_name, rs.delete_comment, rs.close_state, rs.close_time, rs.appeal_id, rs.tenant, rs.total_fee, rs.belonged, rs.belonged_time, rs.belonger_time, rs.transfer, rs.transfer_time, rs.follow_type, rs.transfer_bxg_oa_account, rs.transfer_bxg_belonger_name, if(up.id is null, rs.end_time, date_add(up.start_time,-1)) end_time, rs.start_time from itcast_ods.customer_relationship rs left join ( select * from itcast_ods.customer_relationship_update where start_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") ) up on rs.id=up.id where rs.start_time >= date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP()),30) and rs.end_time='9999-12-31' )his order by his.id, start_time; INSERT OVERWRITE TABLE itcast_ods.customer_relationship partition (start_time) SELECT * from itcast_ods.customer_relationship_tmp; ")

3.3.1.9 Customer_clue線索表

3.3.1.9.1 重建customer_clue_update更新表

DROP TABLE itcast_ods.customer_clue_update; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_update ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

3.3.1.9.2 抽取昨日新增和更新數據（邏輯刪除也屬於更新操作）

因為增量抽取是T+1，所以Sql中需要增加where條件，只查詢昨天一天的數據，而不是所有表數據。

查詢條件需要包含創建日期和更新日期，因為需要將昨日新增和修改的數據都抽取到數倉中。

SQL:

select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user as users, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_time from customer_clue where ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") );

Sqoop腳本：

sqoop import \ --connect jdbc:mysql://172.17.0.202:3306/scrm \ --username root \ --password 123456 \ --query ' select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user as users, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_time from customer_clue where ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") ) and $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_clue_update \ --hive-partition-key starts_time \ --hive-partition-value 2019-12-04 \ -m 100 \ --split-by id

3.3.1.9.3 重建customer_clue_tmp臨時表

DROP TABLE itcast_ods.customer_clue_tmp; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB');

hive取上季度最後一天_數據_26

3.3.1.9.4 合併增量數據與歷史數據（僅更新30天之內的數據，根據需求）

獲取update表的更新數據，新數據end_time為’9999-12-31’，start_time為昨日日期；
獲取拉鍊表歷史數據：

(1) 更新end_time

① 將歷史表customer_relationship（主表）與新增/更新數據表customer_relationship_update通過id進行關聯，如果update中有與歷史表重複的id，證明有此條id數據已有新的變更；

② 沒有更新的數據保留原始end_time；

③ 歷史表已是失效的數據，保留原始有效結束日期end_time；

④ 有更新的數據，且舊數據目前正在生效，修改end_time為前天（昨天之前）；

(2) 因為業務方將更新週期限制在30天內，所以只需查詢更新30天內的數據即可；

將 1.update 與 2.拉鍊表合併，覆蓋插入到臨時表中。

實現：

insert overwrite table itcast_ods.customer_clue_tmp partition (starts_time) select * from ( select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, users, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, '9999-12-31' ends_time, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") union all select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.customer_relationship_id, rs.session_id, rs.sid, rs.status, rs.users, rs.create_time, rs.platform, rs.s_name, rs.seo_source, rs.seo_keywords, rs.ip, rs.referrer, rs.from_url, rs.landing_page_url, rs.url_title, rs.to_peer, rs.manual_time, rs.begin_time, rs.reply_msg_count, rs.total_msg_count, rs.msg_count, rs.comment, rs.finish_reason, rs.finish_user, rs.end_time, rs.platform_description, rs.browser_name, rs.os_info, rs.area, rs.country, rs.province, rs.city, rs.creator, rs.name, rs.idcard, rs.phone, rs.itcast_school_id, rs.itcast_school, rs.itcast_subject_id, rs.itcast_subject, rs.wechat, rs.qq, rs.email, rs.gender, rs.level, rs.origin_type, rs.information_way, rs.working_years, rs.technical_directions, rs.customer_state, rs.valid, rs.anticipat_signup_date, rs.clue_state, rs.scrm_department_id, rs.superior_url, rs.superior_source, rs.landing_url, rs.landing_source, rs.info_url, rs.info_source, rs.origin_channel, rs.course_id, rs.course_name, rs.zhuge_session_id, rs.is_repeat, rs.tenant, rs.activity_id, rs.activity_name, rs.follow_type, rs.shunt_mode_id, rs.shunt_employee_group_id, if(up.id is null or rs.end_time<'9999-12-31', rs.ends_time, date_add(up.starts_time,-1)) ends_time, rs.starts_time from itcast_ods.customer_clue rs left join ( select * from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") ) up on rs.id=up.id where rs.starts_time >= date_add(FROM_UNIXTIME(UNIX_TIMESTAMP()),-30) )his order by his.id, starts_time;

3.3.1.9.5 臨時表覆蓋到拉鍊表

INSERT OVERWRITE TABLE itcast_ods.customer_clue partition (starts_time) SELECT * from itcast_ods.customer_clue_tmp;

3.3.1.9.6 測試

刪除mysql和HDFS(外部表)中的測試數據，避免數據重複，便於驗證測試結果
向mysql中插入新數據
驗證sqoop中的sql是否能夠在mysql正常查詢出測試數據
重建update更新表
手動執行sqoop腳本抽取數據
重建tmp臨時表
合併當天的新增和更新數據
臨時表覆蓋到拉鍊表

3.3.1.9.7 Oozie腳本

#! /bin/bash HIVE_HOME=/usr/bin/hive if [[ $1 == "" ]]; then TD_DATE=`date -d ''1 days ago'' "+%Y-%m-%d"` else TD_DATE=$1 fi output=$(${HIVE_HOME} -S -e " SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; DROP TABLE itcast_ods.customer_clue_update; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_update ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB'); ") SQOOP_HOME=/usr/bin/sqoop output=$(${SQOOP_HOME} import \ --connect jdbc:mysql://172.17.0.202:3306/scrm \ --username root \ --password 123456 \ --query ' select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_time from customer_clue where ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s") ) and $CONDITIONS' \ --hcatalog-database itcast_ods \ --hcatalog-table customer_clue_update \ --hive-partition-key starts_time \ --hive-partition-value ${TD_DATE} \ -m 100 \ --split-by id) output=$(${HIVE_HOME} -S -e " SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; DROP TABLE itcast_ods.customer_clue_tmp; CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢\|wap-wap諮詢\|sdk-app諮詢\|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間') comment '客户關係表' PARTITIONED BY(starts_time STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc TBLPROPERTIES ('orc.compress'='ZLIB'); insert overwrite table itcast_ods.customer_clue_tmp partition (starts_time) select * from ( select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, ends_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, '9999-12-31' ends_time, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") union all select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.customer_relationship_id, rs.session_id, rs.sid, rs.status, rs.user, rs.create_time, rs.platform, rs.s_name, rs.seo_source, rs.seo_keywords, rs.ip, rs.referrer, rs.from_url, rs.landing_page_url, rs.url_title, rs.to_peer, rs.manual_time, rs.begin_time, rs.reply_msg_count, rs.total_msg_count, rs.msg_count, rs.comment, rs.finish_reason, rs.finish_user, rs.ends_time, rs.platform_description, rs.browser_name, rs.os_info, rs.area, rs.country, rs.province, rs.city, rs.creator, rs.name, rs.idcard, rs.phone, rs.itcast_school_id, rs.itcast_school, rs.itcast_subject_id, rs.itcast_subject, rs.wechat, rs.qq, rs.email, rs.gender, rs.level, rs.origin_type, rs.information_way, rs.working_years, rs.technical_directions, rs.customer_state, rs.valid, rs.anticipat_signup_date, rs.clue_state, rs.scrm_department_id, rs.superior_url, rs.superior_source, rs.landing_url, rs.landing_source, rs.info_url, rs.info_source, rs.origin_channel, rs.course_id, rs.course_name, rs.zhuge_session_id, rs.is_repeat, rs.tenant, rs.activity_id, rs.activity_name, rs.follow_type, rs.shunt_mode_id, rs.shunt_employee_group_id, if(up.id is null, rs.ends_time, date_add(up.starts_time,-1)) ends_time, rs.starts_time from itcast_ods.customer_clue rs left join ( select * from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") ) up on rs.id=up.id where rs.starts_time >= date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP()),30) and rs.ends_time='9999-12-31' )his order by his.id, starts_time; INSERT OVERWRITE TABLE itcast_ods.customer_clue partition (starts_time) SELECT * from itcast_ods.customer_clue_tmp; ")

3.3.2 數據清洗轉換

3.3.2.1 DWD

3.3.2.1.1 分析

因為業務方將更新週期限制在30天內，而明細層不涉及統計，只有數據清洗轉換操作，所以我們在進行增量統計時，只需要重新計算上個月1日至今的數據即可。

通過start_time來指定清洗的數據時間範圍（昨天：新增/更新）；

通過end_time來指定獲取當前有效的數據。

清洗掉已刪除的數據；

判斷學校id和學科id，把為空的字段統一轉換為-1；

將origin_type來源渠道字段轉換為線上/線下，如果origin_type是NETSERVICE和PRESIGNUP類型，即為1線上，否則為0線下。

3.3.2.1.2 代碼

3.3.2.1.2.1 SQL：

--分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert into table itcast_dwd.itcast_intention_dwd partition (yearinfo,monthinfo,dayinfo) select rs.id as rid, rs.customer_id, rs.create_date_time, if((rs.itcast_school_id is null) or (rs.itcast_school_id = 0), -1, rs.itcast_school_id) as itcast_school_id, rs.deleted, rs.origin_type, if((rs.itcast_subject_id is null) or (rs.itcast_subject_id = 0), -1, rs.itcast_subject_id) as itcast_subject_id, substr(rs.create_date_time, 12, 2) hourinfo, if(rs.origin_type='NETSERVICE', '1', if(rs.origin_type='PRESIGNUP', '1', '0')) as origin_type_stat, substr(rs.create_date_time, 1, 4) yearinfo, substr(rs.create_date_time, 6, 2) monthinfo, substr(rs.create_date_time, 9, 2) dayinfo from itcast_ods.customer_relationship rs where rs.deleted = 0 and start_time = '${Last_DATE}'--2019-11-01 and rs.end_time = '9999-12-31';

3.3.2.1.2.2 Shell腳本：

通過shell腳本獲取上個月1日的日期，替換sql中的查詢條件。

#! /bin/bash SQOOP_HOME=/usr/bin/sqoop #昨天 Last_DATE=$(date -d "-1 day" +%Y-%m-%d) ${HIVE_HOME} -S -e " --分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert into table itcast_dwd.itcast_intention_dwd partition (yearinfo,monthinfo,dayinfo) select rs.id as rid, rs.customer_id, rs.create_date_time, if((rs.itcast_school_id is null) or (rs.itcast_school_id = 0), -1, rs.itcast_school_id) as itcast_school_id, rs.deleted, rs.origin_type, if((rs.itcast_subject_id is null) or (rs.itcast_subject_id = 0), -1, rs.itcast_subject_id) as itcast_subject_id, substr(rs.create_date_time, 12, 2) hourinfo, if(rs.origin_type='NETSERVICE', '1', if(rs.origin_type='PRESIGNUP', '1', '0')) as origin_type_stat, substr(rs.create_date_time, 1, 4) yearinfo, substr(rs.create_date_time, 6, 2) monthinfo, substr(rs.create_date_time, 9, 2) dayinfo from itcast_ods.customer_relationship rs where rs.deleted = 0 and substr(rs.start_time, 1, 10) = '${Last_DATE}'--2019-11-01 and rs.end_time = '9999-12-31'; "

3.3.2.2 DWM

通過年月日限定，只關聯上個月1日至今的數據。

3.3.2.2.1 SQL:

insert overwrite table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo) select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfo from itcast_dwd.itcast_intention_dwd dwd left join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.rid left join itcast_dimen.customer cus on dwd.customer_id = cus.id left join itcast_dimen.employee e on dwd.creator = e.id left join itcast_dimen.scrm_department dept on e.department_id = dept.id left join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id where concat_ws('-',dwd.yearinfo,dwd.monthinfo,dwd.dayinfo) >= '${Last_Month_DATE}'--2019-11-01;

3.3.2.2.2 Shell:

#! /bin/bash SQOOP_HOME=/usr/bin/sqoop #上個月1日 Last_Month_DATE=$(date -d "-1 month" +%Y-%m-01) ${HIVE_HOME} -S -e " --分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert into table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo) select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfo from itcast_dwd.itcast_intention_dwd dwd left join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.rid left join itcast_dimen.customer cus on dwd.customer_id = cus.id left join itcast_dimen.employee e on dwd.creator = e.id left join itcast_dimen.scrm_department dept on e.department_id = dept.id left join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id where concat_ws('-',dwd.yearinfo,dwd.monthinfo,dwd.dayinfo) >= '${Last_Month_DATE}'--2019-11-01; "

3.3.3 統計分析

3.3.3.1 新增總意向量

可以查詢2016-10-12之前的數據進行測試。

小時和天數據，重新計算上個月1日之後的數據；月份維度，計算上個月之後的數據；年份維度，計算上個月1日所在的年份之後的數據。

--總意向量分組（按照時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo,dwm.dayinfo) >= '${Last_Month_DATE}'--2011-08-01 group by yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '1' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo,dwm.dayinfo) >= '${Last_Month_DATE}'--2011-08-19 group by yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo) >= '${V_Month}'--2011-08 group by yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm where dwm.yearinfo >= '${V_Year}'--2011 group by yearinfo, origin_type_stat, clue_state_stat;

3.3.3.2 意向學員位置熱力圖

--地區分組（按照地區、時間和常駐類型統計） --小時 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '2' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo,dwm.dayinfo) >= '${Last_Month_DATE}'--2011-08-19 group by area, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; --天 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '2' as grouptype, '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo,dwm.dayinfo) >= '${Last_Month_DATE}'--2011-08-19 group by area, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat; --月 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo) >= '${V_Month}'--2011-08 group by area, yearinfo, monthinfo, origin_type_stat, clue_state_stat; --年 insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo) as time_str, '2' as grouptype, '1' as time_type, yearinfo, '-1' as monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm dwm where dwm.yearinfo >= '${V_Year}'--2011 group by area, yearinfo, origin_type_stat, clue_state_stat;

3.3.3.3 意向學科、校區排名

略。

3.3.3.4 來源渠道佔比

略。

3.3.3.5 諮詢中心佔比

略。

3.3.3.6 OOzie Shell示例

需要在上個月1日的基礎上，獲取到對應的年份、月份字符，以替換sql中的變量。

意向中心貢獻佔比小時數據：

#! /bin/bash #上個月1日 Last_Month_DATE=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01) #根據TD_DATE計算年季度月日 V_PARYEAR=`date --date="$Last_Month_DATE" +%Y` V_PARMONTH=`date --date="$Last_Month_DATE" +%m` V_PARDAY=`date --date="$Last_Month_DATE" +%d` #獲取季度，-m為不帶0，比如7，而不是07 V_month_for_quarter=`date --date="$Last_Month_DATE" +%-m` V_PARQUARTER=$(((${V_month_for_quarter}-1)/3+1)) ${HIVE_HOME} -S -e " SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo) select count(distinct customer_id) as customer_total, '-1' as area, '-1' itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str, '1' as grouptype, '1' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm dwm where concat_ws('-',dwm.yearinfo,dwm.monthinfo,dwm.dayinfo) >= '${Last_Month_DATE}' group by yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat; "

3.3.4 導出數據

按照年份，先刪除所在年的數據，後導出。

#! /bin/bash SQOOP_HOME=/usr/bin/sqoop HOST=172.17.0.202 USERNAME="root" PASSWORD="123456" PORT=3306 DBNAME="scrm_bi" MYSQL=/usr/local/mysql_5723/bin/mysql #上個月1日 if [[ $1 == "" ]];then Last_Month_DATE=$(date -d "-1 month" +%Y-%m-01) else Last_Month_DATE=$1 fi TD_YEAR=$(date -d "$Last_Month_DATE" +%Y) ${MYSQL} -h${HOST} -P${PORT} -u${USERNAME} -p${PASSWORD} -D${DBNAME} -e "delete from itcast_intention_app where yearinfo = '${Last_Month_DATE:0:4}'" ${SQOOP_HOME} export \ --connect "jdbc:mysql://${HOST}:${PORT}/${DBNAME}?useUnicode=true&characterEncoding=utf-8" \ --username ${USERNAME} \ --password ${PASSWORD} \ --table itcast_intention_app \ --hcatalog-database itcast_dws \ --hcatalog-table itcast_intention_dws \ --hcatalog-partition-keys yearinfo \ --hcatalog-partition-values ${TD_YEAR} \ -m 100

今日內容:1) 分桶表的相關優化 -- 理解2) 建模分層操作 -- 需要操作3) 全量流程的統計分析: -- 需求操作 (嘗試自己實現) 數據的採集, 數據的清洗轉換, 數據維度退化, 數據的統計分析4) 增量流程的: 如何對拉鍊表實現增量處理 -- 理解

1.意向客户主題看板_需求説明:   需求一: 計期內，新增意向客户（包含自己錄入的意向客户）總數。指標: 意向數量維度:   時間維度:   年月天小時新老維度: 線上線下:
涉及表:   customer_relationship(意向表) 涉及的字段:   create_date_time 基於這個字段統計意向用户數量: customer_id:先去重     需求二: 統計指定時間段內，新增的意向客户，所在城市區域人數熱力圖指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下區域維度: 涉及表:   customer_relationship(意向表)   customer (客户表(學員表)) 涉及的字段:   意向表中: create_date_time
客户表: area
基於這個字段統計意向用户數量: customer_id:先去重兩個表關聯條件: 意向表.customer_id=客户表.id
需求三: 統計指定時間段內，新增的意向客户中，意向學科人數排行榜。學科名稱要關聯查詢出來指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下學科維度涉及表:   customer_relationship(意向表), itcast_subject(學科表) customer_clue(線索表)
涉及字段:   線索表 : clue_state : 可以幫助識別新老用户 deleted : 用於判斷數據是否刪除 create_date_time 意向表 : origin_type: 此字段可以幫助判斷是否為線上還是線下如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下基於這個字段統計意向用户數量: customer_id:先去重學科表:   name    關聯條件:   線索表.customer_relationship_id = 意向表.id 學科表.id = 意向表.itcast_subject_id
需求四: 統計指定時間段內，新增的意向客户中，意向校區人數排行榜指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下校區維度
注意：學校id，同步時，0和null轉換為統一數據，都轉換為-1
涉及表: customer_relationship(意向表), customer_clue(線索表), itcast_school(校區表) 涉及字段:   線索表 : clue_state : 可以幫助識別新老用户 deleted : 用於判斷數據是否刪除 create_date_time 意向表 : origin_type: 此字段可以幫助判斷是否為線上還是線下如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下基於這個字段統計意向用户數量: customer_id:先去重校區表:   name 關聯條件:   意向表.itcast_school_id = 校區表.id 線索表.customer_relationship_id = 意向表.id
需求五: 統計指定時間段內，新增的意向客户中，不同來源渠道的意向客户佔比。指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下來源渠道     涉及表: customer_relationship(意向表), customer_clue(線索表) 涉及字段:   線索表 : clue_state : 可以幫助識別新老用户 deleted : 用於判斷數據是否刪除意向表:   create_date_time origin_type: 此字段可以幫助判斷是否為線上還是線下此字段也表示來源渠道如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下基於這個字段統計意向用户數量: customer_id:先去重關聯條件:   線索表.customer_relationship_id = 意向表.id     需求6: 統計指定時間段內，新增的意向客户中，各諮詢中心產生的意向客户數佔比情況指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下各諮詢中心     涉及表: customer_relationship(意向表), employee: 員工表 scrm_department : 部門表 customer_clue(線索表) 涉及字段:   線索表 : clue_state : 可以幫助識別新老用户意向表:   create_date_time    origin_type: 此字段可以幫助判斷是否為線上還是線下此字段也表示來源渠道如果值為: NETSERVICE OR PRESIDNUP 説明是線上其他就是為線下基於這個字段統計意向用户數量: customer_id:先去重員工表:   tdepart_id : 部門id 部門表: name 關聯條件:   線索表.customer_relationship_id = 意向表.id 員工表.tdepart_id = 部門表.id 意向表.creator = 員工表.id
總結:   指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下產品屬性維度:   地區維度 , 來源渠道, 學科維度, 校區維度 , 各諮詢中心
涉及表: 7張表 customer_relationship(意向表), 涉及到字段: create_date_time , origin_type , customer_id employee: 員工表    涉及到字段 : tdepart_id 和 id scrm_department : 部門表涉及到字段 : name 和 id    customer_clue(線索表)    涉及到字段 : clue_state ,deleted ,create_date_time ,customer_relationship_id itcast_school(校區表) : 涉及到字段 : name 和 id   itcast_subject(學科表) 涉及到字段 : name 和 id   customer(客户表)   涉及到字段: area 和 id 表關聯:   線索表.customer_relationship_id = 意向表.id 員工表.tdepart_id = 部門表.id 意向表.creator = 員工表.id 意向表.itcast_school_id = 校區表.id 學科表.id = 意向表.itcast_subject_id 意向表.customer_id=客户表.id
意向主題看板案例_導入原始業務數據 --- 此層在實際工作中不存在 create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;
將原來發的知行教育分析平台資料中 --> 原始完整數據集 --> scrm --> 將7個表依次導入MySQL中
意向主題看板案例_建模分析:  ODS層:   事實表: 意向表額外放置一張表: 線索表 (説明: 此表由於是後續主題看板事實表, 為了方便後續的處理, 將此表放置在ODS層) 表: 內部表 + 分桶表 + 分區表 + 拉鍊表實施DIM層: 維度層員工表, 校區表, 學科表, 客户表 ,部門表表: 外部表 + 分區表關於以上兩層: 只需要一對對應原生數據表結構構建即可, 構建時注意添加一個 start_time(抽取時間)數據格式和壓縮方式: ORC + ZLIB(SNAPPY)

DW層:   DWD: 清洗轉換以及如果表字段過多, 可以抽取相關的字段 , 對 ODS層表進行處理清洗工作: 清理掉以及被標識為刪除的數據轉換工作:   將 origin_type中數據轉換為 0 和 1 形成一個新的字段, 用於標識線上上下 create_date_time將時間轉換為年月日小時學校id，同步時，0和null轉換為統一數據，都轉換為-1 涉及到字段:   普通字段: id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat, itcast_school_id ,itcast_subject_id,creator,hourinfo 分區:   年(yearinfo) , 月(monthinfo) 日(dayinfo)     DWM: 基於維度提前聚合操作 (不能做) 維度退化將六個維度表, 和 DWD的事實表進行組合, 形成一張表, 從而實現維度退化操作思想: 考慮要從各個維度表中獲取那些字段數據, 將這些字段數據全部糅雜在一個表即可相關字段:    普通字段:   customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id, department_name ,hourinfo 分區字段:   年(yearinfo) , 月(monthinfo) 日(dayinfo)
要想生成這個表的數據, 此處需要進行從ODS+DIM 進行七表聯查得出此表結果

DWS: 指標只有一個, 表也就只有一個 customerid_total,clue_state_stat,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name, department_id, department_name , time_type,group_type ,hourinfo ,time_str
分區:   年(yearinfo) , 月(monthinfo) 日(dayinfo) time_type: 1(年) 2(月) 3(日) 4(小時)    group_type: 1地區維度 , 2來源渠道, 3學科維度, 4校區維度 , 5各諮詢中心 ,6 總意向量

數據結果:   1000 0 0 年 -1 -1 -1 -1 1000 0 1 年 -1 -1 -1 -1 1000 1 0 年 -1 -1 -1 -1 1000 1 1 年 -1 -1 -1 -1 1000 0 0 年 11 -1 -1 -1 1000 0 1 年 11 -1 -1 -1 1000 1 0 年 11 -1 -1 -1 1000 1 1 年 11 -1 -1 -1 1000 0 0 年 11 01 -1 -1 1000 0 1 年 11 01 -1 -1 1000 1 0 年 11 01 -1 -1 1000 1 1 年 11 01 -1 -1 1000 0 0 年 11 -1 山西 -1 1000 0 1 年 11 -1 山西 -1 1000 1 0 年 11 -1 山西 -1 1000 1 1 年 11 -1 山西 -1 1000 0 0 年 11 01 -1 weixin 1000 0 1 年 11 01 -1 weixin 1000 1 0 年 11 01 -1 weixin 1000 1 1 年 11 01 -1 weixin
app層: 不要 DWS已經成功將各個維度分析完成....

2. 分桶表的相關優化:   分桶表: 分文件將一個文件拆分多個文件的操作, 具體拆分多少, 取決於設置的分桶的數量底層是如何實現分文件呢? 核心採用 MR 分區, 採用 Hash取模計算法對分桶字段進行分區操作會將數據進行打散操作, 同時保證相同數據會發往同一個reduce中
桶表的操作:    創建表: create table test_buck(id int, name string) clustered by(id) sorted by (id asc) into 6 buckets -- 主要此處代碼 row format delimited fields terminated by '\t';
插入數據:   --啓用桶表 set hive.enforce.bucketing=true; insert into ...
注意: 桶表不能使用 load data 方式來插入桶表數據,   set hive.strict.checks.bucketing = true; 禁止桶表使用load data 默認true 如何將數據插入到桶表:   對桶表建立一張臨時表(千萬不能桶表) 通過 load data 方式將數據進行加載到臨時表, 然後通過 insert into 從臨時表將數據加載到桶表中
作用:   數據的抽樣處理 : 將一個文件的數據拆分為多個文件後, 從中獲取其中某幾個文件來進行處理, 這個過程數據採樣作用:   1. 測試的時候, 由於數據過於龐大, 可以對數據進行採樣, 然後在採樣的結果上進行統計分析即可,提升快速開發的效率 2. 對整體數據分析不是很方便, 可以進行採樣分析, 得出的結果依然可以反映整個數據的結果信息如何實現抽樣: 格式: select * from table tablesample(bucket x out of y on column) as a
放置位置: 緊跟在表的後面如果表有別名, 請將抽樣函數放置在別名之前, 表之後函數説明: tablesample(bucket x out of y on column) X : 從第幾個桶開始抽 x的值必須小於等於y的值 y : 抽桶數量比例 , 必須是桶的倍數或者因子 column : 按照那個字段進行分桶抽樣
例子: 表有 10個桶分桶字段為id
tablesample(bucket 3 out of 5 on id):   思考 : 會抽出幾個桶? 10/5 = 2 會抽出那兩個桶呢?   第三個桶和第八個桶

提升多表join的查詢性能 : 主要的手段就是 map join 1. mapjoin: 適合於小表和大表的join操作必備條件: set hive.auto.convert.join=true; -- 必須開啓 mapjoin的優化默認值為true set hive.auto.convert.join.noconditionaltask.size=512000000; 小表閾值默認值為 20971520 (20M)
2. 中等大小的表和大表進行join: 要求使用 map join 可以使用 Bucket-MapJoin   實現必備條件:   1) 兩個表的關聯條件的字段必須是分桶字段 2) 中型表的分桶數量小於等於大表的分桶數量並且必須是大表桶的倍數    3) 開啓 bucket_mapjoin : set hive.optimize.bucketmapjoin = true 4) 兩個表必須是分桶表 : 啓用 set hive.enforce.bucketing=true;     一旦將以上的條件都滿足, hive自動採用 Bucket-MapJoin 如果不滿足, hive會檢測是否滿足 map join, 如果不滿足, 那麼就採用原始 reduce join的方案
3. 大表和大表 join: 要求使用 map join 可以採用 SMB Join 基於 Bucket-MapJoin 實施的, 首先要先滿足 Bucket-MapJoin 實現必備條件:   1) 兩個表的關聯條件的字段必須是分桶字段, 並且必須按照分桶字段進行排序 2) 兩個表的分桶數量必須相等    3) 開啓 bucket_mapjoin : set hive.optimize.bucketmapjoin = true 4) 兩個表必須是分桶表 : 啓用 set hive.enforce.bucketing=true; 5) 開啓 SMB join的必備三項條件 :   set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin.sortedmerge = true; --開啓 SMBjoin set hive.auto.convert.sortmerge.join.noconditionaltask=true; set hive.enforce.sorting=true;
建表操作: create table test_smb_2(mid string,age_id string) CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;--3. 意向用户主題看板: 建模分層操作準備工作: 開啓寫入壓縮set hive.exec.orc.compression.strategy=COMPRESSION;--3.1: 創建 ODS層表: 2張表 (意向表和線索表)CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` ( `id` int COMMENT '客户關係id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `customer_id` int COMMENT '所屬客户id', `first_id` int COMMENT '第一條客户關係id', `belonger` int COMMENT '歸屬人', `belonger_name` STRING COMMENT '歸屬人姓名', `initial_belonger` int COMMENT '初始歸屬人', `distribution_handler` int COMMENT '分配處理人', `business_scrm_department_id` int COMMENT '歸屬部門', `last_visit_time` STRING COMMENT '最後回訪時間', `next_visit_time` STRING COMMENT '下次回訪時間', `origin_type` STRING COMMENT '數據來源', `itcast_school_id` int COMMENT '校區Id', `itcast_subject_id` int COMMENT '學科Id', `intention_study_type` STRING COMMENT '意向學習方式', `anticipat_signup_date` STRING COMMENT '預計報名時間', `level` STRING COMMENT '客户級別', `creator` int COMMENT '創建人', `current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', `creator_name` STRING COMMENT '創建者姓名', `origin_channel` STRING COMMENT '來源渠道', `comment` STRING COMMENT '備註', `first_customer_clue_id` int COMMENT '第一條線索id', `last_customer_clue_id` int COMMENT '最後一條線索id', `process_state` STRING COMMENT '處理狀態', `process_time` STRING COMMENT '處理狀態變動時間', `payment_state` STRING COMMENT '支付狀態', `payment_time` STRING COMMENT '支付狀態變動時間', `signup_state` STRING COMMENT '報名狀態', `signup_time` STRING COMMENT '報名時間', `notice_state` STRING COMMENT '通知狀態', `notice_time` STRING COMMENT '通知狀態變動時間', `lock_state` STRING COMMENT '鎖定狀態', `lock_time` STRING COMMENT '鎖定狀態修改時間', `itcast_clazz_id` int COMMENT '所屬ems班級id', `itcast_clazz_time` STRING COMMENT '報班時間', `payment_url` STRING COMMENT '付款鏈接', `payment_url_time` STRING COMMENT '支付鏈接生成時間', `ems_student_id` int COMMENT 'ems的學生id', `delete_reason` STRING COMMENT '刪除原因', `deleter` int COMMENT '刪除人', `deleter_name` STRING COMMENT '刪除人姓名', `delete_time` STRING COMMENT '刪除時間', `course_id` int COMMENT '課程ID', `course_name` STRING COMMENT '課程名稱', `delete_comment` STRING COMMENT '刪除原因説明', `close_state` STRING COMMENT '關閉裝填', `close_time` STRING COMMENT '關閉狀態變動時間', `appeal_id` int COMMENT '申訴id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '報名費總金額', `belonged` int COMMENT '小週期歸屬人', `belonged_time` STRING COMMENT '歸屬時間', `belonger_time` STRING COMMENT '歸屬時間', `transfer` int COMMENT '轉移人', `transfer_time` STRING COMMENT '轉移時間', `follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號', `transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名', `end_time` STRING COMMENT '有效截止時間')comment '客户關係表'PARTITIONED BY(start_time STRING)clustered by(id) sorted by(id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢|wap-wap諮詢|sdk-app諮詢|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間')comment '客户關係表'PARTITIONED BY(starts_time STRING)clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
--3.2: 創建 DIM層表: 5張表CREATE DATABASE IF NOT EXISTS itcast_dimen;CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` ( `id` int COMMENT 'key id', `customer_relationship_id` int COMMENT '當前意向id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `name` STRING COMMENT '姓名', `idcard` STRING COMMENT '身份證號', `birth_year` int COMMENT '出生年份', `gender` STRING COMMENT '性別', `phone` STRING COMMENT '手機號', `wechat` STRING COMMENT '微信', `qq` STRING COMMENT 'qq號', `email` STRING COMMENT '郵箱', `area` STRING COMMENT '所在區域', `leave_school_date` date COMMENT '離校時間', `graduation_date` date COMMENT '畢業時間', `bxg_student_id` STRING COMMENT '博學谷學員ID，可能未關聯到，不存在', `creator` int COMMENT '創建人ID', `origin_type` STRING COMMENT '數據來源', `origin_channel` STRING COMMENT '來源渠道', `tenant` int, `md_id` int COMMENT '中台id')comment '客户表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dimen.employee ( id int COMMENT '員工id', email STRING COMMENT '公司郵箱，OA登錄賬號', real_name STRING COMMENT '員工的真實姓名', phone STRING COMMENT '手機號，目前還沒有使用；隱私問題OA接口沒有提供這個屬性，', department_id STRING COMMENT 'OA中的部門編號，有負值', department_name STRING COMMENT 'OA中的部門名', remote_login STRING COMMENT '員工是否可以遠程登錄', job_number STRING COMMENT '員工工號', cross_school STRING COMMENT '是否有跨校區權限', last_login_date STRING COMMENT '最後登錄日期', creator int COMMENT '創建人', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', scrm_department_id int COMMENT 'SCRM內部部門id', leave_office STRING COMMENT '離職狀態', leave_office_time STRING COMMENT '離職時間', reinstated_time STRING COMMENT '復職時間', superior_leaders_id int COMMENT '上級領導ID', tdepart_id int COMMENT '直屬部門', tenant int COMMENT '租户', ems_user_name STRING COMMENT 'ems用户名稱')comment '員工表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` ( `id` int COMMENT '部門id', `name` STRING COMMENT '部門名稱', `parent_id` int COMMENT '父部門id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '更新時間', `deleted` STRING COMMENT '刪除標誌', `id_path` STRING COMMENT '編碼全路徑', `tdepart_code` int COMMENT '直屬部門', `creator` STRING COMMENT '創建者', `depart_level` int COMMENT '部門層級', `depart_sign` int COMMENT '部門標誌，暫時默認1', `depart_line` int COMMENT '業務線，存儲業務線編碼', `depart_sort` int COMMENT '排序字段', `disable_flag` int COMMENT '禁用標誌', `tenant` int COMMENT '租户')comment 'scrm部門表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` ( `id` int COMMENT '自增主鍵', `create_date_time` timestamp COMMENT '創建時間', `update_date_time` timestamp COMMENT '最後更新時間', `deleted` STRING COMMENT '是否被刪除（禁用）', `name` STRING COMMENT '校區名稱', `code` STRING COMMENT '校區標識', `tenant` int COMMENT '租户')comment '校區字典表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` ( `id` int COMMENT '自增主鍵', `create_date_time` timestamp COMMENT '創建時間', `update_date_time` timestamp COMMENT '最後更新時間', `deleted` STRING COMMENT '是否被刪除（禁用）', `name` STRING COMMENT '學科名稱', `code` STRING COMMENT '學科編碼', `tenant` int COMMENT '租户')comment '學科字典表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

--3.3 構建 DWD層: -- 演示 join優化CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` ( `rid` int COMMENT 'id', `customer_id` STRING COMMENT '客户id', `create_date_time` STRING COMMENT '創建時間', `itcast_school_id` STRING COMMENT '校區id', `deleted` STRING COMMENT '是否被刪除', `origin_type` STRING COMMENT '來源渠道', `itcast_subject_id` STRING COMMENT '學科id', `creator` int COMMENT '創建人', `hourinfo` STRING COMMENT '小時信息', `origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上')comment '客户意向dwd表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(rid) sorted by(rid) into 10 bucketsROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');
-- 3.4: 構建 DWM層create database itcast_dwm;CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` ( `customer_id` STRING COMMENT 'id信息', `create_date_time` STRING COMMENT '創建時間', `area` STRING COMMENT '區域信息', `itcast_school_id` STRING COMMENT '校區id', `itcast_school_name` STRING COMMENT '校區名稱', `deleted` STRING COMMENT '是否被刪除', `origin_type` STRING COMMENT '來源渠道', `itcast_subject_id` STRING COMMENT '學科id', `itcast_subject_name` STRING COMMENT '學科名稱', `hourinfo` STRING COMMENT '小時信息', `origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上', `clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '創建者部門id', `tdepart_name` STRING COMMENT '諮詢中心名稱')comment '客户意向dwm表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(customer_id) sorted by(customer_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');
-- 3.5 構建 DWS 層CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws ( `customer_total` INT COMMENT '聚合意向客户數', `area` STRING COMMENT '區域信息', `itcast_school_id` STRING COMMENT '校區id', `itcast_school_name` STRING COMMENT '校區名稱', `origin_type` STRING COMMENT '來源渠道', `itcast_subject_id` STRING COMMENT '學科id', `itcast_subject_name` STRING COMMENT '學科名稱', `hourinfo` STRING COMMENT '小時信息', `origin_type_stat` STRING COMMENT '數據來源:0.線下；1.線上', `clue_state_stat` STRING COMMENT '客户屬性：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '創建者部門id', `tdepart_name` STRING COMMENT '諮詢中心名稱', `time_str` STRING COMMENT '時間明細', `groupType` STRING COMMENT '產品屬性類別：1.總意向量；2.區域信息；3.校區、學科組合分組；4.來源渠道；5.諮詢中心;', `time_type` STRING COMMENT '時間維度：1、按小時聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；')comment '客户意向dws表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

4. 意向主題看板案例_數據的採集:4.1: 完成 DIM層的數據採集:sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d") as start_time from customer where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table customer \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id,email,real_name,-1 as phone,department_id,department_name,remote_login,job_number,cross_school,last_login_date,creator,create_date_time,update_date_time,deleted,scrm_department_id,leave_office,leave_office_time,reinstated_time,superior_leaders_id,tdepart_id,tenant,ems_user_name,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table employee \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from scrm_department where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table scrm_department \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_school where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table itcast_school \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_subject where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table itcast_subject \-m 1 \--split-by id
4.2: 完成ODS層的數據採集由於ODS層表時兩張桶表數據, 而 sqoop 無法支持桶表數據的導入工作, 此時解決方案: 為對應的桶表構建臨時表, 然後通過sqoop將數據導入到臨時表在通過臨時表使用 insert into 的方式將數據導入分桶表中即可
4.2.1: 意向表的數據導入第一步: 創建意向表的臨時表結構CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` ( `id` int COMMENT '客户關係id', `create_date_time` STRING COMMENT '創建時間', `update_date_time` STRING COMMENT '最後更新時間', `deleted` int COMMENT '是否被刪除（禁用）', `customer_id` int COMMENT '所屬客户id', `first_id` int COMMENT '第一條客户關係id', `belonger` int COMMENT '歸屬人', `belonger_name` STRING COMMENT '歸屬人姓名', `initial_belonger` int COMMENT '初始歸屬人', `distribution_handler` int COMMENT '分配處理人', `business_scrm_department_id` int COMMENT '歸屬部門', `last_visit_time` STRING COMMENT '最後回訪時間', `next_visit_time` STRING COMMENT '下次回訪時間', `origin_type` STRING COMMENT '數據來源', `itcast_school_id` int COMMENT '校區Id', `itcast_subject_id` int COMMENT '學科Id', `intention_study_type` STRING COMMENT '意向學習方式', `anticipat_signup_date` STRING COMMENT '預計報名時間', `level` STRING COMMENT '客户級別', `creator` int COMMENT '創建人', `current_creator` int COMMENT '當前創建人：初始==創建人，當在公海拉回時為拉回人', `creator_name` STRING COMMENT '創建者姓名', `origin_channel` STRING COMMENT '來源渠道', `comment` STRING COMMENT '備註', `first_customer_clue_id` int COMMENT '第一條線索id', `last_customer_clue_id` int COMMENT '最後一條線索id', `process_state` STRING COMMENT '處理狀態', `process_time` STRING COMMENT '處理狀態變動時間', `payment_state` STRING COMMENT '支付狀態', `payment_time` STRING COMMENT '支付狀態變動時間', `signup_state` STRING COMMENT '報名狀態', `signup_time` STRING COMMENT '報名時間', `notice_state` STRING COMMENT '通知狀態', `notice_time` STRING COMMENT '通知狀態變動時間', `lock_state` STRING COMMENT '鎖定狀態', `lock_time` STRING COMMENT '鎖定狀態修改時間', `itcast_clazz_id` int COMMENT '所屬ems班級id', `itcast_clazz_time` STRING COMMENT '報班時間', `payment_url` STRING COMMENT '付款鏈接', `payment_url_time` STRING COMMENT '支付鏈接生成時間', `ems_student_id` int COMMENT 'ems的學生id', `delete_reason` STRING COMMENT '刪除原因', `deleter` int COMMENT '刪除人', `deleter_name` STRING COMMENT '刪除人姓名', `delete_time` STRING COMMENT '刪除時間', `course_id` int COMMENT '課程ID', `course_name` STRING COMMENT '課程名稱', `delete_comment` STRING COMMENT '刪除原因説明', `close_state` STRING COMMENT '關閉裝填', `close_time` STRING COMMENT '關閉狀態變動時間', `appeal_id` int COMMENT '申訴id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '報名費總金額', `belonged` int COMMENT '小週期歸屬人', `belonged_time` STRING COMMENT '歸屬時間', `belonger_time` STRING COMMENT '歸屬時間', `transfer` int COMMENT '轉移人', `transfer_time` STRING COMMENT '轉移時間', `follow_type` int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', `transfer_bxg_oa_account` STRING COMMENT '轉移到博學谷歸屬人OA賬號', `transfer_bxg_belonger_name` STRING COMMENT '轉移到博學谷歸屬人OA姓名', `end_time` STRING COMMENT '有效截止時間')comment '客户關係表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
第二步: 使用sqoop 完成數據導入到臨時表: sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name,date_format("9999-12-31","%Y-%m-%d") as end_time, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from customer_relationship where $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_relationship_tmp \-m 1 \--split-by id
--第三步: 將臨時表的數據, 在次灌入到 ODS的分桶的意向表中: --分區SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;--hive壓縮set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--寫入時壓縮生效set hive.exec.orc.compression.strategy=COMPRESSION;--分桶 set hive.optimize.bucketmapjoin = true;set hive.enforce.bucketing=true;set hive.enforce.sorting=true;
set hive.auto.convert.sortmerge.join=true;set hive.auto.convert.sortmerge.join.noconditionaltask=true;
insert into table itcast_ods.customer_relationship partition(start_time)select * from customer_relationship_tmp;

4.2.2: 將線索表數據導入到ods層的表中第一步: 建立線索表的臨時表: CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '創建時間', update_date_time STRING COMMENT '最後更新時間', deleted STRING COMMENT '是否被刪除（禁用）', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户關係id', session_id STRING COMMENT '七陌會話id', sid STRING COMMENT '訪客id', status STRING COMMENT '狀態（undeal待領取 deal 已領取 finish 已關閉 changePeer 已流轉）', users STRING COMMENT '所屬坐席', create_time STRING COMMENT '七陌創建時間', platform STRING COMMENT '平台來源（pc-網站諮詢|wap-wap諮詢|sdk-app諮詢|weixin-微信諮詢）', s_name STRING COMMENT '用户名稱', seo_source STRING COMMENT '搜索來源', seo_keywords STRING COMMENT '關鍵字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上級來源頁面', from_url STRING COMMENT '會話來源頁面', landing_page_url STRING COMMENT '訪客着陸頁面', url_title STRING COMMENT '諮詢頁面title', to_peer STRING COMMENT '所屬技能組', manual_time STRING COMMENT '人工開始時間', begin_time STRING COMMENT '坐席領取時間 ', reply_msg_count int COMMENT '客服回覆消息數', total_msg_count int COMMENT '消息總數', msg_count int COMMENT '客户發送消息數', comment STRING COMMENT '備註', finish_reason STRING COMMENT '結束類型', finish_user STRING COMMENT '結束坐席', end_time STRING COMMENT '會話結束時間', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '瀏覽器名稱', os_info STRING COMMENT '系統名稱', area STRING COMMENT '區域', country STRING COMMENT '所在國家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '創建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份證號', phone STRING COMMENT '手機號', itcast_school_id int COMMENT '校區Id', itcast_school STRING COMMENT '校區', itcast_subject_id int COMMENT '學科Id', itcast_subject STRING COMMENT '學科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq號', email STRING COMMENT '郵箱', gender STRING COMMENT '性別', level STRING COMMENT '客户級別', origin_type STRING COMMENT '數據來源渠道', information_way STRING COMMENT '資訊方式', working_years STRING COMMENT '開始工作時間', technical_directions STRING COMMENT '技術方向', customer_state STRING COMMENT '當前客户狀態', valid STRING COMMENT '該線索是否是網資有效線索', anticipat_signup_date STRING COMMENT '預計報名時間', clue_state STRING COMMENT '線索狀態', scrm_department_id int COMMENT 'SCRM內部部門id', superior_url STRING COMMENT '諸葛獲取上級頁面URL', superior_source STRING COMMENT '諸葛獲取上級頁面URL標題', landing_url STRING COMMENT '諸葛獲取着陸頁面URL', landing_source STRING COMMENT '諸葛獲取着陸頁面URL來源', info_url STRING COMMENT '諸葛獲取留諮頁URL', info_source STRING COMMENT '諸葛獲取留諮頁URL標題', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '課程編號', course_name STRING COMMENT '課程名稱', zhuge_session_id STRING COMMENT 'zhuge會話id', is_repeat int COMMENT '是否重複線索(手機號維度) 0:正常 1：重複', tenant int COMMENT '租户id', activity_id STRING COMMENT '活動id', activity_name STRING COMMENT '活動名稱', follow_type int COMMENT '分配類型，0-自動分配，1-手動分配，2-自動轉移，3-手動單個轉移，4-手動批量轉移，5-公海領取', shunt_mode_id int COMMENT '匹配到的技能組id', shunt_employee_group_id int COMMENT '所屬分流員工組', ends_time STRING COMMENT '有效時間')comment '客户關係表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
第二步: 使用sqoop 完成數據導入到線索表臨時表
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,date_format("9999-12-31","%Y-%m-%d") as ends_time,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time from customer_clue where $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_clue_tmp \-m 1 \--split-by id
第三步: 將臨時表的數據, 導入到線索表:
insert into table itcast_ods.customer_clue partition(starts_time)select * from itcast_ods.customer_clue_tmp;

4.3: 完成數據清洗轉換處理工作: ODS的意向表 --> DWD層清洗後的意向表需要清洗和轉換的操作都有哪些?   清洗:    將標記為delete=1進行清除轉換工作:   create_date_time字段, 需要轉換出有年月天小時 origin_type 中數據生成一個新的字段 origin_type_stat 用於區分線上和線下學校id和學科ID，同步時，0和null轉換為統一數據，都轉換為-1
清洗轉換的SQL :   INSERT INTO TABLE itcast_dwd.itcast_intention_dwd partition(yearinfo,monthinfo,dayinfo) select    id as rid,   customer_id,   create_date_time,   if(itcast_school_id is null or itcast_school_id =0,'-1',itcast_school_id) as itcast_school_id ,   deleted,   origin_type,   if(itcast_subject_id is null or itcast_subject_id =0,'-1',itcast_subject_id) as itcast_subject_id, creator,   substr(create_date_time,12,2) as hourinfo, if(origin_type in('NETSERVICE','PRESIGNUP'),'1','0') as origin_type_stat, substr(create_date_time,1,4) as yearinfo, substr(create_date_time,6,2) as monthinfo, substr(create_date_time,9,2) as dayinfo from itcast_ods.customer_relationship TABLESAMPLE(BUCKET 1 OUT OF 10 on id) as cr where deleted = 0;
--4.4: 完成數據轉換操作: DWD --> DWM   --分區 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive壓縮 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --寫入時壓縮生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;
insert into table itcast_dwm.itcast_intention_dwm partition(yearinfo,monthinfo,dayinfo) select    iid.customer_id, iid.create_date_time, dcu.area, iid.itcast_school_id, dis.name, iid.deleted, iid.origin_type, iid.itcast_subject_id, disub.name, iid.hourinfo, iid.origin_type_stat, if(cc.clue_state ='VALID_NEW_CLUES' , '1', if(cc.clue_state ='VALID_PUBLIC_NEW_CLUE','0','-1') ) as clue_state_stat, -- 找新老用户 demp.tdepart_id, dsd.name, iid.yearinfo, iid.monthinfo, iid.dayinfo from itcast_dwd.itcast_intention_dwd as iid   left join itcast_ods.customer_clue as cc on iid.rid = cc.customer_relationship_id left join itcast_dimen.itcast_school as dis on dis.id = iid.itcast_school_id left join itcast_dimen.itcast_subject as disub on disub.id=iid.itcast_subject_id left join itcast_dimen.customer as dcu on dcu.id = iid.customer_id left join itcast_dimen.employee as demp on demp.id = iid.creator left join itcast_dimen.scrm_department as dsd on dsd.id = demp.tdepart_id;
經過測試發現: itcast_intention_dwd 和 customer_clue 產生 SMB的mapjoin優化其餘表均為普通 map join
4.5) 統計分析:  指標: 意向數量維度:   時間維度: 年月天小時新老維度: 線上線下產品屬性維度:   地區維度 , 來源渠道, 學科維度, 校區維度 , 各諮詢中心
--需求1: 按照月統計新老用户以及線上下產生意向用户數量 insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo) select    count(distinct customer_id ) as customer_total, '-1' as area, '-1' as itcast_school_id,   '-1' as itcast_school_name,   '-1' as origin_type,   '-1' as itcast_subject_id,   '-1' as itcast_subject_name,   '-1' as hourinfo,   origin_type_stat, clue_state_stat, '-1' as tdepart_id,   '-1' as tdepart_name,   concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype , '4' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo, clue_state_stat,   origin_type_stat;

-- 需求2: 按照天統計新老用户以及線上下以及各個地區產生意向用户數量 insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo) select    count(distinct customer_id ) as customer_total, area, '-1' as itcast_school_id,   '-1' as itcast_school_name,   '-1' as origin_type,   '-1' as itcast_subject_id,   '-1' as itcast_subject_name,   '-1' as hourinfo,   origin_type_stat, clue_state_stat, '-1' as tdepart_id,   '-1' as tdepart_name,   concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '2' as grouptype , '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo,dayinfo, clue_state_stat,   origin_type_stat,area;

本文章為轉載內容，我們尊重原作者對文章享有的著作權。如有內容錯誤或侵權問題，歡迎原作者聯繫我們進行內容更正或刪除文章。

博客 / 詳情