# 漏斗引擎

简单的说就是支持客户在自己的行为数据上,让用户创建自定义行为漏斗,提供漏斗转化数据。我们就核心逻辑来详细解说。表结构和数据样例见漏斗数据表 本文起抛砖引玉的作用,希望给大家一定的参考。

# 数据准备

# 前端漏斗配置相关

1.漏斗配置数据(Json 格式,数据表funnel_json_info),这部分由用户通过前端界面配置并落表,我们通过解析(dim_funnel_info.py)分别输出三个表:

2.漏斗配置结构维表(funnel_new_config_structure_syscustom,dim_config_item_funnel),各种条件的总维表,根据业务定义。

3.漏斗具体步骤展开表(dim_funnel_branchs_condition_funnel),这里把dim_funnel_condition_funnel中有分支步骤漏斗规则进行复制,即每个分支都有一整套单独的漏斗步骤。

4.维度展示表,包含默认值、展示顺序等(funnel_new_display_dimdim_dim_option_funnel

5.全量维度表,定义漏斗的维护,可以从订单流量等表中定义(dim_dim,sqoop_dim_dim.py)

# 事实数据准备

分别准备订单、注册和流量数据合并在一起bdl_input_funnel,字段log_json配置了各个数据源的信息内容。

  • bdl_input_order_funnel -- 下单 type:order
  • bdl_input_order_ship_funnel -- 付款 type:order_ship
  • bdl_input_reg_suc_funnel -- 注册 type:reg_suc
  • bdl_input_web_event_funnel -- 流量事件 type:web_event
  • bdl_input_web_pageview_funnel -- 流量页面 type:web_pageview
to_json(
    'appd',appd,
    if(visit_id!='' and site_session_id!='','web_session',null),
    if(visit_id!='' and site_session_id!='',site_session_id,null ),
    'order_income_isgt0',order_income_isgt0
) as log_json,

接着与dim_funnel_condition_funnel关联,筛选出需要运算的最终事实表 bdl_input_sys_funnelbdl_input_sys_dw_track_funnel), 这步根据业务情况处理。实际就是筛选bdl_input_funnel表 type 字段有几种值参与计算。

# 客户配置漏斗初筛

这个步骤进行漏斗匹配的初步筛选,只要符合一个漏斗步骤就参与后续的计算。做完这一步我们的漏斗其实已经完成,匹配到符合条件的记录会被保留下来,然后我们就可以计算各步骤的漏斗数据了。

1.首先将dim_funnel_condition_funnel中的branch_step_type展开,因为同一个步骤可能适用不同类型事实数据。例如funnel_id为33里面有一个branch_step_type是web_event,order

2.将dim_funnel_setting_funnel失效的漏斗规则过滤。

select * from dim_funnel_setting_funnel where funnel_status=1 and syscustom_code=? and  end_date>=?

3.最后将bdl_input_funneldim_funnel_condition_funneldim_funnel_setting_funnel三表进行关联, 并通过自定义函数con_check_condition进行条件对比过滤生成bdl_output_check_conditions_funnelbdl_output_check_conditions_dw_track_funnel

dim_funnel_condition_funnel.funnel_id = dim_funnel_setting_funnel.funnel_id  and dim_funnel_condition_funnel.branch_step_type = dim_funnel_setting_funnel.type

最关键的部分是过滤函数con_check_condition,他将bdl_input_funnel表中的log_jsondim_funnel_condition_funnel表中的conditions进行逐一对比,不符合条件的过滤掉。我们来看实际例子:

// log_json
{
  "web_is_launch_page": "0",
  "web_terminal": "PC",
  "web_source_appd": "direct",
  "web_quit_page": "class.bihell.com/lesson/p3_11510908/try",
  "web_is_quit_page": "0",
  "appd": "yyy_bihell",
  "web_source": "direct|direct|direct",
  "web_launch_page": "bihell.com",
  "web_source_cate": "direct",
  "web_is_new_session": "1",
  "web_original_session": "c4a78f1e-88c6-89f8-4308-49b0dabf7f7d",
  "web_source_bu": "direct",
  "web_session": "c4a78f1e-88c6-89f8-4308-49b0dabf7f7d",
  "web_pre_page_type": "75",
  "web_page_type": "3"
}
// conditions
{
    "type": "web_pageview",
    "name": "页面浏览-(类型)APP内嵌intro页",
    "uid": "web_page_type—14—22",
    "items": [
      {
        "key": "web_page_type",
        "name": "类型",
        "op": "co",
        "display": "",
        "value": [
          "22"
        ]
      }
    ]
  },
  {
    "type": "web_pageview",
    "name": "页面浏览-(类型)intro页",
    "uid": "web_page_type—14—3",
    "items": [
      {
        "key": "web_page_type",
        "name": "类型",
        "op": "co",
        "display": "",
        "value": [
          "3"
        ]
      }
    ]
  },
  {
    "type": "web_pageview",
    "name": "页面浏览-(类型)触屏intro页测试版",
    "uid": "web_page_type—14—28",
    "items": [
      {
        "key": "web_page_type",
        "name": "类型",
        "op": "co",
        "display": "",
        "value": [
          "28"
        ]
      }
    ]
  },
  {
    "type": "web_pageview",
    "name": "页面浏览-(类型)新版App内嵌intro",
    "uid": "web_page_type—14—115",
    "items": [
      {
        "key": "web_page_type",
        "name": "类型",
        "op": "co",
        "display": "",
        "value": [
          "115"
        ]
      }
    ]
  }

log_json只要满足conditions里面的items任何一个条件就算「通过」。通过 Json 我们可以看到 conditions里面"key": "web_page_type"值分别为 22、3、28、115,其中"op": "co"为对比方式。co表示「相等」。 log_json里面web_page_type值为3,等于conditions里面的第二个条件,因此该事实记录被保留下来。

# 客户配置漏斗细筛

这个步骤根据业务需要处理,这里做的是把第一步的属性带到所有步骤上。 进一步汇总聚合可以出各类漏斗报表。

1.在dim_config_item_funnel 表中,根据预先设置获取需要参与计算的维度。

select distinct syscustom_code,item_code as funnel_item_code from dim_config_item_funnel where funnel_structure_code='funnel_constract_dim'
union all
select distinct syscustom_code,'-1' as funnel_item_code from dim_config_item_funnel where funnel_structure_code='funnel_constract_dim'
syscustom_code funnel_item_code
dw_track -1
dw_track web_source_bu
dw_track abtest_test
dw_track web_is_new_session
dw_track web_source_cate
dw_track web_terminal
dw_track web_source_appd

2.根据上步拿到的对比维度将bdl_output_check_conditions_funnellog_json字段对应的维度取出。并通过group_concat进行funnel_idstat_dim等进行拼接并合并分组。最终生成中间表tmp_explod_funnel_steps1

group_concat(t.funnel_step_id,"=",server_date,"@",ifnull(client_mts,0),"#",step) as steps
--这里的funnel_step_id=funnel_step_id *10+branch_id

--step是各维度对应的值
case
    when t3.funnel_item_code="abtest_test" then regexp_replace(get_json_object(log_json, "$.abtest_test"),",",",")
    when t3.funnel_item_code="web_is_new_session" then regexp_replace(get_json_object(log_json, "$.web_is_new_session"),",",",")
    when t3.funnel_item_code="web_source_appd" then regexp_replace(get_json_object(log_json, "$.web_source_appd"),",",",")
    when t3.funnel_item_code="web_source_bu" then regexp_replace(get_json_object(log_json, "$.web_source_bu"),",",",")
    when t3.funnel_item_code="web_source_cate" then regexp_replace(get_json_object(log_json, "$.web_source_cate"),",",",")
    when t3.funnel_item_code="web_terminal" then regexp_replace(get_json_object(log_json, "$.web_terminal"),",",",")
    when t3.funnel_item_code="-1" then -1
end as step 

-- 输出值
11=2019-12-10 21:24:11@1575984250977#-1
11=2019-12-10 21:24:11@1575984250977#null
11=2019-12-10 21:24:11@1575984250977#direct
11=2019-12-10 21:24:11@1575984250977#direct
11=2019-12-10 21:24:11@1575984250977#direct
11=2019-12-10 16:55:07@1575968107551#2
11=2019-12-10 16:55:07@1575968107551#本应用域APP内嵌
11=2019-12-10 00:12:53@1575907972994#-1,11=2019-12-10 00:11:56@1575907916195#-1
Column Type Extra Comment
funnel_start_date varchar
funnel_id integer
id varchar
id_type varchar
syscustom_code varchar
stat_dim varchar
steps varchar
content_auth varchar
content_auth_value varchar
session_id varchar
funnel_period varchar
  1. dim_funnel_branchs_condition_funnel一样,这里要将tmp_explod_funnel_steps1有着分支步骤的数据进行复制(explode_funnel_dim),获得表tmp_explod_funnel_steps2
lateral view explode_funnel_dim(steps,"11") f_dual as funnel_steps,dims;
-- 该函数作用是按照维度分开数据,维度为第一个#后面的值
select t1.str,step,dim from (
select '11=2019-12-10 00:12:53@1575907972994#yyy_bihell,11=2019-12-10 00:11:56@1575907916195#yyy_bbb' as str from dual
) t1
lateral view explode_funnel_dim(str,'11') dual as step,dim; 
-- 结果
11=2019-12-10 00:12:53@1575907972994#yyy_bihell,11=2019-12-10 00:11:56@1575907916195#yyy_bbb	11=2019-12-10 00:11:56@1575907916195#yyy_bbb	yyy_bbb
11=2019-12-10 00:12:53@1575907972994#yyy_bihell,11=2019-12-10 00:11:56@1575907916195#yyy_bbb	11=2019-12-10 00:12:53@1575907972994#yyy_bihell	yyy_bihell
Column Type Extra Comment
content_auth varchar
content_auth_value varchar
syscustom_code varchar
stat_dim varchar
dims varchar
id varchar
id_type varchar
funnel_id integer
funnel_start_date varchar
funnel_steps varchar
session_id varchar
funnel_period varchar

4.关联dim_funnel_condition_funneltmp_explod_funnel_steps2表关联(funnel_id=funnel_id),并通过函数con_funnel_step_cnt进行过滤

这里只要判断个步骤的进入次数。funnel_steps为之前group_concat后explode_funnel_dim展开结果。step_name=funnel_step_id*10+branch_id。

con_funnel_step_cnt(funnel_steps,step_name,funnel_step_id) as total_cnt,
con_funnel_step_cnt("11=2019-12-10 16:38:52@1575967135105#2,11=2019-12-10 16:39:45@1575967187233#2,11=2019-12-10 16:43:04@1575967386969#2,22=2019-12-10 16:43:19@0#null,31=2019-12-10 16:43:20@0#null",22,2) as total_cnt
-- 结果
1

上面代码中22对应的是funnel_step_id+branch_id,funnel_steps 里面只有一条记录满足,因此返回结果1.

本步骤中只拿取total_cnt>0的数据。最终获得 idl_output_funnel(idl_output_dw_track_funnel)表

# 汇总数据

前面已经将关键部分全部完成。现在只需进行聚合统计就能完成我们所要的漏斗结果。

sum(if(1>t.max_funnel_step_id,null,if(t.funnel_step_id=1 ,id_cnt,0))) as step1_id_cnt,
sum(if(2>t.max_funnel_step_id,null,if(t.funnel_step_id=2 ,id_cnt,0))) as step2_id_cnt,
sum(if(3>t.max_funnel_step_id,null,if(t.funnel_step_id=3 ,id_cnt,0))) as step3_id_cnt,
sum(if(4>t.max_funnel_step_id,null,if(t.funnel_step_id=4 ,id_cnt,0))) as step4_id_cnt,
sum(if(5>t.max_funnel_step_id,null,if(t.funnel_step_id=5 ,id_cnt,0))) as step5_id_cnt,