使用窗口函数进行增长数据分析
以下文章来源于大数据技术与数仓 ,作者西贝
本文会从一个商务分析案例入手,说明SQL窗口函数的使用方式。通过本文的5个需求分析,可以看出SQL窗口函数的功能十分强大,不仅能够使我们编写的SQL逻辑更加清晰,而且在某种程度上可以简化需求开发。
数据准备
本文主要分析只涉及一张订单表orders,操作过程在Hive中完成,具体数据如下:
-- 建表
CREATE TABLE orders(
order_id int,
customer_id string,
city string,
add_time string,
amount decimal(10,2));
-- 准备数据
INSERT INTO orders VALUES
(1,"A","上海","2020-01-01 00:00:00.000000",200),
(2,"B","上海","2020-01-05 00:00:00.000000",250),
(3,"C","北京","2020-01-12 00:00:00.000000",200),
(4,"A","上海","2020-02-04 00:00:00.000000",400),
(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),
(6,"C","北京","2020-02-19 00:00:00.000000",300),
(7,"A","上海","2020-03-01 00:00:00.000000",150),
(8,"E","北京","2020-03-05 00:00:00.000000",500),
(9,"F","上海","2020-03-09 00:00:00.000000",250),
(10,"B","上海","2020-03-21 00:00:00.000000",600);
需求1:收入增长
在业务方面,第m1个月的收入增长计算如下:* 100 (m1-m0)/ m0
其中, m1 是给定月份的收入, m0 是上个月的收入。因此,从技术上讲,我们需要找到每个月的收入,然后以某种方式将每个月的收入与上一个收入相关联,以便进行上述计算。计算当时如下:
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders
GROUP BY 1
)
,prev_month_revenue as (
SELECT
month,
revenue,
lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
FROM monthly_revenue
)
SELECT
month,
revenue,
prev_month_revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
结果输出
monthrevenueprev_month_revenuerevenue_growth2020-01-01650NULLNULL2020-02-01125065092.32020-03-011500125020
我们还可以按照按城市分组进行统计,查看某个城市某个月份的收入增长情况
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
city,
sum(amount) as revenue
FROM orders
GROUP BY 1,2
)
,prev_month_revenue as (
SELECT
month,
city,
revenue,
lag(revenue) over (partition by city order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month,
city,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2,1
结果输出
monthcityrevenuerevenue_growth2020-01-01上海450NULL2020-02-01上海950111.12020-03-01上海10005.32020-01-01北京200NULL2020-02-01北京300502020-03-01北京50066.7
需求2:累计求和
累计汇总,即当前元素和所有先前元素的总和,如下面的SQL:
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
sum(revenue) over (order by month rows between unbounded preceding and current roW) as running_total
FROM monthly_revenue
ORDER BY 1
结果输出
monthrevenuerunning_total2020-01-016506502020-02-01125019002020-03-0115003400
我们还可以使用下面的组合方式进行分析,SQL如下:
SELECT
order_id,
customer_id,
city,
add_time,
amount,
sum(amount) over () as amount_total, -- 所有数据求和
sum(amount) over (order by order_id rows between unbounded preceding and current roW) as running_sum, -- 累计求和
sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current roW) as running_sum_by_customer,
avg(amount) over (order by add_time rows between 5 preceding and current roW) as trailing_avg -- 滚动求平均
FROM orders
ORDER BY 1
结果输出:
order_idcustomer_idcityadd_timeamountamount_totalrunning_sumrunning_sum_by_customertrailing_avg1A上海2020-01-01 00:00:00.00000020034002002002002B上海2020-01-05 00:00:00.00000025034004502502253C北京2020-01-12 00:00:00.0000002003400650200216.6666674A上海2020-02-04 00:00:00.00000040034001050600262.55D上海2020-02-05 00:00:00.000000250340013002502605D上海2020-02-05 12:00:00.00000030034001600550266.6666676C北京2020-02-19 00:00:00.00000030034001900500283.3333337A上海2020-03-01 00:00:00.00000015034002050750266.6666678E北京2020-03-05 00:00:00.00000050034002550500316.6666679F上海2020-03-09 00:00:00.00000025034002800250291.66666710B上海2020-03-21 00:00:00.00000060034003400850
需求3:处理重复数据
从上面的数据可以看出,存在两条重复的数据**(5,“D”,“上海”,“2020-02-05 00:00:00.000000”,250), (5,“D”,“上海”,“2020-02-05 12:00:00.000000”,300),**显然需要对其进行清洗去重,保留最新的一条数据,SQL如下:
我们先进行分组排名,然后保留最新的那条数据即可:
SELECT *
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
结果输出:
t.order_idt.customer_idt.cityt.add_timet.amountt.rank1A上海2020-01-01 00:00:00.00000020012B上海2020-01-05 00:00:00.00000025013C北京2020-01-12 00:00:00.00000020014A上海2020-02-04 00:00:00.00000040015D上海2020-02-05 12:00:00.00000030016C北京2020-02-19 00:00:00.00000030017A上海2020-03-01 00:00:00.00000015018E北京2020-03-05 00:00:00.00000050019F上海2020-03-09 00:00:00.000000250110B上海2020-03-21 00:00:00.0000006001
经过上面的清洗过程,对数据进行了去重。重新计算上面的需求1,正确SQL脚本为:
WITH
orders_cleaned as (
SELECT *
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
)
,monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders_cleaned
GROUP BY 1
)
,prev_month_revenue as (
SELECT
month,
revenue,
lag(revenue) over (order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
结果输出:
monthrevenuerevenue_growth2020-01-01650NULL2020-02-01100053.82020-03-01150050
将清洗后的数据创建成视图,方便以后使用
CREATE VIEW orders_cleaned AS
SELECT
order_id,
customer_id,
city,
add_time,
amount
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
需求4:分组取TopN
分组取topN是最长见的SQL窗口函数使用场景,下面的SQL是计算每个月份的top2订单金额,如下:
WITH orders_ranked as (
SELECT
trunc(add_time,'MM') as month,
*,
row_number() over (partition by trunc(add_time,'MM') order by amount desc, add_time) as rank
FROM orders_cleaned
)
SELECT
month,
order_id,
customer_id,
city,
add_time,
amount
FROM orders_ranked
WHERE rank <=2
ORDER BY 1
需求5:重复购买行为
下面的SQL计算重复购买率:重复购买的人数/总人数*100%以及第一笔订单金额与第二笔订单金额之间的典型差额:avg(第二笔订单金额/第一笔订单金额)
WITH customer_orders as (
SELECT *,
row_number() over (partition by customer_id order by add_time) as customer_order_n,
lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
FROM orders_cleaned
)
SELECT
roun
- 原文作者:知识铺
- 原文链接:https://geek.zshipu.com/post/%E4%BA%92%E8%81%94%E7%BD%91/%E4%BD%BF%E7%94%A8%E7%AA%97%E5%8F%A3%E5%87%BD%E6%95%B0%E8%BF%9B%E8%A1%8C%E5%A2%9E%E9%95%BF%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。
- 免责声明:本页面内容均来源于站内编辑发布,部分信息来源互联网,并不意味着本站赞同其观点或者证实其内容的真实性,如涉及版权等问题,请立即联系客服进行更改或删除,保证您的合法权益。转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。也可以邮件至 sblig@126.com