本文主要是介绍Hive 分析函数lead、lag实例应用,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
说明
Hive的分析函数又叫窗口函数,在oracle中就有这样的分析函数,主要用来做数据统计分析的。
Lag和Lead分析函数可以在同一次查询中取出同一字段的前N行的数据(Lag)和后N行的数据(Lead)作为独立的列。
这种操作可以代替表的自联接,并且LAG和LEAD有更高的效率,其中over()表示当前查询的结果集对象,括号里面的语句则表示对这个结果集进行处理。
函数介绍
LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
LEAD
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
参数1为列名,参数2为往下第n行(可选,默认为1),参数3为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
场景
问题
用户Peter在浏览网页,在某个时刻,Peter点进了某个页面,过一段时间后,Peter又进入了另外一个页面,如此反复,那怎么去统计Peter在某个特定网页的停留时间呢,又或是怎么统计某个网页用户停留的总时间呢?
数据准备
现在用户的行为都被采集了,处理转换到hive数据表,表结构如下:
create table test.user_log(
userid string,
time string,
url string
) row format delimited fields terminated by ',';
记录数据:
+------------------+----------------------+---------------+--+
| user_log.userid | user_log.time | user_log.url |
+------------------+----------------------+---------------+--+
| Peter | 2015-10-12 01:10:00 | url1 |
| Peter | 2015-10-12 01:15:10 | url2 |
| Peter | 2015-10-12 01:16:40 | url3 |
| Peter | 2015-10-12 02:13:00 | url4 |
| Peter | 2015-10-12 03:14:30 | url5 |
| Marry | 2015-11-12 01:10:00 | url1 |
| Marry | 2015-11-12 01:15:10 | url2 |
| Marry | 2015-11-12 01:16:40 | url3 |
| Marry | 2015-11-12 02:13:00 | url4 |
| Marry | 2015-11-12 03:14:30 | url5 |
+------------------+----------------------+---------------+--+
分析步骤
获取用户在某个页面停留的起始与结束时间
select userid,
time stime,
lead(time) over(partition by userid order by time) etime,
url
from test.user_log;
结果:
+---------+----------------------+----------------------+-------+--+
| userid | stime | etime | url |
+---------+----------------------+----------------------+-------+--+
| Marry | 2015-11-12 01:10:00 | 2015-11-12 01:15:10 | url1 |
| Marry | 2015-11-12 01:15:10 | 2015-11-12 01:16:40 | url2 |
| Marry | 2015-11-12 01:16:40 | 2015-11-12 02:13:00 | url3 |
| Marry | 2015-11-12 02:13:00 | 2015-11-12 03:14:30 | url4 |
| Marry | 2015-11-12 03:14:30 | NULL | url5 |
| Peter | 2015-10-12 01:10:00 | 2015-10-12 01:15:10 | url1 |
| Peter | 2015-10-12 01:15:10 | 2015-10-12 01:16:40 | url2 |
| Peter | 2015-10-12 01:16:40 | 2015-10-12 02:13:00 | url3 |
| Peter | 2015-10-12 02:13:00 | 2015-10-12 03:14:30 | url4 |
| Peter | 2015-10-12 03:14:30 | NULL | url5 |
+---------+----------------------+----------------------+-------+--+
计算用户在页面停留的时间间隔(实际分析当中,这里要做数据清洗工作,如果一个用户停留了4、5个小时,那这条记录肯定是不可取的。)
select userid,
time stime,
lead(time) over(partition by userid order by time) etime,
UNIX_TIMESTAMP(lead(time) over(partition by userid order by time),'yyyy-MM-dd HH:mm:ss')- UNIX_TIMESTAMP(time,'yyyy-MM-dd HH:mm:ss') period,
url
from test.user_log;
结果:
+---------+----------------------+----------------------+---------+-------+--+
| userid | stime | etime | period | url |
+---------+----------------------+----------------------+---------+-------+--+
| Marry | 2015-11-12 01:10:00 | 2015-11-12 01:15:10 | 310 | url1 |
| Marry | 2015-11-12 01:15:10 | 2015-11-12 01:16:40 | 90 | url2 |
| Marry | 2015-11-12 01:16:40 | 2015-11-12 02:13:00 | 3380 | url3 |
| Marry | 2015-11-12 02:13:00 | 2015-11-12 03:14:30 | 3690 | url4 |
| Marry | 2015-11-12 03:14:30 | NULL | NULL | url5 |
| Peter | 2015-10-12 01:10:00 | 2015-10-12 01:15:10 | 310 | url1 |
| Peter | 2015-10-12 01:15:10 | 2015-10-12 01:16:40 | 90 | url2 |
| Peter | 2015-10-12 01:16:40 | 2015-10-12 02:13:00 | 3380 | url3 |
| Peter | 2015-10-12 02:13:00 | 2015-10-12 03:14:30 | 3690 | url4 |
| Peter | 2015-10-12 03:14:30 | NULL | NULL | url5 |
+---------+----------------------+----------------------+---------+-------+--+
计算每个页面停留的总时间,某个用户访问某个页面的总时间
select nvl(url,'-1') url,
nvl(userid,'-1') userid,
sum(period) totol_peroid from (
select userid,
time stime,
lead(time) over(partition by userid order by time) etime,
UNIX_TIMESTAMP(lead(time) over(partition by userid order by time),'yyyy-MM-dd HH:mm:ss')- UNIX_TIMESTAMP(time,'yyyy-MM-dd HH:mm:ss') period,
url
from test.user_log
) a group by url, userid with rollup;
结果:
+-------+---------+---------------+--+
| url | userid | totol_peroid |
+-------+---------+---------------+--+
| -1 | -1 | 14940 |
| url1 | -1 | 620 |
| url1 | Marry | 310 |
| url1 | Peter | 310 |
| url2 | -1 | 180 |
| url2 | Marry | 90 |
| url2 | Peter | 90 |
| url3 | -1 | 6760 |
| url3 | Marry | 3380 |
| url3 | Peter | 3380 |
| url4 | -1 | 7380 |
| url4 | Marry | 3690 |
| url4 | Peter | 3690 |
| url5 | -1 | NULL |
| url5 | Marry | NULL |
| url5 | Peter | NULL |
+-------+---------+---------------+--+
这篇关于Hive 分析函数lead、lag实例应用的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!