本文主要是介绍Hive(12):Hive分析函数-窗口函数(Windowing),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
一、实现功能
对于分组之后的数据进行处理。
官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
二、加载数据
1.emp.txt和dept.txt数据
(1)emp.txt
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
(2)dept.txt
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
2.创建分析表并且加载数据
(1)emp人员表
create EXTERNAL table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t' ;
load data local inpath '/opt/datas/emp.txt' into table emp;
(2)dept部门表
create EXTERNAL table dept(
deptno int,
dname string,
loc string
)
row format delimited fields terminated by '\t' ;
load data local inpath '/opt/datas/dept.txt' into table dept;
三、分析实例
1.测试表
hive (hadoop)> select * from emp;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.247 seconds, Fetched: 14 row(s)
2.实例
(1)查询部门编号10的所有员工的信息,按照薪资进行降序排列
select * from emp where deptno='10' order by sal desc;
结果:
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
(2)窗口:查询所有部门的员工的信息,按照薪资进行降序排列,多加一个字段:显示该部门的最高薪资, 或者显示该部门的最低薪资。备注:求最大值,要desc;求最小值,要asc。否则会出错!
求最大值
hive (hadoop)> select empno,ename,deptno,sal,max(sal) over (partition by deptno order by sal desc) as max_sal from emp;
结果:
empno ename deptno sal max_sal
7839 KING 10 5000.0 5000.0
7782 CLARK 10 2450.0 5000.0
7934 MILLER 10 1300.0 5000.0 7788 SCOTT 20 3000.0 3000.0
7902 FORD 20 3000.0 3000.0
7566 JONES 20 2975.0 3000.0
7876 ADAMS 20 1100.0 3000.0
7369 SMITH 20 800.0 3000.0 7698 BLAKE 30 2850.0 2850.0
7499 ALLEN 30 1600.0 2850.0
7844 TURNER 30 1500.0 2850.0
7654 MARTIN 30 1250.0 2850.0
7521 WARD 30 1250.0 2850.0
7900 JAMES 30 950.0 2850.0
求最小值
hive (hadoop)> select empno,ename,deptno,sal,min(sal) over (partition by deptno order by sal asc) as min_sal from emp;
结果:
empno ename deptno sal min_sal
7934 MILLER 10 1300.0 1300.0
7782 CLARK 10 2450.0 1300.0
7839 KING 10 5000.0 1300.0
7369 SMITH 20 800.0 800.0
7876 ADAMS 20 1100.0 800.0
7566 JONES 20 2975.0 800.0
7788 SCOTT 20 3000.0 800.0
7902 FORD 20 3000.0 800.0
7900 JAMES 30 950.0 950.0
7654 MARTIN 30 1250.0 950.0
7521 WARD 30 1250.0 950.0
7844 TURNER 30 1500.0 950.0
7499 ALLEN 30 1600.0 950.0
7698 BLAKE 30 2850.0 950.0
(3)分析函数row_number:查询所有部门的员工的信息,按照薪资进行降序排列,最后一列显示编号
hive (hadoop)> select empno,ename,deptno,sal,row_number() over (partition by deptno order by sal desc) as rn from emp;
结果:
empno ename deptno sal rn
7839 KING 10 5000.0 1
7782 CLARK 10 2450.0 2
7934 MILLER 10 1300.0 37788 SCOTT 20 3000.0 1
7902 FORD 20 3000.0 2
7566 JONES 20 2975.0 3
7876 ADAMS 20 1100.0 4
7369 SMITH 20 800.0 57698 BLAKE 30 2850.0 1
7499 ALLEN 30 1600.0 2
7844 TURNER 30 1500.0 3
7654 MARTIN 30 1250.0 4
7521 WARD 30 1250.0 5
7900 JAMES 30 950.0 6
(4)去重薪水一样的(总共有14个人,有4个人两两薪水是一样的,hql结果应该是12人)
hive (hadoop)> select deptno,count(DISTINCT sal) over (partition by deptno ) as countNum from emp group by deptno;
结果:
deptno countnum
10 3
20 4
30 5
(5)统计每个部门的人数
hive (hadoop)> select deptno,count(*) as count from emp group by deptno;
结果:
deptno count
10 3
20 5
30 6
或者使用窗口函数
hive (hadoop)> select deptno,count(empno) over (partition by deptno) as count from emp group by deptno,empno;
结果:
deptno count
10 3
10 3
10 3
20 5
20 5
20 5
20 5
20 5
30 6
30 6
30 6
30 6
30 6
30 6
(6)windows范围
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
//1.分组内所有行排序
hive (hadoop)> select empno,ename,deptno,sal,row_number() over (partition by deptno order by sal desc) as rn from emp;//2.分组内当前行+往前1行:加和
hive (hadoop)> select empno,ename,deptno,sal,row_number() over (partition by deptno order by sal desc) as rn,
sum(sal) over (partition by deptno order by sal desc ) s1,
sum(sal) over (partition by deptno order by sal desc rows between 1 PRECEDING and CURRENT ROW) s2
from emp;结果:s2是本行和上一行的加和
empno ename deptno sal rn s1 s2
7839 KING 10 5000.0 1 5000.0 5000.0
7782 CLARK 10 2450.0 2 7450.0 7450.0
7934 MILLER 10 1300.0 3 8750.0 3750.0
7788 SCOTT 20 3000.0 1 6000.0 3000.0
7902 FORD 20 3000.0 2 6000.0 6000.0
7566 JONES 20 2975.0 3 8975.0 5975.0
7876 ADAMS 20 1100.0 4 10075.0 4075.0
7369 SMITH 20 800.0 5 10875.0 1900.0
7698 BLAKE 30 2850.0 1 2850.0 2850.0
7499 ALLEN 30 1600.0 2 4450.0 4450.0
7844 TURNER 30 1500.0 3 5950.0 3100.0
7654 MARTIN 30 1250.0 4 8450.0 2750.0
7521 WARD 30 1250.0 5 8450.0 2500.0
7900 JAMES 30 950.0 6 9400.0 2200.0
Time taken: 23.458 seconds, Fetched: 14 row(s)//3.分组内当前行+往后所有行:加和
hive (hadoop)> select empno,ename,deptno,sal,row_number() over (partition by deptno order by sal desc) as rn,
sum(sal) over (partition by deptno order by sal desc ) s1,
sum(sal) over (partition by deptno order by sal desc rows between CURRENT ROW and unbounded following ) s2
from emp;
结果:s2即为本行到最后一行累加结果(当然,是在窗口内)empno ename deptno sal rn s1 s2
7839 KING 10 5000.0 1 5000.0 8750.0
7782 CLARK 10 2450.0 2 7450.0 3750.0
7934 MILLER 10 1300.0 3 8750.0 1300.0
7788 SCOTT 20 3000.0 1 6000.0 10875.0
7902 FORD 20 3000.0 2 6000.0 7875.0
7566 JONES 20 2975.0 3 8975.0 4875.0
7876 ADAMS 20 1100.0 4 10075.0 1900.0
7369 SMITH 20 800.0 5 10875.0 800.0
7698 BLAKE 30 2850.0 1 2850.0 9400.0
7499 ALLEN 30 1600.0 2 4450.0 6550.0
7844 TURNER 30 1500.0 3 5950.0 4950.0
7654 MARTIN 30 1250.0 4 8450.0 3450.0
7521 WARD 30 1250.0 5 8450.0 2200.0
7900 JAMES 30 950.0 6 9400.0 950.0
Time taken: 22.953 seconds, Fetched: 14 row(s)
3.总结
窗口函数核心特征:不减少原始数据个数,只是开辟一个新的维度去观察当前数据。
四、参考
1.hive窗口函数必备宝典
https://blog.csdn.net/dingchangxiu11/article/details/83145151
这篇关于Hive(12):Hive分析函数-窗口函数(Windowing)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!