本文主要是介绍0302-Hive案例1,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
0302-Hive案例1
- 1. 需求描述
- 1.1 数据结构
- 1.2 业务需求
- 2. 数据清洗ETL
- 2.1 ETL之ETLUtil
- 2.2 ETL之Mapper
- 2.3 ETL之Driver
- 3. 上传数据
- 3.1 将原始数据上传到HDFS
- 3.2 执行ETL
- 4. 导入数据
- 4.1 创建表
- 5. 业务分析与实现
- 5.1 统计视频观看数Top10
- 5.2 统计视频类别热度Top10
- 5.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
- 5.4 统计视频观看数Top50所关联视频的所属类别排序
- 5.5 统计每个类别中的视频热度Top10,以Music为例
- 5.6 统计每个类别中视频流量Top10,以Music为例
- 5.7 统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
- 5.8 统计所有类别中的视频热度Top10(5.5-PLUS)
- 5.9 统计上传视频最多的用户Top10以及他们每个人上传的观看次数在前20的视频(5.7-PLUS)
1. 需求描述
1.1 数据结构
- 视频表
fQShwYqGqsw lonelygirl15 736 People & Blogs 133 151763 3.01 666 765 fQShwYqGqsw LfAaY1p_2Is 5LELNIVyMqo
mWzdp7Cg41w toshiaki1973 735 Entertainment 582 142699 3.45 148 146 hIbPgEyOGs4 VWCKN5Agp34
- 用户表
barelypolitical 151 5106
bonk65 89 144
camelcars 26 674
cubskickass34 13 126
boydism08 32 50
1.2 业务需求
- 统计视频观看数Top10
- 统计视频类别热度Top10
- 统计视频观看数Top20所属类别
- 统计视频观看数Top50所关联视频的所属类别Rank
- 统计每个类别中的视频热度Top10
- 统计每个类别中视频流量Top10
- 统计上传视频最多的用户Top10以及他们上传的视频
- 统计每个类别视频观看数Top10
2. 数据清洗ETL
通过观察原始数据形式,可以发现,视频可以有多个所属分类,每个所属分类用&符号分割,且分割的两边有空格字符,同时相关视频也是可以有多个元素,多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作,我们首先进行数据重组清洗操作。即:
将所有的类别用“&”分割,同时去掉两边空格,多个相关视频id也使用“&”进行分割
。
2.1 ETL之ETLUtil
package com.lz.etl;/*** @ClassName ETLUtil* @Description: TODO* @Author MAlone* @Date 2019/12/8* @Version V1.0**/
public class ETLUtil {public static String oriString2ETLString(String ori) {//"fQShwYqGqsw\tlonelygirl15\t736\tPeople & Blogs\t133\t151763\t3.01\t666\t765\tfQShwYqGqsw\tLfAaY1p_2Is\t5LELNIVyMqo"StringBuilder ETLString = new StringBuilder();String[] fileds = ori.split("\t");if (fileds.length < 9) {return null;}fileds[3] = fileds[3].replace(" ", "");for (int i = 0; i < fileds.length; i++) {if (i < 9) {ETLString.append(fileds[i]).append("\t");} else if (i == fileds.length - 1) {ETLString.append(fileds[i]);} else {ETLString.append(fileds[i]).append("&");}}return ETLString.toString();//"fQShwYqGqsw\tlonelygirl15\t736\tPeople&Blogs\t133\t151763\t3.01\t666\t765\tfQShwYqGqsw&LfAaY1p_2Is&5LELNIVyMqo"}
}
2.2 ETL之Mapper
package com.lz.etl;import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/*** @ClassName VideoETLMapper* @Description: TODO* @Author MAlone* @Date 2019/12/8* @Version V1.0**/
public class VideoETLMapper extends Mapper<LongWritable, Text, NullWritable, Text> {Text text = new Text();@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String ori = value.toString();String ETLString = ETLUtil.oriString2ETLString(ori);if (ETLString == null) {context.getCounter("ETLString", "false").increment(1);} else {context.getCounter("ETLString", "true").increment(1);}text.set(ETLString);context.write(NullWritable.get(), text);}
}
2.3 ETL之Driver
package com.lz.etl;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;/*** @ClassName VideoDriver* @Description: TODO* @Author MAlone* @Date 2019/12/8* @Version V1.0**/
public class VideoETLDriver {public static void main(String[] args) throws Exception {// 1 获取job信息Configuration conf = new Configuration();Job job = Job.getInstance(conf);// 2 加载jar包job.setJarByClass(VideoETLDriver.class);// 3 关联mapjob.setMapperClass(VideoETLMapper.class);// 4 设置最终输出类型job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Text.class);// 设置reducetask个数为0job.setNumReduceTasks(0);// 5 设置输入和输出路径FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));// 6 提交job.waitForCompletion(true);}
}
3. 上传数据
3.1 将原始数据上传到HDFS
[yanlzh@node11 data]$ hadoop fs -mkdir /guli
[yanlzh@node11 data]$ hadoop fs -put video/ /guli
[yanlzh@node11 data]$ hadoop fs -put user/ /guli
3.2 执行ETL
[yanlzh@node11 software]$ hadoop jar 0302hive_etl-1.0-SNAPSHOT.jar com.lz.etl.VideoETLDriver /guli/video/ /guli/video_etl
4. 导入数据
4.1 创建表
创建表:gulivideo_ori,gulivideo_user_ori,
创建表:gulivideo_orc,gulivideo_user_orc
--创建表:gulivideo_ori
CREATE EXTERNAL TABLE gulivideo_ori(videoId STRING , uploader STRING, age INT, category ARRAY <STRING>, length INT, views INT, rate FLOAT , ratings INT, comments INT,relatedId ARRAY <STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '&'
STORED AS TEXTFILE;-- 创建表:gulivideo_user_ori
CREATE EXTERNAL TABLE gulivideo_user_ori(uploader STRING ,vidoes INT ,friends INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;-- 导入数据
LOAD DATA INPATH '/guli/video_etl/' INTO TABLE gulivideo_ori;
LOAD DATA INPATH '/guli/user/' INTO TABLE gulivideo_user_ori;-- 创建表:gulivideo_orc
CREATE TABLE video(videoId STRING , uploader STRING, age INT, category ARRAY <STRING>, length INT, views INT, rate FLOAT , ratings INT, comments INT,relatedId ARRAY <STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '&'
STORED AS ORC;-- 创建表:gulivideo_user_orc
CREATE TABLE video_user(uploader STRING ,videos INT,friends INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
STORED AS ORC;-- 导入数据
INSERT INTO TABLE video SELECT * FROM gulivideo_ori;
INSERT INTO TABLE video_user SELECT * FROM gulivideo_user_ori;
5. 业务分析与实现
5.1 统计视频观看数Top10
5.2 统计视频类别热度Top10
分析:
- 即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
- 我们需要按照类别group by聚合,然后count组内的videoId个数即可。
- 因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。
- 最后按照热度排序,显示前10条。
实现:
SELECT t1.category_name, COUNT(views) AS sum_views
FROM (SELECT category_name, viewsFROM videoLATERAL VIEW explode(category) tmp AS category_name
) t1
GROUP BY t1.category_name
ORDER BY sum_views DESC
LIMIT 10;
5.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
分析:
类别 + 个数
- 先找到观看数最高的20个视频所属条目的所有信息,降序排列
ORDER BY
- 把这20条信息中的category分裂出来(列转行)
EXPLODE
- 最后查询视频分类名称和该分类下有多少个Top20的视频
GROUP BY
实现
SELECT category_name, COUNT(t1.videoId) AS hot_with_views
FROM (SELECT videoId, views, category_nameFROM videoLATERAL VIEW explode(category) tmp AS category_nameORDER BY views DESCLIMIT 20
) t1
GROUP BY category_name;
5.4 统计视频观看数Top50所关联视频的所属类别排序
分析:
- 查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1
SELECT *
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 50;
- 将找到的50条视频信息的相关视频relatedId列转行,记为临时表t2
SELECT explode(relatedId) AS videoId
FROM t1;
- 将相关视频的id和video表进行inner join操作
SELECT DISTINCT t2.videoId, t3.category
FROM t2INNER JOIN video t3 ON t2.videoId = t3.videoId AS t4LATERAL VIEW explode(category) t_catetory AS category_name;
- 按照视频类别进行分组,统计每组视频个数,然后排行
SELECT category_name AS category, COUNT(t5.videoId) AS hot
FROM (SELECT videoId, category_nameFROM (SELECT DISTINCT t2.videoId, t3.categoryFROM (SELECT explode(relatedId) AS videoIdFROM (SELECT *FROM videoORDER BY views DESCLIMIT 50) t1) t2INNER JOIN video t3 ON t2.videoId = t3.videoId) t4LATERAL VIEW explode(category) t_catetory AS category_name
) t5
GROUP BY category_name
ORDER BY hot DESC;
5.5 统计每个类别中的视频热度Top10,以Music为例
分析
- 要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。
- 向category展开的表中插入数据。
- 统计对应类别(Music)中的视频热度。
实现:
创建类别表:
CREATE TABLE video_category(videoId STRING ,uploader STRING, age INT, categoryId STRING, length INT, views INT, rate FLOAT, ratings INT, comments INT,relatedId ARRAY<STRING>
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '&'
STORED AS ORC;
向类别表中插入数据:
INSERT INTO TABLE video_category
SELECT videoId,uploader,age,categoryId,length, views, rate, ratings, comments, relatedId
FROM video LATERAL VIEW explode(category) category as categoryId;
统计Music类别的Top10
SELECT videoId, views
FROM video_category
WHERE categoryId = 'Music'
ORDER BY views DESC
LIMIT 10;
5.6 统计每个类别中视频流量Top10,以Music为例
SELECT videoId, ratings
FROM video_category
WHERE categoyrId = 'Music'
ORDER BY ratings DESC
LIMIT 10;
5.7 统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
- 先找到上传视频最多的10个用户的用户信息
SELECT uploader,videos
FROM video_user
ORDER BY videos DESC
LIMIT 10;
- 通过uploader字段与video表进行join,得到的信息按照views观看次数进行排序即可。
SELECT v.uploader,videoId, views
FROM (SELECT uploader, videosFROM video_userORDER BY videos DESCLIMIT 10
) t1JOIN video v ON v.uploader = t1.uploader
ORDER BY views DESC
LIMIT 20;
5.8 统计所有类别中的视频热度Top10(5.5-PLUS)
- 炸开所有类别
SELECT videoId, views, category_name
From video LATERAL VIEW explode(category) tbl as category_name;
结果:
ihhEp3uTZck 533936 Blogs
6B26asyGKDo 5147533 Film
6B26asyGKDo 5147533 Animation
sdUUx5FdySs 5840839 Film
sdUUx5FdySs 5840839 Animation
3gg5LOd_Zus 4200257 Entertainment
CQO3K8BcyGM 3083875 Comedy
bNF_P281Uu4 5231539 Travel
bNF_P281Uu4 5231539 Places
seGhTWE98DU 3296342 Music
N0TR0Irx4Y0 3836122 Comedy
P1OXAQHv09E 3068566 Comedy
o4x-VW_rCSE 3534116 Entertainment
- 使用
窗口函数
, 对每个类别
按照视频热度排序
rank() OVER(PARTITION BY category_name ORDER BY VIEW DESC) hot
SELECT t1.videoId, t1.category_name, rank() OVER(PARTITION BY category_name ORDER BY views DESC) hot
FROM () t1;
- 取每个类别的Top10 , rank <=10
SELECT t2.videoId, t2.category_name, t2.hot
FROM () t2
WHERE hot < 10;
- 最终代码
SELECT t2.videoId, t2.category_name, t2.hot
FROM (SELECT t1.videoId, t1.category_name, rank() OVER (PARTITION BY category_name ORDER BY views DESC) AS hotFROM (SELECT videoId, views, category_nameFROM videoLATERAL VIEW explode(category) tbl AS category_name) t1
) t2
WHERE hot <= 10;
结果
1dmVU08zVpA Entertainment 1
RB-wUgnyGv0 Entertainment 2
vr3x_RRJdd4 Entertainment 3
lsO6D1rwrKc Entertainment 4
ixsZy2425eY Entertainment 5
RUCZJVJ_M8o Entertainment 6
tFXLbXyXy6M Entertainment 7
7uwCEnDgd5o Entertainment 8
2KrdBUFeFtY Entertainment 9
vD4OnHCRd_4 Entertainment 10
bNF_P281Uu4 Places 1
s5ipz_0uC_U Places 2
6jJW7aSNCzU Places 3
dVRUBIyRAYk Places 4
lqbt6X4ZgEI Places 5
RIH1I1doUI4 Places 6
AlPqL7IUT6M Places 7
_5QUdvUhCZc Places 8
m9A_vxIOB-I Places 9
CL6f3Cyh85w Places 10
...
5.9 统计上传视频最多的用户Top10以及他们每个人上传的观看次数在前20的视频(5.7-PLUS)
- 取Top10 用户
SELECT uploader,videos
FROM video_user
ORDER BY videos DESC
LIMIT 10;
- 做连接
SELECT t1.uploader, v.videoId, v.views, rank() OVER (PARTITION BY v.uploader ORDER BY v.views DESC) AS hot
FROM video vJOIN (SELECT uploader, videosFROM video_userORDER BY videos DESCLIMIT 10) t1ON t1.uploader = v.uploader;
- 取每组前20
SELECT t2.uploader, t2.videoId, t2.hot
FROM (SELECT t1.uploader, v.videoId, v.views, rank() OVER (PARTITION BY v.uploader ORDER BY v.views DESC) AS hotFROM video vJOIN (SELECT uploader, videosFROM video_userORDER BY videos DESCLIMIT 10) t1ON t1.uploader = v.uploader
) t2
WHERE t2.hot <= 20;
这篇关于0302-Hive案例1的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!