本文主要是介绍json导入Hive,并整理成大宽表,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 1 创建表
- 2 导入Hive
- 3 json_tuple查询数据
- 4 整理成大宽表
首先我们有如下图的json数据,我们需要把这份数据先导入到Hive,然后在整理成结构化的数据,这样我们就可以根据需求查询对应的数据了
1 创建表
首先先要创建一个表
create table rating(json string);
2 导入Hive
然后把数据导入到hive中
load data local inpath '/home/hadoopadmin/rating.json' into table rating;
查看数据,已经导入到hive中
3 json_tuple查询数据
但是上面的数据格式不是我们想要的,我们想要的数据格式为下面这种结构:
movie | rate | time | userid |
---|---|---|---|
1193 | 5 | 978300760 | 1 |
hive中有个json_tuple函数,官方语法:
json_tuple(string jsonStr,string k1,...,string kn)
#jsonStr:一个json字符串
#k1...kn:json字符串中的key
我们举个例子:
select json_tuple(
'{"movie":"1193","rate":"5","time":"978300760","userid":"1"}',
'movie','rate','time','userid');
OK
#结果
c0 c1 c2 c3
1193 5 978300760 1
上面结果中别名我们需要改下
select json_tuple(
'{"movie":"1193","rate":"5","time":"978300760","userid":"1"}',
'movie','rate','time','userid') as (movie, rate, time, user_id);
OK
#结果
movie rate time user_id
1193 5 978300760 1
那么下面,我们只要把上面的json字符串改成表的字段json,然后从rating表中查询即可
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating limit 10 ;
如下图,我们通过json_tuple函数,把json数据结构,改成了结构化数据格式
4 整理成大宽表
上面的数据正常不会满足我们的需求,假如我们需要查询某个时间的信息,还需要其他的一些信心,例如下面这个格式,也就是常说的大宽表:
movie | rate | time | userid | year | month | day | hour | minute | ts |
---|---|---|---|---|---|---|---|---|---|
1193 | 5 | 978300760 | 1 | 2011 | 1 | 1 | 6 | 12 | 2001-01-01 06:12:40 |
目前我们有的参数是一个字符串的time,我们先要把time字符串转换成整数,然后再把整数转换成时间格式,这两个转换用到下面两个函数:
函数官网:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
#把表达式转成想要的类型
cast(expr as <type>)
#把bigint的时间类型,转成想要的格式
from_unixtime(bigint unixtime[, string format])
针对上面2个函数,我们举个例子:
select cast('978300760' as bigint);
select from_unixtime(cast('978300760' as bigint));
后面我们只需要,把第三步的查询结果作为自查询,然后用上面2个函数去解析time字段就ok了
select movie,rate,time,user_id,
from_unixtime(cast(time as bigint)) as ts
from
(
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating
) t
limit 10;
再通过下面这些函数,获取ts对应的年、月、日、时、分,就完成了大宽表
select movie,rate,time,user_id,
from_unixtime(cast(time as bigint)) as ts,
year(from_unixtime(cast(time as bigint))) as year,
month(from_unixtime(cast(time as bigint))) as month,
day(from_unixtime(cast(time as bigint))) as day,
hour(from_unixtime(cast(time as bigint))) as hour,
minute(from_unixtime(cast(time as bigint))) as minute
from
(
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating
) t
limit 10;
如下图,查询出我们想要的大宽表了
但是只是查询出来还不行,我们不能每次要查一个需求的时候,我就用一下上面那么一大坨sql,我们可以把查询出来的大宽表再生成一张表,然后针对这个表做一些业务的处理:
create table rating_width
as
select movie,rate,time,user_id,
from_unixtime(cast(time as bigint)) as ts,
year(from_unixtime(cast(time as bigint))) as year,
month(from_unixtime(cast(time as bigint))) as month,
day(from_unixtime(cast(time as bigint))) as day,
hour(from_unixtime(cast(time as bigint))) as hour,
minute(from_unixtime(cast(time as bigint))) as minute
from
(
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating
) t;
生成大宽表之后,我们查询看一下:
select * from rating_width limit 10;
如下图,已经生成我们需要的大宽表了
这篇关于json导入Hive,并整理成大宽表的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!