阿里音乐预测之初探ODPS SQL

本文主要是介绍阿里音乐预测之初探ODPS SQL，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一、主要操作平台 

数据的处理，表格的生成读取，都可以在 数据开发 和 机器学习平台 下进行。 

二、读取与统计赛题数据 

-- 分别读取用户表和歌曲表: 

create table if not exists users as select * from odps_tc_257100_f673506e024.mars_tianchi_user_actions; 

create table if not exists songs as select * from odps_tc_257100_f673506e024.mars_tianchi_songs; 

-- 统计所有艺人: 

create table if not exists artists as select distinct artist_id from songs; 

-- 统计所有艺人以及其对应的歌曲: 

create table if not exists artists_songs as select distinct artist_id, song_id from songs; 

-- 统计每首歌曲每天的播放量: 

create table if not exists songs_plays as 

select song_id, ds, count(*) as plays 

from users 

where action_type = '1' 

group by song_id, ds； 

(为了代码更容易解读，分行书写) 

-- 将 表artists_songs 和 表songs_plays 联结: 

create table if not exists artists_songs_plays as 

select b.artist_id, a.ds, a.plays 

from ${t1} a join ${t2} b 

on a.song_id = b.song_id; 

(1.因为 表artists_songs 中有些歌并没有被播放，而 表songs_plays 中有些被播放的歌的歌手没有在 表artists 中，所以这里必须采用内联结 inner join，关键字inner可省略; 2.在 机器学习平台 可直接使用 组件JOIN) 

-- 统计每个艺人每天的播放量: 

create table if not exists artists_plays as 

select artist_id, ds, sum(plays) as plays 

from artists_songs_plays 

group by artist_id, ds; 

-- 计算每个艺人20150801~20150830 这30天的平均播放量: 

create table if not exists artists_plays_avg30 as 

select artist_id, avg(plays) as plays_avg30 

from artists_plays 

where ds > '20150731' and ds < '20150831' 

group by artist_id; 

三、创建预测（测试）时间表: 

--先提取日期20150701~20150831: 

create table if not exists test_dates_0 as 

select distinct ds 

from users 

where ds > 20150701 and ds <= 20150831; 

(SQL里的字符串可以不添加（‘’）?) 

-- 从string型转为datetime型: 

create table if not exists test_dates_1 as 

select to_date(ds, "yyyymmdd") as ds 

from test_dates_0; 

(示例：2015-08-18 00:00:00) 

-- 增加61天 

create table if not exists test_dates_2 as 

select dateadd(ds, 61, "dd") as ds 

from test_dates_1; 

-- ds 转回字符串 

create table if not exists test_dates_3 as 

select cast(ds as string) as ds 

from test_dates_2; 

-- 转换回原来的格式 

create table if not exists test_dates as 

select concat(substr(substr(ds,1,10 ),1,4),substr(substr(ds,1,10 ),6,2),substr(substr(ds,1,10 ),9,2))as ds 

from test_dates_3; 

(示例：将 2015-08-18 00:00:00 转为 20150818) 

--将 表artists 和 时间表test_dates 结合 

（为了将 表artists 和 时间表test_dates 结合，我们分别在两个表中增加一列 select 'a' as join_flag，然后通过join_flag将两个表 全联结，即可得到想要的表格，在此基础上再添加 列plays 则是官方要求的结果提交表格格式了） 

四、平台及语言使用技巧: 

-- SQL关键字的语法顺序： 

SELECT 语句的完整语法较复杂，但是其主要的子句可归纳如下： 

SELECT select_list 

[ INTO new_table ] 

FROM table_source 

[ WHERE search_condition ] 

[ GROUP BY group_by_expression ] 

[ HAVING search_condition ] 

[ ORDER BY order_expression [ ASC | DESC ] ] 

可以在查询之间使用 UNION 运算符，以将查询的结果组合成单个结果集。 

-- 在  数据开发 下，运行脚本可以输出表格结果，这有助于直接观测并检验自己的代码有没有错误： 

比如，查看 表test_dates_1 的前20项： 

select * from test_dates_1 limit 20; 

-- 对比  数据开发 和 机器学习平台： 

 数据开发 直接写脚本代码可以更简洁，一目了然； 

 机器学习平台 可以一步一步执行，并查看每一步的结果，逻辑更清晰。 

这篇关于阿里音乐预测之初探ODPS SQL的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

阿里音乐预测 之 初探ODPS SQL