本文主要是介绍Spark Sql 二次分组排序取TopK,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
基本需求
用spark sql求出每个院系每个班每个专业前3名
样本数据
数据格式:id,studentId,language,math,english,classId,departmentId,即id,学号,语文,数学,外语,班级,院系
1,111,68,69,90,1班,经济系
2,112,73,80,96,1班,经济系
3,113,90,74,75,1班,经济系
4,114,89,94,93,1班,经济系
5,115,99,93,89,1班,经济系
6,121,96,74,79,2班,经济系
7,122,89,86,85,2班,经济系
8,123,70,78,61,2班,经济系
9,124,76,70,76,2班,经济系
10,211,89,93,60,1班,外语系
11,212,76,83,75,1班,外语系
12,213,71,94,90,1班,外语系
13,214,94,94,66,1班,外语系
14,215,84,82,73,1班,外语系
15,216,85,74,93,1班,外语系
16,221,77,99,61,2班,外语系
17,222,80,78,96,2班,外语系
18,223,79,74,96,2班,外语系
19,224,75,80,78,2班,外语系
20,225,82,85,63,2班,外语系
用Spark sql实现
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSessionobject TestSqlGroupByOrder {def main(args: Array[String]): Unit = {/**设置日志等级*/Logger.getLogger("org").setLevel(Level.WARN)/**从Spark 2.0开始,引入SparkSession。SparkSession=SQLContext+HiveContext*/val sparkSession=SparkSession.builder().appName("SparkSqlGroup").master("local[6]").getOrCreate()/**DataFrame*/import sparkSession.implicits._val scoreInfo = sparkSession.read.textFile("/Users/wangpei/Desktop/scores2.txt").map(_.split(",")).map(item=>(item(1),item(2).toInt,item(3).toInt,item(4).toInt,item(5),item(6))).toDF("studentId","language","math","english","classId","departmentId")/**注册DataFrame成一个零时视图*/scoreInfo.createOrReplaceTempView("scoresTable")/*** 使用开窗函数* row_number() OVER (PARTITION BY COL1 ORDER BY COL2) rank* 根据COL1分组,在分组内部根据COL2排序,rank:每组内部排序后的编号字段* 这里用了两段SQl:* 1)(SELECT *, row_number() OVER (PARTITION BY departmentId,classId ORDER BY math DESC) rank FROM scoresTable ) tmp* 用开窗函数:按departmentId,classId分组;分组内部按math降序;每组序号rank从1开始;表别名tmp* 2)SELECT * FROM tmp WHERE rank <= 3* 保留rank <= 3的数据*///语文前3println("############# 语文前3 ##############")sparkSession.sql("SELECT departmentId,classId,language,studentId FROM (SELECT *, row_number() OVER (PARTITION BY departmentId,classId ORDER BY language DESC) rank FROM scoresTable ) tmp WHERE rank <= 3").show()//数学前3println("############# 数学前3 ##############")sparkSession.sql("SELECT departmentId,classId,math,studentId FROM (SELECT *, row_number() OVER (PARTITION BY departmentId,classId ORDER BY math DESC) rank FROM scoresTable ) tmp WHERE rank <= 3").show()//外语前3println("############# 外语前3 ##############")sparkSession.sql("SELECT departmentId,classId,english,studentId FROM (SELECT *, row_number() OVER (PARTITION BY departmentId,classId ORDER BY english DESC) rank FROM scoresTable ) tmp WHERE rank <= 3").show()}
}
这篇关于Spark Sql 二次分组排序取TopK的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!