top N彻底解秘

本文主要是介绍top N彻底解秘，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

本博文内容：

　　1、基础Top N算法实战

　　2、分组Top N算法实战

　　3、排序算法RangePartitioner内幕解密

1、基础Top N算法实战

Top N是排序，Take是直接拿出几个元素，没排序。

　　新建

　　从源码，来说话，take返回的是数组，不是RDD。而colletc需要的是RDD。

/*** Return an array that contains all of the elements in this RDD.*/
def collect(): Array[T] = withScope {val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)Array.concat(results: _*)
}

/*** Take the first num elements of the RDD. It works by first scanning one partition, and use the* results from that partition to estimate the number of additional partitions needed to satisfy* the limit.** @note due to complications in the internal implementation, this method will raise* an exception if called on an RDD of `Nothing` or `Null`.*/
def take(num: Int): Array[T] = withScope {if (num == 0) {new Array[T](0)} else {val buf = new ArrayBuffer[T]val totalParts = this.partitions.lengthvar partsScanned = 0while (buf.size < num && partsScanned < totalParts) {// The number of partitions to try in this iteration. It is ok for this number to be// greater than totalParts because we actually cap it at totalParts in runJob.var numPartsToTry = 1if (partsScanned > 0) {// If we didn't find any rows after the previous iteration, quadruple and retry.// Otherwise, interpolate the number of partitions we need to try, but overestimate// it by 50%. We also cap the estimation in the end.if (buf.size == 0) {numPartsToTry = partsScanned * 4} else {// the left side of max is >=1 whenever partsScanned >= 2numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)numPartsToTry = Math.min(numPartsToTry, partsScanned * 4)}}val left = num - buf.sizeval p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts)val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)res.foreach(buf ++= _.take(num - buf.size))partsScanned += numPartsToTry}buf.toArray}
}

　　则，所以，代码，如下:

package com.zhouls.spark.coresimport org.apache.spark.{SparkConf, SparkContext}/*** 基础Top N实战* Created by Administrator on 2016/10/9.*/
object TopNBasic {def main(args: Array[String]) {val conf = new SparkConf()conf.setAppName("Top N Basically!").setMaster("local")val sc = new SparkContext(conf)val lines = sc.textFile("D://SoftWare//spark-1.5.2-bin-hadoop2.6//basicTopN.txt")val pairs = lines.map(line =>(line.toInt,line)) //生成key-value键值对，方便sortByKey进行排序val sortedPairs = pairs.sortByKey(false) //降序排序val sortedData = sortedPairs.map(pair => pair._2)  //只要是改变每一行列的数据，一般都是用map操作。过滤出排序后的内容本身val top5 = sortedData.take(5)  //获取排名前5位的元素内容
    top5.foreach(println)}
}

好的，这里，学个新知识点。

setLogLevel
看源码

/** Control our logLevel. This overrides any user-defined log settings.* @param logLevel The desired log level as a string.* Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN*/
def setLogLevel(logLevel: String) {val validLevels = Seq("ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN")if (!validLevels.contains(logLevel)) {throw new IllegalArgumentException(s"Supplied level $logLevel did not match one of: ${validLevels.mkString(",")}")}Utils.setLogLevel(org.apache.log4j.Level.toLevel(logLevel))
}

setLogLevel("ALL")

对应的打印输出信息，

"C:\Program Files\Java\jdk1.8.0_66\bin\java" -Didea.launcher.port=7533 "-Didea.launcher.bin.path=D:\SoftWare\IntelliJ IDEA\IntelliJ IDEA Community Edition 2016.1.4\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_66\jre\lib\charsets.jar;C:\Program
artitions
d size 1814.0 B, free 976.2 MB)
16/10/09 09:15:38 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(UpdateBlockInfo(BlockManagerId(driver, localhost, 52833),broadcast_2_piece0,StorageLevel(false, true, false, false, 1),1814,0,0),true) from Actor[akka://sparkDriver/temp/$g]
16/10/09 09:15:38 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(UpdateBlockInfo(BlockManagerId(driver, localhost, 52833),broadcast_2_piece0,StorageLevel(false, true, false, false, 1),1814,0,0),true)
16/10/09 09:15:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:52833 (size: 1814.0 B, free: 976.3 MB)
16/10/09 09:15:38 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (3.09051 ms) AkkaMessage(UpdateBlockInfo(BlockManagerId(driver, localhost, 52833),broadcast_2_piece0,StorageLevel(false, true, false, false, 1),1814,0,0),true) from Actor[akka://sparkDriver/temp/$g]
16/10/09 09:15:38 DEBUG BlockManagerMaster: Updated info of block broadcast_2_piece0
16/10/09 09:15:38 DEBUG BlockManager: Told master about block broadcast_2_piece0
16/10/09 09:15:38 DEBUG BlockManager: Put block broadcast_2_piece0 locally took 8 ms
16/10/09 09:15:38 DEBUG BlockManager: Putting block broadcast_2_piece0 without replication took 9 ms
16/10/09 09:15:38 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
bytes)
16/10/09 09:15:39 TRACE DAGScheduler: failed: Set()
16/10/09 09:15:39 INFO DAGScheduler: Job 0 finished: take at TopNBasic.scala:20, took 1.022280 s
9
7
7
5
5
16/10/09 09:15:39 INFO SparkContext: Invoking stop() from shutdown hook
age (5.094032 ms) AkkaMessage(StopCoordinator,false) from Actor[akka://sparkDriver/deadLetters]
16/10/09 09:15:39 INFO ShutdownHookManager: Deleting directory C:\Users\Administrator\AppData\Local\Temp\spark-3656d24c-bfdb-4def-b751-8d7fc84150cb