(转)groupByKey 和reduceByKey 的区别

2024-05-24 10:32

本文主要是介绍(转)groupByKey 和reduceByKey 的区别,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

【转载原文:https://blog.csdn.net/ZMC921/article/details/75098903】

 

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

本文链接:https://blog.csdn.net/ZMC921/article/details/75098903

一、首先他们都是要经过shuffle的,groupByKey在方法shuffle之间不会合并原样进行shuffle,。reduceByKey进行shuffle之前会先做合并,这样就减少了shuffle的io传送,所以效率高一点。

案例:

object GroupyKeyAndReduceByKeyDemo {def main(args: Array[String]): Unit = {Logger.getLogger("org").setLevel(Level.WARN)val config = new SparkConf().setAppName("GroupyKeyAndReduceByKeyDemo").setMaster("local")val sc = new SparkContext(config)val arr = Array("val config", "val arr")val socketDS = sc.parallelize(arr).flatMap(_.split(" ")).map((_, 1))//groupByKey 和reduceByKey 的区别://他们都是要经过shuffle的,groupByKey在方法shuffle之间不会合并原样进行shuffle,//reduceByKey进行shuffle之前会先做合并,这样就减少了shuffle的io传送,所以效率高一点socketDS.groupByKey().map(tuple => (tuple._1, tuple._2.sum)).foreach(x => {println(x._1 + " " + x._2)})println("----------------------")socketDS.reduceByKey(_ + _).foreach(x => {println(x._1 + " " + x._2)})sc.stop()}
}

二 、首先groupByKey有三种

查看源码groupByKey()实现了 groupByKey(defaultPartitioner(self))

/*** Group the values for each key in the RDD into a single sequence. Hash-partitions the* resulting RDD with the existing partitioner/parallelism level. The ordering of elements* within each group is not guaranteed, and may even differ each time the resulting RDD is* evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {groupByKey(defaultPartitioner(self))}

查看源码 groupByKey(numPartitions: Int) 实现了 groupByKey(new HashPartitioner(numPartitions))

/*** Group the values for each key in the RDD into a single sequence. Hash-partitions the* resulting RDD with into `numPartitions` partitions. The ordering of elements within* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {groupByKey(new HashPartitioner(numPartitions))}

其实上面两个都是实现了groupByKey(partitioner: Partitioner)

/*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.* The ordering of elements within each group is not guaranteed, and may even differ* each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {// groupByKey shouldn't use map side combine because map side combine does not// reduce the amount of data shuffled and requires all map side data be inserted// into a hash table, leading to more objects in the old gen.val createCombiner = (v: V) => CompactBuffer(v)val mergeValue = (buf: CompactBuffer[V], v: V) => buf += vval mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2val bufs = combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)bufs.asInstanceOf[RDD[(K, Iterable[V])]]}

而groupByKey(partitioner: Partitioner)有实现了combineByKeyWithClassTag,所以可以说groupByKey其实底层都是combineByKeyWithClassTag的实现,只是实现的方式不同。

 

三、再查看reduceByKey也有三种方式

 
/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/* parallelism level.*/def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {reduceByKey(defaultPartitioner(self), func)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.*/def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {reduceByKey(new HashPartitioner(numPartitions), func)}
 
/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce.*/def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}

通过查看这三种reduceByKey不难发现,前两种是最后一种的实现。而最后一种是又实现了combineByKeyWithClassTag。

 

### groupByKey是这样实现的

combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)

### reduceByKey是这样实现的

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

对比上面发现,groupByKey设置了mapSideCombine = false,在map端不进行合并,那就是在shuffle前不合并。而reduceByKey没有设置

难道reduceByKey默认合并吗????

 

四、接下来,我们仔细看一下combineByKeyWithClassTag

/*** :: Experimental ::* Generic function to combine the elements for each key using a custom set of aggregation* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C** Users provide three functions:** - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)* - `mergeCombiners`, to combine two C's into a single one.** In addition, users can control the partitioning of the output RDD, and whether to perform* map-side aggregation (if a mapper can produce multiple items with the same key).** @note V and C can be different -- for example, one might group an RDD of type* (Int, Int) into an RDD of type (Int, Seq[Int]).*/@Experimentaldef combineByKeyWithClassTag[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,partitioner: Partitioner,mapSideCombine: Boolean = true,serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0if (keyClass.isArray) {if (mapSideCombine) {throw new SparkException("Cannot use map-side combining with array keys.")}if (partitioner.isInstanceOf[HashPartitioner]) {throw new SparkException("HashPartitioner cannot partition array keys.")}}val aggregator = new Aggregator[K, V, C](self.context.clean(createCombiner),self.context.clean(mergeValue),self.context.clean(mergeCombiners))if (self.partitioner == Some(partitioner)) {self.mapPartitions(iter => {val context = TaskContext.get()new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))}, preservesPartitioning = true)} else {new ShuffledRDD[K, V, C](self, partitioner).setSerializer(serializer).setAggregator(aggregator).setMapSideCombine(mapSideCombine)}}

通过查看combineByKeyWithClassTag的,发现reduceByKey默认在map端进行合并,那就是在shuffle前进行合并,如果合并了一些数据,那在shuffle时进行溢写则减少了磁盘IO,所以reduceByKey会快一些。

这篇关于(转)groupByKey 和reduceByKey 的区别的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/998154

相关文章

native和static native区别

本文基于Hello JNI  如有疑惑,请看之前几篇文章。 native 与 static native java中 public native String helloJni();public native static String helloJniStatic();1212 JNI中 JNIEXPORT jstring JNICALL Java_com_test_g

Android fill_parent、match_parent、wrap_content三者的作用及区别

这三个属性都是用来适应视图的水平或者垂直大小,以视图的内容或尺寸为基础的布局,比精确的指定视图的范围更加方便。 1、fill_parent 设置一个视图的布局为fill_parent将强制性的使视图扩展至它父元素的大小 2、match_parent 和fill_parent一样,从字面上的意思match_parent更贴切一些,于是从2.2开始,两个属性都可以使用,但2.3版本以后的建议使

Collection List Set Map的区别和联系

Collection List Set Map的区别和联系 这些都代表了Java中的集合,这里主要从其元素是否有序,是否可重复来进行区别记忆,以便恰当地使用,当然还存在同步方面的差异,见上一篇相关文章。 有序否 允许元素重复否 Collection 否 是 List 是 是 Set AbstractSet 否

javascript中break与continue的区别

在javascript中,break是结束整个循环,break下面的语句不再执行了 for(let i=1;i<=5;i++){if(i===3){break}document.write(i) } 上面的代码中,当i=1时,执行打印输出语句,当i=2时,执行打印输出语句,当i=3时,遇到break了,整个循环就结束了。 执行结果是12 continue语句是停止当前循环,返回从头开始。

maven发布项目到私服-snapshot快照库和release发布库的区别和作用及maven常用命令

maven发布项目到私服-snapshot快照库和release发布库的区别和作用及maven常用命令 在日常的工作中由于各种原因,会出现这样一种情况,某些项目并没有打包至mvnrepository。如果采用原始直接打包放到lib目录的方式进行处理,便对项目的管理带来一些不必要的麻烦。例如版本升级后需要重新打包并,替换原有jar包等等一些额外的工作量和麻烦。为了避免这些不必要的麻烦,通常我们

ActiveMQ—Queue与Topic区别

Queue与Topic区别 转自:http://blog.csdn.net/qq_21033663/article/details/52458305 队列(Queue)和主题(Topic)是JMS支持的两种消息传递模型:         1、点对点(point-to-point,简称PTP)Queue消息传递模型:         通过该消息传递模型,一个应用程序(即消息生产者)可以

深入探讨:ECMAScript与JavaScript的区别

在前端开发的世界中,JavaScript无疑是最受欢迎的编程语言之一。然而,很多开发者在使用JavaScript时,可能并不清楚ECMAScript与JavaScript之间的关系和区别。本文将深入探讨这两者的不同之处,并通过案例帮助大家更好地理解。 一、什么是ECMAScript? ECMAScript(简称ES)是一种脚本语言的标准,由ECMA国际组织制定。它定义了语言的语法、类型、语句、

Lua 脚本在 Redis 中执行时的原子性以及与redis的事务的区别

在 Redis 中,Lua 脚本具有原子性是因为 Redis 保证在执行脚本时,脚本中的所有操作都会被当作一个不可分割的整体。具体来说,Redis 使用单线程的执行模型来处理命令,因此当 Lua 脚本在 Redis 中执行时,不会有其他命令打断脚本的执行过程。脚本中的所有操作都将连续执行,直到脚本执行完成后,Redis 才会继续处理其他客户端的请求。 Lua 脚本在 Redis 中原子性的原因

msys2 minggw-w64 cygwin wsl区别

1 mingw-w64,这是gcc一直win平台下产生的,所以是win版的gcc,既支持32也支持64bit 2cygwin专注于原样在windows上构建unix软件, 3msys让Linux开发者在windows上运行软件,msys2专注于构建针对windows api构建的本机软件 4 wsl  windows subsystem for linux 是一个在windows 10 上能

【Java中的位运算和逻辑运算详解及其区别】

Java中的位运算和逻辑运算详解及其区别 在 Java 编程中,位运算和逻辑运算是常见的两种操作类型。位运算用于操作整数的二进制位,而逻辑运算则是处理布尔值 (boolean) 的运算。本文将详细讲解这两种运算及其主要区别,并给出相应示例。 应用场景了解 位运算和逻辑运算的设计初衷源自计算机底层硬件和逻辑运算的需求,它们分别针对不同的处理对象和场景。以下是它们设计的初始目的简介: