Spark-push-based shuffle

2024-09-05 05:36
文章标签 push based spark shuffle

本文主要是介绍Spark-push-based shuffle,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

一、上下文

《Spark-Task启动流程》中讲到如果一个Task是一个ShuffleMapTask,那么最后在调用ShuffleWriter写入磁盘后还会判断是否可以启用push-based shuffle机制,下面我们就来继续看看push-based shuffle机制背后都做了什么

二、push-based shuffle机制开启条件

1、spark.shuffle.push.enabled 设置为true 默认为 false (设置为true可在客户端启用基于推送的shuffle,这与服务器端标志spark.shuffle.push.server.mergedShuffleFileManagerImpl协同工作,该标志需要使用相应的org.apache.spark.network.shuffle进行设置。启用基于推送的shuffle的MergedShuffleFileManager实现)

2、提交应用程序以在YARN模式下运行

3、已启用外部洗牌服务

4、IO加密已禁用

5、序列化器(如KryoSerialer)支持重新定位序列化对象

6、RDD不能是Barrier的 

说明:调用barrier()可以返回一个RDDBarrier,且会将该RDD所处的Stage也标记为barrier,在该Stage,Spark必须同时启动所有任务。如果任务失败,Spark将中止整个阶段并重新启动此阶段的所有任务,而不是只重新启动失败的任务。

class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](...){private def canShuffleMergeBeEnabled(): Boolean = {val isPushShuffleEnabled = Utils.isPushBasedShuffleEnabled(rdd.sparkContext.getConf,// invoked at driverisDriver = true)if (isPushShuffleEnabled && rdd.isBarrier()) {logWarning("Push-based shuffle is currently not supported for barrier stages")}isPushShuffleEnabled &&// TODO: SPARK-35547: Push based shuffle is currently unsupported for Barrier stages!rdd.isBarrier()}}

三、推送ShuffleWriter的结果

1、ShuffleWriteProcessor

  def write(rdd: RDD[_],dep: ShuffleDependency[_, _, _],mapId: Long,context: TaskContext,partition: Partition): MapStatus = {var writer: ShuffleWriter[Any, Any] = nulltry {//SparkEnv从获取ShuffleManagerval manager = SparkEnv.get.shuffleManager//从ShuffleManager获取ShuffleWriterwriter = manager.getWriter[Any, Any](dep.shuffleHandle,mapId,context,createMetricsReporter(context))//用ShuffleWriter将该Stage结果写入磁盘writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])val mapStatus = writer.stop(success = true)if (mapStatus.isDefined) {if (dep.shuffleMergeEnabled && dep.getMergerLocs.nonEmpty && !dep.shuffleMergeFinalized) {manager.shuffleBlockResolver match {case resolver: IndexShuffleBlockResolver =>//获取该Stage结果信息val dataFile = resolver.getDataFile(dep.shuffleId, mapId)//推送new ShuffleBlockPusher(SparkEnv.get.conf).initiateBlockPush(dataFile, writer.getPartitionLengths(), dep, partition.index)case _ =>}}}mapStatus.get} catch {......}}

2、ShuffleBlockPusher

用于在启用push-based shuffle时将混洗块推送到远程混洗服务。它是在ShuffleWriter完成结果文件的写入后创建。

private[spark] class ShuffleBlockPusher(conf: SparkConf) extends Logging {//spark.shuffle.push.maxBlockSizeToPush 默认 1M//推送到远程外部shuffle服务的单个块的最大大小。大于此阈值的块不会被推送到远程合并。这些shuffle块将由executors以原始方式获取。private[this] val maxBlockSizeToPush = conf.get(SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH)//spark.shuffle.push.maxBlockBatchSize 默认 3M//要分组到单个推送请求中的一批shuffle块的最大大小//默认值为3m,因为它大于2m(TransportConf#memoryMapBytes的默认值)。如果这也默认为2m,则很可能每批块都将通过内存映射加载到内存中,这对于MB大小的小数据块具有更高的开销。private[this] val maxBlockBatchSize = conf.get(SHUFFLE_MAX_BLOCK_BATCH_SIZE_FOR_PUSH)//每个reduce端任务同时获取的map端输出的最大大小。由于每个输出都需要我们创建一个缓冲区来接收它,这表示每个reduce端任务的固定内存开销,因此除非您有大量内存,否则请保持较小的内存开销private[this] val maxBytesInFlight =conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024//此配置限制了在任何给定点获取块的远程请求数量。当集群中的主机数量增加时,可能会导致与一个或多个节点的大量入站连接,导致工作进程在负载下失败。通过允许它限制获取请求的数量,可以缓解这种情况//默认为 Int.MaxValue 即 2147483647private[this] val maxReqsInFlight = conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue)//spark.reducer.maxBlocksInFlightPerAddress 默认为 Int.MaxValue 即 2147483647//此配置限制了每个reduce任务从给定主机端口获取的远程块的数量。当在单次或同时从给定地址请求大量块时,这可能会使 executor 或 Node Manager 崩溃。这对于在启用外部shuffle时减少节点管理器的负载特别有用。您可以通过将其设置为较低的值来缓解这个问题。private[this] val maxBlocksInFlightPerAddress = conf.get(REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS)private[this] var bytesInFlight = 0Lprivate[this] var reqsInFlight = 0private[this] val numBlocksInFlightPerAddress = new HashMap[BlockManagerId, Int]()private[this] val deferredPushRequests = new HashMap[BlockManagerId, Queue[PushRequest]]()//推送请求队列private[this] val pushRequests = new Queue[PushRequest]private[this] val errorHandler = createErrorHandler()// VisibleForTestingprivate[shuffle] val unreachableBlockMgrs = new HashSet[BlockManagerId]()//......//初始化块推送private[shuffle] def initiateBlockPush(dataFile: File,//map端生成的Shuffle数据文件partitionLengths: Array[Long], //Shuffle块大小数组,这样我们就可以分辨Shuffle块了dep: ShuffleDependency[_, _, _], //用于获取shuffle ID和远程shuffle服务的位置,以推送本地shuffle块mapIndex: Int): Unit = {  //shuffle map 任务的索引val numPartitions = dep.partitioner.numPartitionsval transportConf = SparkTransportConf.fromSparkConf(conf, "shuffle")val requests = prepareBlockPushRequests(numPartitions, mapIndex, dep.shuffleId,dep.shuffleMergeId, dataFile, partitionLengths, dep.getMergerLocs, transportConf)// 随机化PushRequest的顺序//如果map端有排序,那么每个分区的相同key所在的大致范围是一样的,就会造成同意时间都向下游同一个分区或者同一台节点推送数据,因此需要乱序下,这样更有利于并行推送pushRequests ++= Utils.randomize(requests)submitTask(() => {tryPushUpToMax()})}private[shuffle] def tryPushUpToMax(): Unit = {try {pushUpToMax()} catch {......}//由于多个块推送线程可能会为同一个映射器调用pushUpToMax,//因此我们同步对此方法的访问,以便只有一个线程可以为给定的映射器推送块。//这有助于简化对共享状态的访问。这样做的缺点是,如果所有线程都被来自同一映射器的块推送占用,我们可能会不必要地阻止其他映射器的区块推送。private def pushUpToMax(): Unit = synchronized {// 如果可能的话,处理任何未完成的延迟推送请求if (deferredPushRequests.nonEmpty) {for ((remoteAddress, defReqQueue) <- deferredPushRequests) {while (isRemoteBlockPushable(defReqQueue) &&!isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {val request = defReqQueue.dequeue()logDebug(s"Processing deferred push request for $remoteAddress with "+ s"${request.blocks.length} blocks")sendRequest(request)if (defReqQueue.isEmpty) {deferredPushRequests -= remoteAddress}}}}//如果可能的话,处理任何常规推送请求。while (isRemoteBlockPushable(pushRequests)) {//从队列中取出一个请求val request = pushRequests.dequeue()val remoteAddress = request.address//reduce 端也有接收块的大小限制,如果超过了就不用给对方发送了 默认为 Int.MaxValue 即 2147483647if (isRemoteAddressMaxedOut(remoteAddress, request)) {logDebug(s"Deferring push request for $remoteAddress with ${request.blocks.size} blocks")deferredPushRequests.getOrElseUpdate(remoteAddress, new Queue[PushRequest]()).enqueue(request)} else {sendRequest(request)}}def isRemoteBlockPushable(pushReqQueue: Queue[PushRequest]): Boolean = {pushReqQueue.nonEmpty &&(bytesInFlight == 0 ||(reqsInFlight + 1 <= maxReqsInFlight &&bytesInFlight + pushReqQueue.front.size <= maxBytesInFlight))}// 检查发送新的推送请求是否会超过推送到给定远程地址的最大块数。def isRemoteAddressMaxedOut(remoteAddress: BlockManagerId, request: PushRequest): Boolean = {(numBlocksInFlightPerAddress.getOrElse(remoteAddress, 0)+ request.blocks.size) > maxBlocksInFlightPerAddress}}//将块推送到远程shuffle服务器。一旦当前批中的某个块传输完成,回调监听器将再次调用#pushUpToMax来触发推送下一批块。这样,我们将映射任务与块推送过程解耦,因为它是负责大部分块推送的网状客户端线程,而不是任务执行线程。private def sendRequest(request: PushRequest): Unit = {bytesInFlight +=  request.sizereqsInFlight += 1numBlocksInFlightPerAddress(request.address) = numBlocksInFlightPerAddress.getOrElseUpdate(request.address, 0) + request.blocks.lengthval sizeMap = request.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMapval address = request.addressval blockIds = request.blocks.map(_._1.toString)val remainingBlocks = new HashSet[String]() ++= blockIds//块推送监听器val blockPushListener = new BlockPushingListener {//启动连接并将块推送到远程shuffle服务始终由块推送线程处理。//我们不应该在netty事件循环调用的blockPushListener回调中启动连接创建,//因为:1。TransportClient.createConnection(…)块用于建立连接,建议避免在事件循环中进行任何阻塞操作;//      2.实际的连接创建是一个添加到另一个事件循环的任务队列中的任务,该事件循环最终可能会相互阻塞。一旦blockPushListener收到块推送成功或失败的通知,我们只需将其委托给块推送线程。def handleResult(result: PushResult): Unit = {submitTask(() => {if (updateStateAndCheckIfPushMore(sizeMap(result.blockId), address, remainingBlocks, result)) {//再次进行推送tryPushUpToMax()}})}override def onBlockPushSuccess(blockId: String, data: ManagedBuffer): Unit = {logTrace(s"Push for block $blockId to $address successful.")handleResult(PushResult(blockId, null))}override def onBlockPushFailure(blockId: String, exception: Throwable): Unit = {// check the message or it's cause to see it needs to be logged.if (!errorHandler.shouldLogError(exception)) {logTrace(s"Pushing block $blockId to $address failed.", exception)} else {logWarning(s"Pushing block $blockId to $address failed.", exception)}handleResult(PushResult(blockId, exception))}}//除了随机化推送请求的顺序外,还进一步随机化推送申请中块的顺序,以进一步降低推送块在服务器端发生洗牌冲突的可能性。这不会增加在执行器端读取未合并的shuffle文件的成本,因为我们仍然在读取MB大小的块,并且只在读取后对内存中的切片缓冲区进行随机化。//一个请求包括多个块,请求随机化,块也随机化 然后发送请求val (blockPushIds, blockPushBuffers) = Utils.randomize(blockIds.zip(sliceReqBufferIntoBlockBuffers(request.reqBuffer, request.blocks.map(_._2)))).unzip//从SparkEnv获取blockManager,blockManager调用blockStoreClient来进行传输//下面我们就详细看下它是如何把块推送到reduce端的SparkEnv.get.blockManager.blockStoreClient.pushBlocks(address.host, address.port, blockPushIds.toArray,blockPushBuffers.toArray, blockPushListener)}//触发推送protected def submitTask(task: Runnable): Unit = {if (BLOCK_PUSHER_POOL != null) {BLOCK_PUSHER_POOL.execute(task)}}private val BLOCK_PUSHER_POOL: ExecutorService = {val conf = SparkEnv.get.confif (Utils.isPushBasedShuffleEnabled(conf,isDriver = SparkContext.DRIVER_IDENTIFIER == SparkEnv.get.executorId)) {//spark.shuffle.push.numPushThreads  默认值 spark-submit中给executor分批的内核数量//指定推块池中的线程数。这些线程有助于创建连接并将块推送到远程外部shuffle服务。默认情况下,线程池大小等于Spark executor 内核的数量。val numThreads = conf.get(SHUFFLE_NUM_PUSH_THREADS).getOrElse(conf.getInt(SparkLauncher.EXECUTOR_CORES, 1))ThreadUtils.newDaemonFixedThreadPool(numThreads, "shuffle-block-push-thread")} else {null}}//将当前map端的shuffle数据文件转换为PushRequest列表。//基本上,shuffle 文件中的连续块被分组到单个请求中,以允许更有效地读取块数据。//给定shuffle的每个map端将收到与目标位置相同的BlockManagerId列表,以将块推送到目标位置。//同一shuffle中的所有map端将以一致的方式将shuffle分区范围映射到各个目标位置,以确保每个目标位置接收属于同一组分区范围的shuffle块。//0长度的块和足够大的块将被跳过。private[shuffle] def prepareBlockPushRequests(numPartitions: Int,partitionId: Int,shuffleId: Int,shuffleMergeId: Int,dataFile: File,partitionLengths: Array[Long],mergerLocs: Seq[BlockManagerId],transportConf: TransportConf): Seq[PushRequest] = {var offset = 0Lvar currentReqSize = 0var currentReqOffset = 0Lvar currentMergerId = 0val numMergers = mergerLocs.length//推送请求数组val requests = new ArrayBuffer[PushRequest]var blocks = new ArrayBuffer[(BlockId, Int)]for (reduceId <- 0 until numPartitions) {val blockSize = partitionLengths(reduceId)logDebug(s"Block ${ShufflePushBlockId(shuffleId, shuffleMergeId, partitionId,reduceId)} is of size $blockSize")//跳过0长度的块和足够大的块if (blockSize > 0) {val mergerId = math.min(math.floor(reduceId * 1.0 / numPartitions * numMergers),numMergers - 1).asInstanceOf[Int]//如果当前请求超出最大批处理大小,//  或者当前请求中的块数超出每个目标的限制,//  或者下一个块推送位置用于不同的洗牌服务,//  或者下个块超过推送的最大块大小限制,//则启动新的PushRequest。这保证了每个PushRequest代表洗牌文件中要推送到同一洗牌服务的连续块,并且不会超出现有的限制。if (currentReqSize + blockSize <= maxBlockBatchSize&& blocks.size < maxBlocksInFlightPerAddress&& mergerId == currentMergerId && blockSize <= maxBlockSizeToPush) {// 将当前块添加到当前批次currentReqSize += blockSize.toInt} else {if (blocks.nonEmpty) {// 将上一批转换为PushRequestrequests += PushRequest(mergerLocs(currentMergerId), blocks.toSeq,createRequestBuffer(transportConf, dataFile, currentReqOffset, currentReqSize))blocks = new ArrayBuffer[(BlockId, Int)]}//开始一个新批次currentReqSize = 0// 将currentReqOffset设置为-1,以便我们能够区分currentReqOffset的初始值和何时开始新批处理currentReqOffset = -1currentMergerId = mergerId}// 仅在大小符合的情况下进行推送//如果一个分区的大小超过 1M 那就不能推送了,只能用传统的Shuffle方式拉取//其实这个blockSize 应该不会太大,除非有数据倾斜 ,因为这是一个分区向下游某一个分区推送的数据大小if (blockSize <= maxBlockSizeToPush) {val blockSizeInt = blockSize.toIntblocks += ((ShufflePushBlockId(shuffleId, shuffleMergeId, partitionId,reduceId), blockSizeInt))// 仅当当前块是请求中的第一个块时才更新currentReqOffsetif (currentReqOffset == -1) {currentReqOffset = offset}if (currentReqSize == 0) {currentReqSize += blockSizeInt}}}offset += blockSize}// 添加最终请求if (blocks.nonEmpty) {requests += PushRequest(mergerLocs(currentMergerId), blocks.toSeq,createRequestBuffer(transportConf, dataFile, currentReqOffset, currentReqSize))}requests.toSeq}}

3、ExternalBlockStoreClient

客户端,用于读取指向外部(executor外部)服务器的RDD块和shuffle块。

以尽最大努力的方式异步地将一系列Shuffle块推送到远程节点。这些Shuffle块以及其他客户端推送的块将被合并到目标节点上的每个Shuffle分区合并的Shuffle文件中。

public class ExternalBlockStoreClient extends BlockStoreClient {public void pushBlocks(String host,int port,String[] blockIds,ManagedBuffer[] buffers,BlockPushingListener listener) {checkInit();//如果块大小和buffer大小不匹配就停止assert blockIds.length == buffers.length : "Number of block ids and buffers do not match.";//将块和buffer进行对应 放入map中Map<String, ManagedBuffer> buffersWithId = new HashMap<>();for (int i = 0; i < blockIds.length; i++) {buffersWithId.put(blockIds[i], buffers[i]);}//日志打印:推送多大的 shuffle 块 都某一台节点logger.debug("Push {} shuffle blocks to {}:{}", blockIds.length, host, port);try {RetryingBlockTransferor.BlockTransferStarter blockPushStarter =(inputBlockId, inputListener) -> {if (clientFactory != null) {assert inputListener instanceof BlockPushingListener :"Expecting a BlockPushingListener, but got " + inputListener.getClass();//创建一个连接到给定远程主机/端口的 TransportClient//我们维护一个客户端数组(大小由spark.shuffle.io.numConnectionsPerPeer决定),并随机选择一个使用。如果在随机选择的位置之前没有创建客户端,则此函数会创建一个新客户端并将其放置在那里。//因为推送shuffle会对多个节点发送多次请求,因此将创建好的TransportClient 放入数组,如果下次遇到同一个远程目标主机,就不用再创建了,//fastFail 默认值为 false 如果fastFail参数为真,则在快速失败时间窗口内(io等待重试超时的95%)对同一地址的最后一次尝试失败时立即失败。假设调用者将处理重试。//在创建新的TransportClient之前,我们将执行在此工厂注册的所有TransportClientBootstrap  这会一直阻止,直到成功建立并完全引导连接。//TransportClient 其实是一个netty 客户端,且会再pipeline中设置一个TransportChannelHandlerTransportClient client = clientFactory.createClient(host, port);//构建OneForOneBlockPusher 进行推送new OneForOneBlockPusher(client, appId, comparableAppAttemptId, inputBlockId,(BlockPushingListener) inputListener, buffersWithId).start();} else {logger.info("This clientFactory was closed. Skipping further block push retries.");}};int maxRetries = transportConf.maxIORetries();if (maxRetries > 0) {new RetryingBlockTransferor(transportConf, blockPushStarter, blockIds, listener, PUSH_ERROR_HANDLER).start();} else {blockPushStarter.createAndStart(blockIds, listener);}} catch (Exception e) {logger.error("Exception while beginning pushBlocks", e);for (String blockId : blockIds) {listener.onBlockPushFailure(blockId, e);}}}}

4、OneForOneBlockPusher

用于将块推送到要合并的远程shuffle服务,与之对应的类是OneForOneBlockFetcher:用于从远程shuffles服务中拉取块

public class OneForOneBlockPusher {//开始推块过程,每次推块都调用监听器public void start() {logger.debug("Start pushing {} blocks", blockIds.length);for (int i = 0; i < blockIds.length; i++) {assert buffers.containsKey(blockIds[i]) : "Could not find the block buffer for block "+ blockIds[i];String[] blockIdParts = blockIds[i].split("_");if (blockIdParts.length != 5 || !blockIdParts[0].equals(SHUFFLE_PUSH_BLOCK_PREFIX)) {throw new IllegalArgumentException("Unexpected shuffle push block id format: " + blockIds[i]);}//构建消息头:appId 重试id 块信息等ByteBuffer header =new PushBlockStream(appId, appAttemptId, Integer.parseInt(blockIdParts[1]),Integer.parseInt(blockIdParts[2]), Integer.parseInt(blockIdParts[3]),Integer.parseInt(blockIdParts[4]), i).toByteBuffer();//调用netty的客户端开始传输数据client.uploadStream(new NioManagedBuffer(header), buffers.get(blockIds[i]),new BlockPushCallback(i, blockIds[i]));}}}

5、TransportClient

客户端,用于获取预先协商的流的连续块。此API旨在实现大量数据的高效传输,这些数据被分解为大小从数百KB到几MB不等的块。

请注意,虽然此客户端处理从流(即数据平面)中提取块,但流的实际设置是在传输层范围之外完成的。提供方便的方法“sendRPC”来实现客户端和服务器之间的控制平面通信,以执行此设置。

例如,一个典型的工作流程可能是:

client.sendRPC(新的OpenFile(“/foo”))-->  返回StreamId=100

client.fetchChunk(streamId = 100, chunkIndex = 0, callback)

client.fetchChunk(streamId = 100, chunkIndex = 1, callback)

......

client.sendRPC(new CloseStream(100))

使用TransportClientFactory构造TransportClient的实例。单个TransportClient可用于多个流,但任何给定的流都必须限制在单个客户端,以避免乱序响应。

注意:此类用于向服务器发出请求,而TransportResponseHandler负责处理来自服务器的响应。

并发:线程安全,可以从多个线程调用。

public class TransportClient implements Closeable {//以流的形式将数据发送到远程端public long uploadStream(ManagedBuffer meta,ManagedBuffer data,RpcResponseCallback callback) {if (logger.isTraceEnabled()) {logger.trace("Sending RPC to {}", getRemoteAddress(channel));}long requestId = requestId();handler.addRpcRequest(requestId, callback);RpcChannelListener listener = new RpcChannelListener(requestId, callback);//UploadStream是一种一种RPC,其数据在帧外发送,因此可以作为流读取。//利用netty传输数据channel.writeAndFlush(new UploadStream(requestId, meta, data)).addListener(listener);return requestId;}}

四、总结

1、ShuffleMapTask中的ShuffleWriter将结果写入磁盘完毕

2、判断当前环境是否支持push-based shuffle(假定支持)

3、获取该Task中的Shuffle结果文件

4、构建并初始化ShuffleBlockPusher(单块最大限制、单次推送请求数据大小限制、对端一次性可接收的数据大小限制等等)

5、按照分区将数据块组装成PushRequest放入队列中,并将其随机打散(如果有的分区过大会造成不会推送的情况,此时就需要下一个Stage计算时过来拉取

6、准备推送

7、推送前检查对端是否达到接收限制,并将这次PushRequest中的块进行打散

8、从SparkEnv获取BlockManager,BlockManager调用BlockStoreClient来进行传输

9、为该Task结果数据维护一个Map(ConcurrentHashMap<SocketAddress, ClientPool> connectionPool)如果没有远端的Socket对应的Netty客户端就新建,如果有就直接获取

10、构建一个OneForOneBlockPusher开始推送数据流

11、最终调用Netty客户端的channel.writeAndFlush()将数据流推送到目标主机

12、如果监听器收到推送成功的消息将再次调用pushUpToMax来触发推送下一批块

这篇关于Spark-push-based shuffle的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1138095

相关文章

SpringBoot操作spark处理hdfs文件的操作方法

《SpringBoot操作spark处理hdfs文件的操作方法》本文介绍了如何使用SpringBoot操作Spark处理HDFS文件,包括导入依赖、配置Spark信息、编写Controller和Ser... 目录SpringBoot操作spark处理hdfs文件1、导入依赖2、配置spark信息3、cont

Retrieval-based-Voice-Conversion-WebUI模型构建指南

一、模型介绍 Retrieval-based-Voice-Conversion-WebUI(简称 RVC)模型是一个基于 VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)的简单易用的语音转换框架。 具有以下特点 简单易用:RVC 模型通过简单易用的网页界面,使得用户无需深入了

Spark MLlib模型训练—聚类算法 PIC(Power Iteration Clustering)

Spark MLlib模型训练—聚类算法 PIC(Power Iteration Clustering) Power Iteration Clustering (PIC) 是一种基于图的聚类算法,用于在大规模数据集上进行高效的社区检测。PIC 算法的核心思想是通过迭代图的幂运算来发现数据中的潜在簇。该算法适用于处理大规模图数据,特别是在社交网络分析、推荐系统和生物信息学等领域具有广泛应用。Spa

MongoDB学习—(5)修改器$inc,$unset,$push,$pushAll,$allToSet,$pop,$pull,$pullAll

通过db.help()可以查询到关于数据库的操作,一查询发现有很多方法 其中有一个方法为db.getCollection(cname),即通过这一个函数,传入数据库中的一个集合的名称来获取到该集合的一个对象,我们可以编写函数   function insertTenRecord(obj){ var i=0; while(i++<10){ obj.insert({id:i+1,a

MACS bdgdiff: Differential peak detection based on paired four bedGraph files.

参考原文地址:[http://manpages.ubuntu.com/manpages/xenial/man1/macs2_bdgdiff.1.html](http://manpages.ubuntu.com/manpages/xenial/man1/macs2_bdgdiff.1.html) 文章目录 一、MACS bdgdiff 简介DESCRIPTION 二、用法

Neighborhood Homophily-based Graph Convolutional Network

#paper/ccfB 推荐指数: #paper/⭐ #pp/图结构学习 流程 重定义同配性指标: N H i k = ∣ N ( i , k , c m a x ) ∣ ∣ N ( i , k ) ∣ with c m a x = arg ⁡ max ⁡ c ∈ [ 1 , C ] ∣ N ( i , k , c ) ∣ NH_i^k=\frac{|\mathcal{N}(i,k,c_{

王立平--Failed to push selection: Read-only file system

往android模拟器导入资源,失败。提示:只读文件、 mnt是只读文件。应点击sdcard,,在导入

【spark 读写数据】数据源的读写操作

通用的 Load/Save 函数 在最简单的方式下,默认的数据源(parquet 除非另外配置通过spark.sql.sources.default)将会用于所有的操作。 Parquet 是一个列式存储格式的文件,被许多其他数据处理系统所支持。Spark SQL 支持对 Parquet 文件的读写还可以自动的保存源数据的模式 val usersDF = spark.read.load("e

Android Studio打开Modem模块出现:The project ‘***‘ is not a Gradle-based project

花了挺长时间处理该问题,特记录如下:1.背景: 在Android studio 下导入一个新增的modem模块,如MPSS.DE.3.1.1\modem_proc\AAA, 目的是看代码方便一些,可以自由搜索各种关键字。但导入该项目时出现了如下错误: The project '***' is not a Gradle-based project.造成的问题: (1) project 下没有代码,而

Spark数据介绍

从趋势上看,DataFrame 和 Dataset 更加流行。 示例场景 数据仓库和 BI 工具集成: 如果你需要处理存储在数据仓库中的结构化数据,并且希望与 BI 工具集成,那么 DataFrame 和 Dataset 是首选。 机器学习流水线: 在构建机器学习流水线时,使用 DataFrame 和 Dataset 可以更好地管理数据流,并且可以方便地与 MLlib 集成。 实时数据处理: