36_SparkStreaming二—编程

2024-04-28 07:58
文章标签 36 编程 sparkstreaming

本文主要是介绍36_SparkStreaming二—编程,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

SparkStreaming编程

在这里插入图片描述

1 Transformation 高级算子
1.1 updateStateByKey
/*** 单词计数** Driver服务:*     上一次 运行结果,状态* Driver服务*    新的数据**/
object UpdateStateBykeyWordCount {def main(args: Array[String]): Unit = {val conf = new SparkConf().setMaster("local[2]").setAppName("NetWordCount")val sc = new SparkContext(conf)val ssc = new StreamingContext(sc,Seconds(2))ssc.checkpoint("hdfs://hadoop1:9000/streamingcheckpoint")/*** 数据的输入*/val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop1",9999)/**** 数据的处理** Option:*   Some:有值*   None:没有值*   updateFunc: (Seq[V], Option[S]) => Option[S]*   数据的输入:*   you,1*   you,1*   jump,1**   ByKey:分组*   you,{1,1}*   jump,{1}**   values:Seq[Int]   List{1,1}**   state:Option[Int]  上一次这个单词出现了多少次  None  Some 2*/
//    var f=(values:Seq[Int],state:Option[Int])=>{
//      val currentCount = values.sum
//      val lastCount = state.getOrElse(0)
//      Some(currentCount+lastCount)
//    }
//      .updateStateByKey(f)val wordCountDStream = dstream.flatMap(_.split(",")).map((_, 1)).updateStateByKey((values: Seq[Int], state: Option[Int]) => {val currentCount = values.sumval lastCount = state.getOrElse(0)Some(currentCount + lastCount)})/*数据的输出*/wordCountDStream.print()ssc.start()ssc.awaitTermination()ssc.stop()}}
1.2 mapWithState
/***  性能更好*/
object MapWithStateAPITest {def main(args: Array[String]): Unit = {val sparkConf = new SparkConf().setAppName("NetworkWordCount")val sc = new SparkContext(sparkConf)// Create the context with a 1 second batch sizeval ssc = new StreamingContext(sc, Seconds(5))ssc.checkpoint("hdfs://master:9999/streaming/checkpoint")val lines = ssc.socketTextStream("master", 9998, StorageLevel.MEMORY_AND_DISK_SER)val words = lines.flatMap(_.split(" "))val wordsDStream = words.map(x => (x, 1))val initialRDD = sc.parallelize(List(("dummy", 100L), ("source", 32L)))// currentBatchTime : 表示当前的Batch的时间// key: 表示需要更新状态的key// value: 表示当前batch的对应的key的对应的值// currentState: 对应key的当前的状态val stateSpec = StateSpec.function((currentBatchTime: Time, key: String, value: Option[Int], currentState: State[Long]) => {val sum = value.getOrElse(0).toLong + currentState.getOption.getOrElse(0L)val output = (key, sum)if (!currentState.isTimingOut()) {currentState.update(sum)}Some(output)}).initialState(initialRDD).numPartitions(2).timeout(Seconds(30)) //timeout: 当一个key超过这个时间没有接收到数据的时候,这个key以及对应的状态会被移除掉val result = wordsDStream.mapWithState(stateSpec)result.print()result.stateSnapshots().print()//启动Streaming处理流ssc.start()ssc.stop(false)//等待Streaming程序终止ssc.awaitTermination()}
}
1.3 Transform
***  黑名单过滤*/
object WordBlack {def main(args: Array[String]): Unit = {val conf = new SparkConf().setMaster("local[2]").setAppName("NetWordCount")val sc = new SparkContext(conf)val ssc = new StreamingContext(sc,Seconds(2))/*** 数据的输入*//*** 自己模拟一个黑名单:* 各位注意:* 这个黑名单,一般情况下,不是我们自己模拟出来,应该是从mysql数据库* 或者是Reids 数据库,或者是HBase数据库里面读取出来的。*/val wordBlackList = ssc.sparkContext.parallelize(List("?", "!", "*")).map(param => (param, true))val blackList = wordBlackList.collect()val blackListBroadcast = ssc.sparkContext.broadcast(blackList)val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop1",9999)/*** 数据的处理*/val wordOneDStream = dstream.flatMap(_.split(",")).map((_, 1))//transform 需要有返回值,必须类型是RDDval wordCountDStream = wordOneDStream.transform(rdd => {/*** SparkCore:*     咱们演示的就是对RDD进行操作* SparkSQL:*     DataFrame*     SQL*/val filterRDD: RDD[(String, Boolean)] = rdd.sparkContext.parallelize(blackListBroadcast.value)val resultRDD: RDD[(String, (Int, Option[Boolean]))] = rdd.leftOuterJoin(filterRDD)/**** (String, (Int, Option[Boolean])* String: word* Int: 1* Option:有可能join上 也有可能join不上** 思路:* 我们应该要的是join不上的,说白了要的是 Option[Boolean] =None** filter:* true代表我们要*/resultRDD.filter(tuple => {tuple._2._2.isEmpty}).map(_._1)}).map((_, 1)).reduceByKey(_ + _)/*** 数据的输出**/wordCountDStream.print()ssc.start()ssc.awaitTermination()ssc.stop()}}
1.4 Window操作
/**** 需求:*   实现一个 每隔4秒,统计最近6秒的单词计数的情况。**   reduceByKeyAndWindow*/
object WindowOperatorTest {def main(args: Array[String]): Unit = {val conf = new SparkConf().setMaster("local[2]").setAppName("NetWordCount")val sc = new SparkContext(conf)val ssc = new StreamingContext(sc,Seconds(2))/*** 数据的输入* 到目前为止这个地方还没有跟生产进行对接。*/val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop1",9999)/*** 数据的处理* 我们一直讲的是数据处理的算子* 这个地方算子 就是生产时候用的算子。**  reduceFunc: (V, V) => V,windowDuration: Duration,6 窗口的大小slideDuration: Duration,4  滑动的大小numPartitions: Int  指定分区数*/val resultWordCountDStream = dstream.flatMap(_.split(",")).map((_, 1)).reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(6), Seconds(4))/*** 数据的输出*/resultWordCountDStream.print()/*** 这个操作仅仅限于测试的时候使用。*/ssc.start()ssc.awaitTermination()ssc.stop()}}
2 Output
2.1 foreachRDD

核心算子讲解

/*** WordCount程序,Spark Streaming消费TCP Server发过来的实时数据的例子:** 1、在master服务器上启动一个Netcat server* `$ nc -lk 9998` (如果nc命令无效的话,我们可以用yum install -y nc来安装nc)*** create table wordcount(ts bigint, word varchar(50), count int);** spark-shell --total-executor-cores 4 --executor-cores 2 --master spark://master:7077 --jars mysql-connector-java-5.1.44-bin.jar,c3p0-0.9.1.2.jar,spark-streaming-basic-1.0-SNAPSHOT.jar***/
object NetworkWordCountForeachRDD {def main(args: Array[String]) {val sparkConf = new SparkConf().setAppName("NetworkWordCountForeachRDD").setMaster("local[2]")val sc = new SparkContext(sparkConf)// Create the context with a 1 second batch sizeval ssc = new StreamingContext(sc, Seconds(5))//创建一个接收器(ReceiverInputDStream),这个接收器接收一台机器上的某个端口通过socket发送过来的数据并处理val lines = ssc.socketTextStream("master", 9998, StorageLevel.MEMORY_AND_DISK_SER)//处理的逻辑,就是简单的进行word countval words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)//将结果保存到Mysql(一)wordCounts.foreachRDD { (rdd, time) =>Class.forName("com.mysql.jdbc.Driver")val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "root")val statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")rdd.foreach { record =>statement.setLong(1, time.milliseconds)statement.setString(2, record._1)statement.setInt(3, record._2)statement.execute()}statement.close()conn.close()}//启动Streaming处理流ssc.start()ssc.stop(false)//将结果保存到Mysql(二)wordCounts.foreachRDD { (rdd, time) =>rdd.foreach { record =>Class.forName("com.mysql.jdbc.Driver")val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "root")val statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")statement.setLong(1, time.milliseconds)statement.setString(2, record._1)statement.setInt(3, record._2)statement.execute()statement.close()conn.close()}}//将结果保存到Mysql(三)wordCounts.foreachRDD { (rdd, time) =>rdd.foreachPartition { partitionRecords =>Class.forName("com.mysql.jdbc.Driver")val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "root")val statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")partitionRecords.foreach { case (word, count) =>statement.setLong(1, time.milliseconds)statement.setString(2, word)statement.setInt(3, count)statement.execute()}statement.close()conn.close()}}//将结果保存到Mysql(四)wordCounts.foreachRDD { (rdd, time) =>rdd.foreachPartition { partitionRecords =>val conn = ConnectionPool.getConnectionval statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")partitionRecords.foreach { case (word, count) =>statement.setLong(1, time.milliseconds)statement.setString(2, word)statement.setInt(3, count)statement.execute()}statement.close()ConnectionPool.returnConnection(conn)}}//将结果保存到Mysql(五)wordCounts.foreachRDD { (rdd, time) =>rdd.foreachPartition { partitionRecords =>val conn = ConnectionPool.getConnectionval statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")partitionRecords.foreach { case (word, count) =>statement.setLong(1, time.milliseconds)statement.setString(2, word)statement.setInt(3, count)statement.addBatch()}statement.executeBatch()statement.close()ConnectionPool.returnConnection(conn)}}//将结果保存到Mysql(六)wordCounts.foreachRDD { (rdd, time) =>rdd.foreachPartition { partitionRecords =>val conn = ConnectionPool.getConnectionconn.setAutoCommit(false)val statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")partitionRecords.foreach { case (word, count) =>statement.setLong(1, time.milliseconds)statement.setString(2, word)statement.setInt(3, count)statement.addBatch()}statement.executeBatch()statement.close()conn.commit()conn.setAutoCommit(true)ConnectionPool.returnConnection(conn)}}//将结果保存到Mysql(七)wordCounts.foreachRDD { (rdd, time) =>rdd.foreachPartition { partitionRecords =>val conn = ConnectionPool.getConnectionconn.setAutoCommit(false)val statement = conn.prepareStatement(s"insert into wordcount(ts, word, count) values (?, ?, ?)")partitionRecords.zipWithIndex.foreach { case ((word, count), index) =>statement.setLong(1, time.milliseconds)statement.setString(2, word)statement.setInt(3, count)statement.addBatch()if (index != 0 && index % 500 == 0) {statement.executeBatch()conn.commit()}}statement.executeBatch()statement.close()conn.commit()conn.setAutoCommit(true)ConnectionPool.returnConnection(conn)}}//等待Streaming程序终止ssc.awaitTermination()}
}

连接池代码编写:

pom.xml文件添加如下内容:

<dependency><groupId>c3p0</groupId><artifactId>c3p0</artifactId><version>0.9.1.2</version></dependency><!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java --><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.39</version></dependency>
/*** 连接池*/
public class ConnectionPool {private static ComboPooledDataSource dataSource = new ComboPooledDataSource();static {dataSource.setJdbcUrl("jdbc:mysql://master:3306/test");//设置连接数据库的URLdataSource.setUser("root");//设置连接数据库的用户名dataSource.setPassword("root");//设置连接数据库的密码dataSource.setMaxPoolSize(40);//设置连接池的最大连接数dataSource.setMinPoolSize(2);//设置连接池的最小连接数dataSource.setInitialPoolSize(10);//设置连接池的初始连接数dataSource.setMaxStatements(100);//设置连接池的缓存Statement的最大数}public static Connection getConnection() {try {return dataSource.getConnection();} catch (SQLException e) {e.printStackTrace();}return null;}public static void returnConnection(Connection connection) {if (connection != null) {try {connection.close();} catch (SQLException e) {e.printStackTrace();}}}
}
3 Checkpoint
/*** Dirver HA*/
object DriverHAWordCount {def main(args: Array[String]): Unit = {val checkpointDirectory:String="hdfs://hadoop1:9000/streamingcheckpoint2";def functionToCreateContext(): StreamingContext = {val conf = new SparkConf().setMaster("local[2]").setAppName("NetWordCount")val sc = new SparkContext(conf)val ssc = new StreamingContext(sc,Seconds(2))ssc.checkpoint(checkpointDirectory)val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop1",9999)val wordCountDStream = dstream.flatMap(_.split(",")).map((_, 1)).updateStateByKey((values: Seq[Int], state: Option[Int]) => {val currentCount = values.sumval lastCount = state.getOrElse(0)Some(currentCount + lastCount)})wordCountDStream.print()ssc.start()ssc.awaitTermination()ssc.stop()ssc}val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)ssc.start()ssc.awaitTermination()ssc.stop()}}
4 SparkStreaming和SparkSQL整合

pom.xml里面添加

      <dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.11</artifactId><version>2.2.1</version></dependency>
/*** WordCount程序,Spark Streaming消费TCP Server发过来的实时数据的例子:** 1、在master服务器上启动一个Netcat server* `$ nc -lk 9998` (如果nc命令无效的话,我们可以用yum install -y nc来安装nc)***/
object NetworkWordCountForeachRDDDataFrame {def main(args: Array[String]) {val sparkConf = new SparkConf().setAppName("NetworkWordCountForeachRDD")val sc = new SparkContext(sparkConf)// Create the context with a 1 second batch sizeval ssc = new StreamingContext(sc, Seconds(1))//创建一个接收器(ReceiverInputDStream),这个接收器接收一台机器上的某个端口通过socket发送过来的数据并处理val lines = ssc.socketTextStream("master", 9998, StorageLevel.MEMORY_AND_DISK_SER)//处理的逻辑,就是简单的进行word countval words = lines.flatMap(_.split(" "))//将RDD转化为Datasetwords.foreachRDD { rdd =>// Get the singleton instance of SparkSessionval spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()import spark.implicits._// Convert RDD[String] to DataFrameval wordsDataFrame = rdd.toDF("word")// Create a temporary viewwordsDataFrame.createOrReplaceTempView("words")// Do word count on DataFrame using SQL and print itval wordCountsDataFrame =spark.sql("select word, count(*) as total from words group by word")wordCountsDataFrame.show()}//启动Streaming处理流ssc.start()ssc.stop(false)//将RDD转化为Datasetwords.foreachRDD { (rdd, time) =>// Get the singleton instance of SparkSessionval spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()import spark.implicits._// Convert RDD[String] to DataFrameval wordsDataFrame = rdd.toDF("word")// Do word count on DataFrame using SQL and print itval wordCountsDataFrame = wordsDataFrame.groupBy("word").count()val resultDFWithTs = wordCountsDataFrame.rdd.map(row => (row(0), row(1), time.milliseconds)).toDF("word", "count", "ts")resultDFWithTs.write.mode(SaveMode.Append).parquet("hdfs://master:9999/user/spark-course/streaming/parquet")}//等待Streaming程序终止ssc.awaitTermination()}
}

这篇关于36_SparkStreaming二—编程的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/942709

相关文章

Linux 网络编程 --- 应用层

一、自定义协议和序列化反序列化 代码: 序列化反序列化实现网络版本计算器 二、HTTP协议 1、谈两个简单的预备知识 https://www.baidu.com/ --- 域名 --- 域名解析 --- IP地址 http的端口号为80端口,https的端口号为443 url为统一资源定位符。CSDNhttps://mp.csdn.net/mp_blog/creation/editor

【Python编程】Linux创建虚拟环境并配置与notebook相连接

1.创建 使用 venv 创建虚拟环境。例如,在当前目录下创建一个名为 myenv 的虚拟环境: python3 -m venv myenv 2.激活 激活虚拟环境使其成为当前终端会话的活动环境。运行: source myenv/bin/activate 3.与notebook连接 在虚拟环境中,使用 pip 安装 Jupyter 和 ipykernel: pip instal

【编程底层思考】垃圾收集机制,GC算法,垃圾收集器类型概述

Java的垃圾收集(Garbage Collection,GC)机制是Java语言的一大特色,它负责自动管理内存的回收,释放不再使用的对象所占用的内存。以下是对Java垃圾收集机制的详细介绍: 一、垃圾收集机制概述: 对象存活判断:垃圾收集器定期检查堆内存中的对象,判断哪些对象是“垃圾”,即不再被任何引用链直接或间接引用的对象。内存回收:将判断为垃圾的对象占用的内存进行回收,以便重新使用。

Go Playground 在线编程环境

For all examples in this and the next chapter, we will use Go Playground. Go Playground represents a web service that can run programs written in Go. It can be opened in a web browser using the follow

深入理解RxJava:响应式编程的现代方式

在当今的软件开发世界中,异步编程和事件驱动的架构变得越来越重要。RxJava,作为响应式编程(Reactive Programming)的一个流行库,为Java和Android开发者提供了一种强大的方式来处理异步任务和事件流。本文将深入探讨RxJava的核心概念、优势以及如何在实际项目中应用它。 文章目录 💯 什么是RxJava?💯 响应式编程的优势💯 RxJava的核心概念

函数式编程思想

我们经常会用到各种各样的编程思想,例如面向过程、面向对象。不过笔者在该博客简单介绍一下函数式编程思想. 如果对函数式编程思想进行概括,就是f(x) = na(x) , y=uf(x)…至于其他的编程思想,可能是y=a(x)+b(x)+c(x)…,也有可能是y=f(x)=f(x)/a + f(x)/b+f(x)/c… 面向过程的指令式编程 面向过程,简单理解就是y=a(x)+b(x)+c(x)

Java并发编程之——BlockingQueue(队列)

一、什么是BlockingQueue BlockingQueue即阻塞队列,从阻塞这个词可以看出,在某些情况下对阻塞队列的访问可能会造成阻塞。被阻塞的情况主要有如下两种: 1. 当队列满了的时候进行入队列操作2. 当队列空了的时候进行出队列操作123 因此,当一个线程试图对一个已经满了的队列进行入队列操作时,它将会被阻塞,除非有另一个线程做了出队列操作;同样,当一个线程试图对一个空

生信代码入门:从零开始掌握生物信息学编程技能

少走弯路,高效分析;了解生信云,访问 【生信圆桌x生信专用云服务器】 : www.tebteb.cc 介绍 生物信息学是一个高度跨学科的领域,结合了生物学、计算机科学和统计学。随着高通量测序技术的发展,海量的生物数据需要通过编程来进行处理和分析。因此,掌握生信编程技能,成为每一个生物信息学研究者的必备能力。 生信代码入门,旨在帮助初学者从零开始学习生物信息学中的编程基础。通过学习常用

rtmp流媒体编程相关整理2013(crtmpserver,rtmpdump,x264,faac)

转自:http://blog.163.com/zhujiatc@126/blog/static/1834638201392335213119/ 相关资料在线版(不定时更新,其实也不会很多,也许一两个月也不会改) http://www.zhujiatc.esy.es/crtmpserver/index.htm 去年在这进行rtmp相关整理,其实内容早有了,只是整理一下看着方

使用Qt编程QtNetwork无法使用

使用 VS 构建 Qt 项目时 QtNetwork 无法使用的问题 - 摘叶飞镖 - 博客园 (cnblogs.com) 另外,强烈建议在使用QNetworkAccessManager之前看看这篇文章: Qt 之 QNetworkAccessManager踏坑记录-CSDN博客 C++ Qt开发:QNetworkAccessManager网络接口组件 阅读目录 1.1 通用API函数