Spark实战(五)spark streaming + flume(Python版)

本文主要是介绍Spark实战(五)spark streaming + flume(Python版)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一、flume安装

（一）概述

Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。Flume可以采集文件，socket数据包等各种形式源数据，又可以将采集到的数据输出到HDFS、hbase、hive、kafka等众多外部存储系统中，一般的采集需求，通过对flume的简单配置即可实现，Flume针对特殊场景也具备良好的自定义扩展能力，因此flume可以适用于大部分的日常数据采集场景

（二）运行机制

1、 Flume分布式系统中最核心的角色是agent，flume采集系统就是由一个个agent所连接起来形成

2、每一个agent相当于一个数据传递员，内部有三个组件：

a)	Source：采集源，用于跟数据源对接，以获取数据
b)	Sink：下沉地，采集数据的传送目的，用于往下一级agent传递数据或者往最终存储系统传递数据
c)	Channel：angent内部的数据传输通道，用于从source将数据传递到sink

在这里插入图片描述

（三）Flume采集系统结构图

1、简单结构

单个agent采集数据

在这里插入图片描述

2、复杂结构

多级agent之间串联
在这里插入图片描述

（四）Flume的安装部署

1、去apache官网上下载安装包，并解压tar -zxvf apache-flume-1.8.0-bin，并修改conf目录下flume-env.sh，在里面配置JAVA_HOME

2、根据数据采集的需求配置采集方案，描述在配置文件中(文件名可任意自定义)
3、指定采集方案配置文件，在相应的节点上启动flume agent

二、flume push方式

1、spark streaming程序

首先是flume通过push方式将采集到的数据传递到spark程序上，这种方式基本不用。spark代码如下：

import pyspark
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.flume import FlumeUtilsif __name__ == "__main__":spark = SparkSession\.builder\.appName("PythonWordCount") \.master("local[2]") \.getOrCreate()sc = spark.sparkContextssc = StreamingContext(sc, 5)# hostname = sys.argv[1]# port = int(sys.argv[2])flumeStream = FlumeUtils.createStream(ssc, "localhost", 8888, pyspark.StorageLevel.MEMORY_AND_DISK_SER_2)line = flumeStream.map(lambda x: x[1])words = line.flatMap(lambda x: x.split(" "))datas = words.map(lambda x: (x, 1))result = datas.reduceByKey(lambda agg, obj: agg + obj)result.pprint()ssc.start()ssc.awaitTermination()

注意：要指定并行度，如在本地运行设置setMaster(“local[2]”)，相当于启动两个线程，一个给receiver，一个给computer。否则会出现如下问题

2019-01-09 19:36:16 INFO  ReceiverSupervisorImpl:54 - Called receiver 0 onStart
2019-01-09 19:36:16 INFO  ReceiverSupervisorImpl:54 - Waiting for receiver to be stopped
2019-01-09 19:36:20 INFO  JobScheduler:54 - Added jobs for time 1547033780000 ms
2019-01-09 19:36:25 INFO  JobScheduler:54 - Added jobs for time 1547033785000 ms
2019-01-09 19:36:30 INFO  JobScheduler:54 - Added jobs for time 1547033790000 ms
2019-01-09 19:36:35 INFO  JobScheduler:54 - Added jobs for time 1547033795000 ms
2019-01-09 19:36:40 INFO  JobScheduler:54 - Added jobs for time 1547033800000 ms

如果是在集群中运行，必须要求集群中可用core数大于1

2、flume conf文件

<font size=4>&emsp; &emsp;在flume的conf目录下新建flume-push.conf内容如下</font></br>
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/log/flume
a1.sources.r1.fileHeader = true# Describe the sink
a1.sinks.k1.type = avro
#这是接收方
a1.sinks.k1.hostname = 192.168.62.131
a1.sinks.k1.port = 8888# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

需要先将spark程序运行，使用以下命令：

spark/bin/spark-submit  --driver-class-path /home/hadoop/spark/jars/*:/home/hadoop/jar/flume/* /tmp/pycharm_project_563/day5/FlumePushWordCount.py

可能会出现以下问题

	Spark Streaming's Flume libraries not found in class path. Try one of the following.1. Include the Flume library and its dependencies with in thespark-submit command as$ bin/spark-submit --packages org.apache.spark:spark-streaming-flume:2.4.0 ...2. Download the JAR of the artifact from Maven Central http://search.maven.org/,Group Id = org.apache.spark, Artifact Id = spark-streaming-flume-assembly, Version = 2.4.0.Then, include the jar in the spark-submit command as$ bin/spark-submit --jars <spark-streaming-flume-assembly.jar> ...
Traceback (most recent call last):File "/tmp/pycharm_project_563/day5/FlumePushWordCount.py", line 12, in <module>flumeStream = FlumeUtils.createStream(ssc, "192.168.62.131", "8888")File "/home/hadoop/spark/python/pyspark/streaming/flume.py", line 67, in createStreamhelper = FlumeUtils._get_helper(ssc._sc)File "/home/hadoop/spark/python/pyspark/streaming/flume.py", line 130, in _get_helperreturn sc._jvm.org.apache.spark.streaming.flume.FlumeUtilsPythonHelper()
TypeError: 'JavaPackage' object is not callable

需要去maven仓库下载spark-streaming-flume-assembly.jar，然后放到上面指定的jar目录中去。

然后运行flume

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-push.conf -Dflume.root.logger=WARN,console

然后在/home/hadoop/log/flume目录下新建log文件，运行spark的日志中出现如下：

在这里插入图片描述

三、poll方式

1、spark streaming程序

这种方式是有spark主动去flume拉取数据，代码如下：

from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.flume import FlumeUtilsif __name__ == "__main__":spark = SparkSession\.builder\.appName("PythonWordCount") \.master("local[2]") \.getOrCreate()sc = spark.sparkContextssc = StreamingContext(sc, 5)addresses = [("localhost", 8888)]flumeStream = FlumeUtils.createPollingStream(ssc, addresses)line = flumeStream.map(lambda x: x[1])words = line.flatMap(lambda x: x.split(" "))datas = words.map(lambda x: (x, 1))result = datas.reduceByKey(lambda agg, obj: agg + obj)result.pprint()ssc.start()ssc.awaitTermination()

如果是本地模式同样需要指定并行度，如果是在集群中运行，必须要求集群中可用core数大于1

2、flume conf文件

在flume的conf目录下新建flume-poll.conf内容如下:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/log/flume
a1.sources.r1.fileHeader = true# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 8888# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

由于是poll方式，需要的flume

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-poll.conf -Dflume.root.logger=WARN,console

启动spark程序

spark/bin/spark-submit  --driver-class-path /home/hadoop/spark/jars/*:/home/hadoop/jar/flume/* /tmp/pycharm_project_563/day5/FlumePollWordCount.py

同样在/home/hadoop/log/flume目录下新建log文件，将原先生成的COMPLETED文件删除，rm flume/aaa.txt.COMPLETED ，运行spark的日志中出现如下：

在这里插入图片描述

这篇关于Spark实战(五)spark streaming + flume(Python版)的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Spark实战(五)spark streaming + flume(Python版)

一、flume安装

（一）概述

（二）运行机制

（三）Flume采集系统结构图

1、简单结构

2、复杂结构

（四）Flume的安装部署

二、flume push方式

1、spark streaming程序

2、flume conf文件

三、poll方式

1、spark streaming程序

2、flume conf文件

相关文章

使用Python删除Excel中的行列和单元格示例详解

MySQL 多列 IN 查询之语法、性能与实战技巧(最新整理)

Python通用唯一标识符模块uuid使用案例详解

Python办公自动化实战之打造智能邮件发送工具

Python包管理工具pip的升级指南

PowerShell中15个提升运维效率关键命令实战指南

基于Python实现一个图片拆分工具

Python中反转字符串的常见方法小结

Python中将嵌套列表扁平化的多种实现方法

使用Docker构建Python Flask程序的详细教程