【Hadoop】Flume官方文档翻译——Flume 1.7.0 User Guide (unreleased version)(一)

2024-06-17 18:32

本文主要是介绍【Hadoop】Flume官方文档翻译——Flume 1.7.0 User Guide (unreleased version)(一),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Flume 1.7.0 User Guide

  • Introduction(简介)
    • Overview(综述)
    • System Requirements(系统需求)
    • Architecture(架构)
      • Data flow model(数据流模型)
      • Complex flows(复杂流)
      • Reliability(可靠性)
      • Recoverability(可恢复性)
  • Setup(配置)Configuration(配置
    • Setting up an agent(设置一个agent)
      • Configuring individual components(配置单个组件)
      • Wiring the pieces together(碎片聚集)
      • Starting an agent(开始一个agent)
      • A simple example(一个简单的例子)
      • Logging raw data(记录原始数据)
      • Zookeeper based Configuration(ZooKeeper的基础配置)
      • Installing third-party plugins(安装第三方插件)
        • The plugins.d directory(插件目录)
        • Directory layout for plugins(用于插件的目录布局)
    • Data ingestion(数据获取)
      • RPC(远程调用)
      • Executing commands(执行命令)
      • Network streams(网络流)
    • Setting multi-agent flow(设置多个agent流)
    • Consolidation(合并)
    • Multiplexing the flow(多路复用流)
    • Defining the flow(定义一个流)
  • Configuration
    • Defining the flow
    • Configuring individual components(配置单个组件)
    • Adding multiple flows in an agent(一个agent中增加多个流)
    • Configuring a multi agent flow(配置一个多agent流)
    • Fan out flow(扇出流)
    • Flume Sources(各种Source)
      • Avro Source
      • Thrift Source
      • Exec Source
      • JMS Source
        • Converter(转换器)
      • Spooling Directory Source
        • Event Deserializers
          • LINE
          • AVRO
          • BlobDeserializer
      • Taildir Source
      • Twitter 1% firehose Source (experimental)
      • Kafka Source
      • NetCat Source
      • Sequence Generator Source
      • Syslog Sources
        • Syslog TCP Source
        • Multiport Syslog TCP Source
        • Syslog UDP Source
      • HTTP Source
        • JSONHandler
        • BlobHandler
      • Stress Source
      • Legacy Sources
        • Avro Legacy Source
        • Thrift Legacy Source
      • Custom Source
      • Scribe Source
    • Flume Sinks(各种Sink)
      • HDFS Sink
      • Hive Sink
      • Logger Sink
      • Avro Sink
      • Thrift Sink
      • IRC Sink
      • File Roll Sink
      • Null Sink
      • HBaseSinks
        • HBaseSink
        • AsyncHBaseSink
      • MorphlineSolrSink
      • ElasticSearchSink
      • Kite Dataset Sink
      • Kafka Sink
      • Custom Sink
    • Flume Channels(各种Channel)
      • Memory Channel
      • JDBC Channel
      • Kafka Channel
      • File Channel
      • Spillable Memory Channel
      • Pseudo Transaction Channel
      • Custom Channel
    • Flume Channel Selectors(Channel选择器)
      • Replicating Channel Selector (default)
      • Multiplexing Channel Selector
      • Custom Channel Selector
      • Flume Sink Processors(执行器)
      • Default Sink Processor
      • Failover Sink Processor

      • Load balancing Sink Processor
      • Custom Sink Processor
    • Event Serializers(序列化器)
      • Body Text Serializer
      • “Flume Event” Avro Event Serializer
      • Avro Event Serializer
    • Flume Interceptors(拦截器)
      • Timestamp Interceptor
      • Host Interceptor
      • Static Interceptor
      • UUID Interceptor
      • Morphline Interceptor
      • Search and Replace Interceptor
      • Regex Filtering Interceptor
      • Regex Extractor Interceptor
      • Example 1:
      • Example 2:
    • Flume Properties(属性)
      • Property: flume.called.from.service
  • Log4J Appender(日志存储器)
  • Load Balancing Log4J Appender(负载均衡的日志存储器)
  • Security(安全性)
  • Monitoring(监控)
    • JMX Reporting
    • Ganglia Reporting
    • JSON Reporting
    • Custom Reporting
    • Reporting metrics from custom components
  • Tools(工具)
    • File Channel Integrity Tool
    • Event Validator Tool
  • Topology Design Considerations(拓扑结构设计考虑)
    • Is Flume a good fit for your problem?
    • Flow reliability in Flume
    • Flume topology design
    • Sizing a Flume deployment
  • Troubleshooting(故障排除)

    • Handling agent failures
    • Compatibility
      • HDFS
      • AVRO
      • Additional version requirements
    • Tracing
    • More Sample Configs
  • Component Summary(组件总结)
  • Alias Conventions(别名约定)

Introduction(简介)

Overview(综述)

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Apache Flume is a top level project at the Apache Software Foundation.

There are currently two release code lines available, versions 0.9.x and 1.x.

Documentation for the 0.9.x track is available at the Flume 0.9.x User Guide.

This documentation applies to the 1.4.x track.

New and existing users are encouraged to use the 1.x releases so as to leverage the performance improvements and configuration flexibilities available in the latest architecture.

Apache Flume是一个分布式、高可靠和高可用的收集、集合和将大量来自不同来源的日志数据移动到一个中央数据仓库。

Apache Flume不仅局限于数据的聚集。因为数据是可定制的,所以Flume可以用于运输大量时间数据包括不限于网络传输数据,社交媒体产生的数据,电子邮件信息和几乎任何数据源。

Apache Flume是Apache软件基金会的顶级项目。

目前有两个可用的发布版本,0.9.x和1.x。

我们鼓励新老用户使用1.x发布版本来提高性能和利用新结构的配置灵活性。

System Requirements(系统需求)

    1. Java Runtime Environment - Java 1.7 or later(Java运行环境-Java1.7或者以后的版本)
    2. Memory - Sufficient memory for configurations used by sources, channels or sinks(内存——足够的内存来配置souuces,channels和sinks)
    3. Disk Space - Sufficient disk space for configurations used by channels or sinks(磁盘空间-足够的磁盘空间来配置channels或者sinks)
    4. Directory Permissions - Read/Write permissions for directories used by agent(目录权限-代理所使用的目录读/写权限)

Architecture(架构)

Data flow model(数据流动模型)

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

一个Flume event被定义为拥有一个字节的有效负载的一个数据流单元和一个可选的字符串属性配置。Flume agent是一个JVM进程来控制组件完成事件流从一个外部来源传输到下一个目的地。

 

A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.

Flume source消费外部来源像web server传输给他的事件。外部来源发送以目标Flume source定义好的格式的event给Flume。例如,Avro Flume source用于接收Avro客户端或者流中的其他Flume中Avro sink发来的Avro events。一个相似的流可以用Thrift Flume Source 来接收来自Flume sink或者FluemThrift Rpc客户端或者一个用任何语言写的遵守Flume Thrift 协议的Thrift客户端的事件。当一个Flume Source接收一个事件时,它将事件存储在一个或者多个Cannel中。Channel是一个被动仓库用来保存事件直到它被Flume Sink消费掉。File channel就是个例子-它背靠着本地的文件系统。Sink将事件从Channel中移除并且将事件放到一个外部的仓库像HDFS(通过Flume HDFS sink)或者向前传输到流中另一个Flume Agent。Agent中Source和Sink异步地执行Channel中events。

Complex flows(复杂流)

Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume允许一些用户建立multi-hop流当事件在到达最终目的地时要经过多个Agent。它也允许扇入和扇出流,上下文路由和失效hop的恢复路由。

Reliability(可靠性)

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.

事件都是(存储)在每个agent中的Channel。事件会被传送到下一个Agent或者流中的最终目的地如HDFS。事件会在被储存在另一个Agent的Channel中或者终点仓库之后从原来的Agent中移除。这是一个单hop在流中信息传输定义,以此提供了端对端的流的可靠性。

Flume用一个事务性方案来保证事件传递的可靠性。source、sink和channel分别提供不同的事务机制,source和sink是将event的存储/恢复封装在在一个事务机制中,而channel是将事件的存储和提供封装在一个事务机制中。这个保证了事件集合可靠地从流中的一个点传到另一个点。在多个hop的流中,前一个hop的sink和后一个hop的source都有其事务机制来保证数据能够安全得存储在下一个hop中。

Recoverability(可恢复性)

The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.

Channel中存储着event,并且负责失效恢复。Flume支持一个持久的依赖于本地文件系统的文件Channel。同样支持一个内存Channel简单地将事件存储在一个内存队列,处理速度快但当Agent挂掉时内存中存留的事件将会丢失并且没办法恢复。

Setup(设置)

Setting up an agent(设置Agent)

Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.

Flume agent配置存储在一个本地配置文件中。这是一个跟Java 属性文件格式一样的文本文件。一个或者多个agent可以指定同一个配置文件来进行配置。配置文件包括每个source的属性,agent中的sink和channel以及它们是如何连接构成数据流。

Configuring individual components(单个组件的配件)

Each component (source, sink or channel) in the flow has a name, type, and set of properties that are specific to the type and instantiation. For example, an Avro source needs a hostname (or IP address) and a port number to receive data from. A memory channel can have max queue size (“capacity”), and an HDFS sink needs to know the file system URI, path to create files, frequency of file rotation (“hdfs.rollInterval”) etc. All such attributes of a component needs to be set in the properties file of the hosting Flume agent.

流中的每个组件(source,sink或者channel)都有名字,类型和用来指定类型的属性集和实例化。例如,一个avro source需要一个主机名(或者IP地址)和端口来接收数据,内存channel有最大队列值(“capacity”),和HDFS sink需要知道文件系统的URI,来创建路径,轮询文件的频率(hdfs.roollInterval)等.组件的所有属性都必须在Flume agetnt的属性文件里配置。

Wiring the pieces together(碎片集合)

The agent needs to know what individual components to load and how they are connected in order to constitute the flow. This is done by listing the names of each of the sources, sinks and channels in the agent, and then specifying the connecting channel for each sink and source. For example, an agent flows events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a file channel called file-channel. The configuration file will contain names of these components and file-channel as a shared channel for both avroWeb source and hdfs-cluster1 sink.

agent需要知道每个组件加载什么和它们是怎样连接构成流。这通过列出agent中每个source、sink和channel和指定每个sink和source连接的channel。例如,一个agent流事件从一个称为avroWeb的Avro sources通过一个称为file-channel的文件channel流向一个称为hdfs-cluster1的HDFS sink。配置文档将包含这些组件的名字和avroWeb source和hdfs-cluster1 sink中间共享的file-channel。

Starting an agent(开始一个agent)

An agent is started using a shell script called flume-ng which is located in the bin directory of the Flume distribution. You need to specify the agent name, the config directory, and the config file on the command line:

agent通过一个称为flume-ngshell位于Flume项目中bin目录下的脚本来启动。你必须在命令行中指定一个agent名字,配置目录和配置文档

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

Now the agent will start running source and sinks configured in the given properties file.

现在agent将会开始运行给定的属性文档中的cource和sink。

A simple example(一个简单的例子)

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.

这里我们给出一个配置文件的例子,阐述一个单点Flume的部署,这个配置让一个用户产生一个事件和随后把事件打印在控制台。

# example.conf: A single-node Flume configuration# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:

这个配置信息定义了一个名字为a1的单点agent。a1拥有一个监听数据端口为44444的source,一个内存channel和一个将事件打印在控制台的sink。配置文档给多个组件命名,并且描述它们的类型和配置参数。一个给定的配置文档可以定义多个agent;当一个给定的Flume进程加载时,一个标志会传递告诉他具体运行哪个agent。

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

Note that in a full deployment we would typically include one more option: --conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-env.sh and potentially a log4j properties file. In this example, we pass a Java option to force Flume to log to the console and we go without a custom environment script.

需要说明的是在一个完整的部署中我们应该通常会包含多一个选项:--conf=<conf-dir>.<conf-dir>目录包含一个shell脚本 flume-env.sh和一个潜在的log4j属性文档。在这个例子中,我们通过一个Java选项来强制Flume打印信息到控制台和没有自定义一个环境脚本。

From a separate terminal, we can then telnet port 44444 and send Flume an event:

通过一个独立的终端,我们可以telnet 端口4444和发送一个事件:

$ telnet localhost 44444Trying 127.0.0.1...Connected to localhost.localdomain (127.0.0.1).Escape character is '^]'.Hello world! <ENTER>OK

The original Flume terminal will output the event in a log message.

原来的Flume终端将会在控制台将事件打印出来:

12/06/19 15:32:19 INFO source.NetcatSource: Source starting12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

Congratulations - you’ve successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in much more detail.

恭喜-你已经成功配置和部署了一个Flume agent!接下来的部分会覆盖agent配置的更多细节。

这篇关于【Hadoop】Flume官方文档翻译——Flume 1.7.0 User Guide (unreleased version)(一)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1070220

相关文章

Hadoop企业开发案例调优场景

需求 (1)需求:从1G数据中,统计每个单词出现次数。服务器3台,每台配置4G内存,4核CPU,4线程。 (2)需求分析: 1G / 128m = 8个MapTask;1个ReduceTask;1个mrAppMaster 平均每个节点运行10个 / 3台 ≈ 3个任务(4    3    3) HDFS参数调优 (1)修改:hadoop-env.sh export HDFS_NAMENOD

Hadoop集群数据均衡之磁盘间数据均衡

生产环境,由于硬盘空间不足,往往需要增加一块硬盘。刚加载的硬盘没有数据时,可以执行磁盘数据均衡命令。(Hadoop3.x新特性) plan后面带的节点的名字必须是已经存在的,并且是需要均衡的节点。 如果节点不存在,会报如下错误: 如果节点只有一个硬盘的话,不会创建均衡计划: (1)生成均衡计划 hdfs diskbalancer -plan hadoop102 (2)执行均衡计划 hd

hadoop开启回收站配置

开启回收站功能,可以将删除的文件在不超时的情况下,恢复原数据,起到防止误删除、备份等作用。 开启回收站功能参数说明 (1)默认值fs.trash.interval = 0,0表示禁用回收站;其他值表示设置文件的存活时间。 (2)默认值fs.trash.checkpoint.interval = 0,检查回收站的间隔时间。如果该值为0,则该值设置和fs.trash.interval的参数值相等。

Hadoop数据压缩使用介绍

一、压缩原则 (1)运算密集型的Job,少用压缩 (2)IO密集型的Job,多用压缩 二、压缩算法比较 三、压缩位置选择 四、压缩参数配置 1)为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器 2)要在Hadoop中启用压缩,可以配置如下参数

活用c4d官方开发文档查询代码

当你问AI助手比如豆包,如何用python禁止掉xpresso标签时候,它会提示到 这时候要用到两个东西。https://developers.maxon.net/论坛搜索和开发文档 比如这里我就在官方找到正确的id描述 然后我就把参数标签换过来

计算机毕业设计 大学志愿填报系统 Java+SpringBoot+Vue 前后端分离 文档报告 代码讲解 安装调试

🍊作者:计算机编程-吉哥 🍊简介:专业从事JavaWeb程序开发,微信小程序开发,定制化项目、 源码、代码讲解、文档撰写、ppt制作。做自己喜欢的事,生活就是快乐的。 🍊心愿:点赞 👍 收藏 ⭐评论 📝 🍅 文末获取源码联系 👇🏻 精彩专栏推荐订阅 👇🏻 不然下次找不到哟~Java毕业设计项目~热门选题推荐《1000套》 目录 1.技术选型 2.开发工具 3.功能

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

flume系列之:查看flume系统日志、查看统计flume日志类型、查看flume日志

遍历指定目录下多个文件查找指定内容 服务器系统日志会记录flume相关日志 cat /var/log/messages |grep -i oom 查找系统日志中关于flume的指定日志 import osdef search_string_in_files(directory, search_string):count = 0

Maven创建项目中的groupId, artifactId, 和 version的意思

文章目录 groupIdartifactIdversionname groupId 定义:groupId 是 Maven 项目坐标的第一个部分,它通常表示项目的组织或公司的域名反转写法。例如,如果你为公司 example.com 开发软件,groupId 可能是 com.example。作用:groupId 被用来组织和分组相关的 Maven artifacts,这样可以避免

论文翻译:ICLR-2024 PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS

PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS https://openreview.net/forum?id=KS8mIvetg2 验证测试集污染在黑盒语言模型中 文章目录 验证测试集污染在黑盒语言模型中摘要1 引言 摘要 大型语言模型是在大量互联网数据上训练的,这引发了人们的担忧和猜测,即它们可能已