How To Configure Elasticsearch on Hadoop with HDP

2024-03-19 13:48

本文主要是介绍How To Configure Elasticsearch on Hadoop with HDP,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

原文地址:http://www.tuicool.com/articles/Jryyme


Elasticsearch’s engine integrates with Hortonworks Data Platform 2.0 and YARN to provide real-time search and access to information in Hadoop.

See it in action:  register for the Hortonworks and Elasticsearch webinar on March 5th  2014 at 10 am PST/1pm EST to see the demo and an outline for best practices when integrating Elasticsearch and HDP 2.0 to extract maximum insights from your data.   Click here to register for this exciting and informative webinar!

Try it yourself: Get started with this tutorial using Elasticsearch and Hortonworks Data Platform, or Hortonworks Sandbox to access server logs in Kibana using Apache Flume for ingestion.

Architecture

Following diagram depicts the proposed architecture to index the logs in near real-time into Elasticsearch and also save to Hadoop for long-term batch analytics.

es1

Components

Elasticsearch

Elasticsearch is a search engine that can index new documents in near real-time and make them immediately available for querying. Elasticsearch is based on Apache Lucene and allows for setting up clusters of nodes that store any number of indices in a distributed, fault-tolerant way. If a node disappears, the cluster will rebalance the (shards of) indices over the remaining nodes. You can configure how many shards make up each index and how many replicas of these shards there should be. If a master shard goes offline, one of the replicas is promoted to master and used to repopulate another node.

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into different storage destinations like Hadoop Distributed File System. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

Kibana

Kibana is an open source (Apache Licensed), browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful

System Requirements

  • Hadoop: Hortonworks Data Platform 2.0(HDP 2.0) or HDP Sandbox for HDP 2.0
  • OS: 64 bit RHEL (Red Hat Enterprise Linux) 6, CentOS, Oracle Linux 6
  • Software:  yum, rpm, unzip, tar, wget, java
  • JDK: Oracle 1.7 64, Oracle 1.6 update 31, Open JDK 7
Java Installation

Note: Define the JAVA_HOME environment variable and add the Java Virtual Machine and the Java binaries to your PATH environment variable.

Execute the following command to verify that the Java is in the PATH:

export JAVA_HOME=/usr/java/default 
export PATH=$JAVA_HOME/bin:$PATH 
java -version

Flume Installation

Execute the following commands to install flume binaries and agent scripts 
yum install flume-agent flume

Elasticsearch Installation

Latest Elasticsearch can be downloaded from the following URL http://www.elasticsearch.org/download/

RPM Downloads can be found in https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm

To install Elasticsearch on data nodes: 
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm

rpm -ivh elasticsearch- 0.90 . 7 .noarch.rpm

Setup and configure Elasticsearch

Update the following properties in  /etc/elasticsearch/elasticsearch.yml

  • Set cluster name  node.name: "logsearch"
  • Set node name  node.name: "node1"
  • By default every node is eligible to be master and stores data. Properties can be adjusted by
    • node.master: true
    • node.data: true
  • Number of shards can be adjusted by following property index.number_of_shards: 5
  • Number of replicas (Additional copies) can be set with index.number_of_replicas : 1
  • Adjust the path of data with  path.data: /data1,/data2,/data3,/data4
  • Set to ensure a node sees N other master eligible nodes to be considered. This property needs to be set based on the size of the nodes discovery.zen.minimum_master_nodes: 1
  • Set the time to wait for ping responses from other nodes when discovering. Value needs to be higher for slow or congested network discovery.zen.ping.timeout: 3s
  • Disable the following, only if multicast is not supported in the network discovery.zen.ping.multicast.enabled: false

Note:  Configure an initial list of master nodes in the cluster, if multicast is disabled discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

Logging properties can be adjusted in /etc/elasticsearch/logging.yml . The default log location is: /var/log/elasticsearch

Starting and Stopping Elasticsearch
  • To start Elasticsearch /etc/init.d/elasticsearch start
  • To stop Elasticsearch /etc/init.d/elasticsearch stop

Kibana Installation

Download the Kibana binaries from the following URL https://download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz

wget https: //download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz

Extract archive with  tar –zxvf kibana- 3.0 .0milestone4.tar.gz

Setup and configure Kibana
  • Open config.js file under the extracted directory
  • Set the  elasticsearch  parameter to the fully qualified hostname or IP of your Elasticsearch server.
  • elasticsearch: http://:9200
  • Open index.html in your browser to access Kibana UI
  • Update the logstash index pattern to Flume supported index pattern
  • Edit  app/dashboards/logstash.json and replace all occurences of[logstash-]YYYY.MM.DD with [logstash-]YYYY-MM-DD
Setup and configure Flume

For demonstration purpose, lets setup and configure a Flume agent on a host where log file needs to be consumed with the following Flume configuration.

Create plugins.d directory and copy the Elasticsearch dependencies:

mkdir /usr/lib/flume/plugins.d 
cp $elasticsearch_home/lib/elasticsearch- 0.90 *jar /usr/lib/flume/plugins.d 
cp $elasticsearch_home/lib/lucene-core-*jar /usr/lib/flume/plugins.d

Update Flume configuration to consume a local file and index into Elasticsearch in logstash format. Note: in a real-world use cases, Flume Log4j Appender, Syslog TCP Source, Flume Client SDK, Spool Directory Source are preferred over tailing logs.

agent.sources = tail

agent.channels = memoryChannel

agent.channels.memoryChannel.type = memory

agent.sources.tail.channels = memoryChannel

agent.sources.tail.type = exec

agent.sources.tail.command = tail -F /tmp/es_log.log

agent.sources.tail.interceptors=i1 i2 i3

agent.sources.tail.interceptors.i1.type=regex_extractor

agent.sources.tail.interceptors.i1.regex = (\\w.*):(\\w.*):(\\w.*)\\s

agent.sources.tail.interceptors.i1.serializers = s1 s2 s3

agent.sources.tail.interceptors.i1.serializers.s1.name = source

agent.sources.tail.interceptors.i1.serializers.s2.name = type

agent.sources.tail.interceptors.i1.serializers.s3.name = src_path

agent.sources.tail.interceptors.i2.type=org.apache.flume.interceptor.TimestampInterceptor$Builder

agent.sources.tail.interceptors.i3.type=org.apache.flume.interceptor.HostInterceptor$Builder

agent.sources.tail.interceptors.i3.hostHeader = host

agent.sinks = elasticsearch

agent.sinks.elasticsearch.channel = memoryChannel

agent.sinks.elasticsearch.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink

agent.sinks.elasticsearch.batchSize= 100

agent.sinks.elasticsearch.hostNames = 172.16 . 55.129 : 9300

agent.sinks.elasticsearch.indexName = logstash

agent.sinks.elasticsearch.clusterName = logsearch

agent.sinks.elasticsearch.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer

Prepare sample data for a simple test

Create a file /tmp/es_log.log with the following data

website:weblog:login_page weblog data1

website:weblog:profile_page weblog data2

website:weblog:transaction_page weblog data3

website:weblog:docs_page weblog data4

syslog:syslog:sysloggroup syslog data1

syslog:syslog:sysloggroup syslog data2

syslog:syslog:sysloggroup syslog data3

syslog:syslog:sysloggroup syslog data4

Restart Flume

/etc/init.d/flume-agent restart

Searching and Dashboarding with Kibana

Open the $KIBANA_HOME/index.html in browser. By default the welcome page is shown.

es2
Click on “Logstash Dashboard”  and select the appropriate time range to look at the charts based on the time stamped fields.

es3

These screen shots show various available charts on search fields. e.g. Pie, Bar, Table charts

es4es5

Content can be searched with custom filters and graphs can be plotted based on the search results as shown below.

es6

Batch Indexing using MapReduce/Hive/Pig

Elasticsearch’s real-time search and analytics are natively integrated with Hadoop. and support  MapReduce ,  Cascading ,  Hive  and  Pig .

Component Implementation notes
MR2/YARN ESInputFormatESOutputFormat Mapreduce input and out formats are provided by the library
Hive org.elasticsearch.hadoop.hive.ESStorageHandler Hive SerDe implementation
Pig org.elasticsearch.hadoop.pig.ESStorage Pig storage handler

Detailed Documentation with examples related to Elasticsearch hadoop integration can be found in the following URL https://github.com/elasticsearch/elasticsearch-hadoop

Thoughts on Best Practices

  1. Always set minimum_master_nodes to higher to avoid split brain (number of nodes / 2 + 1)
  2. discovery.zen.minimum_master_nodes should be set to something like N/2 + 1 where N is the number of available master nodes.
  3. Set action.disable_delete_all_indices to disable accidental deletes
  4. Set gateway.recover_after_nodes to appropriate number of nodes need to be up before the recovery process starts replicating data around the cluster.
  5. Relax the real time aspect from 1 second to something a bit higher (index.engine.robin.refresh_interval ).
  6. Increase the memory allocated to Elasticsearch node. By default its 1g.
  7. Use Java 7 if possible for better performance with elastic search
  8. Set index.fielddata.cache : soft to avoid OutOfMemory errors
  9. Use higher batch sizes in flume sink for higher throughput. E.g 1000
  10. Increase the open file limits for Elasticsearch

这篇关于How To Configure Elasticsearch on Hadoop with HDP的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/826187

相关文章

基于MySQL Binlog的Elasticsearch数据同步实践

一、为什么要做 随着马蜂窝的逐渐发展,我们的业务数据越来越多,单纯使用 MySQL 已经不能满足我们的数据查询需求,例如对于商品、订单等数据的多维度检索。 使用 Elasticsearch 存储业务数据可以很好的解决我们业务中的搜索需求。而数据进行异构存储后,随之而来的就是数据同步的问题。 二、现有方法及问题 对于数据同步,我们目前的解决方案是建立数据中间表。把需要检索的业务数据,统一放到一张M

Hadoop企业开发案例调优场景

需求 (1)需求:从1G数据中,统计每个单词出现次数。服务器3台,每台配置4G内存,4核CPU,4线程。 (2)需求分析: 1G / 128m = 8个MapTask;1个ReduceTask;1个mrAppMaster 平均每个节点运行10个 / 3台 ≈ 3个任务(4    3    3) HDFS参数调优 (1)修改:hadoop-env.sh export HDFS_NAMENOD

Hadoop集群数据均衡之磁盘间数据均衡

生产环境,由于硬盘空间不足,往往需要增加一块硬盘。刚加载的硬盘没有数据时,可以执行磁盘数据均衡命令。(Hadoop3.x新特性) plan后面带的节点的名字必须是已经存在的,并且是需要均衡的节点。 如果节点不存在,会报如下错误: 如果节点只有一个硬盘的话,不会创建均衡计划: (1)生成均衡计划 hdfs diskbalancer -plan hadoop102 (2)执行均衡计划 hd

hadoop开启回收站配置

开启回收站功能,可以将删除的文件在不超时的情况下,恢复原数据,起到防止误删除、备份等作用。 开启回收站功能参数说明 (1)默认值fs.trash.interval = 0,0表示禁用回收站;其他值表示设置文件的存活时间。 (2)默认值fs.trash.checkpoint.interval = 0,检查回收站的间隔时间。如果该值为0,则该值设置和fs.trash.interval的参数值相等。

Hadoop数据压缩使用介绍

一、压缩原则 (1)运算密集型的Job,少用压缩 (2)IO密集型的Job,多用压缩 二、压缩算法比较 三、压缩位置选择 四、压缩参数配置 1)为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器 2)要在Hadoop中启用压缩,可以配置如下参数

ElasticSearch的DSL查询⑤(ES数据聚合、DSL语法数据聚合、RestClient数据聚合)

目录 一、数据聚合 1.1 DSL实现聚合 1.1.1 Bucket聚合  1.1.2 带条件聚合 1.1.3 Metric聚合 1.1.4 总结 2.1 RestClient实现聚合 2.1.1 Bucket聚合 2.1.2 带条件聚合 2.2.3 Metric聚合 一、数据聚合 聚合(aggregations)可以让我们极其方便的实现对数据的统计、分析、运算。例如:

【docker】基于docker-compose 安装elasticsearch + kibana + ik分词器(8.10.4版本)

记录下,使用 docker-compose 安装 Elasticsearch 和 Kibana,并配置 IK 分词器,你可以按照以下步骤进行。此过程适用于 Elasticsearch 和 Kibana 8.10.4 版本。 安装 首先,在你的工作目录下创建一个 docker-compose.yml 文件,用于配置 Elasticsearch 和 Kibana 的服务。 version:

ElasticSearch底层原理简析

1.ElasticSearch简述 ElastiaSearch(以下简称ES)是一个基于Lucene的搜索服务器,它提供了一个分布式多用户能力的全文搜索引擎,支持RESTful web接口。Elasticsearch是用Java开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。ES设计用于云计算中,能够进行实时搜索,支持PB级搜索,具有稳定,可靠,快速,安装使用方便等

ElasticSearch 6.1.1 通过Head插件,新建索引,添加文档,及其查询数据

ElasticSearch 6.1.1 通过Head插件,新建索引,添加文档,及其查询; 一、首先启动相关服务: 二、新建一个film索引: 三、建立映射: 1、通过Head插件: POST http://192.168.1.111:9200/film/_mapping/dongzuo/ {"properties": {"title": {"type":

ElasticSearch 6.1.1运用代码添加索引及其添加,修改,删除文档

1、新建一个MAVEN项目:ElasticSearchTest 2、修改pom.xml文件内容: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.or