filebeat采集中断与变慢问题分析

本文主要是介绍filebeat采集中断与变慢问题分析，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

4、未采集的那段时间内无以下日志，这段时间内数据源正常，应能被正常采集到。

5、相关进程资源，服务器磁盘、cpu、内存无明显异常。

6、日志中断前有如下报错。

2022-02-15T15:22:22.223+0800 INFO log/harvester.go:254 Harvester started for file: /opt/smbdata/VIBE4/V0403/SPI/Kohyong/20220215151502.TXT

2022-02-15T15:22:22.224+0800 ERROR log/harvester.go:281 Read line error: No more bytes; File: /opt/smbdata/VIBE4/V0403/SPI/Kohyong/20220215151502.TXT

2022-02-15T15:22:45.950+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072255.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072255.TXT: host is down

2022-02-15T15:22:50.183+0800 ERROR instance/metrics.go:91 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:22:50.184+0800 ERROR instance/metrics.go:135 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:22:50.184+0800 ERROR instance/metrics_file_descriptors.go:39 Error while retrieving FD information: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:22:50.184+0800 INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"info":{"ephemeral_id":"31404935-8b04-455d-bb4b-400a9f71b573","uptime":{"ms":3455370021}},"memstats":{"gc_next":356984704,"memory_alloc":225038448,"memory_total":17130440800504}},"filebeat":{"events":{"added":214,"done":214},"harvester":{"closed":68,"open_files":0,"running":0,"started":68}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":68,"batches":1,"total":68},"write":{"bytes":29220}},"pipeline":{"clients":1,"events":{"active":0,"filtered":146,"published":68,"total":214},"queue":{"acked":68}}},"registrar":{"states":{"cleanup":81,"current":11912,"update":214},"writes":{"success":3,"total":3}},"system":{"load":{"1":1.71,"15":1.96,"5":2.02,"norm":{"1":0.2138,"15":0.245,"5":0.2525}}}}}}

2022-02-15T15:23:05.952+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072416.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072416.TXT: host is down

2022-02-15T15:23:20.183+0800 ERROR instance/metrics.go:135 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:23:20.184+0800 ERROR instance/metrics_file_descriptors.go:39 Error while retrieving FD information: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:23:20.184+0800 ERROR instance/metrics.go:91 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:23:20.184+0800 INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"info":{"ephemeral_id":"31404935-8b04-455d-bb4b-400a9f71b573","uptime":{"ms":3455400022}},"memstats":{"gc_next":356984704,"memory_alloc":225191840,"memory_total":17130440953896}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":11912}},"system":{"load":{"1":1.51,"15":1.93,"5":1.94,"norm":{"1":0.1888,"15":0.2413,"5":0.2425}}}}}}

2022-02-15T15:23:25.959+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072224.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072224.TXT: host is down

2022-02-15T15:23:46.185+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072410.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072410.TXT: host is down

2022-02-15T15:23:50.183+0800 ERROR instance/metrics.go:91 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:23:50.184+0800 ERROR instance/metrics.go:135 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:23:50.184+0800 ERROR instance/metrics_file_descriptors.go:39 Error while retrieving FD information: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:23:50.184+0800 INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"info":{"ephemeral_id":"31404935-8b04-455d-bb4b-400a9f71b573","uptime":{"ms":3455430020}},"memstats":{"gc_next":356984704,"memory_alloc":225324160,"memory_total":17130441086216}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":11912}},"system":{"load":{"1":1.45,"15":1.91,"5":1.88,"norm":{"1":0.1813,"15":0.2388,"5":0.235}}}}}}

2022-02-15T15:24:06.228+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072336.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072336.TXT: host is down

2022-02-15T15:24:20.183+0800 ERROR instance/metrics_file_descriptors.go:39 Error while retrieving FD information: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:24:20.184+0800 ERROR instance/metrics.go:91 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:24:20.184+0800 ERROR instance/metrics.go:135 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=17295

2022-02-15T15:24:20.184+0800 INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"info":{"ephemeral_id":"31404935-8b04-455d-bb4b-400a9f71b573","uptime":{"ms":3455460021}},"memstats":{"gc_next":356984704,"memory_alloc":225444480,"memory_total":17130441206536}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":11912}},"system":{"load":{"1":1.46,"15":1.9,"5":1.84,"norm":{"1":0.1825,"15":0.2375,"5":0.23}}}}}}

2022-02-15T15:24:26.241+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072238.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072238.TXT: host is down

2022-02-15T15:24:49.373+0800 ERROR log/input.go:222 input state for /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072350.TXT was not removed: stat /opt/smbdata/VIBE7/V0703/SPI/Kohyong/20220215072350.TXT: host is down

据现场反馈车间设备有时候会停机，有的时候有host down 等开机的时候就自动恢复了，不会有问题。有的时候就不行了，中断接近1天。日志举例如下：

7、2月20日早8点40上述问题复现，也是数据自动恢复（恢复的非常慢）。

8、现场环境。

一、使用Linux开发环境搭建三使用mount -t cifs 挂载windows共享目录方法与问题解决-CSDN博客类似这种搭建的日志挂载共享；

二、未发现任何规律性，比如固定某通道、某时间发生；

三、现场曾使用原生filebeat7.0版本跟logstash配套使用，某1个设备挂载丢了（这个丢了的设备有可能机器重启后不采了，其他没丢的设备一直采集正常）不影响其它设备，目前使用edge是影响整个通道下所有设备；

四、未使用更高版本filebeat验证是否能解决以上问题；

五、设备重启后至少设备已经恢复一段时间了，且目录下写入了能在Linux下看到，但此时filebeat仍未恢复；

六、现场挂载命令为（一次性执行，在设备重启挂载丢失前后没有任何人工或其它程序或脚本干预）：mount -t cifs //10.124.156.31/spimes /opt/smbdata/MOTO3/M0303/SPI/Kohyong -o vers=2.1。

七、据现场反馈“基本每次重启都会恢复”。

9、测试环境复现

使用现场几乎相同的配置与日志数据，复现了host is down但未复现现场采集中断问题（无论是修改版7.0.2或官方7.0.0）；
发现filebeat在日志表现上唯一与现场区别之处为一直在不断打印host is down。而现场日志day15-16filebeat.1只打印了几条短时的host is down，长达几个小时后直到采集恢复。

10、因现场设备IP变动，使用以下2条命令重新进行了挂载，采集中断，查日志与以上情况相同

umount -l /opt/smbdata/MOTO13/M1304/AOI/JUTZE -o vers=2.1

mount -t cifs //10.124.156.204/F20PRE /opt/smbdata/MOTO13/M1304/AOI/JUTZE -o vers=2.1

umount -l /opt/smbdata/MOTO13/M1306/AOI/JUTZE -o vers=2.1

mount -t cifs //10.124.156.206/F20POST /opt/smbdata/MOTO13/M1306/AOI/JUTZE -o vers=2.1

11、现场反馈Filebeat的Registry文件越来越大，重启时恢复慢，一般删data目录后再启动，具体见Filebeat重启后不收集的问题分析 - NYC's Blog (niyanchun.com)

12、在配置上加logging.level: debug，待问题复现时获取filebeat库执行日志

14、现场配置，现场registry目录下data.json大小为2-4M，可能是这一区别造成未复现。

15、根据Elastic Stack最佳实践系列：filebeat CPU使用率过高分析 - 云+社区 - 腾讯云 (tencent.com)文档描述，修改filebeat代码，在启动fb进程时加启动参数，如下：

16、现场日志格式举例。

17、2月28日问题复现，根据以下现象初步判断，有至少2个目录host is down（此时挂载设备的ip ping不通），里面的大量文件txt异常。在代码的for循环里执行的很慢（手动命令ll一个txt也是相当耗时，约10s）。n个文件乘以10s就是在这里死循环一样出不来，耽误了正常文件的采集。

解决方案1：在上述for循环代码中加判断，当某一个文件出现host is down时取其上一级路径进行循环时匹配，如属于同一目录则整个上一级目录跳过不再执行os.Stat(state.Source)

18、删除data目录后重启通道则正常（http://10.114.113.105:8080/ AOI通道）。现场配置如下（单个/opt/smbdata/MOTO10/M1004/AOI/JUTZE/*.txt目录下文件数约1.6万个，registery文件约4M）：

output.console: {enabled: true}

processors:

- drop_fields:

fields: [input, agent, host, ecs]

logging.level: debug

logging.files.rotateeverybytes: 104857600

logging.files.keepfiles: 20

filebeat.inputs:

- type: log

paths:[/opt/smbdata/MOTO1/M0104/AOI/JUTZE/*.txt, /opt/smbdata/MOTO2/M0204/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO3/M0304/AOI/JUTZE/*.txt, /opt/smbdata/MOTO4/M0404/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO5/M0504/AOI/JUTZE/*.txt, /opt/smbdata/MOTO6/M0604/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO7/M0704/AOI/JUTZE/*.txt, /opt/smbdata/MOTO8/M0804/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO9/M0904/AOI/JUTZE/*.txt, /opt/smbdata/MOTO10/M1004/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO11/M1104/AOI/JUTZE/*.txt, /opt/smbdata/MOTO12/M1204/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO13/M1304/AOI/JUTZE/*.txt, /opt/smbdata/MOTO1/M0106/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO2/M0206/AOI/JUTZE/*.txt, /opt/smbdata/MOTO3/M0306/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO4/M0406/AOI/JUTZE/*.txt, /opt/smbdata/MOTO5/M0506/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO6/M0606/AOI/JUTZE/*.txt, /opt/smbdata/MOTO7/M0706/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO8/M0806/AOI/JUTZE/*.txt, /opt/smbdata/MOTO9/M0906/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO10/M1006/AOI/JUTZE/*.txt, /opt/smbdata/MOTO11/M1106/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO12/M1206/AOI/JUTZE/*.txt, /opt/smbdata/MOTO13/M1306/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO14/M1404/AOI/JUTZE/*.txt, /opt/smbdata/MOTO14/M1406/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO15/M1504/AOI/JUTZE/*.txt, /opt/smbdata/MOTO15/M1506/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO20/M2004/AOI/JUTZE/*.txt, /opt/smbdata/MOTO20/M2006/AOI/JUTZE/*.txt,

/opt/smbdata/MOTO21/M2104/AOI/JUTZE/*.txt, /opt/smbdata/MOTO21/M2106/AOI/JUTZE/*.txt]

encoding: gbk

scan_frequency: 2s

harvester_buffer_size: 16384

max_bytes: 10485760

is_eof: true

is_read_all: false

close_renamed: true

close_removed: true

close_inactive: 3s

close_eof: true

clean_removed: true

delay_time: 0

spool_size: 2048

batch_size: 50

ignore_older: 4h

clean_inactive: 5h

multiline.pattern: ^\w

multiline.negate: false

multiline.match: after

multiline.max_lines: 500

multiline.timeout: 5s

groups:

fields: &id001 {id: '902623264956878848'}

group: '33'

fields: *id001

这篇关于filebeat采集中断与变慢问题分析的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

filebeat采集中断与变慢问题分析

相关文章

Spring事务中@Transactional注解不生效的原因分析与解决

SpringBoot启动报错的11个高频问题排查与解决终极指南

找不到Anaconda prompt终端的原因分析及解决方案

Spring定时任务只执行一次的原因分析与解决方案

MySQL新增字段后Java实体未更新的潜在问题与解决方案

如何解决mysql出现Incorrect string value for column ‘表项‘ at row 1错误问题

如何解决Spring MVC中响应乱码问题

pip无法安装osgeo失败的问题解决

解决Java中基于GeoTools的Shapefile读取乱码的问题

C++ 各种map特点对比分析