Kafka 监控及使用 JMX 进行远程监控的安全注意事项

本文主要是介绍Kafka 监控及使用 JMX 进行远程监控的安全注意事项，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一. 前言

二. Kafka 监控（Kafka Monitoring）

2.1. 概览

2.2. 使用 JMX 进行远程监控的安全注意事项

一. 前言

众所周知，Kafka 的集中式设计具有很强的耐用性和容错性。此外，由于 Kafka 是一个分布式系统，因此 Topic 在多个节点之间进行分区和复制。此外，Kafka 可以成为数据集成的极具吸引力的选择，具有有意义的性能监控和对问题的及时警报。基本上，当对 Kafka 问题进行故障排除时，应用程序管理器会向需要采取纠正措施的人收集所有性能指标和警报。

二. Kafka 监控（Kafka Monitoring）

2.1. 概览

原文引用：Kafka uses Yammer Metrics for metrics reporting in the server. The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.

Kafka 使用 Yammer Metrics 在服务器中进行度量报告。Java 客户端使用 Kafka Metrics，这是一个内置的度量注册表，可以最大限度地减少客户端应用程序中的可传递依赖关系。两者都通过JMX 公开度量，并且可以配置为使用可插入的统计报告器报告统计信息，以连接到您的监控系统。

原文引用：All Kafka rate metrics have a corresponding cumulative count metric with suffix -total. For example, records-consumed-rate has a corresponding metric named records-consumed-total.

所有 Kafka 速率度量都有一个后缀为 -total 的相应累积计数度量。例如，records-consumed-rate（记录消费率）有一个名为 records-consumed-total（记录消费总量）的相应度量。

原文引用：The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.

查看可用度量的最简单方法是启动 jconsole 并将其指向正在运行的 Kafka 客户端或服务器；这将允许使用 JMX 浏览所有度量。

2.2. 使用 JMX 进行远程监控的安全注意事项

原文引用：Apache Kafka disables remote JMX by default. You can enable remote monitoring using JMX by setting the environment variable JMX_PORT for processes started using the CLI or standard Java system properties to enable remote JMX programmatically. You must enable security when enabling remote JMX in production scenarios to ensure that unauthorized users cannot monitor or control your broker or application as well as the platform on which these are running. Note that authentication is disabled for JMX by default in Kafka and security configs must be overridden for production deployments by setting the environment variable KAFKA_JMX_OPTS for processes started using the CLI or by setting appropriate Java system properties. See Monitoring and Management Using JMX Technology for details on securing JMX.

We do graphing and alerting on the following metrics:

Apache Kafka 默认禁用远程 JMX。您可以使用 JMX 启用远程监控，方法是为使用 CLI 或标准Java 系统属性启动的进程设置环境变量 JMX_PORT，以编程方式启用远程 JMX。在生产场景中启用远程 JMX 时，必须启用安全性，以确保未经授权的用户无法监视或控制您的 Broker 或应用程序以及运行这些 Broker 或应用程序的平台。请注意，在 Kafka 中，默认情况下会禁用 JMX 的身份验证，并且必须通过为使用 CLI 启动的进程设置环境变量 Kafka_JMX_OPTS 或设置适当的Java 系统属性来覆盖生产部署的安全配置。有关保护 JMX 的详细信息，请参阅使用 JMX 技术进行监视和管理。

我们根据以下指标进行绘图和警报：

DESCRIPTION	MBEAN NAME	NORMAL VALUE
Message in rate 消息速率	kafka.server:type=BrokerTopicMetrics, name=MessagesInPerSec,topic=([-.\w]+)	Incoming message rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Byte in rate from clients 客户端字节速率	kafka.server:type=BrokerTopicMetrics, name=BytesInPerSec,topic=([-.\w]+)	Byte in (from the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Byte in rate from other brokers 其他brokers字节速率	kafka.server:type=BrokerTopicMetrics, name=ReplicationBytesInPerSec	Byte in (from the other brokers) rate across all topics.
Controller Request rate from Broker	kafka.controller:type=ControllerChannelManager, name=RequestRateAndQueueTimeMs, brokerId=([0-9]+)	The rate (requests per second) at which the ControllerChannelManager takes requests from the queue of the given broker. And the time it takes for a request to stay in this queue before it is taken from the queue.
Controller Event queue size	kafka.controller:type=ControllerEventManager, name=EventQueueSize	Size of the ControllerEventManager's queue.
Controller Event queue time	kafka.controller:type=ControllerEventManager, name=EventQueueTimeMs	Time that takes for any event (except the Idle event) to wait in the ControllerEventManager's queue before being processed
Request rate 请求速率	kafka.network:type=RequestMetrics, name=RequestsPerSec, request={Produce\|FetchConsumer\|FetchFollower}, version=([0-9]+)
Error rate 错误速率	kafka.network:type=RequestMetrics, name=ErrorsPerSec,request=([-.\w]+), error=([-.\w]+)	Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.
Produce request rate	kafka.server:type=BrokerTopicMetrics, name=TotalProduceRequestsPerSec, topic=([-.\w]+)	Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Fetch request rate	kafka.server:type=BrokerTopicMetrics, name=TotalFetchRequestsPerSec, topic=([-.\w]+)	Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Failed produce request rate	kafka.server:type=BrokerTopicMetrics, name=FailedProduceRequestsPerSec, topic=([-.\w]+)	Failed Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Failed fetch request rate	kafka.server:type=BrokerTopicMetrics, name=FailedFetchRequestsPerSec, topic=([-.\w]+)	Failed Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Request size in bytes 请求大小（以字节为单位）	kafka.network:type=RequestMetrics, name=RequestBytes,request=([-.\w]+)	Size of requests for each request type.
Temporary memory size in bytes 临时内存大小（以字节为段位）	kafka.network:type=RequestMetrics, name=TemporaryMemoryBytes,request={Produce\|Fetch}	Temporary memory used for message format conversions and decompression.
Message conversion time 消息转换时间	kafka.network:type=RequestMetrics, name=MessageConversionsTimeMs, request={Produce\|Fetch}	Time in milliseconds spent on message format conversions.
Message conversion rate 消息转换比率	kafka.server:type=BrokerTopicMetrics, name={Produce\|Fetch}MessageConversionsPerSec, topic=([-.\w]+)	Message format conversion rate, for Produce or Fetch requests, per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Request Queue Size	kafka.network:type=RequestChannel, name=RequestQueueSize	Size of the request queue.
Byte out rate to clients 向客户端的字节输出率	kafka.server:type=BrokerTopicMetrics, name=BytesOutPerSec,topic=([-.\w]+)	Byte out (to the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Byte out rate to other brokers 对其他broker的字节输出率	kafka.server:type=BrokerTopicMetrics, name=ReplicationBytesOutPerSec	Byte out (to the other brokers) rate across all topics
Rejected byte rate	kafka.server:type=BrokerTopicMetrics, name=BytesRejectedPerSec,topic=([-.\w]+)	Rejected byte rate per topic, due to the record batch size being greater than max.message.bytes configuration. Omitting 'topic=(...)' will yield the all-topic rate.
Message validation failure rate due to no key specified for compacted topic 由于未为压缩topic指定key，消息验证失败率	kafka.server:type=BrokerTopicMetrics, name=NoKeyCompactedTopicRecordsPerSec	0
Message validation failure rate due to invalid magic number 无效的magic导致的消息验证失败率	kafka.server:type=BrokerTopicMetrics, name=InvalidMagicNumberRecordsPerSec	0
Message validation failure rate due to incorrect crc checksum 由于错误的crc校验和导致的消息验证失败率	kafka.server:type=BrokerTopicMetrics, name=InvalidMessageCrcRecordsPerSec	0
Message validation failure rate due to non-continuous offset or sequence number in batch 由于不连续offset或批处理中的序列号，导致消息验证失败率	kafka.server:type=BrokerTopicMetrics, name=InvalidOffsetOrSequenceRecordsPerSec	0
Log flush rate and time 日志刷新率和时间	kafka.log:type=LogFlushStats, name=LogFlushRateAndTimeMs
# of offline log directories 脱机日志目录	kafka.log:type=LogManager, name=OfflineLogDirectoryCount	0
Leader election rate leader选举率	kafka.controller:type=ControllerStats, name=LeaderElectionRateAndTimeMs	non-zero when there are broker failures
Unclean leader election rate 未清理的leader选举率	kafka.controller:type=ControllerStats, name=UncleanLeaderElectionsPerSec	0
Is controller active on broker 控制器在broker上是否活跃	kafka.controller:type=KafkaController, name=ActiveControllerCount	only one broker in the cluster should have 1
Pending topic deletes 待删除主题	kafka.controller:type=KafkaController, name=TopicsToDeleteCount
Pending replica deletes 待删除的副本	kafka.controller:type=KafkaController, name=ReplicasToDeleteCount
Ineligible pending topic deletes 不合格的待删除主题	kafka.controller:type=KafkaController, name=TopicsIneligibleToDeleteCount
Ineligible pending replica deletes 不合格的待删除副本	kafka.controller:type=KafkaController, name=ReplicasIneligibleToDeleteCount
# of under replicated partitions (\|ISR\| < \|all replicas\|)	kafka.server:type=ReplicaManager, name=UnderReplicatedPartitions	0
# of under minIsr partitions (\|ISR\| < min.insync.replicas)	kafka.server:type=ReplicaManager, name=UnderMinIsrPartitionCount	0
# of at minIsr partitions (\|ISR\| = min.insync.replicas)	kafka.server:type=ReplicaManager, name=AtMinIsrPartitionCount	0
Producer Id counts	kafka.server:type=ReplicaManager, name=ProducerIdCount	Count of all producer ids created by transactional and idempotent producers in each replica on the broker
Partition counts 分区数	kafka.server:type=ReplicaManager, name=PartitionCount	mostly even across brokers
Offline Replica counts	kafka.server:type=ReplicaManager, name=OfflineReplicaCount	0
Leader replica counts Leader副本数	kafka.server:type=ReplicaManager, name=LeaderCount	mostly even across brokers
ISR shrink rate ISR收缩率	kafka.server:type=ReplicaManager, name=IsrShrinksPerSec	If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion rate ISR扩展率	kafka.server:type=ReplicaManager, name=IsrExpandsPerSec	See above
Failed ISR update rate	kafka.server:type=ReplicaManager, name=FailedIsrUpdatesPerSec	0
Max lag in messages btw follower and leader replicas follower副本和leader副本之间的最大消息延迟	kafka.server:type=ReplicaFetcherManager, name=MaxLag,clientId=Replica	lag should be proportional to the maximum batch size of a produce request.
Lag in messages per follower replica 每个follower副本的消息延迟	kafka.server:type=FetcherLagMetrics, name=ConsumerLag,clientId=([-.\w]+), topic=([-.\w]+),partition=([0-9]+)	lag should be proportional to the maximum batch size of a produce request.
Requests waiting in the producer purgatory 请求在生产者purgatory中等待	kafka.server:type=DelayedOperationPurgatory, name=PurgatorySize, delayedOperation=Produce	non-zero if ack=-1 is used
Requests waiting in the fetch purgatory 请求在purgatory中等待	kafka.server:type=DelayedOperationPurgatory, name=PurgatorySize,delayedOperation=Fetch	size depends on fetch.wait.max.ms in the consumer
Request total time 请求总时间	kafka.network:type=RequestMetrics, name=TotalTimeMs, request={Produce\|FetchConsumer\|FetchFollower}	broken into queue, local, remote and response send time
Time the request waits in the request queue 请求在请求队列中等待的时间	kafka.network:type=RequestMetrics, name=RequestQueueTimeMs, request={Produce\|FetchConsumer\|FetchFollower}
Time the request is processed at the leader leader处理请求的时间	kafka.network:type=RequestMetrics, name=LocalTimeMs, request={Produce\|FetchConsumer\|FetchFollower}
Time the request waits for the follower 请求等待follower的时间	kafka.network:type=RequestMetrics, name=RemoteTimeMs, request={Produce\|FetchConsumer\|FetchFollower}	non-zero for produce requests when ack=-1
Time the request waits in the response queue 请求在响应队列中等待的时间	kafka.network:type=RequestMetrics, name=ResponseQueueTimeMs, request={Produce\|FetchConsumer\|FetchFollower}
Time to send the response 发送回应的时间	kafka.network:type=RequestMetrics, name=ResponseSendTimeMs, request={Produce\|FetchConsumer\|FetchFollower}
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker. 消费者落后于生产者的消息数。由消费者而非broker提供。	kafka.consumer:type=consumer-fetch-manager-metrics, client-id={client-id} Attribute: records-lag-max
The average fraction of time the network processors are idle 网络处理空闲的平均时间	kafka.network:type=SocketServer, name=NetworkProcessorAvgIdlePercent	between 0 and 1, ideally > 0.3
The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication 由于客户端未重新进行身份验证，然后将连接超出其到期时间而用于除重新身份验证以外的任何操作而在处理器上断开的连接数	kafka.server:type=socket-server-metrics, listener=[SASL_PLAINTEXT\|SASL_SSL], networkProcessor=<#>, name=expired-connections-killed-count	ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination
The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication 由于客户端未重新进行身份验证，然后在其过期时间之后使用该连接进行除重新身份验证以外的任何操作时，所有处理器之间断开连接的总数	kafka.network:type=SocketServer, name=ExpiredConnectionsKilledCount	ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker
The average fraction of time the request handler threads are idle 请求处理程序线程空闲的平均时间百分比	kafka.server:type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent	between 0 and 1, ideally > 0.3
Bandwidth quota metrics per (user, client-id), user or client-id 每个（user， client-id），user或client-id的带宽配额指标	kafka.server:type={Produce\|Fetch}, user=([-.\w]+),client-id=([-.\w]+)	Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Request quota metrics per (user, client-id), user or client-id 每个（user， client-id），user或client-id的请求配额指标	kafka.server:type=Request, user=([-.\w]+),client-id=([-.\w]+)	Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Requests exempt from throttling 请求不受限制	kafka.server:type=Request	exempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.
ZooKeeper client request latency ZooKeeper客户端请求延迟	kafka.server:type=ZooKeeperClientMetrics, name=ZooKeeperRequestLatencyMs	Latency in milliseconds for ZooKeeper requests from broker.
ZooKeeper connection status ZooKeeper连接状态	kafka.server:type=SessionExpireListener, name=SessionState	Connection status of broker's ZooKeeper session which may be one of Disconnected\|SyncConnected\|AuthFailed\|ConnectedReadOnly\|SaslAuthenticated\|Expired.
Max time to load group metadata 加载组元数据的最长时间	kafka.server:type=group-coordinator-metrics, name=partition-load-time-max	maximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Avg time to load group metadata 加载组元数据的平均时间	kafka.server:type=group-coordinator-metrics, name=partition-load-time-avg	average time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Max time to load transaction metadata 加载交易元数据的最长时间	kafka.server:type=transaction-coordinator-metrics, name=partition-load-time-max	maximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Avg time to load transaction metadata 加载交易元数据的平均时间	kafka.server:type=transaction-coordinator-metrics, name=partition-load-time-avg	average time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Rate of transactional verification errors	kafka.server:type=AddPartitionsToTxnManager, name=VerificationFailureRate	Rate of verifications that returned in failure either from the AddPartitionsToTxn API response or through errors in the AddPartitionsToTxnManager. In steady state 0, but transient errors are expected during rolls and reassignments of the transactional state partition.
Time to verify a transactional request	kafka.server:type=AddPartitionsToTxnManager, name=VerificationTimeMs	The amount of time queueing while a possible previous request is in-flight plus the round trip to the transaction coordinator to verify (or not verify)
Consumer Group Offset Count	kafka.server:type=GroupMetadataManager, name=NumOffsets	Total number of committed offsets for Consumer Groups
Consumer Group Count	kafka.server:type=GroupMetadataManager, name=NumGroups	Total number of Consumer Groups
Consumer Group Count, per State	kafka.server:type=GroupMetadataManager, name=NumGroups[PreparingRebalance, CompletingRebalance,Empty,Stable,Dead]	The number of Consumer Groups in each state: PreparingRebalance, CompletingRebalance, Empty, Stable, Dead
Number of reassigning partitions	kafka.server:type=ReplicaManager, name=ReassigningPartitions	The number of reassigning leader partitions on a broker.
Outgoing byte rate of reassignment traffic	kafka.server:type=BrokerTopicMetrics, name=ReassignmentBytesOutPerSec	0; non-zero when a partition reassignment is in progress.
Incoming byte rate of reassignment traffic	kafka.server:type=BrokerTopicMetrics, name=ReassignmentBytesInPerSec	0; non-zero when a partition reassignment is in progress.
Size of a partition on disk (in bytes)	kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)	The size of a partition on disk, measured in bytes.
Number of log segments in a partition	kafka.log:type=Log,name=NumLogSegments, topic=([-.\w]+),partition=([0-9]+)	The number of log segments in a partition.
First offset in a partition	kafka.log:type=Log,name=LogStartOffset, topic=([-.\w]+),partition=([0-9]+)	The first offset in a partition.
Last offset in a partition	kafka.log:type=Log,name=LogEndOffset, topic=([-.\w]+),partition=([0-9]+)	The last offset in a partition.

DESCRIPTION

MBEAN NAME

NORMAL VALUE

Message in rate

消息速率

kafka.server:type=BrokerTopicMetrics,

name=MessagesInPerSec,topic=([-.\w]+)

Incoming message rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Byte in rate from clients

客户端字节速率

kafka.server:type=BrokerTopicMetrics,

name=BytesInPerSec,topic=([-.\w]+)

Byte in (from the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Byte in rate from other brokers

其他brokers字节速率

kafka.server:type=BrokerTopicMetrics,

name=ReplicationBytesInPerSec

Byte in (from the other brokers) rate across all topics.

Controller Request rate from Broker

kafka.controller:type=ControllerChannelManager,

name=RequestRateAndQueueTimeMs,

brokerId=([0-9]+)

The rate (requests per second) at which the ControllerChannelManager takes requests from the queue of the given broker. And the time it takes for a request to stay in this queue before it is taken from the queue.

Controller Event queue size

kafka.controller:type=ControllerEventManager,

name=EventQueueSize

Size of the ControllerEventManager's queue.

Controller Event queue time

kafka.controller:type=ControllerEventManager,

name=EventQueueTimeMs

Time that takes for any event (except the Idle event) to wait in the ControllerEventManager's queue before being processed

Request rate

请求速率

kafka.network:type=RequestMetrics,

name=RequestsPerSec,

request={Produce|FetchConsumer|FetchFollower},

version=([0-9]+)

Error rate

错误速率

kafka.network:type=RequestMetrics,

name=ErrorsPerSec,request=([-.\w]+),

error=([-.\w]+)

Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.

Produce request rate

kafka.server:type=BrokerTopicMetrics,

name=TotalProduceRequestsPerSec,

topic=([-.\w]+)

Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Fetch request rate

kafka.server:type=BrokerTopicMetrics,

name=TotalFetchRequestsPerSec,

topic=([-.\w]+)

Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Failed produce request rate

kafka.server:type=BrokerTopicMetrics,

name=FailedProduceRequestsPerSec,

topic=([-.\w]+)

Failed Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Failed fetch request rate

kafka.server:type=BrokerTopicMetrics,

name=FailedFetchRequestsPerSec,

topic=([-.\w]+)

Failed Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Request size in bytes

请求大小（以字节为单位）

kafka.network:type=RequestMetrics,

name=RequestBytes,request=([-.\w]+)

Size of requests for each request type.

Temporary memory size in bytes

临时内存大小（以字节为段位）

kafka.network:type=RequestMetrics,

name=TemporaryMemoryBytes,request={Produce|Fetch}

Temporary memory used for message format conversions and decompression.

Message conversion time

消息转换时间

kafka.network:type=RequestMetrics,

name=MessageConversionsTimeMs,

request={Produce|Fetch}

Time in milliseconds spent on message format conversions.

Message conversion rate

消息转换比率

kafka.server:type=BrokerTopicMetrics,

name={Produce|Fetch}MessageConversionsPerSec,

topic=([-.\w]+)

Message format conversion rate, for Produce or Fetch requests, per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Request Queue Size

kafka.network:type=RequestChannel,

name=RequestQueueSize

Size of the request queue.

Byte out rate to clients

向客户端的字节输出率

kafka.server:type=BrokerTopicMetrics,

name=BytesOutPerSec,topic=([-.\w]+)

Byte out (to the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.

Byte out rate to other brokers

对其他broker的字节输出率

kafka.server:type=BrokerTopicMetrics,

name=ReplicationBytesOutPerSec

Byte out (to the other brokers) rate across all topics

Rejected byte rate

kafka.server:type=BrokerTopicMetrics,

name=BytesRejectedPerSec,topic=([-.\w]+)

Rejected byte rate per topic, due to the record batch size being greater than max.message.bytes configuration. Omitting 'topic=(...)' will yield the all-topic rate.

Message validation failure rate due to no key specified for compacted topic

由于未为压缩topic指定key，消息验证失败率

kafka.server:type=BrokerTopicMetrics,

name=NoKeyCompactedTopicRecordsPerSec

Message validation failure rate due to invalid magic number

无效的magic导致的消息验证失败率

kafka.server:type=BrokerTopicMetrics,

name=InvalidMagicNumberRecordsPerSec

Message validation failure rate due to incorrect crc checksum

由于错误的crc校验和导致的消息验证失败率

kafka.server:type=BrokerTopicMetrics,

name=InvalidMessageCrcRecordsPerSec

Message validation failure rate due to non-continuous offset or sequence number in batch

由于不连续offset或批处理中的序列号，导致消息验证失败率

kafka.server:type=BrokerTopicMetrics,

name=InvalidOffsetOrSequenceRecordsPerSec

Log flush rate and time

日志刷新率和时间

kafka.log:type=LogFlushStats,

name=LogFlushRateAndTimeMs

# of offline log directories

脱机日志目录

kafka.log:type=LogManager,

name=OfflineLogDirectoryCount

Leader election rate

leader选举率

kafka.controller:type=ControllerStats,

name=LeaderElectionRateAndTimeMs

non-zero when there are broker failures

Unclean leader election rate

未清理的leader选举率

kafka.controller:type=ControllerStats,

name=UncleanLeaderElectionsPerSec

Is controller active on broker

控制器在broker上是否活跃

kafka.controller:type=KafkaController,

name=ActiveControllerCount

only one broker in the cluster should have 1

Pending topic deletes

待删除主题

kafka.controller:type=KafkaController,

name=TopicsToDeleteCount

Pending replica deletes

待删除的副本

kafka.controller:type=KafkaController,

name=ReplicasToDeleteCount

Ineligible pending topic deletes

不合格的待删除主题

kafka.controller:type=KafkaController,

name=TopicsIneligibleToDeleteCount

Ineligible pending replica deletes

不合格的待删除副本

kafka.controller:type=KafkaController,

name=ReplicasIneligibleToDeleteCount

# of under replicated partitions (|ISR| < |all replicas|)

kafka.server:type=ReplicaManager,

name=UnderReplicatedPartitions

# of under minIsr partitions (|ISR| < min.insync.replicas)

kafka.server:type=ReplicaManager,

name=UnderMinIsrPartitionCount

# of at minIsr partitions (|ISR| = min.insync.replicas)

kafka.server:type=ReplicaManager,

name=AtMinIsrPartitionCount

Producer Id counts

kafka.server:type=ReplicaManager,

name=ProducerIdCount

Count of all producer ids created by transactional and idempotent producers in each replica on the broker

Partition counts

分区数

kafka.server:type=ReplicaManager,

name=PartitionCount

mostly even across brokers

Offline Replica counts

kafka.server:type=ReplicaManager,

name=OfflineReplicaCount

Leader replica counts

Leader副本数

kafka.server:type=ReplicaManager,

name=LeaderCount

mostly even across brokers

ISR shrink rate

ISR收缩率

kafka.server:type=ReplicaManager,

name=IsrShrinksPerSec

If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.

ISR expansion rate

ISR扩展率

kafka.server:type=ReplicaManager,

name=IsrExpandsPerSec

See above

Failed ISR update rate

kafka.server:type=ReplicaManager,

name=FailedIsrUpdatesPerSec

Max lag in messages btw follower and leader replicas

follower副本和leader副本之间的最大消息延迟

kafka.server:type=ReplicaFetcherManager,

name=MaxLag,clientId=Replica

lag should be proportional to the maximum batch size of a produce request.

Lag in messages per follower replica

每个follower副本的消息延迟

kafka.server:type=FetcherLagMetrics,

name=ConsumerLag,clientId=([-.\w]+),

topic=([-.\w]+),partition=([0-9]+)

lag should be proportional to the maximum batch size of a produce request.

Requests waiting in the producer purgatory

请求在生产者purgatory中等待

kafka.server:type=DelayedOperationPurgatory,

name=PurgatorySize,

delayedOperation=Produce

non-zero if ack=-1 is used

Requests waiting in the fetch purgatory

请求在purgatory中等待

kafka.server:type=DelayedOperationPurgatory,

name=PurgatorySize,delayedOperation=Fetch

size depends on fetch.wait.max.ms in the consumer

Request total time

请求总时间

kafka.network:type=RequestMetrics,

name=TotalTimeMs,

request={Produce|FetchConsumer|FetchFollower}

broken into queue, local, remote and response send time

Time the request waits in the request queue

请求在请求队列中等待的时间

kafka.network:type=RequestMetrics,

name=RequestQueueTimeMs,

request={Produce|FetchConsumer|FetchFollower}

Time the request is processed at the leader

leader处理请求的时间

kafka.network:type=RequestMetrics,

name=LocalTimeMs,

request={Produce|FetchConsumer|FetchFollower}

Time the request waits for the follower

请求等待follower的时间

kafka.network:type=RequestMetrics,

name=RemoteTimeMs,

request={Produce|FetchConsumer|FetchFollower}

non-zero for produce requests when ack=-1

Time the request waits in the response queue

请求在响应队列中等待的时间

kafka.network:type=RequestMetrics,

name=ResponseQueueTimeMs,

request={Produce|FetchConsumer|FetchFollower}

Time to send the response

发送回应的时间

kafka.network:type=RequestMetrics,

name=ResponseSendTimeMs,

request={Produce|FetchConsumer|FetchFollower}

Number of messages the consumer lags behind the producer by. Published by the consumer, not broker.

消费者落后于生产者的消息数。由消费者而非broker提供。

kafka.consumer:type=consumer-fetch-manager-metrics,

client-id={client-id} Attribute: records-lag-max

The average fraction of time the network processors are idle

网络处理空闲的平均时间

kafka.network:type=SocketServer,

name=NetworkProcessorAvgIdlePercent

between 0 and 1, ideally > 0.3

The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication

由于客户端未重新进行身份验证，然后将连接超出其到期时间而用于除重新身份验证以外的任何操作而在处理器上断开的连接数

kafka.server:type=socket-server-metrics,

listener=[SASL_PLAINTEXT|SASL_SSL],

networkProcessor=<#>,

name=expired-connections-killed-count

ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination

The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication

由于客户端未重新进行身份验证，然后在其过期时间之后使用该连接进行除重新身份验证以外的任何操作时，所有处理器之间断开连接的总数

kafka.network:type=SocketServer,

name=ExpiredConnectionsKilledCount

ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker

The average fraction of time the request handler threads are idle

请求处理程序线程空闲的平均时间百分比

kafka.server:type=KafkaRequestHandlerPool,

name=RequestHandlerAvgIdlePercent

between 0 and 1, ideally > 0.3

Bandwidth quota metrics per (user, client-id), user or client-id

每个（user， client-id），user或client-id的带宽配额指标

kafka.server:type={Produce|Fetch},

user=([-.\w]+),client-id=([-.\w]+)

Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.

Request quota metrics per (user, client-id), user or client-id

每个（user， client-id），user或client-id的请求配额指标

kafka.server:type=Request,

user=([-.\w]+),client-id=([-.\w]+)

Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.

Requests exempt from throttling

请求不受限制

kafka.server:type=Request

exempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.

ZooKeeper client request latency

ZooKeeper客户端请求延迟

kafka.server:type=ZooKeeperClientMetrics,

name=ZooKeeperRequestLatencyMs

Latency in milliseconds for ZooKeeper requests from broker.

ZooKeeper connection status

ZooKeeper连接状态

kafka.server:type=SessionExpireListener,

name=SessionState

Max time to load group metadata

加载组元数据的最长时间

kafka.server:type=group-coordinator-metrics,

name=partition-load-time-max

maximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)

Avg time to load group metadata

加载组元数据的平均时间

kafka.server:type=group-coordinator-metrics,

name=partition-load-time-avg

average time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)

Max time to load transaction metadata

加载交易元数据的最长时间

kafka.server:type=transaction-coordinator-metrics,

name=partition-load-time-max

maximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)

Avg time to load transaction metadata

加载交易元数据的平均时间

kafka.server:type=transaction-coordinator-metrics,

name=partition-load-time-avg

average time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)

Rate of transactional verification errors

kafka.server:type=AddPartitionsToTxnManager,

name=VerificationFailureRate

Rate of verifications that returned in failure either from the AddPartitionsToTxn API response or through errors in the AddPartitionsToTxnManager. In steady state 0, but transient errors are expected during rolls and reassignments of the transactional state partition.

Time to verify a transactional request

kafka.server:type=AddPartitionsToTxnManager,

name=VerificationTimeMs

The amount of time queueing while a possible previous request is in-flight plus the round trip to the transaction coordinator to verify (or not verify)

Consumer Group Offset Count

kafka.server:type=GroupMetadataManager,

name=NumOffsets

Total number of committed offsets for Consumer Groups

Consumer Group Count

kafka.server:type=GroupMetadataManager,

name=NumGroups

Total number of Consumer Groups

Consumer Group Count, per State

kafka.server:type=GroupMetadataManager,

name=NumGroups[PreparingRebalance,

CompletingRebalance,Empty,Stable,Dead]

The number of Consumer Groups in each state: PreparingRebalance, CompletingRebalance, Empty, Stable, Dead

Number of reassigning partitions

kafka.server:type=ReplicaManager,

name=ReassigningPartitions

The number of reassigning leader partitions on a broker.

Outgoing byte rate of reassignment traffic

kafka.server:type=BrokerTopicMetrics,

name=ReassignmentBytesOutPerSec

0; non-zero when a partition reassignment is in progress.

Incoming byte rate of reassignment traffic

kafka.server:type=BrokerTopicMetrics,

name=ReassignmentBytesInPerSec

0; non-zero when a partition reassignment is in progress.

Size of a partition on disk (in bytes)

kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)

The size of a partition on disk, measured in bytes.

Number of log segments in a partition

kafka.log:type=Log,name=NumLogSegments,

topic=([-.\w]+),partition=([0-9]+)

The number of log segments in a partition.

First offset in a partition

kafka.log:type=Log,name=LogStartOffset,

topic=([-.\w]+),partition=([0-9]+)

The first offset in a partition.

Last offset in a partition

kafka.log:type=Log,name=LogEndOffset,

topic=([-.\w]+),partition=([0-9]+)

The last offset in a partition.

这篇关于Kafka 监控及使用 JMX 进行远程监控的安全注意事项的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Kafka 监控及使用 JMX 进行远程监控的安全注意事项

一. 前言

二. Kafka 监控（Kafka Monitoring）

2.1. 概览

2.2. 使用 JMX 进行远程监控的安全注意事项

相关文章

Spring StateMachine实现状态机使用示例详解

使用Python删除Excel中的行列和单元格示例详解

SpringBoot结合Docker进行容器化处理指南

深入理解Go语言中二维切片的使用

prometheus如何使用pushgateway监控网路丢包

Spring Boot集成Druid实现数据源管理与监控的详细步骤

Python通用唯一标识符模块uuid使用案例详解

linux解压缩 xxx.jar文件进行内部操作过程

SpringBoot中如何使用Assert进行断言校验

Android kotlin中 Channel 和 Flow 的区别和选择使用场景分析