kafka in a nutshell -- kafka 简介

2023-11-01 15:08
文章标签 简介 kafka nutshell

本文主要是介绍kafka in a nutshell -- kafka 简介,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!


Kafka is a messaging system. That’s it. So why all the hype? In realitymessaging is a hugely important piece of infrastructure for moving data betweensystems. To see why, let’s look at a data pipeline without a messaging system.

This system starts with Hadoop for storage and data processing. Hadoop isn’tvery useful without data so the first stage in using Hadoop is getting data in.

Bringing Data in to Hadoop

So far, not a big deal. Unfortunately, in the real world data exists on manysystems in parallel, all of which need to interact with Hadoop and with eachother. The situation quickly becomes more complex, ending with a system wheremultiple data systems are talking to one another over many channels. Each ofthese channels requires their own custom protocols and communication methods andmoving data between these systems becomes a full-time job for a team ofdevelopers.

Moving Data Between Systems

Let’s look at this picture again, using Kafka as a central messaging bus. Allincoming data is first placed in Kafka and all outgoing data is read from Kafka.Kafka centralizes communication between producers of data and consumers of thatdata.

Moving Data Between Systems

What is Kafka?

Kafka is publish-subscribe messaging rethought as a distributed commit log.

Kafka Documentation http://kafka.apache.org/

Kafka is a distributed messaging system providing fast, highly scalable andredundant messaging through a pub-sub model. Kafka’s distributed design gives itseveral advantages. First, Kafka allows a large number of permanent or ad-hocconsumers. Second, Kafka is highly available and resilient to node failures andsupports automatic recovery. In real world data systems, these characteristicsmake Kafka an ideal fit for communication and integration between components oflarge scale data systems.

Kafka Terminology

The basic architecture of Kafka is organized around a few key terms: topics,producers, consumers, and brokers.

All Kafka messages are organized into topics. If you wish to send a messageyou send it to a specific topic and if you wish to read a message you read itfrom a specific topic. A consumer of topics pulls messages off of a Kafkatopic while producers push messages into a Kafka topic. Lastly, Kafka, as adistributed system, runs in a cluster. Each node in the cluster is called aKafka broker.

Anatomy of a Kafka Topic

Kafka topics are divided into a number of partitions. Partitions allow you toparallelize a topic by splitting the data in a particular topic across multiplebrokers — each partition can be placed on a separate machine to allow formultiple consumers to read from a topic in parallel. Consumers can also beparallelized so that multiple consumers can read from multiple partitions in atopic allowing for very high message processing throughput.

Each message within a partition has an identifier called its offset. Theoffset the ordering of messages as an immutable sequence. Kafka maintains thismessage ordering for you. Consumers can read messages starting from a specificoffset and are allowed to read from any offset point they choose, allowingconsumers to join the cluster at any point in time they see fit. Given theseconstraints, each specific message in a Kafka cluster can be uniquely identifiedby a tuple consisting of the message’s topic, partition, and offset within thepartition.

Log Anatomy

Another way to view a partition is as a log. A data source writes messages tothe log and one or more consumers reads from the log at the point in time theychoose. In the diagram below a data source is writing to the log and consumers Aand B are reading from the log at different offsets.

Data Log

Kafka retains messages for a configurable period of time and it is up to theconsumers to adjust their behaviour accordingly. For instance, if Kafka isconfigured to keep messages for a day and a consumer is down for a period oflonger than a day, the consumer will lose messages. However, if the consumer isdown for an hour it can begin to read messages again starting from its lastknown offset. From the point of view of Kafka, it keeps no state on what theconsumers are reading from a topic.

Partitions and Brokers

Each broker holds a number of partitions and each of these partitions can beeither a leader or a replica for a topic. All writes and reads to a topic gothrough the leader and the leader coordinates updating replicas with new data.If a leader fails, a replica takes over as the new leader.

Partitions and Brokers

Producers

Producers write to a single leader, this provides a means of load balancingproduction so that each write can be serviced by a separate broker and machine.In the first image, the producer is writing to partition 0 of the topic andpartition 0 replicates that write to the available replicas.

Producer writing to partition.

In the second image, the producer is writing to partition 1 of the topic andpartition 1 replicates that write to the available replicas.

Producer writing to second partition.

Since each machine is responsible for each write, throughput of the system as awhole is increased.

Consumers and Consumer Groups

Consumers read from any single partition, allowing you to scale throughput ofmessage consumption in a similar fashion to message production. Consumers canalso be organized into consumer groups for a given topic — each consumer withinthe group reads from a unique partition and the group as a whole consumes allmessages from the entire topic. Typically, you structure your Kafka cluster tohave the same number of consumers as the number of partitions in your topics.If you have more consumers than partitions then some consumers will be idlebecause they have no partitions to read from. If you have more partitions thanconsumers then consumers will receive messages from multiple partitions. If youhave equal numbers of consumers and partitions you maximize efficiency.

The following picture from the Kafkadocumentation describes thesituation with multiple partitions of a single topic. Server 1 holds partitions 0and 3 and server 3 holds partitions 1 and 2. We have two consumer groups, A andB. A is made up of two consumers and B is made up of four consumers. ConsumerGroup A has two consumers of four partitions — each consumer reads from twopartitions. Consumer Group B, on the other hand, has the same number ofconsumers as partitions and each consumer reads from exactly one partition.

Consumers and Consumer Groups

Consistency and Availability

Before beginning the discussion on consistency and availability, keep in mindthat these guarantees hold as long as you are producing to one partition andconsuming from one partition. All guarantees are off if you are reading fromthe same partition using two consumers or writing to the same partition usingtwo producers.

Kafka makes the following guarantees about data consistency and availability:(1) Messages sent to a topic partition will be appended to the commit log inthe order they are sent, (2) a single consumer instance will see messages in theorder they appear in the log, (3) a message is ‘committed’ when all in syncreplicas have applied it to their log, and (4) any committed message will not belost, as long as at least one in sync replica is alive.

The first and second guarantee ensure that message ordering is preserved foreach partition. Note that message ordering for the entire topic is notguaranteed. The third and fourth guarantee ensure that committed messages can beretrieved. In Kafka, the partition that is elected the leader is responsible forsyncing any messages received to replicas. Once a replica has acknowledged themessage, that replica is considered to be in sync. To understand this further,lets take a closer look at what happens during a write.

Handling Writes

When communicating with a Kafka cluster, all messages are sent to thepartition’s leader. The leader is responsible for writing the message to its ownin sync replica and, once that message has been committed, is responsible forpropagating the message to additional replicas on different brokers. Each replicaacknowledges that they have received the message and can now be called in sync.

Leader Writes to Replicas

When every broker in the cluster is available, consumers and producers canhappily read and write from the leading partition of a topic without issue.Unfortunately, either leaders or replicas may fail and we need to handle each ofthese situations.

Handling Failure

What happens when a replica fails? Writes will no longer reach the failedreplica and it will no longer receive messages, falling further and further outof sync with the leader. In the image below, Replica 3 is no longer receivingmessages from the leader.

First Replica Fails

What happens when a second replica fails? The second replica will also nolonger receive messages and it too becomes out of sync with the leader.

Second Replica Fails

At this point in time, only the leader is in sync. In Kafka terminology we stillhave one in sync replica even though that replica happens to be the leader forthis partition.

What happens if the leader dies? We are left with three dead replicas.

Third Replica Fails

Replica one is actually still in sync — it cannot receive any new data but it isin sync with everything that was possible to receive. Replica two is missingsome data, and replica three (the first to go down) is missing even more data.Given this state, there are two possible solutions. The first, and simplest,scenario is to wait until the leader is back up before continuing. Oncethe leader is back up it will begin receiving and writing messages and asthe replicas are brought back online they will be made in sync with theleader. The second scenario is to elect the first broker to come back up asthe new leader. This broker will be out of sync with the existing leader andall data written between the time where this broker went down and when itwas elected the new leader will be lost. As additional brokers come back up,they will see that they have committed messages that do not exist on thenew leader and drop those messages. By electing a new leader as soon aspossible messages may be dropped but we will minimized downtime as any newmachine can be leader.

Taking a step back, we can view a scenario where the leader goes down while insync replicas still exist.

Leader Fails

In this case, the Kafka controller will detect the loss of the leader andelect a new leader from the pool of in sync replicas. This may take a fewseconds and result in LeaderNotAvailable errors from the client. However, nodata loss will occur as long as producers and consumers handle this possibilityand retry appropriately.

Consistency as a Kafka Client

Kafka clients come in two flavours: producer and consumer. Each of these can beconfigured to different levels of consistency.

For a producer we have three choices. On each message we can (1) wait for all insync replicas to acknowledge the message, (2) wait for only the leader toacknowledge the message, or (3) do not wait for acknowledgement. Each of thesemethods have their merits and drawbacks and it is up to the system implementerto decide on the appropriate strategy for their system based on factors likeconsistency and throughput.

On the consumer side, we can only ever read committed messages (i.e., those that havebeen written to all in sync replicas). Given that, we have three methods ofproviding consistency as a consumer: (1) receive each message at most once, (2)receive each message at least once, or (3) receive each message exactlyonce. Each of these scenarios deserves a discussion of its own.

For at most once message delivery, the consumer reads data from a partition,commits the offset that it has read, and then processes the message. If theconsumer crashes between committing the offset and processing the message itwill restart from the next offset without ever having processed the message.This would lead to potentially undesirable message loss.

A better alternative is at least once message delivery. For at least oncedelivery, the consumer reads data from a partition, processes the message, andthen commits the offset of the message it has processed. In this case, theconsumer could crash between processing the message and committing the offsetand when the consumer restarts it will process the message again. This leads toduplicate messages in downstream systems but no data loss.

Exactly once delivery is guaranteed by having the consumer process a message andcommit the output of the message along with the offset to a transactional system.If the consumer crashes it can re-read the last transaction committed and resumeprocessing from there. This leads to no data loss and no data duplication. Inpractice however, exactly once delivery implies significantly decreasing thethroughput of the system as each message and offset is committed as atransaction.

In practice most Kafka consumer applications choose at least once deliverybecause it offers the best trade-off between throughput and correctness. Itwould be up to downstream systems to handle duplicate messages in their own way.

Conclusion

Kafka is quickly becoming the backbone of many organization’s data pipelines —and with good reason. By using Kafka as a message bus we achieve a high level ofparallelism and decoupling between data producers and data consumers, making ourarchitecture more flexible and adaptable to change. This article provides abirds eye view of Kafka architecture. From here, consult the Kafkadocumentation. Enjoy learning Kafkaand putting this tool to more use!

from post: http://sookocheff.com/post/kafka/kafka-in-a-nutshell/


这篇关于kafka in a nutshell -- kafka 简介的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/323842

相关文章

Debezium 与 Apache Kafka 的集成方式步骤详解

《Debezium与ApacheKafka的集成方式步骤详解》本文详细介绍了如何将Debezium与ApacheKafka集成,包括集成概述、步骤、注意事项等,通过KafkaConnect,D... 目录一、集成概述二、集成步骤1. 准备 Kafka 环境2. 配置 Kafka Connect3. 安装 D

Java中Springboot集成Kafka实现消息发送和接收功能

《Java中Springboot集成Kafka实现消息发送和接收功能》Kafka是一个高吞吐量的分布式发布-订阅消息系统,主要用于处理大规模数据流,它由生产者、消费者、主题、分区和代理等组件构成,Ka... 目录一、Kafka 简介二、Kafka 功能三、POM依赖四、配置文件五、生产者六、消费者一、Kaf

Kafka拦截器的神奇操作方法

《Kafka拦截器的神奇操作方法》Kafka拦截器是一种强大的机制,用于在消息发送和接收过程中插入自定义逻辑,它们可以用于消息定制、日志记录、监控、业务逻辑集成、性能统计和异常处理等,本文介绍Kafk... 目录前言拦截器的基本概念Kafka 拦截器的定义和基本原理:拦截器是 Kafka 消息传递的不可或缺

Golang的CSP模型简介(最新推荐)

《Golang的CSP模型简介(最新推荐)》Golang采用了CSP(CommunicatingSequentialProcesses,通信顺序进程)并发模型,通过goroutine和channe... 目录前言一、介绍1. 什么是 CSP 模型2. Goroutine3. Channel4. Channe

Java中的Opencv简介与开发环境部署方法

《Java中的Opencv简介与开发环境部署方法》OpenCV是一个开源的计算机视觉和图像处理库,提供了丰富的图像处理算法和工具,它支持多种图像处理和计算机视觉算法,可以用于物体识别与跟踪、图像分割与... 目录1.Opencv简介Opencv的应用2.Java使用OpenCV进行图像操作opencv安装j

如何在一台服务器上使用docker运行kafka集群

《如何在一台服务器上使用docker运行kafka集群》文章详细介绍了如何在一台服务器上使用Docker运行Kafka集群,包括拉取镜像、创建网络、启动Kafka容器、检查运行状态、编写启动和关闭脚本... 目录1.拉取镜像2.创建集群之间通信的网络3.将zookeeper加入到网络中4.启动kafka集群

IDEA中的Kafka管理神器详解

《IDEA中的Kafka管理神器详解》这款基于IDEA插件实现的Kafka管理工具,能够在本地IDE环境中直接运行,简化了设置流程,为开发者提供了更加紧密集成、高效且直观的Kafka操作体验... 目录免安装:IDEA中的Kafka管理神器!简介安装必要的插件创建 Kafka 连接第一步:创建连接第二步:选

搭建Kafka+zookeeper集群调度

前言 硬件环境 172.18.0.5        kafkazk1        Kafka+zookeeper                Kafka Broker集群 172.18.0.6        kafkazk2        Kafka+zookeeper                Kafka Broker集群 172.18.0.7        kafkazk3

ASIO网络调试助手之一:简介

多年前,写过几篇《Boost.Asio C++网络编程》的学习文章,一直没机会实践。最近项目中用到了Asio,于是抽空写了个网络调试助手。 开发环境: Win10 Qt5.12.6 + Asio(standalone) + spdlog 支持协议: UDP + TCP Client + TCP Server 独立的Asio(http://www.think-async.com)只包含了头文件,不依

业务协同平台--简介

一、使用场景         1.多个系统统一在业务协同平台定义协同策略,由业务协同平台代替人工完成一系列的单据录入         2.同时业务协同平台将执行任务推送给pda、pad等执行终端,通知各人员、设备进行作业执行         3.作业过程中,可设置完成时间预警、作业节点通知,时刻了解作业进程         4.做完再给你做过程分析,给出优化建议         就问你这一套下