Manual configuration of public and private subnet, routes, security groups, nat gateway and bastion server via de AWS console
In our last blog we wrote how we used Kafka to ingest data from SAP. In this blog we will explain what Kafka is and how it works and why it is an important piece of infrastructure in modern large scala data landscapes.
Kafka is a distibuted reliable messaging system for real-time processing of streaming messages. A reliable messaging system plays an important role in moving data between systems.
A data pipeline without a reliable messaging system would lead to intercommunication between multiple data systems over many channels where each channel requires their own protocol and communication for moving data between these systems. In short we can say this will lead to a mess of point to point connections when we have many different data systems, such a system is hard to build, to maintain and to keep operational. Below is a simple visualisation of this scenario when we want to load our data from different sources into Hadoop.
So to structure this scenario we need a central messaging system like Kafka to be the man in the middle between those different data systems.This means these data systems won't communicate with each other anymore over different protocols but just communicate with the messaging system, in our case Kafka. This reduces the complexity dramatically since we are only dealing with the connection of a data system to the messaging system and don't have to deal with different dependencies. This gives us also more flexibility since we can easily connect or disconnect data systems because there are no other dependencies.
Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a publish and subscribe model. This can be compared with taking a subscription on a magazine, it will be delivered every week or month in your mailbox. In the case of Kafka messages of data are being delivered, thats why these systems are called messaging systems.
Kafka allows for a large number of permanent ad-hoc consumers or subscribers and is highly available and resilient to node failures when setup in a cluster and supports automatic recovery.
In the real world these characteristics make Kafka an ideal fit for communication and integration between components of large scale data systems. Kafka is implemented in production within thousands of companies.
If we are talking about Kafka the following terminology is often used: topics, producers, consumers, subscribers, messaging broker, partitions and offset.
Messages in Kafka are organized in topics. Producers push (write) messages from a certain data system to a topic. Consumers are able to pull (read) messages from a certain topic after they subscribe to that topic, there can be multiple subscribers or consumers on one topic. Kafka as a distributed messaging system, runs in a cluster. Eech node in the cluster is called a messaging broker.
Kafka topics can be divided into partitions. These partitions allow us to parallelize a topic by splitting data in a particular topic across multiple brokerss where each partition can be placed on a different machine to allow multiple consumers to read from a topic in parallel.
Consumers can also be parallized so that multiple consumers can read from multiple partitions in a topic for very high message throughput.
With every message that arrives Kafka assigns a unique immutable sequence which is call offset, this is basically the order in which messages are arriving and the offset is maintained by Kafka.
Consumers can read messages starting from a specific offset and are allowd to read from any offset point they choose, allowing consumers to join the cluster at any point in time. Given these constraints, each specific message in a Kafka cluster can be uniquely identified by a tuple consisting of the message topic, partition, and offset within the partition.
Reading and writing from a Kafka partition can be viewed as if it is a log file, new data is being appended at the end of the log file and consumers can read information starting from a certain point in time called offset.
Kafka retains messages for a configurable period of time, to be specified with the parameter log.retention.hours in your broker configuration. After this period of time is met the messages are marked as deleted. In such a case the consumer will lose these messages if it starts reading at the offset of the messages marked for deletion, it will continue reading from the offset of the first message that has not been marked for deletion.
Each broker can hold a number of partitions and each of these partitions can be either a leader or a replica for a given topic. All writes and read for a given topic go through the leader and the leader coordinates the update process of the replicas with new data. If the leader fails, a replica will take over and become the new leader.
Producers write to a single leader, this provides in load balancing since each write can be served by a seperate broker machine in the cluster. Since every individual machine is responsible for each write, throughput of the system as a whole is increased.
Consumers read from a single partition, this allows us to scale the throughput of message consumption in a similar way to message production. Consumers can also be organized into consumer groups for a given topic, each consumer within the group reads from a unique partition and the group as a while consumes all messages from the entire topic.
There are threes scenarios that can occur in case of consumer groups:
Scenario 1 and 3 are drawn in the above picture. Consumer group B has the same number of consumers as partitions, thus each consumer reads from exactly one partition. Consumer group A has two consumers of 4 partitions, each consumer reads from 2 partitions.
Kafka makes the following quarantees about consistency and availability:
The first and the second guarantee that message ordering is preserved for each partition. Note that message ordering for the entire topic is not guaranteed. The third and fourth guarantee ensure that committed messages can be retrieved.
All guarantees made by Kafka about consistency and availability apply for the situation where we are producing to one partition and consuming from one partition, in case we are reading from the same partition using two consumers or writing to the same partition using two producers all quarantees are off the table.
The partition leader is responsible for writing the message to its own in sync replica and once that message is committed it is responsible for propagating the message to additional replicas on different brokers. Each replica acknowledges that they have received the message and can now be called in sync. When every broker in the cluster is available, consumers and producers can read and write from the leading partition of a topic. But unfortunately this is the sunny scenario, either leaders or replicas my fail in practice and we need to handle these situations.
What happens when replicas fail? Writes will no longer reach the failed replica and it will no longer receive messages, falling further apart with the leader or with other words it gets out of sync.
In the above scenario first Replica 3 fails and is no longer receiving messages from the leader, after that Replica 2 fails and also no longer receives messages, both get out of sync with the leader. Only the leader is still in sync. In the last scenario even the leader dies and we have three dead replicas.
Replica 1 i actually still in sync, it cannot receive new arriving messages but it is in sync with every message that was possible to receive. Replica 2 is missing some messages and Replica 3 is even missing more messsages.
Given this state there are two possible solutions to handle this failure scenario:
Solution 1: Once the broker of the leader is back up it will begin receiving and writing messages and as the replicas are brought back online they will start syncing with the leader till they in sync again.
Solution 2: In case this is not the broker of the leader, the broker of the new leader will be out of sync, since it is missing messages from the leader from the moment it went down. As the other brokers are brought up again, they will see that the committed messages do no exist on the new leader and start dropping these messages. By electing a new leader as soon as possible messages may get dropped but it will minimize the downtime as a new machine can become the leader.
In case the leader fails itself before the replicas, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of sync replicas. This may take a few secons and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
Kafka clients come in two flavors:
Each of the clients can be configured with different levels of consistency.
For a producer there are three scenarios to provide consistency of messages.
Depending on your requirements on consistency and throughput you can decided what is best for your use case scenario.
For a consumer there are also three scenarios to provide consistency of messages.
We will explain these scenarios below in more detail.
At most once delivery, the consumer reads data from a partition, commits the offsets that it has read, and then processes the message. If the consumer crashes between committing the offset and processing the message it will restart from the next offset without ever having processed the messages. This would lead to undesirable message loss.
At least once, the consumer reads data from a partition, processes the message, and then commits the offset of the message it processed. In this case the consumer could crash and it will process the same message again. This leads to duplicate messages in the downstream system but not to data loss.
Exactly once delivery is guaranteed by having the consumer process a message and commit the output of the message along with the offset to a transactional system. If the consumer crashes it can re-read the last transaction committed and resume processing from there. This leads to no data loss and no data duplication. This method leads to significantly decrease in throughput of the system as each message and offset is committed as a transaction.
In practice most Kafka consumer applications choose at least once delivery because it offers the best trade off between throughput and correctness. The downstream appplication should handles the duplicate messages in the case of at least once delivery.
Kafka is not a traditional message broker with a lot of bells and whistles like for example ActiveMQ.
Because of these differences Kafka can make optimizations.
It must be clear that above performance characteristics and its scalability and fault tolerance make Kafka an excellent candidate to be used in the big data space as a reliable way to ingest and move large amounts of data very quickly. There are many large companies who make use of Kafka for their data ingesting, one example of them is Netflix who is ingesting about 500 billion events per day or ca. 1.3 Petabyte of data.
Kafka can provide the backing store for domain driven design concepts like Command Query Responsible Segregation and event sourcing which are powerful mechanisms for implementing scalable microservices. Event sourcing applicaytions that generate a lot of events like the Internet of Things can be difficult to implement with traditional databases, and an additional concept in Kafka called log compaction can preserve events for the lifetime of the app. This helps in making the application very loosely coupled, because it can lose or discard logs and just restore the domain state from a log of preserved events.
Should you use Kafka? This question is dependent on your use case. Kakfa is not a one size fit it is a fit for a certain class of problems a lot of web-scale and enterprise companies have. If you're looking to build a set of resilient data services and applications Kafka can serve as the source of thruth by collecting and keeping all the facts or events for a system.
If you would like to know more about implementing Kafka within your organization and make use of the experiences Scalar Data has made with implementing this exciting technology for customer solutions then you can contact us, we would be glad to help and assist you with implementing Kafka within your organisation.
e-mail: [email protected]