Layman’s guide to Kafka

6 min readSep 6, 2020

Firstly why Kafka?

Kafka’s growth is exploding, more than 1⁄3 of all Fortune 500 companies use Kafka.
These companies include the top ten travel companies,
7 of the top ten banks,
8 of the top ten insurance companies,
9 of the top ten telecom companies,
and much more. LinkedIn, Microsoft, and Netflix process four comma messages a day with Kafka (1,000,000,000,000).

Kafka is used for real-time streams of data, used to collect big data or to do real-time analysis or both).

Kafka is used with in-memory microservices to provide durability and it can be used to feed events to CEP (complex event streaming systems), and IOT/IFTTT style automation systems.
If you want to know which companies use Kafka take a look here https://kafka.apache.org/powered-by

But what the Heck is Kafka?

Well Yeah... this is why I am reading the layman's blog :|

So Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.

Let’s simplify it a bit

Think of Kafka as your house mailbox, in Kafka, we refer to that mailbox as a Kafka Topic, other people can send you mail’s on this mailbox address, those senders we refer as Kafka Producers, and yes the people who read those mail are known as Kafka Consumers and well yeah your family living in the same house belonging to a Consumer group all grouped by a Group id.

Now Let’s see some of the key features about the things we have discovered

Events

An Event is the most basic entity in Kafka, it corresponds to a single message/event published or consumed from any topic, When you read or write data to Kafka, you do this in the form of Events.

Topic

All Kafka Events are organized into topics.

A topic in Kafka can have zero to many producers writing events to it as well as zero to many consumers that are subscribed to these events.
Unlike a traditional messaging system events are not deleted after consumption, you can define how long an event should remain thorough a per-topic configuration setting and read from them as often as needed.

Brokers

Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker.

Now let us dive a little more into these topics

Topics in Kafka are Partitioned this is just a fancy way of saying that your events are distributed into several buckets located onto different Kafka brokers. Every Broker has exactly one partition leader which handles all the read/write requests of that broker. If the replication factor is greater than 1, the additional partition replications act as partition followers.

A partition, in theory, can be described as an immutable collection (or sequence) of messages.

This distributed placement of your data is very important for scalability because it allows client applications to both read and write the data from/to many brokers at the same time.

Each event in a partition has an identifier called offset. This offset is responsible for maintaining the order of your events in a partition for you.

Hence to summarise each event in a Kafka topic can be uniquely identified by a tuple of partition, and offset within the partition.

A visual representation of what we discussed so far

Producers

So far you must have known that a producer is responsible for writing / committing your events to any Kafka topic, so let’s be a little more specific now.

A Producer writes to a single partition leader for a Kafka topic, this provides a means of load balancing production so that each write can be serviced by a separate broker and machine.

Consumers

Each consumer reads from a single partition, also consumers can be organized into consumer groups for a given topic, the group as a whole consumes all messages from the entire topic.

If the number of consumers in a consumer group is more than the number of partitions then some of these consumers will be idle as they have no partition to read from. Similarly, if the number of partitions is more than the number of consumers then consumers will receive messages from multiple partitions

If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.

Kafka Guarantees

There are some claims which Kafka makes but like always there are terms and conditions applied so let’s discuss those first.

So Kafka guarantees hold as long as we are producing to one partition and consuming from one partition, it voids as soon as we either read from the same partition using multiple consumers or write to a single partition using multiple producers.

Q. But what will I get if I pay this cost?
A. Data consistency and availability 🚀

How?

Messages sent to a topic partition will be appended to the commit log in the order they are sent.
A single consumer instance will see messages in the order they appear in the log.
A message is ‘committed’ when all in-sync replicas have applied it to their log.
Any committed message will not be lost, as long as at least one in sync replica is alive.

What can we achieve using Kafka?

Using Kafka in our architecture provides a high level of parallelism and decoupling between data producers and their consumers. In a Microservice pattern, this comes very handy when we want to trigger some other flow with an event happening somewhere else.

Wait! but this can also be achieved using some rest APIs and async calls 😕. So why do I bother to learn something new? I had the same question so I thought of researching a little more about this and these are my findings.

Thank you Java Technical for an amazing explanation

Also committing to a Kafka topic is much faster than executing an API call which writes to a database, hence is we have a pipeline to process the data which is being produced choosing Kafka over REST is better, but if we have a user waiting for the response after the data is being consumed REST is best.

Hope this article was able to develop a better understanding of Kafka and also its use case, will be soon writing the next blog about how to set up a Kafka system in some of the frequently used programming languages 😀. Feel free to put your thoughts or corrections about this article it will surely help me to write better next time.

Some good reads which helped me write this blog