Introduction

Coming into this class, the key thing to keep in mind when referencing Kafka is that we aren’t going to deal with object-based data. Apache Kafka deals with information represented as events in real-time, rather than working on existing sources of data.

Event Organization within Kafka

As mentioned before, Kafka is not like a traditional data store. That means when new events are ingested, they are not pre-existing values, but are rather appended to a log in sequence.

An event can be broken down into the following three parts:

  • Key - denotes information about the source of the event (usually some id)
  • Value
  • Timestamp

Some key details about Kafka events are as follows:

  • Messages are immutable. Transforming data will mean creating a new topic with that event type present.
  • Topics are not like traditional queues; data can be replayed and retained at configurable thresholds.
    • Log compaction allows for configuring storage policies for each key in a topic.
    • Retention policies allow for configuring how long data should be stored in topics.

Kafka Partitioning

Apache Kafka is primarily designed as a distributed system which enables high concurrency and redundancy through horizontal scaling. To support this, each Kafka topic log is usually split up into smaller sections of data called partitions.

Partitions have key rules that govern how they function and what gets allocated to them:

  • Events with the same key are always written to the same partition (hashing distribution)
  • Events that are written to the same partition will always maintain sequential ordering
  • Events with no key are distributed to partitions using a round-robin approach

Kafka Brokers

Brokers are the servers that store data and handle all of the Kafka streaming requests. These brokers can be distributed across various physical servers, each of which stores partitions of topics.

A group of brokers forms a Kafka cluster, which handles all read and write operations associated with client requests. When clients read and write information, they interact with the cluster, which directly reads/writes from specific partitions within the cluster.

Kafka Broker Metadata Management & Synchronization

Historically, Kafka used ZooKeeper for managing metadata and coordinating Kafka brokers, but this has recently shifted to KRaft which is built into Kafka itself. This means that brokers are able to handle their own metadata synchronization, simplifying the architecture and improving efficiency.

Replication

Kafka follows a pretty standard replication method. Multiple copies of partitions exist across many brokers, determined by the replication factor. The replication pattern resembles the Master-Slave Pattern of replication.

There is always a lead partition to which all reads and writes are directed, which synchronizes with its children. When the lead partition fails, a child partition is elected as the new leader.

See how similar this pattern is to the database replication patterns in System Design Interview: Chapter 1

Kafka Producers: Writing Data Into Your Event Pipeline

Producers are the client applications that connect to clusters and send data to them. As a client, you have to configure a producer with a set of properties that denote the cluster you will be communicating with and other relevant metadata.

  • bootstrap.servers # list of brokers that the producer can connect to (clusters)
  • acks # the level of acknowledgment the server needs to maintain to have an event succeed

To actually use this in code, you will have the following two objects setup to send a message to a cluster/topics within the cluster.

  • KafkaProducer – Manages the connection to the cluster and handles message sending.
  • ProducerRecord – Represents the message (key, value) and the topic it will be sent to. It also allows you to set optional fields like timestamppartition, and headers.

Example Code

KafkaProducer
ProducerRecord
 
import { Kafka, Producer, ProducerRecord } from 'kafkajs';
 
// 1. Initialize the Kafka client
const kafka = new Kafka({
  clientId: 'my-ts-app',
  brokers: ['localhost:9092'], // Replace with your Kafka broker address
});
 
const producer: Producer = kafka.producer();
 
// 2. Define the payload types
interface OrderMessage {
  orderId: string;
  amount: number;
}
 
// 3. Create and send the ProducerRecord
async function runProducer() {
  await producer.connect();
 
  const record: ProducerRecord = {
    topic: 'order-events',
    messages: [
      {
        key: 'order-123', // Keys ensure messages with the same key go to the same partition
        value: JSON.stringify({ orderId: 'order-123', amount: 250.50 } as OrderMessage),
        headers: {
          'correlation-id': 'corr-abc-123',
        },
      },
    ],
  };
 
  try {
    const result = await producer.send(record);
    console.log('Message sent successfully:', result);
  } catch (error) {
    console.error('Error sending message:', error);
  } finally {
    await producer.disconnect();
  }
}
 
runProducer();
 

In this example, the correlation ID is a unique identifier that travels with the event to track its entire journey from the producer, through Kafka, and into any downstream consumers.

Correlation ID vs. Kafka Message Keys

FeatureCorrelation ID (in Headers)Kafka Message Key
PurposeMonitoring, debugging, and tracing across services.Partitioning data and ensuring strict message ordering.
ScopeTravels through the entire network/ecosystem.Stays strictly inside that specific Kafka topic.
UniquenessGlobally Unique. Generated per request (e.g., uuid-v4).Reused. Shared by the same entity (e.g., customer-456).
Topic LocationWill appear across many different topics as work flows.Always lands on the same partition within the topic.

It’s important to note that the producer determines the partition the data will be sent to with the Kafka message key.

Kafka Consumers: Reading and Reacting to Event Streams

Consumers are the client applications actually responsible from reading the Kafka data and processing it. The consumer needs to provide the consumer group.id and the bootstrap.servers information for cluster discovery.

Consumers can subscribe to one or more topics dynamically and will infinitely poll the clusters for new events. The consumers fetch ConsumerRecords which contain the following information:

  • Key: The unique identifier of the message.
  • Value: The data payload.
  • Partition: The partition the message came from.
  • Timestamp: When the event was recorded.
  • Headers: Optional metadata.

Consumer Recovery

Offsets of where the consumer left off reading are stored within the Kafka topics to ensure that if a consumer goes offline, it will start reading from the same place.

Consumer Groups and Parallelism

To ensure we can scale processing of Kafka information, we can also create consumer groups that can process data from partitions concurrently. Kafka assigns each partition to one consumer in the group—no two consumers in the same group read from the same partition.

Since messages are log-based, multiple consumers can process the same event-stream without any conflicts.

Schema Registry

When writing or consuming data from a topic, both the producers and the consumers need to have an understanding of the structure of the data they are sending or consuming.

Schema Registries support storing the shape of the data that is being sent by the producers and are located within each topic.

  • Consumers can verify that the data they are consuming matches the expected schema by comparing the schemaId from the message to the schema registry

Serialization Formats

The schema registry supports three major formats: Avro, JSON


Linked Map of Contexts