Apache Kafka is one of those technologies that quietly powers a huge chunk of the modern internet — real-time analytics pipelines, event-driven microservices, fraud detection, log aggregation — yet it can feel intimidating at first. This post is the guide I wish I had when I started. Let’s go from zero to production, step by step.


What Is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform originally developed at LinkedIn and donated to the Apache Software Foundation in 2011. At its core, Kafka lets you:

  • Publish streams of records (events/messages)
  • Subscribe to those streams and process them in real time or batch
  • Store streams durably and fault-tolerantly for as long as you want
  • Replay past events, which is something traditional message queues can’t easily do

The key insight that makes Kafka different: it treats every event as an immutable log entry. Events are never deleted on consume — they live on disk until a configured retention period expires.


Core Concepts

Topics

A topic is a named category for a stream of records — think of it like a database table or a folder in a filesystem, but for events.

Topic: "user-signups"
  Event: { "userId": "u123", "email": "vi@example.com", "ts": 1713456000 }
  Event: { "userId": "u124", "email": "bob@example.com", "ts": 1713456010 }

Topics are append-only — you can only add new records, never edit old ones.

Partitions

Every topic is split into one or more partitions. This is Kafka’s secret to horizontal scalability.

Topic "orders" with 3 partitions:

Partition 0: [order-1] [order-4] [order-7] →
Partition 1: [order-2] [order-5] [order-8] →
Partition 2: [order-3] [order-6] [order-9] →

Each record within a partition gets a monotonically increasing integer called an offset. Kafka guarantees ordering within a partition, not across partitions.

Why does this matter? You can run one consumer per partition in parallel, so more partitions = more throughput.

Producers

A producer is any application that publishes records to a Kafka topic.

// Go example using the confluent-kafka-go client
producer, _ := kafka.NewProducer(&kafka.ConfigMap{
    "bootstrap.servers": "localhost:9092",
})

producer.Produce(&kafka.Message{
    TopicPartition: kafka.TopicPartition{
        Topic:     &topic,
        Partition: kafka.PartitionAny,
    },
    Key:   []byte("user-123"),
    Value: []byte(`{"event": "signup", "plan": "pro"}`),
}, nil)

The message key determines which partition the record goes to. Records with the same key always land in the same partition — which preserves order for a given entity (e.g., all events for user-123).

Consumers & Consumer Groups

A consumer reads records from a topic. Consumers are grouped into consumer groups. Kafka distributes partitions across consumers in the same group, so each partition is read by exactly one consumer in the group at a time.

Topic "payments" (4 partitions)
Consumer Group "payment-processor" (2 consumers):

  Consumer A → Partition 0, Partition 1
  Consumer B → Partition 2, Partition 3

Adding more consumers to a group scales out your processing — up to the number of partitions. A second consumer group reading the same topic gets its own independent copy of all records.

Brokers

A broker is a single Kafka server. A Kafka cluster is a group of brokers that cooperate to store and serve data.

Partitions are distributed and replicated across brokers. Each partition has one leader (handles reads and writes) and zero or more followers (replicas for fault tolerance).

ZooKeeper vs KRaft

Historically, Kafka used Apache ZooKeeper for cluster coordination — leader election, topic metadata, ACLs. As of Kafka 3.x, the KRaft (Kafka Raft) mode is stable and eliminates the ZooKeeper dependency entirely, simplifying deployment significantly.


How Kafka Stores Data

Kafka writes records to disk in log segments — binary files on the broker’s filesystem. This is why Kafka is so fast: it uses the OS page cache aggressively and relies on sequential disk I/O, which is much faster than random I/O (even SSDs benefit from this).

/kafka-logs/
  orders-0/         ← Partition 0 data directory
    00000000000000000000.log     ← segment file
    00000000000000000000.index   ← offset index
    00000000000000000000.timeindex
  orders-1/
    ...

Retention Policy

Records aren’t deleted when consumed. Kafka retains them based on:

  • Time-based: retention.ms — delete segments older than N days (default: 7 days)
  • Size-based: retention.bytes — delete when a partition exceeds N bytes
  • Log compaction: keep only the latest record per key — useful for “change data capture” and maintaining current state

Delivery Guarantees

Kafka offers three levels of delivery semantics:

GuaranteeProducer settingConsumer behaviourRisk
At most onceacks=0commit before processingdata loss
At least onceacks=allcommit after processingduplicates
Exactly onceacks=all + idempotent + transactionstransactional consumerhighest complexity

For most systems, at-least-once with idempotent consumers is the sweet spot. Exactly-once is available but adds operational complexity.


Key Use Cases

1. Event-Driven Microservices

Instead of direct API calls between services, services emit events to Kafka and other services consume them asynchronously. This decouples services and makes the system resilient to individual service downtime.

User Service → [user-events topic] → Email Service
                                  → Analytics Service
                                  → Notification Service

2. Real-Time Data Pipelines

Kafka Connect ships data between Kafka and external systems (databases, S3, Elasticsearch) without writing code. Source connectors read from systems; sink connectors write to them.

3. Log Aggregation

Collect logs from hundreds of servers into Kafka, then stream them to Elasticsearch or a data warehouse. Kafka acts as a high-throughput buffer that smooths out spikes.

4. Stream Processing

With Kafka Streams (or Apache Flink on top of Kafka), you can do stateful computations — windowing, joins, aggregations — directly on the event stream without a separate processing cluster.

5. Change Data Capture (CDC)

Tools like Debezium read the database write-ahead log and emit every row change as a Kafka event. Downstream systems get a real-time feed of all database mutations.


Running Kafka Locally (KRaft mode)

# Using Docker Compose (KRaft — no ZooKeeper needed)
cat > docker-compose.yml <<EOF
version: "3"
services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    hostname: kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qk"
EOF

docker compose up -d

Create a topic and send your first event:

# Create topic
docker exec kafka kafka-topics --create \
  --topic hello-kafka \
  --bootstrap-server localhost:9092 \
  --partitions 3 \
  --replication-factor 1

# Produce a message
echo '{"hello": "kafka"}' | docker exec -i kafka \
  kafka-console-producer --topic hello-kafka \
  --bootstrap-server localhost:9092

# Consume from the beginning
docker exec kafka kafka-console-consumer \
  --topic hello-kafka \
  --bootstrap-server localhost:9092 \
  --from-beginning

Performance Tuning Tips

AreaSettingWhat it does
Producer throughputbatch.size, linger.msBatch more records per request
Producer durabilityacks=all, min.insync.replicas=2Wait for follower acknowledgement
Consumer throughputfetch.min.bytes, fetch.max.wait.msFetch larger batches less often
Compressioncompression.type=lz4Reduce network and disk I/O
PartitionsMatch to consumer parallelismMore partitions = more scale
Retentionlog.retention.hoursBalance disk cost vs replay window

Common Pitfalls

1. Too few partitions early on You can add partitions later, but it breaks key-based ordering for existing keys. Plan for growth upfront.

2. Large messages Kafka is optimised for many small messages (< 1MB). For large payloads, store the data in S3/object storage and put the reference URL in the Kafka event.

3. Ignoring consumer lag Monitor kafka-consumer-groups --describe or use Burrow/Cruise Control. Growing lag means your consumers can’t keep up with producers.

4. Not handling rebalances When consumers join or leave a group, Kafka triggers a rebalance — all consumption pauses briefly. Use cooperative rebalancing (partition.assignment.strategy=CooperativeStickyAssignor) to minimise disruption.

5. Skipping schema management Without a schema registry (e.g., Confluent Schema Registry + Avro/Protobuf), producer and consumer schema drift will break your pipeline silently.


Kafka vs. Alternatives

KafkaRabbitMQAWS SQSRedis Streams
ModelLog-basedQueueQueueLog-based
RetentionDays–foreverUntil consumed14 days maxMemory-bounded
ThroughputMillions/sec~50k/sec~3k/secHundreds of k/sec
Replay✅ Yes❌ No❌ No✅ Yes
OrderingPer-partitionPer-queueBest-effortPer stream
Best forEvent sourcing, analytics, high-throughput pipelinesTask queues, RPC-style messagingSimple serverless queuingSmall-scale streams, caching

Final Thoughts

Kafka changed how I think about data flow. Once you see systems as streams of immutable events, the architecture simplifies dramatically — services become stateless processors, debugging becomes replaying events, and scaling becomes adding partitions.

The learning curve is real, especially around consumer groups, partition assignment, and exactly-once semantics. But the payoff is a backbone that can handle millions of events per second, survive broker failures, and let you replay the entire history of your system whenever you need to.

Start small — one topic, one producer, one consumer. Then scale up from there.


Have questions about Kafka? Feel free to reach out through the contact page or drop a comment. I’d love to hear what you’re building.