[{"content":"Experimentation platforms are one of those invisible infrastructure layers that quietly determine which features ship, which UI changes stick, and which product bets get doubled down on. Every time you open Spotify, LinkedIn, or any modern tech product, you\u0026rsquo;re almost certainly enrolled in at least one running experiment. This post is everything I wish I had when I started building and running experiments — from the fundamentals to the hairy edge cases.\nWhat Is an Experimentation Platform? An experimentation platform is the infrastructure that allows teams to run controlled experiments — most commonly A/B tests — at scale. At its core, it answers one question: did this change cause a measurable improvement?\nThe key word is causal. Unlike analytics dashboards that show correlations, a well-run experiment gives you causal evidence because it randomly splits users into groups, changing only one variable between them.\nA complete platform handles:\nAssignment — deterministically placing users into control/treatment groups Configuration delivery — serving the right experience to each user Exposure logging — recording who actually saw what Metric computation — aggregating business and product metrics per group Statistical analysis — deciding whether observed differences are real or noise Experiment management — the UI and workflow for creating, launching, and reviewing experiments Without this infrastructure, teams either skip experiments entirely (dangerous) or run them in ad-hoc, error-prone ways (often worse than skipping).\nCore Concepts Hypothesis and Variants Every experiment starts with a hypothesis: \u0026ldquo;If we change X, we expect to see Y change in metric Z.\u0026rdquo; A hypothesis has:\nA control group — the existing experience, your baseline One or more treatment groups — the modified experience(s) Hypothesis: \u0026#34;Showing a progress bar during checkout will increase completion rate.\u0026#34; Control (50%): Checkout without progress bar Treatment (50%): Checkout with progress bar Primary metric: checkout_completion_rate Guardrail metrics: page_load_time_p99, error_rate Randomisation Unit The randomisation unit is the entity you split — most commonly a user, but it can be a session, a device, a request, or even a geographic region. Choosing the wrong unit is one of the most common early mistakes.\nUser-level: Same user always sees the same variant → stable UX, good for most features Session-level: Variant can change between sessions → useful for anonymous traffic Request-level: Different variant per request → only safe for stateless changes (e.g. ranking) The golden rule: your randomisation unit must be the same unit you measure your metrics on. If you split by user, measure by user. Mixing these causes dilution bias.\nTraffic Allocation Traffic allocation defines what percentage of eligible users enter the experiment at all, and how that traffic is split across variants. A 10% / 90% holdout is common when you want to ship quickly but still measure impact:\nTotal traffic └── 10% enters experiment ├── 50% → Control └── 50% → Treatment └── 90% → Holdout (sees neither variant, gets default) Metrics Metrics fall into a few categories:\nPrimary metric — the one you\u0026rsquo;re trying to move. Your hypothesis is directly about this. You get one. Secondary metrics — supporting evidence. Moving in the expected direction builds confidence. Guardrail metrics — things you must not make worse. Latency, error rates, revenue per user. A treatment that improves click-through but tanks page speed fails the guardrail check and shouldn\u0026rsquo;t ship. Debug metrics — operational signals (assignment counts, event volumes) used to detect platform bugs. Platform Architecture A production experimentation platform has three major subsystems:\n1. Assignment Service The assignment service takes a user (or other unit) and a list of active experiments, and deterministically returns which variant that user belongs to.\nThe canonical implementation uses consistent hashing: hash the combination of user_id + experiment_id (or a random salt to avoid correlation) into a number from 0–99, then map buckets to variants.\nimport hashlib def assign_variant(user_id: str, experiment_id: str, variants: list[dict]) -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34; Deterministic assignment via consistent hashing. Same user_id + experiment_id always returns the same variant. \u0026#34;\u0026#34;\u0026#34; key = f\u0026#34;{experiment_id}:{user_id}\u0026#34; hash_val = int(hashlib.sha256(key.encode()).hexdigest(), 16) bucket = hash_val % 10000 # 0–9999 gives 0.01% granularity cumulative = 0 for variant in variants: cumulative += variant[\u0026#34;allocation\u0026#34;] * 100 # allocation is 0.0–1.0 if bucket \u0026lt; cumulative: return variant[\u0026#34;name\u0026#34;] return variants[-1][\u0026#34;name\u0026#34;] # fallback to last variant # Example variants = [ {\u0026#34;name\u0026#34;: \u0026#34;control\u0026#34;, \u0026#34;allocation\u0026#34;: 0.50}, {\u0026#34;name\u0026#34;: \u0026#34;treatment\u0026#34;, \u0026#34;allocation\u0026#34;: 0.50}, ] assign_variant(\u0026#34;user-abc123\u0026#34;, \u0026#34;exp-checkout-progress-bar\u0026#34;, variants) # → \u0026#34;control\u0026#34; (always, for this user+experiment combination) This approach is stateless and fast — no database lookup needed at request time. The assignment can be computed in under 1ms on any server or client that has the experiment configuration.\nTwo alternative approaches exist but have fatal flaws in practice:\nPre-assigned tables: Store every user\u0026rsquo;s variant in a database. Assignment is a lookup, which adds latency and causes overexposure — you log assignments for users who never actually encountered the changed surface. Server-side session state: Breaks when users switch devices, and creates consistency nightmares. 2. Configuration / Feature Flag Service The assignment service tells you which variant a user is in. The configuration service tells the application what to do with that information — what values to serve, which code paths to take.\nThis is where feature flags live. A feature flag is a variable your code reads at runtime:\n// TypeScript example — server SDK pattern const config = await experimentClient.getConfig(user, \u0026#34;checkout-experiment\u0026#34;); if (config.get(\u0026#34;show_progress_bar\u0026#34;, false)) { return \u0026lt;CheckoutWithProgressBar /\u0026gt;; } else { return \u0026lt;CheckoutDefault /\u0026gt;; } The SDK fetches a ruleset (a compact representation of all active experiments and their allocations) from the config service at startup, caches it in memory, and evaluates assignments locally on each call. This avoids a network round-trip on the hot path.\nRuleset (simplified JSON): { \u0026#34;experiments\u0026#34;: [ { \u0026#34;id\u0026#34;: \u0026#34;exp-checkout-progress-bar\u0026#34;, \u0026#34;salt\u0026#34;: \u0026#34;x7f2k\u0026#34;, \u0026#34;allocation\u0026#34;: 1.0, \u0026#34;variants\u0026#34;: [ { \u0026#34;name\u0026#34;: \u0026#34;control\u0026#34;, \u0026#34;allocation\u0026#34;: 0.5, \u0026#34;config\u0026#34;: { \u0026#34;show_progress_bar\u0026#34;: false } }, { \u0026#34;name\u0026#34;: \u0026#34;treatment\u0026#34;, \u0026#34;allocation\u0026#34;: 0.5, \u0026#34;config\u0026#34;: { \u0026#34;show_progress_bar\u0026#34;: true } } ], \u0026#34;targeting\u0026#34;: { \u0026#34;user_country\u0026#34;: [\u0026#34;SG\u0026#34;, \u0026#34;MY\u0026#34;, \u0026#34;TH\u0026#34;] } } ] } The SDK polls for ruleset updates every 30–60 seconds. Changes propagate within a minute without requiring a deployment.\n3. Exposure Logging \u0026amp; Metrics Pipeline An exposure event is logged the moment a user is actually served a variant — not when they\u0026rsquo;re assigned, but when the feature flag is evaluated in a user-visible context. This distinction matters: a user assigned to an experiment who never reaches the checkout page should not be included in checkout metrics.\nexposure_event = { \u0026#34;user_id\u0026#34;: \u0026#34;user-abc123\u0026#34;, \u0026#34;experiment_id\u0026#34;: \u0026#34;exp-checkout-progress-bar\u0026#34;, \u0026#34;variant\u0026#34;: \u0026#34;treatment\u0026#34;, \u0026#34;timestamp\u0026#34;: \u0026#34;2026-04-19T10:22:00Z\u0026#34;, \u0026#34;client\u0026#34;: \u0026#34;web\u0026#34;, \u0026#34;app_version\u0026#34;: \u0026#34;3.12.1\u0026#34; } Exposure events flow into a streaming pipeline (Kafka is a common choice here), where they\u0026rsquo;re joined with downstream metric events and aggregated per variant:\nRaw events (Kafka) ├── exposure_events └── metric_events (purchases, clicks, errors, ...) ↓ join on user_id, within experiment window Per-variant metric aggregations (data warehouse) → control: { n: 48203, checkout_rate: 0.612, p99_latency: 320ms } → treatment: { n: 47891, checkout_rate: 0.638, p99_latency: 318ms } Statistical Foundations This is where most engineering guides stop, but it\u0026rsquo;s arguably the most important part.\nHypothesis Testing The standard framework is null hypothesis significance testing (NHST):\nH₀ (null hypothesis): The treatment has no effect. Any difference is due to random chance. H₁ (alternative hypothesis): The treatment has a real effect. You set a significance threshold α (typically 0.05) before running the experiment. If the p-value from a two-sample t-test falls below α, you reject H₀ and call the result statistically significant.\np-value = P(observing this difference, or larger, if H₀ is true) p \u0026lt; 0.05 → statistically significant (5% false positive rate) p ≥ 0.05 → no significant result Statistical Power and Sample Size Power (1 - β) is the probability of detecting a real effect if one exists. Common target: 80%.\nBefore launching, always compute required sample size:\nfrom scipy import stats import math def required_sample_size( baseline_rate: float, minimum_detectable_effect: float, alpha: float = 0.05, power: float = 0.80 ) -\u0026gt; int: \u0026#34;\u0026#34;\u0026#34; Compute required n per variant for a two-proportion z-test. \u0026#34;\u0026#34;\u0026#34; p1 = baseline_rate p2 = baseline_rate + minimum_detectable_effect p_pooled = (p1 + p2) / 2 z_alpha = stats.norm.ppf(1 - alpha / 2) # two-tailed z_beta = stats.norm.ppf(power) n = ( (z_alpha * math.sqrt(2 * p_pooled * (1 - p_pooled)) + z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / (p2 - p1) ** 2 ) return math.ceil(n) # Example: baseline checkout rate 60%, want to detect a 2pp lift n = required_sample_size(baseline_rate=0.60, minimum_detectable_effect=0.02) print(f\u0026#34;Required per variant: {n:,}\u0026#34;) # ~4,700 per variant CUPED — Variance Reduction CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique developed at Microsoft that can improve experiment sensitivity by 30–50%. The idea: use each user\u0026rsquo;s pre-experiment metric value as a covariate to subtract out noise that has nothing to do with the treatment.\nY_cuped = Y - θ * X_pre Where: Y = metric value during experiment X_pre = same metric, measured before experiment start θ = covariate coefficient (estimated from data) The adjusted metric has the same expected value but lower variance, meaning you need fewer users to reach significance — or equivalently, you can detect smaller effects with the same traffic.\nSequential Testing The classic fixed-horizon design requires you to decide your sample size upfront and not peek at results until you\u0026rsquo;ve hit it. In practice, teams peek constantly — and this inflates your false positive rate.\nSequential testing (or always-valid inference) provides statistical guarantees even when you check results continuously. It\u0026rsquo;s what powers the \u0026ldquo;you can stop early\u0026rdquo; buttons in modern platforms. The trade-off: slightly wider confidence intervals when you do reach the end.\nExperiment Types Beyond the standard two-variant A/B test:\nA/A Test — Run two identical control groups. Results should show no significant difference. Use this to validate your platform\u0026rsquo;s randomisation and statistical setup before trusting real experiment results.\nMultivariate Test (MVT) — Test combinations of changes simultaneously. Useful when changes might interact (e.g. button colour and button text). Requires much more traffic because you need adequate sample size for each cell.\nMulti-Armed Bandit (MAB) — Dynamically shifts traffic toward the winning variant as results accumulate, rather than waiting for a fixed-horizon test to conclude. Trades statistical rigour for faster convergence. Best for low-stakes decisions (e.g. marketing copy) where you care more about maximising near-term conversions than having a clean causal estimate.\nHoldout Experiments — A long-running control group held out from a set of features to measure their combined long-term impact. Useful because short experiments often can\u0026rsquo;t detect slow-burning effects (e.g. does the new onboarding flow improve 90-day retention?).\nSwitchback / Time-Based Experiments — Used when user-level randomisation is impossible because of network interference (e.g. a ride-sharing surge pricing algorithm affects all riders in an area). Traffic alternates between control and treatment across time windows rather than across users.\nRunning Experiments: End-to-End 1. Define the Experiment name: \u0026#34;checkout-progress-bar\u0026#34; hypothesis: \u0026#34;A progress bar reduces checkout abandonment by giving users a sense of completion.\u0026#34; owner: \u0026#34;payments-team\u0026#34; variants: - name: control allocation: 50% config: show_progress_bar: false - name: treatment allocation: 50% config: show_progress_bar: true targeting: eligible_users: \u0026#34;all_logged_in\u0026#34; excluded_experiments: [] # ensure no conflicts metrics: primary: checkout_completion_rate secondary: - time_to_complete_checkout - add_payment_method_rate guardrails: - checkout_error_rate - p99_page_load_ms duration: minimum_runtime_days: 7 # avoid day-of-week bias target_sample_size_per_variant: 5000 2. Instrument the Code // Server-side Node.js example import { ExperimentClient } from \u0026#34;@internal/experiment-sdk\u0026#34;; const client = new ExperimentClient({ apiKey: process.env.EXP_KEY }); await client.start(); // fetches and caches ruleset async function renderCheckout(userId: string) { const variant = client.getVariant(userId, \u0026#34;checkout-progress-bar\u0026#34;); // Log exposure — only if user actually reaches checkout client.logExposure(userId, \u0026#34;checkout-progress-bar\u0026#34;); return { showProgressBar: variant.config.show_progress_bar ?? false, }; } 3. Run and Monitor Once launched, monitor daily:\nSample ratio mismatch check — is traffic split close to 50/50? Guardrail metrics — any regressions in error rate or latency? Assignment counts — are they growing at the expected rate? 4. Analyse and Decide After hitting the target sample size and minimum runtime, read the results:\nExperiment: checkout-progress-bar Duration: 14 days Sample: 12,841 control / 12,807 treatment Primary metric (checkout_completion_rate): Control: 0.612 (61.2%) Treatment: 0.638 (63.8%) Lift: +2.6pp (+4.2%) p-value: 0.003 ✅ significant (α = 0.05) 95% CI: [+0.9pp, +4.3pp] Guardrails: checkout_error_rate: no significant change ✅ p99_page_load_ms: no significant change ✅ Decision: SHIP ✅ Common Pitfalls 1. Peeking and early stopping Checking results daily and stopping when you see p \u0026lt; 0.05 inflates your false positive rate dramatically. If you need early stopping, use sequential testing with proper statistical guarantees — don\u0026rsquo;t just eyeball it.\n2. Sample Ratio Mismatch (SRM) If you designed a 50/50 experiment but see a 52/48 split, something is wrong with your assignment or logging pipeline — a bot filter, a caching layer, a redirect, or a bug in exposure logging. Experiments with SRM must be discarded. The check is a simple chi-squared test against the expected ratio; any p-value below 0.001 is a red flag.\n3. Novelty effect Users behave differently when they encounter something new. A redesigned checkout flow might see a temporary engagement spike simply because it\u0026rsquo;s different, not because it\u0026rsquo;s better. Always run experiments long enough to let the novelty wear off — at least one full week, often two.\n4. Network interference When users can affect each other (social features, marketplaces, two-sided platforms), standard A/B tests are invalid. Treating one user changes the experience of their connections in the control group, contaminating the control. Mitigation strategies include cluster-based randomisation or switchback experiments.\n5. Multiple testing without correction If you test 20 metrics and call anything with p \u0026lt; 0.05 a win, you\u0026rsquo;ll get one false positive on average by chance alone. Apply a correction (Benjamini-Hochberg for FDR control is standard) or pre-register your single primary metric and treat everything else as exploratory.\n6. Overexposure Logging an exposure when a user is assigned rather than when they encounter the feature dilutes your metric estimates. A user assigned to the checkout experiment who never visits checkout adds noise to your checkout rate measurement. Log exposures as late as possible — at the point of the actual user-visible change.\n7. Carry-over effects from overlapping experiments Two experiments that both modify the checkout flow will interact. A well-designed platform handles this with mutual exclusion layers (also called namespaces): experiments in the same namespace are guaranteed non-overlapping. Users in namespace A are only in one experiment; users in namespace B only in another.\nAdvanced Techniques Metric Sensitivity: Why Your Tests Feel Underpowered Variance is the enemy of fast experiments. The higher the variance in your metric, the more users you need to detect a given effect. Beyond CUPED, other variance reduction techniques include:\nStratified sampling — ensure variants are balanced on high-variance covariates (e.g., country, device type) at assignment time Triggered analysis — restrict analysis to users who actually triggered the change surface, excluding ineligible users who just dilute signal Winsorisation — cap extreme metric values (e.g., users who spent $10,000 in a session) to prevent outliers from dominating variance Long-Term Holdouts Short-term experiments miss long-term effects. Habituation, learning curves, and ecosystem effects all take weeks or months to materialise. Large platforms maintain long-running holdout groups — small populations (1–5%) permanently withheld from a set of features — to measure true cumulative impact over quarters.\nWarehouse-Native Analysis Modern platforms like Eppo and GrowthBook support warehouse-native experiment analysis: metrics are defined as SQL queries that run directly against your data warehouse (Snowflake, BigQuery, Databricks). This keeps sensitive data in your own infrastructure, makes metrics auditable and version-controlled, and lets data scientists use the full expressiveness of SQL rather than being constrained to platform-defined metric types.\n-- Metric definition: checkout_completion_rate SELECT user_id, COUNT_IF(event_type = \u0026#39;checkout_completed\u0026#39;)::FLOAT / NULLIF(COUNT_IF(event_type = \u0026#39;checkout_started\u0026#39;), 0) AS checkout_rate FROM events WHERE event_date BETWEEN :start_date AND :end_date GROUP BY user_id Build vs. Buy Factor Build in-house Use a vendor (Statsig, Eppo, GrowthBook, LaunchDarkly) Control Full ownership of data and logic Dependent on vendor roadmap Time to first experiment Months to years Days to weeks Customisation Unlimited Constrained to platform model Statistical quality Only as good as your team Battle-tested by default Cost (small scale) Engineer salaries Usually affordable Cost (large scale) Cheaper per event Can get expensive Data residency Full control Varies; warehouse-native options help The honest answer: unless you\u0026rsquo;re operating at the scale of Spotify, LinkedIn, or Airbnb — or have regulatory constraints that force data residency — buy before you build. The real cost of an in-house platform isn\u0026rsquo;t the engineering; it\u0026rsquo;s the ongoing statistical methodology, the debugging of subtle platform bugs (SRM, exposure logging errors), and the product work to make it usable by non-engineers.\nExperimentation Culture: The Harder Problem The hardest part of experimentation isn\u0026rsquo;t the technology — it\u0026rsquo;s the culture.\nShip with experiments by default. The highest-leverage change is making every feature flag an experiment from the start, rather than treating experimentation as an optional add-on. Engineers ship code; experiments are baked in.\nCelebrate learning, not winning. Most experiments don\u0026rsquo;t produce significant lifts. A clean null result that disproves a hypothesis is valuable — it means you didn\u0026rsquo;t invest in building something that doesn\u0026rsquo;t work. Teams that only celebrate wins will start peeking for significance and torturing data until it confesses.\nBuild a metrics catalogue. Metrics should be defined once, centrally, with SQL, and reused across experiments. A company where every team defines \u0026ldquo;conversion\u0026rdquo; slightly differently will produce untrustworthy experiments.\nDocument experiment results. An experiment that runs, concludes, and is never written up is a wasted learning. A simple internal wiki page per experiment — hypothesis, results, decision, what we learned — compounds into institutional knowledge over years.\nEcosystem and Tooling Category Popular options Managed platforms Statsig, Optimizely, LaunchDarkly, Amplitude Experiment Open source GrowthBook, Flagsmith, Unleash Warehouse-native Eppo, GrowthBook (warehouse mode) Statistics libraries scipy.stats, pingouin (Python); Evan Miller's calculators Sample size calculators Statsig, Evan Miller, AB Testguide Causal inference (advanced) DoWhy, EconML (Microsoft), CausalML (Uber) Final Thoughts Experimentation platforms are, at their core, epistemology infrastructure — they determine how your organisation knows what\u0026rsquo;s true about your product. Get them right, and you ship faster with more confidence. Get them wrong, and you ship decisions based on flawed data while believing you\u0026rsquo;re being rigorous.\nThe technical pieces are achievable: consistent hashing for assignment, local SDK evaluation for performance, exposure-triggered logging to avoid dilution, warehouse-joined aggregations for metrics, and t-tests with proper power calculations. The statistical subtleties — SRM checks, CUPED, sequential testing, network interference — take longer to get right, but each one has a documented solution.\nStart with one experiment. Instrument it carefully. Check for SRM. Wait for the full sample. Then ship or don\u0026rsquo;t, and write it up. Do that fifty times and you\u0026rsquo;ll have an experimentation platform that works.\n","permalink":"https://lcv-back.github.io/vilecongblog.github.io/posts/experimentation-platform/","summary":"A comprehensive deep-dive into experimentation platforms: how they work, core components, statistical foundations, implementation patterns, and everything you need to go from your first A/B test to a mature experimentation culture.","title":"Everything About Experimentation Platforms"},{"content":"Apache Kafka is one of those technologies that quietly powers a huge chunk of the modern internet — real-time analytics pipelines, event-driven microservices, fraud detection, log aggregation — yet it can feel intimidating at first. This post is the guide I wish I had when I started. Let\u0026rsquo;s go from zero to production, step by step.\nWhat Is Apache Kafka? Apache Kafka is an open-source distributed event streaming platform originally developed at LinkedIn and donated to the Apache Software Foundation in 2011. At its core, Kafka lets you:\nPublish streams of records (events/messages) Subscribe to those streams and process them in real time or batch Store streams durably and fault-tolerantly for as long as you want Replay past events, which is something traditional message queues can\u0026rsquo;t easily do The key insight that makes Kafka different: it treats every event as an immutable log entry. Events are never deleted on consume — they live on disk until a configured retention period expires.\nCore Concepts Topics A topic is a named category for a stream of records — think of it like a database table or a folder in a filesystem, but for events.\nTopic: \u0026#34;user-signups\u0026#34; Event: { \u0026#34;userId\u0026#34;: \u0026#34;u123\u0026#34;, \u0026#34;email\u0026#34;: \u0026#34;vi@example.com\u0026#34;, \u0026#34;ts\u0026#34;: 1713456000 } Event: { \u0026#34;userId\u0026#34;: \u0026#34;u124\u0026#34;, \u0026#34;email\u0026#34;: \u0026#34;bob@example.com\u0026#34;, \u0026#34;ts\u0026#34;: 1713456010 } Topics are append-only — you can only add new records, never edit old ones.\nPartitions Every topic is split into one or more partitions. This is Kafka\u0026rsquo;s secret to horizontal scalability.\nTopic \u0026#34;orders\u0026#34; with 3 partitions: Partition 0: [order-1] [order-4] [order-7] → Partition 1: [order-2] [order-5] [order-8] → Partition 2: [order-3] [order-6] [order-9] → Each record within a partition gets a monotonically increasing integer called an offset. Kafka guarantees ordering within a partition, not across partitions.\nWhy does this matter? You can run one consumer per partition in parallel, so more partitions = more throughput.\nProducers A producer is any application that publishes records to a Kafka topic.\n// Go example using the confluent-kafka-go client producer, _ := kafka.NewProducer(\u0026amp;kafka.ConfigMap{ \u0026#34;bootstrap.servers\u0026#34;: \u0026#34;localhost:9092\u0026#34;, }) producer.Produce(\u0026amp;kafka.Message{ TopicPartition: kafka.TopicPartition{ Topic: \u0026amp;topic, Partition: kafka.PartitionAny, }, Key: []byte(\u0026#34;user-123\u0026#34;), Value: []byte(`{\u0026#34;event\u0026#34;: \u0026#34;signup\u0026#34;, \u0026#34;plan\u0026#34;: \u0026#34;pro\u0026#34;}`), }, nil) The message key determines which partition the record goes to. Records with the same key always land in the same partition — which preserves order for a given entity (e.g., all events for user-123).\nConsumers \u0026amp; Consumer Groups A consumer reads records from a topic. Consumers are grouped into consumer groups. Kafka distributes partitions across consumers in the same group, so each partition is read by exactly one consumer in the group at a time.\nTopic \u0026#34;payments\u0026#34; (4 partitions) Consumer Group \u0026#34;payment-processor\u0026#34; (2 consumers): Consumer A → Partition 0, Partition 1 Consumer B → Partition 2, Partition 3 Adding more consumers to a group scales out your processing — up to the number of partitions. A second consumer group reading the same topic gets its own independent copy of all records.\nBrokers A broker is a single Kafka server. A Kafka cluster is a group of brokers that cooperate to store and serve data.\nPartitions are distributed and replicated across brokers. Each partition has one leader (handles reads and writes) and zero or more followers (replicas for fault tolerance).\nZooKeeper vs KRaft Historically, Kafka used Apache ZooKeeper for cluster coordination — leader election, topic metadata, ACLs. As of Kafka 3.x, the KRaft (Kafka Raft) mode is stable and eliminates the ZooKeeper dependency entirely, simplifying deployment significantly.\nHow Kafka Stores Data Kafka writes records to disk in log segments — binary files on the broker\u0026rsquo;s filesystem. This is why Kafka is so fast: it uses the OS page cache aggressively and relies on sequential disk I/O, which is much faster than random I/O (even SSDs benefit from this).\n/kafka-logs/ orders-0/ ← Partition 0 data directory 00000000000000000000.log ← segment file 00000000000000000000.index ← offset index 00000000000000000000.timeindex orders-1/ ... Retention Policy Records aren\u0026rsquo;t deleted when consumed. Kafka retains them based on:\nTime-based: retention.ms — delete segments older than N days (default: 7 days) Size-based: retention.bytes — delete when a partition exceeds N bytes Log compaction: keep only the latest record per key — useful for \u0026ldquo;change data capture\u0026rdquo; and maintaining current state Delivery Guarantees Kafka offers three levels of delivery semantics:\nGuarantee Producer setting Consumer behaviour Risk At most once acks=0 commit before processing data loss At least once acks=all commit after processing duplicates Exactly once acks=all + idempotent + transactions transactional consumer highest complexity For most systems, at-least-once with idempotent consumers is the sweet spot. Exactly-once is available but adds operational complexity.\nKey Use Cases 1. Event-Driven Microservices Instead of direct API calls between services, services emit events to Kafka and other services consume them asynchronously. This decouples services and makes the system resilient to individual service downtime.\nUser Service → [user-events topic] → Email Service → Analytics Service → Notification Service 2. Real-Time Data Pipelines Kafka Connect ships data between Kafka and external systems (databases, S3, Elasticsearch) without writing code. Source connectors read from systems; sink connectors write to them.\n3. Log Aggregation Collect logs from hundreds of servers into Kafka, then stream them to Elasticsearch or a data warehouse. Kafka acts as a high-throughput buffer that smooths out spikes.\n4. Stream Processing With Kafka Streams (or Apache Flink on top of Kafka), you can do stateful computations — windowing, joins, aggregations — directly on the event stream without a separate processing cluster.\n5. Change Data Capture (CDC) Tools like Debezium read the database write-ahead log and emit every row change as a Kafka event. Downstream systems get a real-time feed of all database mutations.\nRunning Kafka Locally (KRaft mode) # Using Docker Compose (KRaft — no ZooKeeper needed) cat \u0026gt; docker-compose.yml \u0026lt;\u0026lt;EOF version: \u0026#34;3\u0026#34; services: kafka: image: confluentinc/cp-kafka:7.6.0 hostname: kafka ports: - \u0026#34;9092:9092\u0026#34; environment: KAFKA_NODE_ID: 1 KAFKA_PROCESS_ROLES: broker,controller KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092 KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093 KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 CLUSTER_ID: \u0026#34;MkU3OEVBNTcwNTJENDM2Qk\u0026#34; EOF docker compose up -d Create a topic and send your first event:\n# Create topic docker exec kafka kafka-topics --create \\ --topic hello-kafka \\ --bootstrap-server localhost:9092 \\ --partitions 3 \\ --replication-factor 1 # Produce a message echo \u0026#39;{\u0026#34;hello\u0026#34;: \u0026#34;kafka\u0026#34;}\u0026#39; | docker exec -i kafka \\ kafka-console-producer --topic hello-kafka \\ --bootstrap-server localhost:9092 # Consume from the beginning docker exec kafka kafka-console-consumer \\ --topic hello-kafka \\ --bootstrap-server localhost:9092 \\ --from-beginning Performance Tuning Tips Area Setting What it does Producer throughput batch.size, linger.ms Batch more records per request Producer durability acks=all, min.insync.replicas=2 Wait for follower acknowledgement Consumer throughput fetch.min.bytes, fetch.max.wait.ms Fetch larger batches less often Compression compression.type=lz4 Reduce network and disk I/O Partitions Match to consumer parallelism More partitions = more scale Retention log.retention.hours Balance disk cost vs replay window Common Pitfalls 1. Too few partitions early on You can add partitions later, but it breaks key-based ordering for existing keys. Plan for growth upfront.\n2. Large messages Kafka is optimised for many small messages (\u0026lt; 1MB). For large payloads, store the data in S3/object storage and put the reference URL in the Kafka event.\n3. Ignoring consumer lag Monitor kafka-consumer-groups --describe or use Burrow/Cruise Control. Growing lag means your consumers can\u0026rsquo;t keep up with producers.\n4. Not handling rebalances When consumers join or leave a group, Kafka triggers a rebalance — all consumption pauses briefly. Use cooperative rebalancing (partition.assignment.strategy=CooperativeStickyAssignor) to minimise disruption.\n5. Skipping schema management Without a schema registry (e.g., Confluent Schema Registry + Avro/Protobuf), producer and consumer schema drift will break your pipeline silently.\nKafka vs. Alternatives Kafka RabbitMQ AWS SQS Redis Streams Model Log-based Queue Queue Log-based Retention Days–forever Until consumed 14 days max Memory-bounded Throughput Millions/sec ~50k/sec ~3k/sec Hundreds of k/sec Replay ✅ Yes ❌ No ❌ No ✅ Yes Ordering Per-partition Per-queue Best-effort Per stream Best for Event sourcing, analytics, high-throughput pipelines Task queues, RPC-style messaging Simple serverless queuing Small-scale streams, caching Final Thoughts Kafka changed how I think about data flow. Once you see systems as streams of immutable events, the architecture simplifies dramatically — services become stateless processors, debugging becomes replaying events, and scaling becomes adding partitions.\nThe learning curve is real, especially around consumer groups, partition assignment, and exactly-once semantics. But the payoff is a backbone that can handle millions of events per second, survive broker failures, and let you replay the entire history of your system whenever you need to.\nStart small — one topic, one producer, one consumer. Then scale up from there.\nHave questions about Kafka? Feel free to reach out through the contact page or drop a comment. I\u0026rsquo;d love to hear what you\u0026rsquo;re building.\n","permalink":"https://lcv-back.github.io/vilecongblog.github.io/posts/first-post/","summary":"A comprehensive deep-dive into Apache Kafka: how it works, core concepts, use cases, and everything you need to go from zero to production.","title":"Everything About Apache Kafka"}]