Design a Change Data Capture (CDC) Pipeline

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Change Data Capture (CDC) Pipeline.

Clarifying Questions (ask before designing)

Question	Why it matters
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

End-to-end Cdc Pipeline flows and API contracts
High-level architecture with major components
Data model and storage choices with rationale
Capacity estimation with shown math
Primary failure modes and mitigations

Out of scope (state explicitly)

Full analytics warehouse modeling (dbt/Star schema)
Exactly-once end-to-end across all downstream consumers
Building Kafka from scratch

Assumptions

Clarify scale (DAU, QPS, data volume) for cdc pipeline in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Capture every INSERT, UPDATE, DELETE from source databases in real-time
Propagate changes to downstream consumers with < 5 second latency
Maintain strict ordering of changes per row/primary key
Initial snapshot + ongoing incremental streaming
Schema evolution: handle column adds, renames, type changes
Multi-source: PostgreSQL, MySQL, MongoDB, SQL Server, Oracle
Multi-sink: Kafka, Elasticsearch, S3/Parquet, Redis, ClickHouse
Effectively-once delivery via at-least-once capture + idempotent consumers
Filtering and transformations: rename fields, mask PII, enrich

Metric	Calculation	Value
Source databases	Given (assumption documented in value)	100
Tables per database	Given (assumption documented in value)	50
Total tables	Given (assumption documented in value)	5,000
Changes / sec (all sources)	From Changes / day ÷ 86400 (+ peak factor in value)	500K
Avg change event size	Given (typical workload assumption)	500 bytes
Throughput	Given (assumption documented in value)	250 MB/sec
Kafka retention (7 days)	Given	~150 TB
Kafka topics (one per table)	Given (assumption documented in value)	5,000

Loading...

How Log-Based CDC Works

PostgreSQL Write-Ahead Log (WAL):
  Every transaction → WAL entry before data pages modified
  WAL is the source of truth for replication

Logical Replication (PostgreSQL 10+):
  1. Create PUBLICATION: CREATE PUBLICATION my_pub FOR TABLE orders, users;
  2. Debezium creates a REPLICATION SLOT
  3. Debezium reads decoded WAL changes via replication protocol
  4. Each change decoded into: {table, op, before, after, source_metadata}

Replication slot guarantees:
  PostgreSQL retains WAL segments until Debezium confirms consumption
  ? No changes are ever lost
  → BUT: if Debezium is down too long → WAL accumulates → disk fills up
  → Monitor: pg_replication_slots → confirmed_flush_lsn lag

Zero impact on production:
  Debezium reads the WAL (written anyway for crash recovery)
  No extra queries, no triggers, no polling. CPU impact: < 1%.

Initial Snapshot + Streaming

Problem: When a CDC connector starts, table already has 100M rows.

Debezium Snapshot Flow:
  1. LOCK table (briefly) for consistent snapshot position
  2. Read entire table with SELECT * (chunked by PK for large tables)
  3. UNLOCK table (total lock time: < 1 second)
  4. Switch to streaming mode: read WAL from snapshot position

For very large tables (billions of rows):
  Incremental snapshot (Debezium 1.6+):
  - No global table lock
  - Reads table in chunks (WHERE id BETWEEN 1 AND 10000)
  - Interleaves snapshot reads with streaming reads
  - Can take hours but never locks the table

Change Event Format (Debezium Envelope)

JSON

{
  "schema": {...},
  "payload": {
    "before": { "id": 42, "status": "pending", "amount": 99.99 },
    "after": { "id": 42, "status": "completed", "amount": 99.99 },
    "source": {
      "connector": "postgresql",
      "db": "ecommerce", "table": "orders",
      "lsn": 987654321, "ts_ms": 1710403200000
    },
    "op": "u",
    "ts_ms": 1710403200050
  }
}

Schema Evolution Without Downtime

Scenario: ALTER TABLE orders ADD COLUMN discount DECIMAL(10,2);

What happens:
  1. Debezium reads DDL from WAL
  2. Generates new schema (v2) with discount field
  3. Schema Registry: compatibility check → v2 compatible → registered
  4. New events have discount field; old events do not
  5. No pipeline restart needed

Problematic DDL (breaks compatibility):
  ALTER TABLE orders DROP COLUMN status → BACKWARD incompatible
  Schema Registry REJECTS → Debezium pauses with error

Safe schema changes:
  Add optional field with default ?
  Rename field (with Avro alias) ?
  Remove required field ?
  Change field type ?

Exactly-Once End-to-End

Source → Kafka: Debezium + Kafka acks=all → at-least-once (idempotent via PK key)

Kafka → Sink:
  Elasticsearch: upsert by doc_id = PK → naturally idempotent
  PostgreSQL: INSERT ON CONFLICT (pk) DO UPDATE → idempotent
  S3: Flink exactly-once file sink
  Redis: SET key value → idempotent

Net effect: exactly-once semantics without distributed transactions
The pattern: at-least-once delivery + idempotent consumers = exactly-once

Outbox Pattern

Problem: Application needs to emit a DOMAIN EVENT (not just a row change)
  CDC captures: "order.status changed from 'pending' to 'shipped'"
  But you need: "OrderShipped event with tracking_number, carrier"

Solution: Outbox Pattern
  1. In SAME database transaction:
     UPDATE orders SET status='shipped' WHERE id=42;
     INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
       VALUES ('order', '42', 'OrderShipped', '{"tracking":"1Z999"}');
  2. Debezium CDC on outbox table → rich domain events to Kafka
  3. Background cleaner: DELETE FROM outbox WHERE created_at < NOW() - INTERVAL '1 hour';

Transactional guarantee: event is produced IF AND ONLY IF business action committed

HTTP

# Create PostgreSQL CDC connector
POST /connectors
{
  "name": "pg-ecommerce-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-primary.internal",
    "database.dbname": "ecommerce",
    "database.server.name": "pg-server1",
    "table.include.list": "public.orders,public.users",
    "column.exclude.list": "public.users.ssn",
    "slot.name": "debezium_ecommerce",
    "snapshot.mode": "initial",
    "transforms.route.topic.replacement": "cdc.$2"
  }
}

GET    /connectors                    → List connectors
GET    /connectors/{name}/status      → Connector health
POST   /connectors/{name}/restart     → Restart connector
PUT    /connectors/{name}/pause       → Stop reading WAL
PUT    /connectors/{name}/resume      → Resume from last position

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

Slot WAL Accumulation: The #1 Operational Risk

Problem: Debezium is down for 6 hours → PostgreSQL retains 6 hours of WAL → disk full
  ? PRODUCTION DATABASE DOWN

Mitigations:
  1. Monitor: pg_stat_replication_slots → confirmed_flush_lsn lag
     Alert if lag > 1 GB or 1 hour
  2. max_slot_wal_keep_size = 100GB (PG 13+) → auto-invalidate slot
     Debezium detects invalidation → triggers new snapshot automatically
  3. Heartbeat table: Debezium writes every 30s ? advances WAL position
  4. PagerDuty alert if connector status != RUNNING for > 5 minutes

Concern	Solution
Debezium crash	Restart → reads last offset → resumes from WAL position
Source DB failure	Debezium reconnects; WAL retained until consumed
Kafka broker failure	RF=3, min.ISR=2; events survive any single broker failure
Schema change (DDL)	Detects DDL → new schema version → backward compatible
Duplicate events	Kafka key = PK → log compaction; consumers use upsert
Ordering violation	Partition by PK → all changes for same row → strict order

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

How Log-Based CDC Works

Initial Snapshot + Streaming

Change Event Format (Debezium Envelope)

Schema Evolution Without Downtime

Exactly-Once End-to-End

Outbox Pattern

Common Error Responses

Kafka Topic Layout

Connect Offset Storage

Slot WAL Accumulation: The #1 Operational Risk

CDC vs Dual Writes vs Polling

Multi-Region CDC

Interview Walkthrough

Log Mining vs Query Polling vs Dual Write

Exactly-Once Delivery: When It Matters

Schema Evolution in a Live Pipeline

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR