This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Change Data Capture (CDC) Pipeline.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- End-to-end Cdc Pipeline flows and API contracts
- High-level architecture with major components
- Data model and storage choices with rationale
- Capacity estimation with shown math
- Primary failure modes and mitigations
Out of scope (state explicitly)
- Full analytics warehouse modeling (dbt/Star schema)
- Exactly-once end-to-end across all downstream consumers
- Building Kafka from scratch
Assumptions
- Clarify scale (DAU, QPS, data volume) for cdc pipeline in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Capture every INSERT, UPDATE, DELETE from source databases in real-time
- Propagate changes to downstream consumers with < 5 second latency
- Maintain strict ordering of changes per row/primary key
- Initial snapshot + ongoing incremental streaming
- Schema evolution: handle column adds, renames, type changes
- Multi-source: PostgreSQL, MySQL, MongoDB, SQL Server, Oracle
- Multi-sink: Kafka, Elasticsearch, S3/Parquet, Redis, ClickHouse
- Effectively-once delivery via at-least-once capture + idempotent consumers
- Filtering and transformations: rename fields, mask PII, enrich
- Low Latency: DB change → consumer in < 5 seconds (p99)
- High Throughput: 100K+ changes/sec from a single source
- Zero Impact on Source: No performance degradation of production DB
- Effectively-once delivery: At-least-once CDC with idempotent sinks and offset deduplication: no duplicate downstream effects
- Durability: No change event lost: every committed transaction captured
- Schema Resilient: Handle DDL changes without manual intervention
| Metric | Calculation | Value |
|---|---|---|
| Source databases | Given (assumption documented in value) | 100 |
| Tables per database | Given (assumption documented in value) | 50 |
| Total tables | Given (assumption documented in value) | 5,000 |
| Changes / sec (all sources) | From Changes / day ÷ 86400 (+ peak factor in value) | 500K |
| Avg change event size | Given (typical workload assumption) | 500 bytes |
| Throughput | Given (assumption documented in value) | 250 MB/sec |
| Kafka retention (7 days) | Given | ~150 TB |
| Kafka topics (one per table) | Given (assumption documented in value) | 5,000 |
How Log-Based CDC Works
PostgreSQL Write-Ahead Log (WAL):
Every transaction → WAL entry before data pages modified
WAL is the source of truth for replication
Logical Replication (PostgreSQL 10+):
1. Create PUBLICATION: CREATE PUBLICATION my_pub FOR TABLE orders, users;
2. Debezium creates a REPLICATION SLOT
3. Debezium reads decoded WAL changes via replication protocol
4. Each change decoded into: {table, op, before, after, source_metadata}
Replication slot guarantees:
PostgreSQL retains WAL segments until Debezium confirms consumption
? No changes are ever lost
→ BUT: if Debezium is down too long → WAL accumulates → disk fills up
→ Monitor: pg_replication_slots → confirmed_flush_lsn lag
Zero impact on production:
Debezium reads the WAL (written anyway for crash recovery)
No extra queries, no triggers, no polling. CPU impact: < 1%.Initial Snapshot + Streaming
Problem: When a CDC connector starts, table already has 100M rows. Debezium Snapshot Flow: 1. LOCK table (briefly) for consistent snapshot position 2. Read entire table with SELECT * (chunked by PK for large tables) 3. UNLOCK table (total lock time: < 1 second) 4. Switch to streaming mode: read WAL from snapshot position For very large tables (billions of rows): Incremental snapshot (Debezium 1.6+): - No global table lock - Reads table in chunks (WHERE id BETWEEN 1 AND 10000) - Interleaves snapshot reads with streaming reads - Can take hours but never locks the table
Change Event Format (Debezium Envelope)
{
"schema": {...},
"payload": {
"before": { "id": 42, "status": "pending", "amount": 99.99 },
"after": { "id": 42, "status": "completed", "amount": 99.99 },
"source": {
"connector": "postgresql",
"db": "ecommerce", "table": "orders",
"lsn": 987654321, "ts_ms": 1710403200000
},
"op": "u",
"ts_ms": 1710403200050
}
}Schema Evolution Without Downtime
Scenario: ALTER TABLE orders ADD COLUMN discount DECIMAL(10,2); What happens: 1. Debezium reads DDL from WAL 2. Generates new schema (v2) with discount field 3. Schema Registry: compatibility check → v2 compatible → registered 4. New events have discount field; old events do not 5. No pipeline restart needed Problematic DDL (breaks compatibility): ALTER TABLE orders DROP COLUMN status → BACKWARD incompatible Schema Registry REJECTS → Debezium pauses with error Safe schema changes: Add optional field with default ? Rename field (with Avro alias) ? Remove required field ? Change field type ?
Exactly-Once End-to-End
Source → Kafka: Debezium + Kafka acks=all → at-least-once (idempotent via PK key) Kafka → Sink: Elasticsearch: upsert by doc_id = PK → naturally idempotent PostgreSQL: INSERT ON CONFLICT (pk) DO UPDATE → idempotent S3: Flink exactly-once file sink Redis: SET key value → idempotent Net effect: exactly-once semantics without distributed transactions The pattern: at-least-once delivery + idempotent consumers = exactly-once
Outbox Pattern
Problem: Application needs to emit a DOMAIN EVENT (not just a row change)
CDC captures: "order.status changed from 'pending' to 'shipped'"
But you need: "OrderShipped event with tracking_number, carrier"
Solution: Outbox Pattern
1. In SAME database transaction:
UPDATE orders SET status='shipped' WHERE id=42;
INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES ('order', '42', 'OrderShipped', '{"tracking":"1Z999"}');
2. Debezium CDC on outbox table → rich domain events to Kafka
3. Background cleaner: DELETE FROM outbox WHERE created_at < NOW() - INTERVAL '1 hour';
Transactional guarantee: event is produced IF AND ONLY IF business action committed# Create PostgreSQL CDC connector
POST /connectors
{
"name": "pg-ecommerce-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg-primary.internal",
"database.dbname": "ecommerce",
"database.server.name": "pg-server1",
"table.include.list": "public.orders,public.users",
"column.exclude.list": "public.users.ssn",
"slot.name": "debezium_ecommerce",
"snapshot.mode": "initial",
"transforms.route.topic.replacement": "cdc.$2"
}
}
GET /connectors → List connectors
GET /connectors/{name}/status → Connector health
POST /connectors/{name}/restart → Restart connector
PUT /connectors/{name}/pause → Stop reading WAL
PUT /connectors/{name}/resume → Resume from last positionCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingKafka Topic Layout
Topic per table (recommended): pg-server1.public.orders → partitioned by PK hash pg-server1.public.users → partitioned by PK hash Key: serialized primary key (same row → same partition → ordering) Value: Debezium change event envelope (before, after, source, op)
Connect Offset Storage
Topic: __connect_offsets (25 partitions, compacted)
Key: ["pg-ecommerce-connector", {"server": "pg-server1"}]
Value: {"lsn": 987654321, "txId": 12345678}
On restart: reads last offset → resumes WAL from that LSNSlot WAL Accumulation: The #1 Operational Risk
Problem: Debezium is down for 6 hours → PostgreSQL retains 6 hours of WAL → disk full
? PRODUCTION DATABASE DOWN
Mitigations:
1. Monitor: pg_stat_replication_slots → confirmed_flush_lsn lag
Alert if lag > 1 GB or 1 hour
2. max_slot_wal_keep_size = 100GB (PG 13+) → auto-invalidate slot
Debezium detects invalidation → triggers new snapshot automatically
3. Heartbeat table: Debezium writes every 30s ? advances WAL position
4. PagerDuty alert if connector status != RUNNING for > 5 minutes| Concern | Solution |
|---|---|
| Debezium crash | Restart → reads last offset → resumes from WAL position |
| Source DB failure | Debezium reconnects; WAL retained until consumed |
| Kafka broker failure | RF=3, min.ISR=2; events survive any single broker failure |
| Schema change (DDL) | Detects DDL → new schema version → backward compatible |
| Duplicate events | Kafka key = PK → log compaction; consumers use upsert |
| Ordering violation | Partition by PK → all changes for same row → strict order |
CDC vs Dual Writes vs Polling
| Approach | How | Pros | Cons | |---|---|---|---|---| | CDC (log-based) ? | Read database WAL | Zero impact, complete, ordered, real-time | Operational complexity | | Dual writes | App writes to DB + Kafka | Simple | Race condition: DB succeeds, Kafka fails | | Outbox pattern | App writes to outbox → CDC on outbox | Transactional consistency | Extra table, cleanup needed | | Polling (query) | SELECT WHERE updated_at > last_poll | Simple | Misses deletes, not real-time, adds DB load | CDC wins for: real-time sync, search index, cache invalidation, analytics pipeline, event-driven microservices, CQRS read model updates
Multi-Region CDC
Primary DB (us-east) → Debezium → Kafka → MirrorMaker → Kafka (eu-west) Latency: DB change → EU consumer = ~200ms
Interview Walkthrough
- Explain CDC as reading database transaction log (WAL/binlog) — not polling or dual writes.
- Cover exactly-once delivery to downstream sinks via idempotent consumers and offset tracking.
- Discuss schema evolution: additive changes safe, column renames/deletes need careful migration.
- Mention ordering guarantees per partition key when fanning out to search index or cache invalidation.
- Cover lag monitoring and backfill strategy when adding a new downstream consumer.
- Common pitfall: application-level dual write to DB and search index — inconsistency on partial failure is guaranteed.
Log Mining vs Query Polling vs Dual Write
Approach Latency DB Impact Completeness
-------------------------------------------------------------------------
Log Mining Milliseconds Minimal (read-only) 100% — DELETEs,
(Debezium/WAL) Replication slot lag UPDATEs, INSERTs
risk if consumer stalls captured in order
Query Polling Seconds–min Adds SELECT load Misses DELETEs,
(WHERE updated_at (poll interval) on hot tables misses rows updated
> last_seen) twice within gap
Dual Write Milliseconds App-side complexity Inconsistent if app
(App → DB + Kafka) Two network calls crashes after DB
write but before KafkaExactly-Once Delivery: When It Matters
Critical: financial ledger, inventory decrement, email sends. Optional: analytics counters, search index updates, cache warming.
Techniques: Natural idempotency (Elasticsearch upsert, Redis SET), Dedup table, Kafka Transactions.
Schema Evolution in a Live Pipeline
Safe multi-phase: (1) Add column nullable, no default. (2) Deploy consumer code handling both schemas. (3) Deploy writers. (4) After backfill, add NOT NULL. Dead Letter Queue for schema failures: never silently drop.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core cdc pipeline flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.