This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Event Sourcing System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Event log as source of truth, CQRS pattern, Projections/materialized views? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Event log as source of truth
- CQRS pattern
- Projections/materialized views
- Snapshot optimization
- Saga pattern for distributed transactions
- Temporal queries
Out of scope (state explicitly)
- Full analytics warehouse modeling (dbt/Star schema)
- Exactly-once end-to-end across all downstream consumers
- Building Kafka from scratch
Assumptions
- Clarify scale (DAU, QPS, data volume) for event sourcing system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Append events: Persist all application state changes as an immutable, ordered stream of events rather than direct updates.
- Rebuild state: Reconstruct the current state of any domain aggregate at runtime by replaying its complete sequence of events.
- Event replay: Replay events dynamically to repair corrupt projections, bootstrap new read-optimized views, or fix application bugs.
- Materialized snapshots: Periodically cache aggregate states to optimize read paths and avoid replaying millions of events.
- Dynamic projections: Build and update multiple independent, read-optimized projection stores from the single unified write stream.
- Temporal queries: Enable seamless time-travel capability to query the exact state of any aggregate at any past timestamp.
- Schema evolution: Support robust schema versioning strategies to upcast historical event structures smoothly.
- Pub/Sub streaming: Allow downstream microservices to subscribe to aggregate streams reactively.
- Durability: Events are the immutable source of truth and must guarantee 11 nines of durability (zero data loss).
- Immutability: Enforce append-only storage. Events must be structurally protected from updates, deletion, or tampering.
- Strict Ordering: Events matching a specific aggregate stream must be ordered sequentially without gaps.
- High Write Throughput: Design for 100K+ event writes per second.
- Low Read Latency: Serve read models and denormalized projections in < 10 ms.
- Horizontal Scalability: Scale to billions of global events across hundreds of millions of distinct aggregate entities.
- Backward Compatibility: Ensure older versions of serialized events can be deserialized safely by newer software layers.
| Metric | Calculation | Value |
|---|---|---|
| Events / day | Given (typical workload assumption) | 5 Billion |
| Events / sec | From Events / day ÷ 86400 (+ peak factor in value) | ~58K avg (peak 200K) |
| Avg event size | Given (typical workload assumption) | 500 bytes |
| Write throughput | Derived | 29 MB/s avg, 100 MB/s peak |
| Storage / day | Derived from upstream throughput × size | 2.5 TB |
| Storage / year | Daily storage × 365 | ~900 TB |
| Aggregates (entities) | Derived | 500 Million |
| Avg events per aggregate | Given (typical workload assumption) | 50 |
I/O and Bandwidth Derivations: 1. Daily Telemetry/Data Ingestion: - 5 Billion events/day × 500 bytes/event = 2.5 TB per day. 2. Peak Write Throughput: - 200,000 events/sec peak × 500 bytes = 100 MB/s network ingestion rate. 3. Decadal Aggregate Storage Footprint: - 2.5 TB/day × 365 days = ~912 TB/year. - Fits comfortably in distributed object store repositories or sharded DB storage blocks.
The write path handles incoming commands by validating them against the current domain aggregate state (assembled using snapshots and events). Validated commands write immutable events to the Event Store (PostgreSQL). Change Data Capture (CDC via Debezium) forwards events to Kafka to push them down to projection engines, maintaining eventual consistency in read databases.
1. Event Store: Aggregate Streams & CAS Locking ⭐
Each domain entity is modeled as an independent stream of sequential events. The database enforces Optimistic Concurrency Controlusing an expected revision version check, preventing parallel writers from overwriting concurrent state shifts.
Stream ID: order-12345
Version 1: {type: "OrderCreated", data: {items: [...], total: 100}, timestamp: ...}
Version 2: {type: "PaymentReceived", data: {amount: 100, method: "card"}, timestamp: ...}
Version 3: {type: "OrderShipped", data: {tracking: "UPS123"}, timestamp: ...}Compare-And-Swap (CAS) Concurrency Protocol:
When writing event to stream "order-12345": Expected version: 3 (client knows current version) New event version: 4 IF current stream version ≠ 3 → CONFLICT (another writer modified it) → Client must re-read, re-validate business rules, retry - Compare-And-Swap (CAS) at the stream level. - Eliminates row locks, preventing deadlock hazards completely.
Event Sourcing vs Traditional CRUD:
Traditional CRUD:
UPDATE orders SET status='shipped' WHERE id=12345;
→ Previous state LOST. Audit trail requires separate manual logs.
Event Sourcing:
APPEND {OrderCancelled, reason: "customer_request", cancelled_by: "user-789"}
→ Complete immutable history preserved. Every state transition recorded.2. Snapshots: Performance Optimization ⭐
If an aggregate accumulates thousands of changes, replaying them from version 1 on every command is extremely slow. The system periodically snapshots the materialized aggregate state:
1. Load latest snapshot (version 9,900, state = {...})
2. Replay only events 9,901 to 10,000 (100 events instead of 10,000)
3. Reconstruct current state in milliseconds.
Snapshot Trigger Strategies:
- Every N events (e.g., every 100).
- When event size or quantity exceeds specified threshold.
- On-demand when aggregate experiences high read traffic.3. Projections: CQRS Denormalized Read Models
Command Query Responsibility Segregation (CQRS) decouples writing from reading. The Write store is optimized exclusively for fast, sequential appends. Read models are generated asynchronously by projection builders consuming from the Kafka bus:
Event Stream → Projection Builder → Read-Optimized Database
Projection 1: Order Details (PostgreSQL Read Model)
OrderCreated → INSERT INTO orders (id, items, total, status='created')
PaymentReceived → UPDATE orders SET status='paid', payment_method=...
OrderShipped → UPDATE orders SET status='shipped', tracking=...
Projection 2: Daily Revenue Dashboard (ClickHouse OLAP)
PaymentReceived → INSERT INTO revenue (date, amount, currency)
Projection 3: Real-Time Counter (Redis cache)
OrderCreated → INCR user:{user_id}:order_count1. Command Execution (Write Path)
POST /api/v1/orders
{
"items": [{"product_id": "prod-91", "qty": 2}],
"customer_id": "cust-8821"
}
→ Response: 201 Created (Generates: OrderCreated event)
POST /api/v1/orders/{order_id}/ship
{
"tracking_number": "UPS981242"
}
→ Response: 202 Accepted (Generates: OrderShipped event if state allows)2. Event Queries (Read Path)
GET /api/v1/orders/{order_id}
→ Returns current state from denormalized PostgreSQL projection
GET /api/v1/orders/{order_id}/history
→ Returns all raw historical events for this aggregate from Event Store
GET /api/v1/orders/{order_id}/state-at?timestamp=2026-01-15T00:00:00Z
→ Replays events up to designated timestamp, returning temporal stateCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL Event Store & Snapshot Tables
-- Primary event store schema
CREATE TABLE events (
global_position BIGSERIAL, -- global sequential order
stream_id VARCHAR(256) NOT NULL, -- e.g., "order-12345"
stream_version INT NOT NULL, -- per-stream sequence number
event_type VARCHAR(128) NOT NULL, -- e.g., "OrderCreated"
data JSONB NOT NULL, -- immutable event payload
metadata JSONB, -- correlation_id, causation_id, actor
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
PRIMARY KEY (stream_id, stream_version),
UNIQUE (global_position)
);
CREATE INDEX idx_events_stream ON events(stream_id);
-- Snapshot cache table
CREATE TABLE snapshots (
stream_id VARCHAR(256) PRIMARY KEY,
stream_version INT NOT NULL,
state JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);Kafka Event Bus Topic Topology
Topic: domain-events
- Partition Key: stream_id (Guarantees strict per-aggregate message ordering)
- Payload: { stream_id, event_type, data, metadata, version, timestamp }
- Config: cleanup.policy=delete, retention.ms=-1 (Retain full event history indefinitely; do NOT compact, which would drop superseded events per stream_id key)| Concern | Solution |
|---|---|
| Event Data Loss | Synchronous PostgreSQL primary-standby replication. Kafka partition replica factor = 3. |
| Projection Builder Failure | Each builder persists its last successfully processed Kafka partition offset. Resumes seamlessly. |
| Projection DB Corruption | Wipe the local read model completely and replay all raw events from the Event Store WAL. |
| Event Schema Drift | Implement dynamic Upcaster layers to translate legacy event formats into new structures during read time. |
1. Resolving the Dual-Write delivery gap ⭐
Writing to the Event Store and publishing to Kafka must be highly atomic. If the DB write succeeds but Kafka fails, projections become permanently broken.
- Outbox Pattern: Write both the event and an outbox record in a single database transaction. A separate background runner polls and safely publishes to Kafka with at-least-once guarantees.
- Change Data Capture (CDC): Utilize Debezium to tail the PostgreSQL Write-Ahead Log (WAL). Any insert on the
eventstable instantly generates a Kafka stream record, ensuring zero overhead.
2. Mitigation of Projection Eventual Consistency Lag
Because read models update asynchronously, a user might write an event but see old data when navigating to their page immediately.
- Read-Your-Own-Writes: Return the calculated mutation directly from the command handler to update the client's local store.
- Version-Assertion Queries: The write API returns the new stream version number. Read gateways poll until the projection version matches or exceeds it.
1. Dynamic Schema Evolution with Upcasters ⭐
Over years of development, event payloads inevitably shift (fields are added, deprecated, or restructured). Since events are completely immutable, they cannot be updated directly in storage.
V1 Schema: OrderCreated { items: [...], total: 100 }
V2 Schema: OrderCreated { items: [...], subtotal: 100, tax: 8, total: 108 }
Upcaster Transformation (V1 → V2 on-the-fly read path):
if event.version == 1:
event.data.subtotal = event.data.total
event.data.tax = 0
event.version = 2
return event;2. When NOT to Adopt Event Sourcing
Event sourcing introduces substantial complexity (handling eventual consistency, creating projections, maintaining schemas). Do not use it if: the system is simple CRUD without historical needs, schemas change too rapidly to manage upcasters, or absolute synchronous read consistency is a firm bottleneck.
Interview Walkthrough
- Start by contrasting CRUD (state overwritten) with append-only events (full audit trail) — justify when history and replay matter.
- Draw the CQRS split: PostgreSQL event store for writes, Kafka-driven projection builders for denormalized read models.
- Explain optimistic concurrency with expected stream version — reject conflicting writes instead of row locking.
- Introduce snapshots for long-lived aggregates so command handlers replay only recent events, not thousands.
- Solve the dual-write problem with Outbox pattern or Debezium CDC — never publish to Kafka outside the DB transaction.
- Mention upcasters for schema evolution since stored events are immutable and cannot be rewritten in place.
- Quantify write load: 5B events/day × 500 bytes ≈ 2.5 TB/day — partition streams by aggregate ID for horizontal scale.
- Common pitfall: serving UI reads by replaying the event store on every request — projections exist precisely to avoid this.
1. Event Store Storage Selection
| Feature | PostgreSQL | EventStoreDB | Kafka |
|---|---|---|---|
| Optimistic Concurrency | ✓ (UNIQUE constraints) | ✓ (Built-in revision check) | ✗ (No partition-key version checks) |
| Stream Subscriptions | Polling or LISTEN/NOTIFY | ✓ (Native catch-up client subs) | ✓ (Consumer group offsets) |
| Ordering Guarantee | Per-stream + Global sequential position | Per-stream + Global sequential position | Strictly partition-level ordering |
| Query Flexibility | ✓ (Full relational SQL queries) | Highly limited indexing | ✗ (Offset-based offset scan only) |
| Scalability & Ingestion | Requires manual active sharding | Clustered replicas | Excellent (Highly partitioned streams) |
| Operational Maturity | Extremely mature | Niche specialized storage | Extremely mature |
Recommendation: Use PostgreSQL for write rates < 100K events/second (extremely simple, reliable SQL). Adopt EventStoreDB for specialized native event sourcing, and leverage Kafka primarily as a message delivery bus rather than a transactional source of truth, due to Kafka's lack of key-level optimistic concurrency controls.
2. Snapshot Consistency Verification
If a snapshot is created containing a logic bug, all subsequent states reconstructed from it will be corrupted. Mitigate this by executing continuous offline verification: run background tasks that periodically replay raw historical events from scratch and compare the calculated outcome with current snapshot structures. Discard and rebuild snapshots if any discrepancies are found.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core event sourcing system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.