Design an Event Sourcing System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Event Sourcing System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Event log as source of truth, CQRS pattern, Projections/materialized views?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Event log as source of truth
CQRS pattern
Projections/materialized views
Snapshot optimization
Saga pattern for distributed transactions
Temporal queries

Out of scope (state explicitly)

Full analytics warehouse modeling (dbt/Star schema)
Exactly-once end-to-end across all downstream consumers
Building Kafka from scratch

Assumptions

Clarify scale (DAU, QPS, data volume) for event sourcing system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Append events: Persist all application state changes as an immutable, ordered stream of events rather than direct updates.
Rebuild state: Reconstruct the current state of any domain aggregate at runtime by replaying its complete sequence of events.
Event replay: Replay events dynamically to repair corrupt projections, bootstrap new read-optimized views, or fix application bugs.
Materialized snapshots: Periodically cache aggregate states to optimize read paths and avoid replaying millions of events.
Dynamic projections: Build and update multiple independent, read-optimized projection stores from the single unified write stream.
Temporal queries: Enable seamless time-travel capability to query the exact state of any aggregate at any past timestamp.
Schema evolution: Support robust schema versioning strategies to upcast historical event structures smoothly.
Pub/Sub streaming: Allow downstream microservices to subscribe to aggregate streams reactively.

Metric	Calculation	Value
Events / day	Given (typical workload assumption)	5 Billion
Events / sec	From Events / day ÷ 86400 (+ peak factor in value)	~58K avg (peak 200K)
Avg event size	Given (typical workload assumption)	500 bytes
Write throughput	Derived	29 MB/s avg, 100 MB/s peak
Storage / day	Derived from upstream throughput × size	2.5 TB
Storage / year	Daily storage × 365	~900 TB
Aggregates (entities)	Derived	500 Million
Avg events per aggregate	Given (typical workload assumption)	50

I/O and Bandwidth Derivations:
1. Daily Telemetry/Data Ingestion:
   - 5 Billion events/day × 500 bytes/event = 2.5 TB per day.
2. Peak Write Throughput:
   - 200,000 events/sec peak × 500 bytes = 100 MB/s network ingestion rate.
3. Decadal Aggregate Storage Footprint:
   - 2.5 TB/day × 365 days = ~912 TB/year.
   - Fits comfortably in distributed object store repositories or sharded DB storage blocks.

The write path handles incoming commands by validating them against the current domain aggregate state (assembled using snapshots and events). Validated commands write immutable events to the Event Store (PostgreSQL). Change Data Capture (CDC via Debezium) forwards events to Kafka to push them down to projection engines, maintaining eventual consistency in read databases.

Loading...

1. Event Store: Aggregate Streams & CAS Locking ⭐

Each domain entity is modeled as an independent stream of sequential events. The database enforces Optimistic Concurrency Controlusing an expected revision version check, preventing parallel writers from overwriting concurrent state shifts.

Stream ID: order-12345
  Version 1: {type: "OrderCreated", data: {items: [...], total: 100}, timestamp: ...}
  Version 2: {type: "PaymentReceived", data: {amount: 100, method: "card"}, timestamp: ...}
  Version 3: {type: "OrderShipped", data: {tracking: "UPS123"}, timestamp: ...}

Compare-And-Swap (CAS) Concurrency Protocol:

When writing event to stream "order-12345":
  Expected version: 3 (client knows current version)
  New event version: 4
  
  IF current stream version ≠ 3 → CONFLICT (another writer modified it)
  → Client must re-read, re-validate business rules, retry

- Compare-And-Swap (CAS) at the stream level.
- Eliminates row locks, preventing deadlock hazards completely.

Event Sourcing vs Traditional CRUD:

Traditional CRUD:
  UPDATE orders SET status='shipped' WHERE id=12345;
  → Previous state LOST. Audit trail requires separate manual logs.

Event Sourcing:
  APPEND {OrderCancelled, reason: "customer_request", cancelled_by: "user-789"}
  → Complete immutable history preserved. Every state transition recorded.

2. Snapshots: Performance Optimization ⭐

If an aggregate accumulates thousands of changes, replaying them from version 1 on every command is extremely slow. The system periodically snapshots the materialized aggregate state:

1. Load latest snapshot (version 9,900, state = {...})
2. Replay only events 9,901 to 10,000 (100 events instead of 10,000)
3. Reconstruct current state in milliseconds.

Snapshot Trigger Strategies:
- Every N events (e.g., every 100).
- When event size or quantity exceeds specified threshold.
- On-demand when aggregate experiences high read traffic.

3. Projections: CQRS Denormalized Read Models

Command Query Responsibility Segregation (CQRS) decouples writing from reading. The Write store is optimized exclusively for fast, sequential appends. Read models are generated asynchronously by projection builders consuming from the Kafka bus:

Event Stream → Projection Builder → Read-Optimized Database

Projection 1: Order Details (PostgreSQL Read Model)
  OrderCreated    → INSERT INTO orders (id, items, total, status='created')
  PaymentReceived → UPDATE orders SET status='paid', payment_method=...
  OrderShipped    → UPDATE orders SET status='shipped', tracking=...

Projection 2: Daily Revenue Dashboard (ClickHouse OLAP)
  PaymentReceived → INSERT INTO revenue (date, amount, currency)

Projection 3: Real-Time Counter (Redis cache)
  OrderCreated    → INCR user:{user_id}:order_count

1. Command Execution (Write Path)

HTTP

POST /api/v1/orders
{
  "items": [{"product_id": "prod-91", "qty": 2}],
  "customer_id": "cust-8821"
}
→ Response: 201 Created (Generates: OrderCreated event)

POST /api/v1/orders/{order_id}/ship
{
  "tracking_number": "UPS981242"
}
→ Response: 202 Accepted (Generates: OrderShipped event if state allows)

2. Event Queries (Read Path)

HTTP

GET /api/v1/orders/{order_id}
→ Returns current state from denormalized PostgreSQL projection

GET /api/v1/orders/{order_id}/history
→ Returns all raw historical events for this aggregate from Event Store

GET /api/v1/orders/{order_id}/state-at?timestamp=2026-01-15T00:00:00Z
→ Replays events up to designated timestamp, returning temporal state

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

PostgreSQL Event Store & Snapshot Tables

SQL

-- Primary event store schema
CREATE TABLE events (
    global_position  BIGSERIAL,           -- global sequential order
    stream_id        VARCHAR(256) NOT NULL, -- e.g., "order-12345"
    stream_version   INT NOT NULL,         -- per-stream sequence number
    event_type       VARCHAR(128) NOT NULL, -- e.g., "OrderCreated"
    data             JSONB NOT NULL,       -- immutable event payload
    metadata         JSONB,                -- correlation_id, causation_id, actor
    created_at       TIMESTAMP NOT NULL DEFAULT NOW(),
    PRIMARY KEY (stream_id, stream_version),
    UNIQUE (global_position)
);
CREATE INDEX idx_events_stream ON events(stream_id);

-- Snapshot cache table
CREATE TABLE snapshots (
    stream_id        VARCHAR(256) PRIMARY KEY,
    stream_version   INT NOT NULL,
    state            JSONB NOT NULL,
    created_at       TIMESTAMP NOT NULL DEFAULT NOW()
);

Kafka Event Bus Topic Topology

Topic: domain-events
- Partition Key: stream_id (Guarantees strict per-aggregate message ordering)
- Payload: { stream_id, event_type, data, metadata, version, timestamp }
- Config: cleanup.policy=delete, retention.ms=-1 (Retain full event history indefinitely; do NOT compact, which would drop superseded events per stream_id key)

Concern	Solution
Event Data Loss	Synchronous PostgreSQL primary-standby replication. Kafka partition replica factor = 3.
Projection Builder Failure	Each builder persists its last successfully processed Kafka partition offset. Resumes seamlessly.
Projection DB Corruption	Wipe the local read model completely and replay all raw events from the Event Store WAL.
Event Schema Drift	Implement dynamic Upcaster layers to translate legacy event formats into new structures during read time.

1. Resolving the Dual-Write delivery gap ⭐

Writing to the Event Store and publishing to Kafka must be highly atomic. If the DB write succeeds but Kafka fails, projections become permanently broken.

Outbox Pattern: Write both the event and an outbox record in a single database transaction. A separate background runner polls and safely publishes to Kafka with at-least-once guarantees.
Change Data Capture (CDC): Utilize Debezium to tail the PostgreSQL Write-Ahead Log (WAL). Any insert on the events table instantly generates a Kafka stream record, ensuring zero overhead.

2. Mitigation of Projection Eventual Consistency Lag

Because read models update asynchronously, a user might write an event but see old data when navigating to their page immediately.

Read-Your-Own-Writes: Return the calculated mutation directly from the command handler to update the client's local store.
Version-Assertion Queries: The write API returns the new stream version number. Read gateways poll until the projection version matches or exceeds it.

1. Dynamic Schema Evolution with Upcasters ⭐

Over years of development, event payloads inevitably shift (fields are added, deprecated, or restructured). Since events are completely immutable, they cannot be updated directly in storage.

PYTHON

V1 Schema: OrderCreated { items: [...], total: 100 }
V2 Schema: OrderCreated { items: [...], subtotal: 100, tax: 8, total: 108 }

Upcaster Transformation (V1 → V2 on-the-fly read path):
if event.version == 1:
    event.data.subtotal = event.data.total
    event.data.tax = 0
    event.version = 2
return event;

2. When NOT to Adopt Event Sourcing

Event sourcing introduces substantial complexity (handling eventual consistency, creating projections, maintaining schemas). Do not use it if: the system is simple CRUD without historical needs, schemas change too rapidly to manage upcasters, or absolute synchronous read consistency is a firm bottleneck.

Interview Walkthrough

Start by contrasting CRUD (state overwritten) with append-only events (full audit trail) — justify when history and replay matter.
Draw the CQRS split: PostgreSQL event store for writes, Kafka-driven projection builders for denormalized read models.
Explain optimistic concurrency with expected stream version — reject conflicting writes instead of row locking.
Introduce snapshots for long-lived aggregates so command handlers replay only recent events, not thousands.
Solve the dual-write problem with Outbox pattern or Debezium CDC — never publish to Kafka outside the DB transaction.
Mention upcasters for schema evolution since stored events are immutable and cannot be rewritten in place.
Quantify write load: 5B events/day × 500 bytes ≈ 2.5 TB/day — partition streams by aggregate ID for horizontal scale.
Common pitfall: serving UI reads by replaying the event store on every request — projections exist precisely to avoid this.

1. Event Store Storage Selection

Feature	PostgreSQL	EventStoreDB	Kafka
Optimistic Concurrency	✓ (UNIQUE constraints)	✓ (Built-in revision check)	✗ (No partition-key version checks)
Stream Subscriptions	Polling or LISTEN/NOTIFY	✓ (Native catch-up client subs)	✓ (Consumer group offsets)
Ordering Guarantee	Per-stream + Global sequential position	Per-stream + Global sequential position	Strictly partition-level ordering
Query Flexibility	✓ (Full relational SQL queries)	Highly limited indexing	✗ (Offset-based offset scan only)
Scalability & Ingestion	Requires manual active sharding	Clustered replicas	Excellent (Highly partitioned streams)
Operational Maturity	Extremely mature	Niche specialized storage	Extremely mature

Recommendation: Use PostgreSQL for write rates < 100K events/second (extremely simple, reliable SQL). Adopt EventStoreDB for specialized native event sourcing, and leverage Kafka primarily as a message delivery bus rather than a transactional source of truth, due to Kafka's lack of key-level optimistic concurrency controls.

2. Snapshot Consistency Verification

If a snapshot is created containing a logic bug, all subsequent states reconstructed from it will be corrupted. Mitigate this by executing continuous offline verification: run background tasks that periodically replay raw historical events from scratch and compare the calculated outcome with current snapshot structures. Discard and rebuild snapshots if any discrepancies are found.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.