Design an Inventory Management System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Inventory Management System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Reserved vs available stock, Warehouse distribution, Eventual consistency?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Reserved vs available stock
Warehouse distribution
Eventual consistency
Oversell protection
Capacity estimation with shown math

Out of scope (state explicitly)

Recommendation engine (#48)
Review/rating system (#70)
Warehouse management (WMS) internals

Assumptions

Clarify scale (DAU, QPS, data volume) for inventory management system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Track stock levels: Real-time quantity tracking per SKU across warehouses, stores, and channels
Stock reservation: Temporarily hold inventory for pending orders (soft lock) before payment confirmation
Stock decrement: Atomically reduce stock on confirmed purchase; prevent overselling
Multi-warehouse: Track inventory per warehouse/fulfillment center with inter-warehouse transfers
Replenishment alerts: Notify when stock falls below reorder point (low-stock threshold)
Batch updates: Bulk ingest from suppliers, warehouse scans, POS systems
Stock adjustments: Manual adjustments for damage, theft, counting discrepancies
Inventory holds: Reserve stock for flash sales, bundles, pre-orders before they go live
Multi-channel sync: Sync available stock across website, mobile app, marketplace (Amazon, eBay), and physical stores
Audit trail: Immutable log of every stock change with reason, actor, and timestamp

Metric	Calculation	Value
Total SKUs	Given (assumption documented in value)	100M
Warehouses / fulfillment centers	Given (assumption documented in value)	1,000
SKU-warehouse combinations	Given	~500M (not every SKU in every warehouse)
Stock check queries / sec	From Stock check queries / day ÷ 86400 (+ peak factor in value)	200K (product page views trigger stock check)
Reservation requests / sec	From Reservation requests / day ÷ 86400 (+ peak factor in value)	10K (add to cart / checkout)
Confirmed decrements / sec	From Confirmed decrements / day ÷ 86400 (+ peak factor in value)	5K (orders placed)
Flash sale peak	Given (peak load assumption)	500K stock ops/sec for hot SKUs
Stock record size	Given	~200 bytes
Total data	500M × 200B	100 GB

Loading...

Stock Reservation Flow — Two-Phase Pattern

Event Bus Design (Kafka)

Topic: inventory_management_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "inventory_management_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: inventory_management_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an Inventory Management System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Flash Sale Concurrency — Redis Counter

Problem: Flash sale of 1,000 units. 500K users hit "Buy" simultaneously.
  PostgreSQL: SELECT ... FOR UPDATE → row-level lock → 500K waiting → timeout/crash

Solution: Use Redis as the fast path for stock decrements.

Pre-sale setup:
  SET flash_stock:{sku_id} 1000

On purchase attempt:
  local remaining = redis.call('DECR', 'flash_stock:SKU-FLASH-1')
  if remaining >= 0 then
    -- SUCCESS: user got one
    -- Async: write to PostgreSQL, create order
    return 'reserved'
  else
    -- SOLD OUT
    redis.call('INCR', 'flash_stock:SKU-FLASH-1')  -- undo the DECR
    return 'sold_out'
  end

Why this works:
  Redis DECR is atomic (single-threaded) → no race conditions
  500K DECR operations/sec → Redis handles easily (100K ops/sec per shard)
  After Redis confirms → async write to PostgreSQL (no lock contention)

Multi-Warehouse Stock Allocation

User orders SKU-123. It's available in 3 warehouses:
  WH-NYC: 50 units (200 miles from user)
  WH-CHI: 30 units (700 miles)  
  WH-LAX: 100 units (2500 miles)

Allocation strategies:

1. Nearest warehouse (minimize shipping cost + time):
   Sort warehouses by distance to delivery address
   Pick closest with stock → WH-NYC ✓

2. Load-balanced (prevent one warehouse from depleting):
   Pick warehouse with highest stock level → WH-LAX

3. Cost-optimized (minimize total fulfillment cost):
   cost = shipping_cost + handling_cost + last_mile_cost
   Consider: shipping zone, carrier rates, warehouse labor cost
   Pick minimum cost

4. Hybrid ⭐ (Amazon's approach):
   score = w1 × (1 / distance) + w2 × stock_level + w3 × (1 / cost) 
           + w4 × delivery_speed_guarantee
   Pick highest score

Split shipment:
  Order has 3 items: SKU-A (only in WH-NYC), SKU-B (only in WH-LAX), SKU-C (both)
  → Split into 2 shipments: {SKU-A, SKU-C} from WH-NYC, {SKU-B} from WH-LAX
  → Minimize number of shipments while respecting stock constraints
  → NP-hard problem at scale → greedy heuristic: maximize items per shipment

PostgreSQL: Source of Truth (Sharded by sku_id)

CREATE TABLE inventory (
    sku_id          VARCHAR(50) NOT NULL,
    warehouse_id    VARCHAR(50) NOT NULL,
    available       INT NOT NULL DEFAULT 0 CHECK (available >= 0),
    reserved        INT NOT NULL DEFAULT 0 CHECK (reserved >= 0),
    reorder_point   INT DEFAULT 10,
    max_stock       INT DEFAULT 1000,
    last_replenished TIMESTAMP,
    updated_at      TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (sku_id, warehouse_id),
    CHECK (reserved <= available)
);

CREATE TABLE reservations (
    reservation_id  UUID PRIMARY KEY,
    sku_id          VARCHAR(50) NOT NULL,
    warehouse_id    VARCHAR(50) NOT NULL,
    user_id         UUID NOT NULL,
    order_id        UUID,
    quantity        INT NOT NULL,
    status          ENUM('active', 'confirmed', 'released', 'expired') DEFAULT 'active',
    expires_at      TIMESTAMPTZ NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    INDEX idx_status_expires (status, expires_at) WHERE status = 'active',
    INDEX idx_sku (sku_id, warehouse_id)
);

CREATE TABLE stock_audit_log (
    log_id          BIGSERIAL PRIMARY KEY,
    sku_id          VARCHAR(50) NOT NULL,
    warehouse_id    VARCHAR(50) NOT NULL,
    change_type     ENUM('reserve', 'confirm', 'release', 'adjust', 'replenish', 'return'),
    quantity_change INT NOT NULL,
    previous_available INT,
    new_available   INT,
    reason          TEXT,
    actor_id        VARCHAR(50),
    idempotency_key VARCHAR(64),
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    INDEX idx_sku_time (sku_id, created_at DESC)
);

Redis: Hot Path Cache + Flash Sale

# Sellable stock cache (read path)
stock:{sku_id}:{warehouse_id}  → INT (sellable = available - reserved)
TTL: 60 seconds (refreshed from PostgreSQL)

# Flash sale atomic counters
flash_stock:{sale_id}:{sku_id}  → INT
No TTL (managed explicitly)

# Reservation TTL keys
reservation:{reservation_id}   → JSON { sku_id, warehouse_id, quantity, user_id }
TTL: 600 (10 minutes)

# Idempotency (prevent duplicate reserve/confirm)
idempotency:{key}  → response JSON
TTL: 86400

Kafka Topics

Topic: stock-changes         (every mutation → consumed by channel sync, analytics, alerts)
Topic: reservation-events    (reserved, confirmed, released → consumed by order service)
Topic: low-stock-alerts      (sku dropped below reorder point → notify procurement)

Concern	Solution
Overselling	Reservation pattern + PostgreSQL CHECK constraint + Redis DECR atomic
Reservation leak (never confirmed/released)	TTL-based expiry + cron cleanup job every 1 min
Redis ↔ PostgreSQL divergence	Periodic reconciliation (every 5 min); Redis is cache, PostgreSQL is truth
Double decrement (retry)	Idempotency key on every reserve/confirm request
Warehouse system offline	Queue updates in Kafka; apply when system recovers; use last-known stock
Flash sale thundering herd	Redis absorbs all traffic; PostgreSQL writes batched asynchronously
Database failover	PostgreSQL synchronous replication; promote standby on primary failure

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Stock Reservation Flow — Two-Phase Pattern

Event Bus Design (Kafka)

Flash Sale Concurrency — Redis Counter

Multi-Warehouse Stock Allocation

Check Stock

Reserve Stock

Confirm Reservation

Bulk Stock Update

Common Error Responses

PostgreSQL: Source of Truth (Sharded by sku_id)

Redis: Hot Path Cache + Flash Sale

Kafka Topics

Interview Walkthrough

PostgreSQL vs DynamoDB for Inventory

Inventory: Push vs Pull for Channel Sync

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR