Design a Shopping Cart System

Interview Prompt

Design Shopping Cart System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Guest vs logged-in cart merge, Cart ↔ inventory sync, Pricing at checkout?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Guest vs logged-in cart merge
Cart ↔ inventory sync
Pricing at checkout
Session management
Capacity estimation with shown math

Out of scope (state explicitly)

Recommendation engine (#48)
Review/rating system (#70)
Warehouse management (WMS) internals

Assumptions

Clarify scale (DAU, QPS, data volume) for shopping cart system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Add / remove items: Add products with quantity; update quantity; remove items
Persist cart across sessions: Logged-in user's cart survives app close, device switch
Guest cart: Visitors can add items without login; merge cart on sign-in
Price & availability validation: Re-validate prices and stock at checkout (cart may sit for days)
Saved for later: Move items from cart to "Save for Later" wishlist
Cart expiry: Auto-remove items after configurable TTL (e.g., 30 days)
Promotions: Apply coupons, show discounted prices, bundle discounts
Multi-seller: Items from different sellers in one cart; show per-seller subtotals
Shipping estimation: Show estimated delivery date and shipping cost per item
Cart sharing: Share cart via link (gift registries, wishlists)

Metric	Calculation	Value
DAU with active carts	Given (assumption documented in value)	50M
Cart operations / day	50M users × ~10 ops/day	500M
Cart operations / sec	500M ÷ 86400	~6K (peak 50K during sales)
Avg items per cart	Given (typical workload assumption)	5
Cart data per user	Given	~2 KB
Total cart data	50M × 2 KB	100 GB
Guest carts / day	Given (assumption documented in value)	20M (many abandoned)
Cart-to-order conversion	Given	~10%

Loading...

Cart Storage: Why Redis

Event Bus Design (Kafka)

Topic: shopping_cart_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "shopping_cart_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: shopping_cart_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Shopping Cart System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Guest Cart → Login Merge

When a guest with 3 items logs in to an existing user cart with 2 items, merge by keeping max quantity for overlapping SKUs. Implemented atomically with Redis Lua script to prevent race conditions on simultaneous logins.

Scenario:
  1. Guest adds 3 items to cart (stored by session_id)
     cart:guest-session-abc → { SKU-1: qty 1, SKU-2: qty 2, SKU-3: qty 1 }
  
  2. Guest logs in as user-123
     Existing cart: cart:user-123 → { SKU-2: qty 1, SKU-4: qty 3 }
  
  3. Merge strategy:
     For items in BOTH carts (SKU-2):
       Option A: Keep higher quantity → qty = max(2, 1) = 2
       Option B: Sum quantities → qty = 2 + 1 = 3
       Option C: Keep guest cart quantity (most recent intent) → qty = 2
       
       Amazon uses Option A (max). Most user-friendly.
     
     For items only in guest cart: add to user cart
     For items only in user cart: keep
     
     Result: cart:user-123 → { SKU-1: qty 1, SKU-2: qty 2, SKU-3: qty 1, SKU-4: qty 3 }

  4. Delete guest cart: DEL cart:guest-session-abc

Price Validation at Checkout

Hybrid approach: store price at add time, compare with current price at checkout. Use current price if lower (customer wins), show warning if higher.

Implementation:
  On cart page load:
    cart_items = redis.hgetall("cart:user-123")
    current_prices = catalog_service.batch_get_prices([sku_ids])
    
    for item in cart_items:
      item.current_price = current_prices[item.sku_id]
      item.price_changed = (item.current_price != item.price_at_add)
      item.in_stock = inventory_service.check_stock(item.sku_id)
    
    return cart with annotations

Concern	Solution
Redis node failure	Redis Cluster (6+ nodes); AOF persistence; data replicated across shards
Redis total failure	Fall back to PostgreSQL backup (slightly stale); rebuild Redis from PG
Cart data loss	Async backup to PostgreSQL every 5 min; client-side localStorage as last resort
Duplicate add-to-cart	Idempotent: HSET is naturally idempotent (same key → overwrite, not duplicate)
Price discrepancy	Always validate at checkout; show warnings on cart page
Race condition (merge)	Redis Lua script for atomic merge; no concurrent merge possible
Abandoned cart recovery	Kafka event on cart inactivity > 1 hour → trigger email reminder

Redis vs DynamoDB vs Session Storage for Cart

Redis ⭐ (recommended):
  ✓ Sub-millisecond latency
  ✓ Hash data structure is perfect for cart operations
  ✓ TTL for automatic expiry
  ✓ Lua scripting for atomic operations (merge, validate)
  ✗ Memory-only (need AOF/RDB for persistence)
  ✗ Cost: RAM is expensive at 100 GB scale ($$$)

DynamoDB:
  ✓ Durable (replicated, persistent)
  ✓ Pay-per-request pricing (good for variable load)
  ✓ Auto-scaling
  ✗ Higher latency (5-10 ms vs < 1 ms Redis)
  ✗ No TTL granularity per hash field (only per item)
  ✗ Item size limit 400 KB (fine for carts, but constraint)

Server-side session (e.g., PostgreSQL):
  ✓ Durable, ACID
  ✗ Much higher latency (10-50 ms)
  ✗ Cart reads are very frequent → DB bottleneck
  ✗ Not designed for key-value access patterns

Client-side (localStorage):
  ✓ Zero server cost for guest carts
  ✓ Works offline
  ✗ Not synced across devices
  ✗ Lost on browser clear
  ✗ No server-side analytics

Best practice:
  Guest carts: localStorage (client) + async backup to Redis on significant changes
  Logged-in carts: Redis (primary) + async backup to PostgreSQL
  This minimizes Redis memory usage (no guest cart storage) while keeping fast access

Cart Abandonment

Industry average: 70% of carts are abandoned (never checked out)

Detection:
  Cart has items + no checkout activity for 1 hour → "abandoned"
  
  Kafka event: { user_id, cart_items, total_value, abandoned_at }
  → Triggers:
    1. Email reminder (1 hour): "You left items in your cart"
    2. Push notification (4 hours): "Your cart is waiting"
    3. Email with discount (24 hours): "10% off your cart — use code COMEBACK10"
    4. Final reminder (72 hours): "Items may sell out soon"
  
  Suppress if:
    - User opted out of marketing
    - Cart value < $10 (not worth the email cost)
    - User has completed a purchase since abandonment

Analytics:
  Track: abandonment rate per step (cart → shipping → payment → confirm)
  Identify: at which step users drop off most
  A/B test: different cart UX → measure conversion rate

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Cart Storage: Why Redis

Event Bus Design (Kafka)

Guest Cart → Login Merge

Price Validation at Checkout

Add to Cart

Get Cart

Update Item Quantity

Remove Item

Merge Guest Cart

Common Error Responses

Redis: Primary Cart Store

PostgreSQL: Backup + Analytics

Kafka Topics

Interview Walkthrough

Redis vs DynamoDB vs Session Storage for Cart

Cart Abandonment

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR