Interview Prompt
Design Shopping Cart System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Guest vs logged-in cart merge, Cart ↔ inventory sync, Pricing at checkout? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Guest vs logged-in cart merge
- Cart ↔ inventory sync
- Pricing at checkout
- Session management
- Capacity estimation with shown math
Out of scope (state explicitly)
- Recommendation engine (#48)
- Review/rating system (#70)
- Warehouse management (WMS) internals
Assumptions
- Clarify scale (DAU, QPS, data volume) for shopping cart system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Add / remove items: Add products with quantity; update quantity; remove items
- Persist cart across sessions: Logged-in user's cart survives app close, device switch
- Guest cart: Visitors can add items without login; merge cart on sign-in
- Price & availability validation: Re-validate prices and stock at checkout (cart may sit for days)
- Saved for later: Move items from cart to "Save for Later" wishlist
- Cart expiry: Auto-remove items after configurable TTL (e.g., 30 days)
- Promotions: Apply coupons, show discounted prices, bundle discounts
- Multi-seller: Items from different sellers in one cart; show per-seller subtotals
- Shipping estimation: Show estimated delivery date and shipping cost per item
- Cart sharing: Share cart via link (gift registries, wishlists)
- Low Latency: Cart operations (add/remove/update) in < 50 ms
- High Availability: 99.99%: cart failure = lost revenue
- Consistency: Cart state must be consistent across devices in real-time
- Scale: 100M+ DAU, 500M+ cart operations/day
- Durability: Cart data survives Redis restart (AOF persistence)
- Session affinity not required: Any API server can serve any cart (stateless)
- Idempotent: Duplicate "add to cart" requests don't double-add
| Metric | Calculation | Value |
|---|---|---|
| DAU with active carts | Given (assumption documented in value) | 50M |
| Cart operations / day | 50M users × ~10 ops/day | 500M |
| Cart operations / sec | 500M ÷ 86400 | ~6K (peak 50K during sales) |
| Avg items per cart | Given (typical workload assumption) | 5 |
| Cart data per user | Given | ~2 KB |
| Total cart data | 50M × 2 KB | 100 GB |
| Guest carts / day | Given (assumption documented in value) | 20M (many abandoned) |
| Cart-to-order conversion | Given | ~10% |
Cart Storage: Why Redis
Event Bus Design (Kafka)
Topic: shopping_cart_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "shopping_cart_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: shopping_cart_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Shopping Cart System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Guest Cart → Login Merge
When a guest with 3 items logs in to an existing user cart with 2 items, merge by keeping max quantity for overlapping SKUs. Implemented atomically with Redis Lua script to prevent race conditions on simultaneous logins.
Scenario:
1. Guest adds 3 items to cart (stored by session_id)
cart:guest-session-abc → { SKU-1: qty 1, SKU-2: qty 2, SKU-3: qty 1 }
2. Guest logs in as user-123
Existing cart: cart:user-123 → { SKU-2: qty 1, SKU-4: qty 3 }
3. Merge strategy:
For items in BOTH carts (SKU-2):
Option A: Keep higher quantity → qty = max(2, 1) = 2
Option B: Sum quantities → qty = 2 + 1 = 3
Option C: Keep guest cart quantity (most recent intent) → qty = 2
Amazon uses Option A (max). Most user-friendly.
For items only in guest cart: add to user cart
For items only in user cart: keep
Result: cart:user-123 → { SKU-1: qty 1, SKU-2: qty 2, SKU-3: qty 1, SKU-4: qty 3 }
4. Delete guest cart: DEL cart:guest-session-abcPrice Validation at Checkout
Hybrid approach: store price at add time, compare with current price at checkout. Use current price if lower (customer wins), show warning if higher.
Implementation:
On cart page load:
cart_items = redis.hgetall("cart:user-123")
current_prices = catalog_service.batch_get_prices([sku_ids])
for item in cart_items:
item.current_price = current_prices[item.sku_id]
item.price_changed = (item.current_price != item.price_at_add)
item.in_stock = inventory_service.check_stock(item.sku_id)
return cart with annotationsAdd to Cart
POST /api/v1/cart/items
{
"sku_id": "SKU-456",
"quantity": 2,
"seller_id": "seller-1"
}
Response: 200 OK
{
"cart_id": "cart:user-123",
"item_count": 5,
"item": {"sku_id": "SKU-456", "quantity": 2, "price": 29.99, "seller_id": "seller-1"}
}Get Cart
GET /api/v1/cart
Response: 200 OK
{
"items": [
{"sku_id": "SKU-456", "name": "Wireless Mouse", "quantity": 2,
"price_at_add": 29.99, "current_price": 29.99, "price_changed": false,
"in_stock": true, "seller": "TechStore", "image": "https://cdn..."}
],
"subtotal": 59.98,
"item_count": 5,
"promotions_applied": [{"code": "SAVE10", "discount": -5.99}],
"estimated_total": 53.99
}Update Item Quantity
PUT /api/v1/cart/items/{sku_id}
{ "quantity": 3 }
Response: 200 OKRemove Item
DELETE /api/v1/cart/items/{sku_id}
Response: 200 OKMerge Guest Cart
POST /api/v1/cart/merge
{ "guest_session_id": "session-abc" }
Response: 200 OK
{ "merged_items": 3, "conflicts_resolved": 1 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
Redis: Primary Cart Store
# User cart (Hash)
cart:{user_id} → Hash {
SKU-456: '{"qty":2,"price_at_add":29.99,"seller_id":"s-1","added_at":"2026-03-14T10:00:00Z"}',
SKU-789: '{"qty":1,"price_at_add":14.99,"seller_id":"s-2","added_at":"2026-03-14T10:05:00Z"}'
}
TTL: 2592000 (30 days, refreshed on activity)
# Guest cart
cart:guest-{session_id} → Hash (same structure)
TTL: 86400 (1 day — shorter for guests)
# Saved for later
saved:{user_id} → Hash { SKU-123: '{"saved_at":"...","price":29.99}' }
TTL: 7776000 (90 days)
# Cart item count (for badge display — avoid HLEN on every page)
cart_count:{user_id} → INT
TTL: 2592000PostgreSQL: Backup + Analytics
CREATE TABLE cart_snapshots (
user_id UUID NOT NULL,
sku_id VARCHAR(50) NOT NULL,
quantity INT NOT NULL,
price_at_add DECIMAL(10,2),
seller_id VARCHAR(50),
added_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (user_id, sku_id)
);
-- Async sync from Redis → PostgreSQL every 5 minutes
-- Used for: analytics (abandonment analysis), Redis disaster recoveryKafka Topics
Topic: cart-events (add, remove, update, checkout, abandon)
→ consumed by: analytics, recommendation engine ("frequently added together")| Concern | Solution |
|---|---|
| Redis node failure | Redis Cluster (6+ nodes); AOF persistence; data replicated across shards |
| Redis total failure | Fall back to PostgreSQL backup (slightly stale); rebuild Redis from PG |
| Cart data loss | Async backup to PostgreSQL every 5 min; client-side localStorage as last resort |
| Duplicate add-to-cart | Idempotent: HSET is naturally idempotent (same key → overwrite, not duplicate) |
| Price discrepancy | Always validate at checkout; show warnings on cart page |
| Race condition (merge) | Redis Lua script for atomic merge; no concurrent merge possible |
| Abandoned cart recovery | Kafka event on cart inactivity > 1 hour → trigger email reminder |
Interview Walkthrough
- Frame the cart as a session-scoped, high-churn key-value store — Redis Hash per user with 30-day TTL refreshed on activity.
- Explain price-at-add semantics: store the price when the item enters the cart, then flag
price_changedon read without silently updating totals. - Cover guest carts with shorter TTL (1 day) and the merge flow on login — resolve quantity conflicts by summing or keeping higher qty.
- Walk through add/update/remove as O(1) Hash operations with a separate
cart_countkey for badge display without HLEN scans. - Mention async PostgreSQL snapshots via Kafka for analytics and disaster recovery — Redis is primary, PG is backup not hot path.
- Discuss multi-seller carts: each line item carries seller_id so checkout can split into separate fulfillment orders.
- Common pitfall: storing the cart in the session cookie or JWT — size limits, no server-side merge, and no cross-device sync.
Redis vs DynamoDB vs Session Storage for Cart
Redis ⭐ (recommended): ✓ Sub-millisecond latency ✓ Hash data structure is perfect for cart operations ✓ TTL for automatic expiry ✓ Lua scripting for atomic operations (merge, validate) ✗ Memory-only (need AOF/RDB for persistence) ✗ Cost: RAM is expensive at 100 GB scale ($$$) DynamoDB: ✓ Durable (replicated, persistent) ✓ Pay-per-request pricing (good for variable load) ✓ Auto-scaling ✗ Higher latency (5-10 ms vs < 1 ms Redis) ✗ No TTL granularity per hash field (only per item) ✗ Item size limit 400 KB (fine for carts, but constraint) Server-side session (e.g., PostgreSQL): ✓ Durable, ACID ✗ Much higher latency (10-50 ms) ✗ Cart reads are very frequent → DB bottleneck ✗ Not designed for key-value access patterns Client-side (localStorage): ✓ Zero server cost for guest carts ✓ Works offline ✗ Not synced across devices ✗ Lost on browser clear ✗ No server-side analytics Best practice: Guest carts: localStorage (client) + async backup to Redis on significant changes Logged-in carts: Redis (primary) + async backup to PostgreSQL This minimizes Redis memory usage (no guest cart storage) while keeping fast access
Cart Abandonment
Industry average: 70% of carts are abandoned (never checked out)
Detection:
Cart has items + no checkout activity for 1 hour → "abandoned"
Kafka event: { user_id, cart_items, total_value, abandoned_at }
→ Triggers:
1. Email reminder (1 hour): "You left items in your cart"
2. Push notification (4 hours): "Your cart is waiting"
3. Email with discount (24 hours): "10% off your cart — use code COMEBACK10"
4. Final reminder (72 hours): "Items may sell out soon"
Suppress if:
- User opted out of marketing
- Cart value < $10 (not worth the email cost)
- User has completed a purchase since abandonment
Analytics:
Track: abandonment rate per step (cart → shipping → payment → confirm)
Identify: at which step users drop off most
A/B test: different cart UX → measure conversion rateStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core shopping cart system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.