Design a Coupon and Discount Engine

Interview Prompt

Design Coupon and Discount Engine.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Rule engine, Stacking policies, Redemption limits (atomic counter)?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Rule engine
Stacking policies
Redemption limits (atomic counter)
Coupon code generation
Capacity estimation with shown math

Out of scope (state explicitly)

Full catalog/search infrastructure (#12)
Payment checkout flow (#24)
Fraud and abuse ML pipelines

Assumptions

Clarify scale (DAU, QPS, data volume) for coupon discount engine in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Create coupons: Percentage off, fixed amount, BOGO, free shipping, tiered discounts
Apply coupons: Validate and apply coupon code at checkout
Auto-apply promotions: Automatic discounts (site-wide sales, category discounts) without code
Stacking rules: Define which coupons can combine (stackable vs exclusive)
Eligibility rules: Target by user segment, purchase history, cart value, product category, first-time buyer
Usage limits: Per-coupon limit (10,000 total), per-user limit (1 per customer), time-bound
Coupon generation: Bulk generate unique codes for campaigns (100K codes)
Analytics: Track redemption rates, revenue impact, popular coupons

Metric	Calculation	Value
Active coupons	Given	100K
Coupon validation / sec	Derived from daily volume ÷ 86400 (+ peak factor)	5K
Redemptions / day	Given (assumption documented in value)	10M
Bulk code generation	Given	Up to 1M codes per campaign

Loading...

Coupon Rule Engine

Coupons have complex eligibility rules. Validation must be fast (in checkout path): lookup coupon, check validity period, check usage limits via Redis atomic INCR, evaluate conditions against cart.

Coupon definition:
{
  "coupon_id": "SUMMER25",
  "type": "percentage",
  "value": 25,
  "conditions": {
    "min_cart_value": 100.00,
    "eligible_categories": ["electronics", "clothing"],
    "eligible_user_segments": ["premium_members"],
    "first_purchase_only": false,
    "max_uses_total": 10000,
    "max_uses_per_user": 1,
    "valid_from": "2026-03-01",
    "valid_to": "2026-03-31",
    "stackable": false,
    "excluded_skus": ["SKU-GIFT-CARD"]
  },
  "discount_cap": 50.00
}

Validation flow:
  1. Lookup coupon: Redis cache or PostgreSQL
  2. Check validity period: valid_from <= NOW <= valid_to
  3. Check total usage: Redis INCR coupon_usage:{coupon_id}
  4. Check per-user usage: Redis GET user_coupon:{user_id}:{coupon_id}
  5. Evaluate conditions against cart
  6. If all pass -> calculate discount and return

Discount calculation:
  percentage: discount = min(cart_subtotal * value/100, discount_cap)
  fixed: discount = min(value, cart_subtotal)
  bogo: discount = price of cheapest qualifying item
  tiered: spend $100 get $10 off, spend $200 get $30 off
  free_shipping: discount = shipping_cost

Stacking Rules

User applies two coupons: "SUMMER25" (25% off) + "FREESHIP" (free shipping)

Stacking policies:
  1. No stacking: only one coupon per order (simplest, Amazon's approach)
  2. Category stacking: one coupon per category (percentage + shipping OK)
  3. Full stacking: any coupons can combine (rare, complex)

Implementation:
  Each coupon has: stackable = true/false, stack_group = "percentage" | "shipping" | "fixed"
  
  Rule: max one coupon per stack_group.
  "SUMMER25" (stack_group: "percentage") + "FREESHIP" (stack_group: "shipping") -> OK
  "SUMMER25" (percentage) + "FALL20" (percentage) -> REJECTED (same group)
  
  Application order matters:
    Option A: Apply percentage THEN fixed: $100 - 25% = $75 - $10 = $65
    Option B: Apply fixed THEN percentage: $100 - $10 = $90 - 25% = $67.50
    Standard: apply most beneficial order for the customer

Race Condition: Coupon Usage Limit

Coupon "FLASH50" has max_uses = 1000. 1,500 users try to use it simultaneously.

Without atomicity: 1,500 users all read count=999, all increment, all succeed -> oversold

Solution: Redis INCR with Lua script (same pattern as flash sale):

  local current = redis.call('INCR', 'coupon_usage:FLASH50')
  if current > 1000 then
    redis.call('DECR', 'coupon_usage:FLASH50')
    return 0  -- exceeded
  end
  return 1  -- success

If payment later fails -> DECR to release usage slot.
Per-user: SETNX user_coupon:{user_id}:{coupon_id} with TTL.

Auto-Apply Promotions

Site-wide: "10% off everything this weekend"
Category: "Free shipping on electronics"
Cart-value: "$15 off orders over $100"

These are applied automatically — user doesn't enter a code.

Implementation:
  1. On cart page load / checkout initiation:
     Fetch all active auto-apply promotions from Redis cache:
       auto_promos:{current_date} --> List of auto-apply coupon definitions
       TTL: 300 (refreshed from PostgreSQL)
  
  2. Evaluate each promotion against current cart:
     For each promo in auto_promos:
       if evaluate_conditions(promo.conditions, cart, user):
         applicable_promos.append(promo)
  
  3. Apply the BEST applicable promotion (or stack if policy allows):
     Sort by discount_amount DESC -> pick highest
     Or: apply all stackable auto-promos

  4. Show on cart page: "Promotion applied: $15 off orders over $100 ✓"

Edge case: auto-promo + manual coupon
  Policy options:
  a. Manual coupon replaces auto-promo
  b. Both stack (more customer-friendly)
  c. Apply whichever gives bigger discount (best of both)
  Standard: option (c) — show user the better deal.

Concern	Solution
Over-redemption	Redis atomic INCR; rollback on payment failure
Coupon cache stale	TTL + invalidate on coupon update
Bulk generation failure	Batch insert; resume from last generated; idempotent
Abuse (shared codes)	Per-user limits; device fingerprinting; account age checks
Redis/PG divergence	Periodic reconciliation; PG is source of truth

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Coupon Rule Engine

Stacking Rules

Race Condition: Coupon Usage Limit

Auto-Apply Promotions

PostgreSQL

Redis

Common Error Responses

Interview Walkthrough

Personalized Coupons vs Universal Codes

Coupon Abuse Detection

Coupon Analytics

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR