Interview Prompt
Design Coupon and Discount Engine.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Rule engine, Stacking policies, Redemption limits (atomic counter)? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Rule engine
- Stacking policies
- Redemption limits (atomic counter)
- Coupon code generation
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full catalog/search infrastructure (#12)
- Payment checkout flow (#24)
- Fraud and abuse ML pipelines
Assumptions
- Clarify scale (DAU, QPS, data volume) for coupon discount engine in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Create coupons: Percentage off, fixed amount, BOGO, free shipping, tiered discounts
- Apply coupons: Validate and apply coupon code at checkout
- Auto-apply promotions: Automatic discounts (site-wide sales, category discounts) without code
- Stacking rules: Define which coupons can combine (stackable vs exclusive)
- Eligibility rules: Target by user segment, purchase history, cart value, product category, first-time buyer
- Usage limits: Per-coupon limit (10,000 total), per-user limit (1 per customer), time-bound
- Coupon generation: Bulk generate unique codes for campaigns (100K codes)
- Analytics: Track redemption rates, revenue impact, popular coupons
- Low Latency: Coupon validation in < 50 ms (in checkout critical path)
- Strong Consistency: Usage count must be accurate: no over-redemption
- Scale: 100K+ active coupons, 10M+ redemptions/day
- Availability: 99.99%: coupon failure blocks checkout
- Fraud Resistant: Prevent coupon abuse (sharing, scripted redemption)
| Metric | Calculation | Value |
|---|---|---|
| Active coupons | Given | 100K |
| Coupon validation / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 5K |
| Redemptions / day | Given (assumption documented in value) | 10M |
| Bulk code generation | Given | Up to 1M codes per campaign |
Coupon Rule Engine
Coupons have complex eligibility rules. Validation must be fast (in checkout path): lookup coupon, check validity period, check usage limits via Redis atomic INCR, evaluate conditions against cart.
Coupon definition:
{
"coupon_id": "SUMMER25",
"type": "percentage",
"value": 25,
"conditions": {
"min_cart_value": 100.00,
"eligible_categories": ["electronics", "clothing"],
"eligible_user_segments": ["premium_members"],
"first_purchase_only": false,
"max_uses_total": 10000,
"max_uses_per_user": 1,
"valid_from": "2026-03-01",
"valid_to": "2026-03-31",
"stackable": false,
"excluded_skus": ["SKU-GIFT-CARD"]
},
"discount_cap": 50.00
}
Validation flow:
1. Lookup coupon: Redis cache or PostgreSQL
2. Check validity period: valid_from <= NOW <= valid_to
3. Check total usage: Redis INCR coupon_usage:{coupon_id}
4. Check per-user usage: Redis GET user_coupon:{user_id}:{coupon_id}
5. Evaluate conditions against cart
6. If all pass -> calculate discount and return
Discount calculation:
percentage: discount = min(cart_subtotal * value/100, discount_cap)
fixed: discount = min(value, cart_subtotal)
bogo: discount = price of cheapest qualifying item
tiered: spend $100 get $10 off, spend $200 get $30 off
free_shipping: discount = shipping_costStacking Rules
User applies two coupons: "SUMMER25" (25% off) + "FREESHIP" (free shipping)
Stacking policies:
1. No stacking: only one coupon per order (simplest, Amazon's approach)
2. Category stacking: one coupon per category (percentage + shipping OK)
3. Full stacking: any coupons can combine (rare, complex)
Implementation:
Each coupon has: stackable = true/false, stack_group = "percentage" | "shipping" | "fixed"
Rule: max one coupon per stack_group.
"SUMMER25" (stack_group: "percentage") + "FREESHIP" (stack_group: "shipping") -> OK
"SUMMER25" (percentage) + "FALL20" (percentage) -> REJECTED (same group)
Application order matters:
Option A: Apply percentage THEN fixed: $100 - 25% = $75 - $10 = $65
Option B: Apply fixed THEN percentage: $100 - $10 = $90 - 25% = $67.50
Standard: apply most beneficial order for the customerRace Condition: Coupon Usage Limit
Coupon "FLASH50" has max_uses = 1000. 1,500 users try to use it simultaneously.
Without atomicity: 1,500 users all read count=999, all increment, all succeed -> oversold
Solution: Redis INCR with Lua script (same pattern as flash sale):
local current = redis.call('INCR', 'coupon_usage:FLASH50')
if current > 1000 then
redis.call('DECR', 'coupon_usage:FLASH50')
return 0 -- exceeded
end
return 1 -- success
If payment later fails -> DECR to release usage slot.
Per-user: SETNX user_coupon:{user_id}:{coupon_id} with TTL.Auto-Apply Promotions
Site-wide: "10% off everything this weekend"
Category: "Free shipping on electronics"
Cart-value: "$15 off orders over $100"
These are applied automatically — user doesn't enter a code.
Implementation:
1. On cart page load / checkout initiation:
Fetch all active auto-apply promotions from Redis cache:
auto_promos:{current_date} --> List of auto-apply coupon definitions
TTL: 300 (refreshed from PostgreSQL)
2. Evaluate each promotion against current cart:
For each promo in auto_promos:
if evaluate_conditions(promo.conditions, cart, user):
applicable_promos.append(promo)
3. Apply the BEST applicable promotion (or stack if policy allows):
Sort by discount_amount DESC -> pick highest
Or: apply all stackable auto-promos
4. Show on cart page: "Promotion applied: $15 off orders over $100 ✓"
Edge case: auto-promo + manual coupon
Policy options:
a. Manual coupon replaces auto-promo
b. Both stack (more customer-friendly)
c. Apply whichever gives bigger discount (best of both)
Standard: option (c) — show user the better deal.POST /api/v1/coupons/validate
{ "coupon_code": "SUMMER25", "cart": { "items": [...], "subtotal": 150.00 }, "user_id": "..." }
--> 200 { "valid": true, "discount": 37.50, "type": "percentage", "message": "25% off (max $50)" }
OR 200 { "valid": false, "reason": "Coupon expired" }
POST /api/v1/coupons/apply (at checkout confirmation)
{ "coupon_code": "SUMMER25", "order_id": "order-uuid" }
--> 200 { "applied": true, "discount": 37.50 }
POST /api/v1/admin/coupons (create coupon)
{ "code": "SUMMER25", "type": "percentage", "value": 25, "conditions": {...} }
POST /api/v1/admin/coupons/bulk-generate
{ "campaign_id": "camp-uuid", "prefix": "SUMMER", "count": 100000, "template": {...} }
--> 202 { "batch_id": "batch-uuid", "status": "generating" }PostgreSQL
CREATE TABLE coupons (
coupon_id VARCHAR(50) PRIMARY KEY, campaign_id UUID,
type ENUM('percentage','fixed','bogo','free_shipping','tiered'),
value DECIMAL(10,2), discount_cap DECIMAL(10,2),
conditions JSONB, stackable BOOLEAN DEFAULT FALSE, stack_group VARCHAR(20),
max_uses_total INT, max_uses_per_user INT DEFAULT 1,
current_uses INT DEFAULT 0,
valid_from TIMESTAMPTZ, valid_to TIMESTAMPTZ,
active BOOLEAN DEFAULT TRUE, created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE coupon_redemptions (
redemption_id UUID PRIMARY KEY, coupon_id VARCHAR(50),
user_id UUID, order_id UUID, discount_amount DECIMAL(10,2),
redeemed_at TIMESTAMPTZ DEFAULT NOW(),
INDEX idx_coupon (coupon_id), INDEX idx_user_coupon (user_id, coupon_id)
);Redis
coupon:{code} --> JSON (coupon definition), TTL 3600
coupon_usage:{coupon_id} --> INT (atomic INCR/DECR)
user_coupon:{user_id}:{coupon_id} --> "1", TTL 86400Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
| Concern | Solution |
|---|---|
| Over-redemption | Redis atomic INCR; rollback on payment failure |
| Coupon cache stale | TTL + invalidate on coupon update |
| Bulk generation failure | Batch insert; resume from last generated; idempotent |
| Abuse (shared codes) | Per-user limits; device fingerprinting; account age checks |
| Redis/PG divergence | Periodic reconciliation; PG is source of truth |
Interview Walkthrough
- Separate validate (read-only, fast) from redeem (write, atomic) — users expect instant feedback before committing checkout.
- Walk through the rule engine: eligibility checks (min cart, category, user segment, expiry, usage limits) evaluated in deterministic order.
- Explain atomic redemption in Redis via Lua script or PostgreSQL transaction — increment used_count only if under max_uses.
- Cover stacking rules: which coupons combine, which are mutually exclusive, and how to apply best-discount-first automatically.
- Mention per-user unique codes for targeted campaigns — prevents viral sharing on coupon aggregator sites.
- Discuss abuse detection: velocity limits on invalid attempts, device fingerprint for multi-account farming, auto-disable on Reddit leaks.
- Common pitfall: validating coupon at cart-add but not re-checking at payment capture — expired or exhausted codes slip through after a delay.
Personalized Coupons vs Universal Codes
Universal: "SAVE20" — anyone can use. Easy to share virally but hard to control. Unique per-user: "USR-A3F7K2" — generated per user. Prevents sharing. Generate: prefix + random chars. Store: coupon_id -> user_id mapping. Validate: check coupon belongs to requesting user. For campaigns: generate 100K unique codes -> distribute via email/SMS. Each code is single-use + tied to recipient. Prevents coupon aggregator sites.
Coupon Abuse Detection
Common abuse patterns: 1. Code sharing on Reddit/coupon sites: Detection: > 1000 unique users redeem same code within 1 hour Action: auto-disable coupon; alert marketing team 2. Multi-account abuse: Same person creates 5 accounts to use "first-purchase" coupon 5x Detection: same device fingerprint / IP / payment method across accounts Action: flag accounts; revoke coupon benefit 3. Automated redemption bots: Script tries thousands of coupon code variations Detection: > 10 invalid coupon attempts per minute from same IP/session Action: CAPTCHA + rate limit (max 5 coupon attempts per checkout session) 4. Returning after coupon: Buy with 50% off coupon -> return item -> rebuy at full price with credit Detection: track return rate per coupon; if > 30% return rate, flag coupon
Coupon Analytics
Key metrics per coupon (ClickHouse queries): 1. Redemption rate = redemptions / unique_views_of_coupon_field 2. Revenue impact = total_order_value_with_coupon - total_discount_given 3. Incremental revenue = orders_with_coupon - estimated_orders_without_coupon 4. Average order value with coupon vs without 5. Customer acquisition cost (for first-purchase coupons): CAC = total_discount / new_customers_acquired
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core coupon discount engine flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.