Design Tinder (Matching System)

Interview Prompt

Design Design Tinder (Matching System).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Recommendation + geo filtering, Swipe queue pre-computation, Mutual match detection?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Recommendation + geo filtering
Swipe queue pre-computation
Mutual match detection
Elo/Glicko scoring
Capacity estimation with shown math

Out of scope (state explicitly)

Full ads auction and monetization stack
Content moderation at scale (#81)
Direct messaging (#07)

Assumptions

Clarify scale (DAU, QPS, data volume) for tinder in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Profile creation: Photos, bio, age, gender, preferences (age range, distance, gender)
Discovery (Swiping): Show nearby profiles one at a time; user swipes right (like) or left (pass)
Matching: When both users swipe right on each other → MATCH → enable chat
Geo-based discovery: Only show users within configured radius (e.g., 50 km)
Chat: Matched users can text message each other
Super Like: Special like that notifies the other user immediately
Undo: Undo last swipe (premium feature)
Boost: Temporarily increase visibility (premium)
Block/Report: Safety features

Metric	Calculation	Value
DAU	Given	25M
Swipes / day	Given	2B
Swipes / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~23K (peak 100K)
Matches / day	Given	30M
Avg profiles in deck	Given	200/session
Profile size	Given	5 KB (metadata) + 5 MB (photos)
Geo-query fan-out	Given	~10K profiles per 50 km radius in dense city

Loading...

Mutual Match — Atomic Lua Script

Problem: Simultaneous mutual swipe → Double Match
A swipes right on B at T=0.000, B swipes right on A at T=0.001
Without protection: Both check → both see no prior right-swipe → 
both record → both detect match → TWO match records.

Solution: Redis Lua script (atomic, single-threaded):
  local already_liked = redis.call('SISMEMBER', 'swiped_right:'..B, A)
  redis.call('SADD', 'swiped_right:'..A, B)
  if already_liked == 1 then
    return 1  -- MATCH
  end
  return 0  -- no match yet

Match record written to MySQL idempotently:
  INSERT IGNORE INTO matches (user_id_1, user_id_2) VALUES (min(A,B), max(A,B))
  Always store (smaller_id, larger_id) → prevents duplicate match records.

Already-Swiped Tracking — Bloom Filter

Problem: User has swiped on 50K profiles over 6 months.
  swiped:{user_id} SET in Redis = 50K × 16 bytes = 800 KB per user
  200M users × 800 KB = 160 TB just for swipe dedup → TOO EXPENSIVE

Bloom filter approach ⭐:
  BF per user: 50K entries, 0.1% false positive rate → 72 KB per user
  200M users × 72 KB = 14 TB → 10× reduction

  False positive impact: BF says "already swiped" but actually not →
    User never sees that profile → missed opportunity, but harmless
    At 0.1% rate → 1 in 1000 profiles incorrectly filtered → acceptable

  Implementation:
    BF.ADD swiped_bf:{user_id} {target_user_id}
    BF.EXISTS swiped_bf:{user_id} {candidate_id}
    → If exists → filter out from deck
    → If not exists → show in deck (guaranteed correct)

  Cassandra stores exact swipe history for auditing / undo feature.
  Bloom filter is a READ optimization, not the source of truth.

Profile Boost — Temporary Visibility Increase

Premium feature: "Boost" places your profile at the top of nearby users' decks

Implementation:
  On boost activation:
    SET boost:{user_id} {expiry_timestamp} EX 1800  (30-minute boost)
    ZADD boosted_users:{geohash_prefix} {score=999} {user_id}

  During deck generation for nearby users:
    1. First: pull boosted users in this geo area (ZREVRANGE boosted_users:...)
    2. Then: normal ranked candidates
    3. Mix: 1 boosted profile per 5 normal profiles
    
  Revenue model: ~$5 per boost → at 10M boosts/month = $50M/month
  
  Anti-abuse: Max 1 boost per 12 hours, no stacking.
  Fairness: If too many boosts in one area → dilute effect (each boost gets
  fewer guaranteed views → prevents boost-only decks)

Concern	Solution
Match consistency	Redis Lua script ensures atomic check-and-set for mutual match
Location staleness	Update location only when app is in foreground; TTL of 24 hours
Swipe history too large	Bloom filter for 'already swiped' check (false positive = user re-shown, not harmful)
Redis GeoSet loss	Rebuild from MySQL user locations on startup
Unfair ranking	Elo score decay for inactive users; reset periodically

Stale Location: User Moves to New City

Alice was in New York → Tinder shows NY profiles
Alice flies to London → still showing NY profiles!

Solution: Update location on app open + every 30 minutes while active
  GEOADD users:geo {new_lng} {new_lat} {user_id}
  Invalidate cached deck: DEL deck:{user_id}
  → Next swipe request regenerates deck with London profiles

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Mutual Match — Atomic Lua Script

Already-Swiped Tracking — Bloom Filter

Profile Boost — Temporary Visibility Increase

Get Discovery Deck

Swipe

Get Matches

Common Error Responses

MySQL: Core Data

Redis: Location & Swipe State

Stale Location: User Moves to New City

Interview Walkthrough

Elo Score vs ML Ranking

Geohash vs H3 vs R-tree for Proximity

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR