Design Top K Rankings System (App Store / Amazon Bestsellers)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Design Top K Rankings System (App Store / Amazon Bestsellers).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Min-heap per partition, Map-reduce merge, Approximate algorithms?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Min-heap per partition
Map-reduce merge
Approximate algorithms
Time-decay scoring
Multi-level aggregation
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Index staleness of minutes is acceptable unless real-time is stated
Clarify query QPS vs index update rate early
Managed search/stream stack (Elasticsearch, Kafka) is fine to propose

Compute and display top K items (K = 10, 50, 100) ranked by a metric (sales, downloads, ratings, revenue)
Support multiple ranking dimensions: overall, by category, by time period (daily, weekly, monthly, all-time)
Near real-time updates: rankings reflect recent activity within minutes
Support historical rankings: "What was #1 last Tuesday?"
Handle large scale: millions of items, billions of events (purchases, downloads)
Expose ranked lists via API for clients

Metric	Calculation	Value
Total items	Given (assumption documented in value)	5M
Events (purchases/downloads) / day	Given (assumption documented in value)	1B
Events / sec	1B ÷ 86400	~12K (peak 60K)
Event size	Given (assumption documented in value)	100 bytes
Top K lists to maintain	1000 (categories × time periods)	1000 (categories × time periods)
List storage	1000 × 100 entries × 200B	20 MB
Event storage / day	1B × 100B	100 GB

Loading...

Stream Processing (Apache Flink): The Core Engine

Approach 1: Exact Count (for small-medium scale)

Flink Pipeline:
  Source(Kafka: ranking-events)
  → KeyBy(item_id)
  → Tumbling/Sliding Window(5 minutes)
  → Aggregate(count per item_id)
  → Top K (min-heap of size K)
  → Sink(Redis sorted set)

Approach 2: Approximate Count (for massive scale): Count-Min Sketch + Heap

When you have millions of unique items, maintaining exact counts for all is expensive.

Count-Min Sketch:

Probabilistic data structure: d hash functions × w counters (matrix)
On event for item X: hash with each of d functions, increment d counters
To query count of X: take minimum of d counter values
Space: O(w × d): typically 2 MB for < 0.1% error rate
Error: Always overcounts (never undercounts); error bounded by ε = e/w

Min-Heap for Top K:

JAVA

// Maintain a min-heap of size K
// For each incoming event:
if heap.size < K:
    heap.add(item)
elif item.count > heap.peek().count:
    heap.poll()
    heap.add(item)

Combined approach:

Count-Min Sketch tracks approximate counts for ALL items (memory efficient)
Min-Heap of size K maintains the current top K items
When a new event arrives:
- Increment item count in Count-Min Sketch
- Check if item's new count qualifies it for top K heap
- If yes → add/update in heap
Periodically flush heap to Redis

Time-Windowed Rankings

Rankings needed:
  - Hourly top K  → Sliding window, 1 hour, slide every 5 min
  - Daily top K   → Tumbling window, 24 hours
  - Weekly top K  → Tumbling window, 7 days
  - All-time top K → Cumulative counter (no window)

Exponential Decay for "Trending":

Instead of fixed windows, use exponential time decay:

PYTHON

score = Σ (event_value × e^(-λ × age_in_hours))
# λ = decay rate (e.g., 0.1 → half-life ≈ 7 hours)
# Recent events contribute more than old events

Redis (Ranking Cache)

Sorted Set per ranking list:

Key:    ranking:{category}:{period}
Type:   Sorted Set
Members: item_id
Scores:  count / score

ZREVRANGE ranking:games:daily 0 9 → Top 10 games today
ZREVRANK ranking:overall:weekly item123 → What's item123's rank this week?
Update atomically: ZADD ranking:games:daily 15432 item123

Batch Pipeline (Spark): Reconciliation

Hourly/daily Spark job reads raw events from Cassandra/S3
Computes exact rankings (no approximation)
Reconciles with real-time rankings in Redis (corrects Count-Min Sketch drift)
Writes historical rankings to Cassandra for "ranking at time T" queries

Event Bus Design (Kafka)

Topic: top_k_rankings-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "top_k_rankings-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: top_k_rankings-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design Top K Rankings System (App Store / Amazon Bestsellers): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Get Top K

HTTP

GET /api/v1/rankings?category=games&period=daily&limit=50
Response: 200 OK
{
  "category": "games",
  "period": "daily",
  "as_of": "2026-03-13T10:00:00Z",
  "rankings": [
    {"rank": 1, "item_id": "app123", "name": "Puzzle Master", "score": 152000, "change": "+2"},
    {"rank": 2, "item_id": "app456", "name": "Word Rush", "score": 148500, "change": "-1"},
    ...
  ]
}

Get Item Rank

HTTP

GET /api/v1/rankings/item/{item_id}?category=games&period=weekly
Response: 200 OK
{
  "item_id": "app123",
  "ranks": {
    "overall": 15,
    "category_games": 3,
    "daily": 1,
    "weekly": 5
  }
}

Get Historical Rankings

HTTP

GET /api/v1/rankings/history?category=games&date=2026-03-01&limit=10

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

Kafka Topic: `ranking-events`

JSON

{
  "event_id": "uuid",
  "event_type": "purchase",
  "item_id": "app123",
  "category": "games",
  "timestamp": "2026-03-13T10:00:00Z",
  "value": 1,
  "metadata": {"country": "US", "price": 2.99}
}

Redis: Ranked Lists

Key:    ranking:overall:daily
Type:   Sorted Set
Members: item_id
Scores:  event_count (or revenue, or weighted score)

Key:    ranking:games:weekly
Type:   Sorted Set
...

Cassandra: Historical Rankings

SQL

CREATE TABLE historical_rankings (
    category    TEXT,
    period      TEXT,          -- 'daily', 'weekly'
    snapshot_date DATE,
    rank        INT,
    item_id     TEXT,
    score       BIGINT,
    PRIMARY KEY ((category, period), snapshot_date, rank)
) WITH CLUSTERING ORDER BY (snapshot_date DESC, rank ASC);

Cassandra: Raw Events (for batch reconciliation)

SQL

CREATE TABLE ranking_events (
    event_date  DATE,
    event_hour  INT,
    event_id    UUID,
    item_id     TEXT,
    event_type  TEXT,
    category    TEXT,
    value       INT,
    timestamp   TIMESTAMP,
    PRIMARY KEY ((event_date, event_hour), event_id)
);

Concern	Solution
Flink failure	Checkpointing to S3 every 30s; restart from last checkpoint
Redis data loss	Batch pipeline can reconstruct rankings from raw events
Count-Min Sketch drift	Hourly reconciliation via Spark batch job
Event loss	Kafka with RF=3 ensures no events lost
Stale rankings	Serve from Redis (stale by at most window size); acceptable

Specific: Handling "Rank Manipulation"

Fake purchases/downloads to boost rankings
Defenses:
- Only count verified purchases (not free downloads with refunds)
- Device fingerprinting to detect bot farms
- Velocity checks: sudden spike in downloads → flag for review
- Exclude events from known fraud accounts

Exact vs Approximate Counting: When Approximate Is Good Enough

Exact counting (for small-medium scale):
  Each item has a counter in a hash map
  Space: O(N) where N = unique items
  At 10M unique items × 8 bytes = 80 MB (acceptable)
  At 1B unique items × 8 bytes = 8 GB → too large for in-memory aggregation
  
  Use when: N < 10M unique items per window → exact count is feasible
  Use case: App Store top 100 apps (only ~5M apps total)

Approximate counting (Count-Min Sketch for massive scale):
  Space: O(w × d) — fixed regardless of N
  Typical: w=10K, d=5 → 50K counters × 8 bytes = 400 KB
  Error rate: ε = e/w = 2.71/10K = 0.027% overcount
  
  Use when: N > 10M unique items → exact is memory-prohibitive
  Use case: Twitter trending hashtags (billions of unique tweets)

Key insight: Top-K is MORE tolerant of approximation than exact counting.
  If true rank: Item A=1000, Item B=999, Item C=998
  Count-Min may show: A=1001, B=1000, C=999 (1% overcount uniformly)
  → Relative ordering PRESERVED → Top K still correct!
  
  Only fails if two items have very close counts (e.g., A=1000, B=999):
    Count-Min might swap their ranks → acceptable for "trending" systems
    Not acceptable for exact ranking competitions or financial systems

Flink Tumbling vs Sliding vs Session Windows

For rankings, three window types matter:

Tumbling Windows (non-overlapping):
  [0-5min] [5-10min] [10-15min] ...
  ✓ Simple, efficient (each event processed once)
  ✗ "Top 5 in last hour" jumps at window boundaries
    At minute 59: window covers 55-60min → only 5 minutes of data in the "1 hour" ranking!
    At minute 60: window resets → sudden ranking changes
  Use for: Hourly batch snapshots (Top K this hour = events in [T-60min, T])

Sliding Windows (overlapping):
  Window size=5min, slide=1min
  [0-5] [1-6] [2-7] [3-8] ...
  ✓ Smooth, continuous updates
  ✗ Each event processed window_size/slide_interval times (5× overhead)
  ✗ Memory-intensive: keep state for all active windows
  Use for: "Trending now" that updates every minute (the right choice for rankings)

Session Windows (activity-based):
  Group events with < 30 min gap into a session
  ✓ Natural user behavior grouping
  ✗ Unpredictable size, complex for Top K
  Use for: User session analytics, not rankings

Recommendation for rankings:
  Real-time display: Sliding window (5min window, 1min slide) → always shows "last 5 minutes"
  Hourly rankings: Tumbling window → clear, non-overlapping periods
  Trending detection: Sliding window with exponential decay

Reconciling Real-Time vs Batch Rankings

Lambda Architecture for Top-K:

Speed Layer (Flink, ~5s latency):
  Advantages: Real-time, reflects last 5 minutes of activity
  Disadvantages: 
    - Count-Min Sketch has bounded error (may miss items near the K boundary)
    - Flink state is in-memory → lost on restart (need recovery from checkpoint)
    - Late events (clock skew > watermark) dropped → undercount

Batch Layer (Spark, hourly):
  Advantages: Exact counts, processes ALL events including late arrivals
  Disadvantages: Up to 1 hour stale

Serving Layer (Redis):
  Merges speed layer output with last batch output
  "Corrected ranking" = take batch as ground truth, apply speed layer deltas

Why not batch-only?
  Users expect near-real-time rankings ("trending now" should reflect last few minutes)
  Hourly updates feel stale for viral events

Why not speed-only?
  Flink can miss late events and has drift; batch provides correction
  Without batch reconciliation, rankings drift over days

Practical hybrid:
  Redis always serves from speed layer (< 5s latency)
  Every hour, batch job overwrites Redis with exact values
  Maximum inaccuracy: < 1% for a few hours, corrected hourly

The "Count Inflation" Problem in Ranking Systems

Problem: Bad actors inflate download/purchase counts to boost rankings.

Detection approaches:

1. Velocity anomaly detection:
   Normal: app X gets 10K downloads/day (avg last 30d)
   Suspicious: 500K downloads in 1 hour (50× spike)
   Action: flag for review, apply "fraud discount" to counting

2. Device/account deduplication:
   Same device re-downloading → count only once per device per 24h
   Redis: SETEX dedup:{device_id}:{item_id} 86400 "1"
   Only count if key didn't exist before (SETNX)

3. Velocity cap per source:
   Max 1000 downloads per IP per hour
   Max 10 downloads per account per day (legitimate users rarely need more)
   Redis: INCR downloads:{ip}:{hour} with TTL

4. Post-hoc adjustment:
   ML fraud score per event → filtered out before ranking computation
   High-velocity events from new accounts discounted by 70%
   
   Effect: fraud inflates raw count, but fraud-adjusted count stays accurate
   Implementation: Flink computes fraud_score per event → weighted INCR
     actual_count += (1 - fraud_score) × 1

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Stream Processing (Apache Flink): The Core Engine

Time-Windowed Rankings

Redis (Ranking Cache)

Batch Pipeline (Spark): Reconciliation

Event Bus Design (Kafka)

Get Top K

Get Item Rank

Get Historical Rankings

Common Error Responses

Kafka Topic: `ranking-events`

Redis: Ranked Lists

Cassandra: Historical Rankings

Cassandra: Raw Events (for batch reconciliation)

Specific: Handling "Rank Manipulation"

Lambda Architecture (Real-time + Batch)

Multi-Dimensional Rankings

Heavy Hitters Problem

Interview Walkthrough

Exact vs Approximate Counting: When Approximate Is Good Enough

Flink Tumbling vs Sliding vs Session Windows

Reconciling Real-Time vs Batch Rankings

The "Count Inflation" Problem in Ranking Systems

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR