This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Design Top K Rankings System (App Store / Amazon Bestsellers).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Min-heap per partition, Map-reduce merge, Approximate algorithms? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Min-heap per partition
- Map-reduce merge
- Approximate algorithms
- Time-decay scoring
- Multi-level aggregation
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Index staleness of minutes is acceptable unless real-time is stated
- Clarify query QPS vs index update rate early
- Managed search/stream stack (Elasticsearch, Kafka) is fine to propose
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Compute and display top K items (K = 10, 50, 100) ranked by a metric (sales, downloads, ratings, revenue)
- Support multiple ranking dimensions: overall, by category, by time period (daily, weekly, monthly, all-time)
- Near real-time updates: rankings reflect recent activity within minutes
- Support historical rankings: "What was #1 last Tuesday?"
- Handle large scale: millions of items, billions of events (purchases, downloads)
- Expose ranked lists via API for clients
- Low Latency: Return top K list in < 50 ms
- High Availability: 99.99%
- Scalability: Process billions of ranking events per day
- Accuracy: Precise top K for small K (top 100); approximate OK for larger K
- Freshness: Rankings updated every 1-5 minutes
- Read-Heavy: Millions of users viewing rankings, few writing events
| Metric | Calculation | Value |
|---|---|---|
| Total items | Given (assumption documented in value) | 5M |
| Events (purchases/downloads) / day | Given (assumption documented in value) | 1B |
| Events / sec | 1B ÷ 86400 | ~12K (peak 60K) |
| Event size | Given (assumption documented in value) | 100 bytes |
| Top K lists to maintain | 1000 (categories × time periods) | 1000 (categories × time periods) |
| List storage | 1000 × 100 entries × 200B | 20 MB |
| Event storage / day | 1B × 100B | 100 GB |
Stream Processing (Apache Flink): The Core Engine
Approach 1: Exact Count (for small-medium scale)
Flink Pipeline: Source(Kafka: ranking-events) → KeyBy(item_id) → Tumbling/Sliding Window(5 minutes) → Aggregate(count per item_id) → Top K (min-heap of size K) → Sink(Redis sorted set)
Approach 2: Approximate Count (for massive scale): Count-Min Sketch + Heap
When you have millions of unique items, maintaining exact counts for all is expensive.
Count-Min Sketch:
- Probabilistic data structure: d hash functions × w counters (matrix)
- On event for item X: hash with each of d functions, increment d counters
- To query count of X: take minimum of d counter values
- Space: O(w × d): typically 2 MB for < 0.1% error rate
- Error: Always overcounts (never undercounts); error bounded by ε = e/w
Min-Heap for Top K:
// Maintain a min-heap of size K
// For each incoming event:
if heap.size < K:
heap.add(item)
elif item.count > heap.peek().count:
heap.poll()
heap.add(item)Combined approach:
- Count-Min Sketch tracks approximate counts for ALL items (memory efficient)
- Min-Heap of size K maintains the current top K items
- When a new event arrives:
- Increment item count in Count-Min Sketch
- Check if item's new count qualifies it for top K heap
- If yes → add/update in heap
- Periodically flush heap to Redis
Time-Windowed Rankings
Rankings needed: - Hourly top K → Sliding window, 1 hour, slide every 5 min - Daily top K → Tumbling window, 24 hours - Weekly top K → Tumbling window, 7 days - All-time top K → Cumulative counter (no window)
Exponential Decay for "Trending":
Instead of fixed windows, use exponential time decay:
score = Σ (event_value × e^(-λ × age_in_hours))
# λ = decay rate (e.g., 0.1 → half-life ≈ 7 hours)
# Recent events contribute more than old eventsRedis (Ranking Cache)
- Sorted Set per ranking list:
Key: ranking:{category}:{period}
Type: Sorted Set
Members: item_id
Scores: count / scoreZREVRANGE ranking:games:daily 0 9→ Top 10 games todayZREVRANK ranking:overall:weekly item123→ What's item123's rank this week?- Update atomically:
ZADD ranking:games:daily 15432 item123
Batch Pipeline (Spark): Reconciliation
- Hourly/daily Spark job reads raw events from Cassandra/S3
- Computes exact rankings (no approximation)
- Reconciles with real-time rankings in Redis (corrects Count-Min Sketch drift)
- Writes historical rankings to Cassandra for "ranking at time T" queries
Event Bus Design (Kafka)
Topic: top_k_rankings-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "top_k_rankings-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: top_k_rankings-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design Top K Rankings System (App Store / Amazon Bestsellers): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Get Top K
GET /api/v1/rankings?category=games&period=daily&limit=50
Response: 200 OK
{
"category": "games",
"period": "daily",
"as_of": "2026-03-13T10:00:00Z",
"rankings": [
{"rank": 1, "item_id": "app123", "name": "Puzzle Master", "score": 152000, "change": "+2"},
{"rank": 2, "item_id": "app456", "name": "Word Rush", "score": 148500, "change": "-1"},
...
]
}Get Item Rank
GET /api/v1/rankings/item/{item_id}?category=games&period=weekly
Response: 200 OK
{
"item_id": "app123",
"ranks": {
"overall": 15,
"category_games": 3,
"daily": 1,
"weekly": 5
}
}Get Historical Rankings
GET /api/v1/rankings/history?category=games&date=2026-03-01&limit=10Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Kafka Topic: ranking-events
{
"event_id": "uuid",
"event_type": "purchase",
"item_id": "app123",
"category": "games",
"timestamp": "2026-03-13T10:00:00Z",
"value": 1,
"metadata": {"country": "US", "price": 2.99}
}Redis: Ranked Lists
Key: ranking:overall:daily Type: Sorted Set Members: item_id Scores: event_count (or revenue, or weighted score) Key: ranking:games:weekly Type: Sorted Set ...
Cassandra: Historical Rankings
CREATE TABLE historical_rankings (
category TEXT,
period TEXT, -- 'daily', 'weekly'
snapshot_date DATE,
rank INT,
item_id TEXT,
score BIGINT,
PRIMARY KEY ((category, period), snapshot_date, rank)
) WITH CLUSTERING ORDER BY (snapshot_date DESC, rank ASC);Cassandra: Raw Events (for batch reconciliation)
CREATE TABLE ranking_events (
event_date DATE,
event_hour INT,
event_id UUID,
item_id TEXT,
event_type TEXT,
category TEXT,
value INT,
timestamp TIMESTAMP,
PRIMARY KEY ((event_date, event_hour), event_id)
);| Concern | Solution |
|---|---|
| Flink failure | Checkpointing to S3 every 30s; restart from last checkpoint |
| Redis data loss | Batch pipeline can reconstruct rankings from raw events |
| Count-Min Sketch drift | Hourly reconciliation via Spark batch job |
| Event loss | Kafka with RF=3 ensures no events lost |
| Stale rankings | Serve from Redis (stale by at most window size); acceptable |
Specific: Handling "Rank Manipulation"
- Fake purchases/downloads to boost rankings
- Defenses:
- Only count verified purchases (not free downloads with refunds)
- Device fingerprinting to detect bot farms
- Velocity checks: sudden spike in downloads → flag for review
- Exclude events from known fraud accounts
Lambda Architecture (Real-time + Batch)
Speed Layer: Flink (real-time, approximate) → Redis Batch Layer: Spark (exact, hourly) → Cassandra → Redis Serving Layer: Redis (merged view)
Multi-Dimensional Rankings
Dimensions: - Category (games, productivity, education, ...) - Geography (US, UK, India, ...) - Time period (hourly, daily, weekly, monthly, all-time) - Metric (downloads, revenue, ratings, active users) Total ranking lists = categories × geographies × periods × metrics e.g., 50 × 20 × 5 × 4 = 20,000 ranking lists Each maintained as a Redis Sorted Set
Heavy Hitters Problem
The Top-K problem is related to the "Heavy Hitters" problem in streaming algorithms:
- Misra-Gries Algorithm: Maintains at most K-1 candidates using O(K) space
- Space-Saving Algorithm: Keeps top K counters, replaces the minimum when a new item arrives
- Lossy Counting: Maintains approximate counts with guaranteed error bound
For system design interviews, Count-Min Sketch + Min-Heap is the most common and practical answer.
Interview Walkthrough
- Frame as a streaming problem: you cannot store all item counts — bounded memory requires approximate algorithms.
- Present Count-Min Sketch for frequency estimation in O(1) space per update, then maintain a min-heap of size K for the current top items.
- Compare exact solutions (hash map + heap) vs approximate — exact works for small cardinality, sketch scales to billions of events.
- Discuss update path: events arrive on Kafka → stream processor increments sketch → periodic heap refresh for leaderboard API.
- Specify error tolerance: Count-Min Sketch overcounts but never undercounts — acceptable for trending lists, not for billing.
- Use Back-of-the-Envelope Estimation: 1M events/sec × 8 bytes × hash functions = sketch memory budget in megabytes, not terabytes.
- Common pitfall: sorting the entire dataset periodically in batch — at billion-event scale the sort job cannot finish before the next window closes.
Exact vs Approximate Counting: When Approximate Is Good Enough
Exact counting (for small-medium scale):
Each item has a counter in a hash map
Space: O(N) where N = unique items
At 10M unique items × 8 bytes = 80 MB (acceptable)
At 1B unique items × 8 bytes = 8 GB → too large for in-memory aggregation
Use when: N < 10M unique items per window → exact count is feasible
Use case: App Store top 100 apps (only ~5M apps total)
Approximate counting (Count-Min Sketch for massive scale):
Space: O(w × d) — fixed regardless of N
Typical: w=10K, d=5 → 50K counters × 8 bytes = 400 KB
Error rate: ε = e/w = 2.71/10K = 0.027% overcount
Use when: N > 10M unique items → exact is memory-prohibitive
Use case: Twitter trending hashtags (billions of unique tweets)
Key insight: Top-K is MORE tolerant of approximation than exact counting.
If true rank: Item A=1000, Item B=999, Item C=998
Count-Min may show: A=1001, B=1000, C=999 (1% overcount uniformly)
→ Relative ordering PRESERVED → Top K still correct!
Only fails if two items have very close counts (e.g., A=1000, B=999):
Count-Min might swap their ranks → acceptable for "trending" systems
Not acceptable for exact ranking competitions or financial systemsFlink Tumbling vs Sliding vs Session Windows
For rankings, three window types matter:
Tumbling Windows (non-overlapping):
[0-5min] [5-10min] [10-15min] ...
✓ Simple, efficient (each event processed once)
✗ "Top 5 in last hour" jumps at window boundaries
At minute 59: window covers 55-60min → only 5 minutes of data in the "1 hour" ranking!
At minute 60: window resets → sudden ranking changes
Use for: Hourly batch snapshots (Top K this hour = events in [T-60min, T])
Sliding Windows (overlapping):
Window size=5min, slide=1min
[0-5] [1-6] [2-7] [3-8] ...
✓ Smooth, continuous updates
✗ Each event processed window_size/slide_interval times (5× overhead)
✗ Memory-intensive: keep state for all active windows
Use for: "Trending now" that updates every minute (the right choice for rankings)
Session Windows (activity-based):
Group events with < 30 min gap into a session
✓ Natural user behavior grouping
✗ Unpredictable size, complex for Top K
Use for: User session analytics, not rankings
Recommendation for rankings:
Real-time display: Sliding window (5min window, 1min slide) → always shows "last 5 minutes"
Hourly rankings: Tumbling window → clear, non-overlapping periods
Trending detection: Sliding window with exponential decayReconciling Real-Time vs Batch Rankings
Lambda Architecture for Top-K:
Speed Layer (Flink, ~5s latency):
Advantages: Real-time, reflects last 5 minutes of activity
Disadvantages:
- Count-Min Sketch has bounded error (may miss items near the K boundary)
- Flink state is in-memory → lost on restart (need recovery from checkpoint)
- Late events (clock skew > watermark) dropped → undercount
Batch Layer (Spark, hourly):
Advantages: Exact counts, processes ALL events including late arrivals
Disadvantages: Up to 1 hour stale
Serving Layer (Redis):
Merges speed layer output with last batch output
"Corrected ranking" = take batch as ground truth, apply speed layer deltas
Why not batch-only?
Users expect near-real-time rankings ("trending now" should reflect last few minutes)
Hourly updates feel stale for viral events
Why not speed-only?
Flink can miss late events and has drift; batch provides correction
Without batch reconciliation, rankings drift over days
Practical hybrid:
Redis always serves from speed layer (< 5s latency)
Every hour, batch job overwrites Redis with exact values
Maximum inaccuracy: < 1% for a few hours, corrected hourlyThe "Count Inflation" Problem in Ranking Systems
Problem: Bad actors inflate download/purchase counts to boost rankings.
Detection approaches:
1. Velocity anomaly detection:
Normal: app X gets 10K downloads/day (avg last 30d)
Suspicious: 500K downloads in 1 hour (50× spike)
Action: flag for review, apply "fraud discount" to counting
2. Device/account deduplication:
Same device re-downloading → count only once per device per 24h
Redis: SETEX dedup:{device_id}:{item_id} 86400 "1"
Only count if key didn't exist before (SETNX)
3. Velocity cap per source:
Max 1000 downloads per IP per hour
Max 10 downloads per account per day (legitimate users rarely need more)
Redis: INCR downloads:{ip}:{hour} with TTL
4. Post-hoc adjustment:
ML fraud score per event → filtered out before ranking computation
High-velocity events from new accounts discounted by 70%
Effect: fraud inflates raw count, but fraud-adjusted count stays accurate
Implementation: Flink computes fraud_score per event → weighted INCR
actual_count += (1 - fraud_score) × 1Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core top k rankings flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.