Interview Prompt
Design Design Top K Most Shared Articles.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Lossy counting, Time-window leaderboards, Map-reduce aggregation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Lossy counting
- Time-window leaderboards
- Map-reduce aggregation
- Approximation algorithms
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for top k shared articles in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Track shares: Record every time an article/URL is shared
- Top-K ranking: Return the K most shared articles in a given time window (1h, 24h, 7d, all-time)
- Real-time: Rankings update within 1-5 minutes of new shares
- Category filtering: Top K per category (tech, sports, politics)
- Regional: Top K per country/region
- Trending detection: Highlight articles with fastest share velocity
- High Throughput: 100K+ share events/sec
- Low Latency: Top-K query responds in < 50 ms
- Approximate OK: Exact ranking not required; approximate within 5% is acceptable
- Scalability: Handle billions of articles, millions of shares/day
- Availability: 99.99%
| Metric | Calculation | Value |
|---|---|---|
| Share events / day | Given | 1B |
| Share events / sec | 1B ÷ 86400 | ~12K (peak 100K) |
| Unique articles shared / day | Given | 10M |
| Top-K returned | K | 100 typically |
| Share event size | Given (assumption documented in value) | 100 bytes |
| Time windows | Given (assumption documented in value) | 1h, 24h, 7d, 30d |
Multi-Window Counting with Panes
Pane-based approach ⭐:
Divide time into 5-minute "panes"
Each pane stores: count per article in that 5-min period
1-hour count = sum of last 12 panes
24-hour count = sum of last 288 panes
7-day count = sum of last 2016 panes
Optimization: Only track articles with > 10 shares in any pane
→ ~100K active articles × 2016 panes = 800 MB → manageableApproximate Top-K: Space-Saving Algorithm
Maintain a fixed-size map of K=1000 entries: {article_id → count}
On each share event:
If article already in map → increment count
If article not in map AND map has < K entries → add with count=1
If article not in map AND map is full:
Find entry with MINIMUM count (= min_count)
Replace it: article_id = new_article, count = min_count + 1
Guarantee: The true top-K is always contained in the result
Memory: O(K) = O(1000) regardless of how many articles existViral Article: Single Key Hot Spot
One article goes viral → 1M shares in 10 minutes → all INCR operations
hit ONE Flink key/partition → backpressure
Solutions:
a. Pre-aggregate at service layer:
Each API server buffers share counts in-memory for 1 second
Flush batch count to Kafka: {article_id: "art-123", count: 342}
b. Use sub-keys → partition by hash(article_id + random(0..7))
Flink: aggregate across sub-keys in a second stageEvent Bus Design (Kafka)
Topic: top_k_shared_articles-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "top_k_shared_articles-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: top_k_shared_articles-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design Top K Most Shared Articles: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
GET /api/v1/top-articles?window=24h®ion=US&category=tech&limit=50
→ 200 OK
{
"articles": [
{"article_id": "art-123", "url": "https://...", "title": "...",
"share_count": 152340, "rank": 1}
],
"window": "24h",
"computed_at": "2026-03-14T10:05:00Z"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Redis
Key: topk:{window}:{region}:{category}
Type: Sorted Set
Score: share_count
Member: article_id
Ops: ZREVRANGE for top-K, ZADD for updates
Key: article_meta:{article_id}
Type: Hash
Fields: url, title, image_url, publisher, categoryClickHouse
CREATE TABLE share_events (
article_id String,
user_id String,
platform LowCardinality(String),
region LowCardinality(String),
category LowCardinality(String),
shared_at DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(shared_at)
ORDER BY (region, category, shared_at);
CREATE MATERIALIZED VIEW share_counts_hourly
ENGINE = SummingMergeTree()
ORDER BY (article_id, region, category, hour)
AS SELECT
article_id, region, category,
toStartOfHour(shared_at) AS hour,
count() AS share_count
FROM share_events
GROUP BY article_id, region, category, hour;| Concern | Solution |
|---|---|
| Flink failure | Checkpoint to S3 every 60s; resume from checkpoint |
| Redis loss | Flink recomputes full top-K within 60 seconds from pane state |
| Late events | Flink watermark allows 5-min late arrivals |
| Article metadata missing | Async enrichment; display URL if title unavailable |
| Spam shares | Rate limit per user (max 100 shares/hour); bot detection |
URL Deduplication for Articles
URL normalization pipeline: 1. Expand shortened URLs (follow redirects) 2. Strip tracking params (utm_source, utm_medium, fbclid) 3. Normalize: lowercase, remove trailing slash, remove www prefix 4. Compute canonical_url_hash = SHA256(normalized_url) 5. Use canonical_url_hash as the article_id for counting
Click-Bait and Spam Filtering
Detection signals: 1. Share source diversity: if >80% of shares from <50 unique users → spam 2. Article domain reputation: known spam domains get score penalty 3. Share velocity anomaly: organic articles grow gradually; spam jumps 0→100K 4. User account age: if >50% of sharers have accounts <7 days old → bot Action: Apply penalty multiplier (0.1×) or suppress entirely
Virality Detection
Track share velocity per article (shares per minute, sliding window) Stage 1: "Emerging" — velocity > 50 shares/min Stage 2: "Viral" — velocity > 500 shares/min Stage 3: "Mega-viral" — velocity > 5000 shares/min On stage transition → publish Kafka event for: Editorial dashboard, push notification, CDN pre-warming
Interview Walkthrough
- Normalize URLs before counting: strip tracking params, expand short links, lowercase — otherwise the same article splits across multiple IDs.
- Use a sliding time window (last 24h) with decay, not lifetime totals — stale evergreen articles must not dominate the leaderboard.
- Apply Space-Saving or Count-Min Sketch for approximate per-article counts — O(1) per share event instead of O(N log N) global sort.
- Run spam detection on share velocity anomalies: organic growth is gradual, bot campaigns jump from 0 to 100K shares in minutes.
- Penalize articles where >80% of shares come from <50 unique accounts or from accounts younger than 7 days.
- Publish virality stage transitions (Emerging → Viral → Mega-viral) via Kafka to trigger CDN pre-warming and editorial alerts.
- Quantify: 100K shares/sec × 200 bytes ≈ 20 MB/s ingest — partition Kafka by canonical URL hash for ordered per-article counting.
- Common pitfall: running an exact global sort on every share event — memory and CPU explode as the article catalog grows past 10M entries.
Exact vs Approximate Top-K
| Approach | Memory | Accuracy | Speed |
|---|---|---|---|
| Exact (Full Count) | O(N) = 10M+ entries | Perfect | Expensive sort |
| Space-Saving ⭐ | O(K) = 1000 entries | Approximate | O(1) per event |
| Hybrid (Count-Min + HashMap + Sort) | O(10K) | Exact top-100 | Fast |
MapReduce vs Stream Processing
Batch (Spark): scan all events → aggregate → rank. Latency: hours. Exact. Stream (Flink) ⭐: continuous processing. Latency: seconds. Near real-time. Best: Flink for real-time + hourly Spark job to reconcile exact counts
Fixed Tumbling Window vs Exponential Decay
Fixed tumbling window: cliff effect — article drops off suddenly at T+1h Exponential decay: score = shares × e^(-λ × age), no hard cliff Recommended hybrid: Real-time: Redis INCR in 1-minute buckets for last 24 hours Batch: Spark/Flink hourly computes decay-weighted score Serve: Redis sorted set ZREVRANGE for top-K query
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core top k shared articles flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.