This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Threaded Discussion Forum (Reddit).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Nested comment trees (materialized path vs adjacency list vs closure table), Hot/best/controversial ranking, Subreddit isolation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Nested comment trees (materialized path vs adjacency list vs closure table)
- Hot/best/controversial ranking
- Subreddit isolation
- Vote aggregation
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full ML ranking model training pipeline
- Direct messaging / chat (#07)
- Ad insertion and monetization
Assumptions
- Millions of DAU with heavy fan-out — clarify celebrity/hot-key cases early
- Eventual consistency acceptable for non-critical side effects (counts, notifications)
- WebSocket or push infrastructure available at the edge
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Subreddits: Create and join topic-based communities
- Posts: Submit text posts, links, images, videos to a subreddit
- Comments: Threaded/nested comments on posts (tree structure)
- Voting: Upvote/downvote on posts and comments
- Ranking: Posts ranked by Hot, New, Top, Rising, Controversial
- User Karma: Aggregated score from upvotes/downvotes received
- Home feed: Aggregated posts from subscribed subreddits
- Search: Search posts, comments, subreddits
- Moderation: Subreddit moderators can remove posts/comments, ban users
- Awards / Gilding: Give awards to posts/comments (premium feature)
- High Availability: 99.99%
- Low Latency: Front page loads in < 200 ms; comment thread loads in < 500 ms
- Scalability: 50M+ DAU, millions of posts/day
- Read-Heavy: 100:1 read to write ratio
- Eventual Consistency: Votes and karma can be slightly stale
- Deep Comment Threads: Support threads with 10,000+ comments, nested 20+ levels deep
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given (product assumption) | 50M |
| Posts / day | 50M DAU × 0.1 | 5M |
| Comments / day | 50M DAU × 1 | 50M |
| Votes / day | 50M DAU × 10 | 500M |
| Votes / sec | 500M ÷ 86400 | ~6,000 (peak 30K) |
| Avg post size | Given (typical workload assumption) | 2 KB |
| Avg comment size | Given (typical workload assumption) | 500 bytes |
| Storage / day | (5M × 2KB) + (50M × 500B) | 35 GB |
| Hot post cache | 100K posts × 10 KB | 1 GB |
Post Service
- Create post: Validate content (size limits, flair required for some subs), upload media to S3 if present, write to PostgreSQL, publish
post-createdevent to Kafka - Post types: Text, link, image, video, poll: each stored with type-specific metadata
- Flair: Per-subreddit flair system (tags like "Discussion", "Meme", "Question"): enforced by Post Service
- Soft delete: Deleted posts keep the entry (for comment tree integrity) but content is replaced with "[deleted]"
Subreddit Service
- Manages: Subreddit creation, membership (join/leave), rules, moderator assignments, banned users
- Membership:
user_subredditstable (user_id, subreddit_id): used by Feed Service to build home feed - Access control: Public, restricted (anyone can view, only approved can post), private (invite-only)
- Moderator actions: Remove posts/comments, ban users, pin posts, configure automod rules
Comment Tree Service: The Key Challenge
Reddit's nested comment tree is the most complex data model:
Approach 1: Adjacency List (simple but slow for deep nesting)
comments (comment_id, post_id, parent_comment_id, ...)
-- To get full tree: recursive query or multiple DB round tripsApproach 2: Materialized Path ? (recommended)
comments (comment_id, post_id, path, ...)
-- path = "001/003/007/015" (ancestors from root)
-- To get all descendants: WHERE path LIKE '001/003/%'
-- Sorting by path gives pre-order traversal (natural comment display order)Approach 3: Nested Set Model (fast reads, slow writes)
comments (comment_id, post_id, lft, rgt, ...)
-- All descendants: WHERE lft > parent.lft AND rgt < parent.rgt
-- But inserting a comment requires updating lft/rgt of many rowsApproach 4: Closure Table (flexible but storage-heavy)
comment_tree (ancestor_id, descendant_id, depth)
-- One row per ancestor-descendant pair
-- Fast queries but O(depth) insertsRecommendation: Materialized Path for Reddit-like system. Efficient reads, reasonable writes, and supports sorting.
Vote Service
- Challenge: 500M votes/day = high write throughput on potentially hot rows
- Architecture:
- Vote request → check Redis if user already voted on this post/comment
- If new vote or changed vote → publish to Kafka
vote-events - Kafka consumer updates PostgreSQL (source of truth) and Redis (cached counts)
- Redis stores:
vote_count:{entity_type}:{entity_id}as atomic counter - Redis stores:
user_votes:{user_id}:{post_id}→ direction (up/down/none)
- Why not direct DB write: Hot posts get thousands of concurrent votes → DB contention. Kafka buffers and Redis absorbs reads
Ranking Algorithms: Deep Dive
Hot Ranking (Reddit's actual algorithm)
import math
from datetime import datetime
def hot_score(ups, downs, created_at):
score = ups - downs
order = math.log10(max(abs(score), 1))
sign = 1 if score > 0 else -1 if score < 0 else 0
# Epoch: Dec 8, 2005 (Reddit's birthday)
epoch = datetime(2005, 12, 8, 7, 46, 43)
seconds = (created_at - epoch).total_seconds()
return round(sign * order + seconds / 45000, 7)- Key insight: Time dominates. A post from 12.5 hours ago needs 10× the votes to rank equally with a new post
- The log scale means going from 10 → 100 votes has same effect as 100 → 1000
Wilson Score (for "Best" comment ranking)
import math
def wilson_score(ups, downs, confidence=0.95):
n = ups + downs
if n == 0:
return 0
z = 1.96 # 95% confidence
p = ups / n
return (p + z*z/(2*n) - z * math.sqrt((p*(1-p) + z*z/(4*n)) / n)) / (1 + z*z/n)- Handles the "1 upvote, 0 downvote" vs "100 upvote, 1 downvote" problem
- Lower bound of confidence interval → more data = more confident
Controversial
def controversial_score(ups, downs):
if ups <= 0 or downs <= 0:
return 0
magnitude = ups + downs
balance = min(ups, downs) / max(ups, downs)
return magnitude * balanceFeed Service
- Home feed = aggregation from subscribed subreddits
- Pre-computation: Background job periodically computes top posts per subreddit → stores in Redis
- On request: Merge top posts from subscribed subreddits, interleave by score
- Popular/All: Global feed → single pre-computed ranked list in Redis
Search Service (Elasticsearch)
- Indexed: Post title, body text, subreddit name, author username
- Features: Full-text search with BM25 ranking, filter by subreddit/time range/post type, sort by relevance or recency
- Sync: Post creates/updates → Kafka → Elasticsearch consumer indexes in near real-time
Moderation Pipeline
- Consumes post and comment events from Kafka for automated moderation:
- Spam ML model: Detect spam posts (link farms, repetitive content, new account patterns)
- Toxicity classifier: Flag toxic/hateful content for review or auto-removal
- AutoMod rules: Per-subreddit regex rules (e.g., "remove posts with banned keywords", "require minimum karma to post")
- Report queue: User reports → prioritized queue for human moderators
- Auto-removal: Content matching high-confidence spam/toxicity → removed immediately, mod notified
Event Bus Design (Kafka)
Topic: reddit_discussion_forum-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "reddit_discussion_forum-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: reddit_discussion_forum-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Threaded Discussion Forum (Reddit): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Create Post
POST /api/v1/subreddits/{subreddit_name}/posts
{
"title": "System Design Tips",
"content": "Here are my top tips...",
"type": "text",
"flair": "Discussion"
}Get Post with Comments
GET /api/v1/posts/{post_id}?sort=best&depth=5&limit=200
Response:
{
"post": { ... },
"comments": [
{
"comment_id": "...",
"content": "Great post!",
"score": 150,
"depth": 0,
"replies": [
{
"comment_id": "...",
"content": "Agreed!",
"score": 42,
"depth": 1,
"replies": [...]
}
]
}
],
"has_more_comments": true
}Vote
POST /api/v1/vote
{
"entity_type": "post",
"entity_id": "post-uuid",
"direction": 1 // 1 = upvote, -1 = downvote, 0 = remove vote
}Get Subreddit Feed
GET /api/v1/subreddits/{name}/posts?sort=hot&cursor=...&limit=25Add Comment
POST /api/v1/posts/{post_id}/comments
{
"parent_comment_id": "comment-uuid", // null for top-level
"content": "This is a great point!"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
PostgreSQL: Posts (Sharded by post_id)
CREATE TABLE posts (
post_id BIGINT PRIMARY KEY,
subreddit_id INT NOT NULL,
user_id BIGINT NOT NULL,
title VARCHAR(300),
content TEXT,
url TEXT,
media_url TEXT,
post_type ENUM('text', 'link', 'image', 'video'),
score INT DEFAULT 0,
upvotes INT DEFAULT 0,
downvotes INT DEFAULT 0,
comment_count INT DEFAULT 0,
hot_score FLOAT,
is_locked BOOLEAN DEFAULT FALSE,
is_removed BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP,
INDEX idx_subreddit_hot (subreddit_id, hot_score DESC),
INDEX idx_subreddit_new (subreddit_id, created_at DESC),
INDEX idx_subreddit_top (subreddit_id, score DESC)
);PostgreSQL: Comments (Materialized Path)
CREATE TABLE comments (
comment_id BIGINT PRIMARY KEY,
post_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
parent_id BIGINT, -- null for top-level
path VARCHAR(1024), -- "001.003.007.015"
depth SMALLINT,
content TEXT,
score INT DEFAULT 0,
upvotes INT DEFAULT 0,
downvotes INT DEFAULT 0,
is_removed BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP,
INDEX idx_post_path (post_id, path),
INDEX idx_post_score (post_id, depth, score DESC)
);Querying comment tree
-- Get all top-level comments sorted by "best"
SELECT * FROM comments
WHERE post_id = ? AND depth = 0
ORDER BY wilson_score DESC LIMIT 25;
-- Get all replies to a comment
SELECT * FROM comments
WHERE post_id = ? AND path LIKE '001.003.%'
ORDER BY path;Redis: Vote Counts & User Votes
# Score cache
Key: score:post:{post_id}
Value: Hash { upvotes: 150, downvotes: 3, hot_score: 123456.789 }
# User's vote on a post (to show UI state and prevent double-voting)
Key: uv:{user_id}:{post_id}
Value: 1 | -1 | (deleted if removed)
TTL: 30 daysRedis: Ranked Feeds
# Subreddit hot feed (pre-computed)
Key: feed:sub:{subreddit_id}:hot
Type: Sorted Set
Members: post_id
Scores: hot_score
# Home feed for user
Key: feed:home:{user_id}
Type: Sorted Set
Members: post_id
Scores: hot_score
TTL: 300 (5 min, rebuilt by background job)Kafka Topics
Topic: post-events (new post, edit, delete) Topic: comment-events (new comment, edit, delete) Topic: vote-events (upvote, downvote, remove)
General
| Concern | Solution |
|---|---|
| Vote count accuracy | Kafka ensures no vote is lost; Redis is periodically reconciled with DB |
| Comment tree corruption | Materialized path is append-only; no destructive updates |
| Hot score staleness | Background job recomputes hot scores every 5 minutes; Kafka consumers update in near real-time for popular posts |
| DB hot spot (popular post) | Read replicas for post reads; Redis caches for vote counts |
| Ranking computation failure | Stale rankings served from Redis (still functional, just not perfectly fresh) |
Specific: Handling Vote Manipulation
- Challenge: Bots mass-upvoting/downvoting
- Solutions:
- One vote per user per entity (enforced at DB level with unique constraint)
- Rate limiting votes per user (max 100 votes/minute)
- "Vote fuzzing": Display slightly randomized scores to confuse bots
- Shadow banning: Bot's votes are accepted but not counted
- IP-based detection: Multiple accounts voting from same IP
"More Comments" / Lazy Loading
- For posts with 10K+ comments, don't load all at once
- Load top 200 top-level comments sorted by "best"
- Each comment shows reply count; clicking "load replies" fetches children
- Use cursor-based pagination for deep threads
Subreddit Moderation
- Moderators can: remove posts/comments, ban users, set rules, configure automod
- AutoModerator: Rule-based system (regex patterns, account age requirements, karma thresholds)
- Moderation queue: Reported content queued for moderator review
Cross-posting
- A post can appear in multiple subreddits (cross-post)
- Stored as a separate post with
crosspost_parent_idreference - Votes are independent per crosspost
Caching Strategy
L1: Client-side cache (browser/app) — 60s TTL L2: CDN cache (for popular subreddits) — 30s TTL L3: Redis (ranked feeds, vote counts) — 5 min TTL L4: PostgreSQL (source of truth)
Interview Walkthrough
- Clarify ranking model upfront — hot score (time-decayed engagement) vs new vs top — because each implies different precomputation and cache keys.
- Model posts as a tree: store adjacency list or materialized path in PostgreSQL; lazy-load comment branches on expand.
- Use Redis sorted sets per subreddit+sort for ranked feeds; recompute hot scores asynchronously on vote events.
- Handle vote writes with optimistic concurrency or a counter service — avoid locking the post row on every upvote.
- Cache aggressively at four layers (client, CDN, Redis, DB) with TTLs tuned to staleness tolerance per sort mode.
- Apply Back-of-the-Envelope Estimation on vote QPS: 50M DAU × 10 votes/day = 500M writes — batch or aggregate before hitting PostgreSQL.
- Common pitfall: computing hot ranking synchronously on every read instead of precomputing on write/vote, causing p99 latency spikes on popular threads.
Vote Counting: Redis INCR vs DB Counters vs Log-Based Aggregation
Reddit has 2B+ votes/day across billions of posts and comments.
Option 1: Direct DB update (naive)
UPDATE posts SET score = score + 1 WHERE post_id = ?
→ At 2B votes/day → write bottleneck on hot posts
✗ Single-row hot spot for viral posts (millions of concurrent UPDATEs)
✗ DB row locking under high concurrency
Option 2: Redis INCR (real-time approximate) ?
INCR vote_score:{post_id}
INCR vote_up:{post_id} or INCR vote_down:{post_id}
✓ O(1) atomic operation, handles millions/sec
→ Single-threaded Redis → no contention
✓ Reads are instant (no aggregation needed)
✗ Volatile (lost on Redis restart without persistence)
✗ Periodically needs to sync to DB
Sync strategy: every 30 seconds, Kafka consumer writes batch to PostgreSQL
INSERT INTO post_votes (post_id, score, ups, downs, updated_at)
VALUES (...) ON CONFLICT DO UPDATE SET score=EXCLUDED.score
Option 3: Kafka event log + Flink aggregation
Each vote → Kafka event → Flink aggregates → Redis + DB write
✓ Durable (Kafka replays if Flink fails)
✓ Can recompute historical vote counts from log
✗ Higher latency (5-30s until vote reflected)
✗ More infrastructure complexity
Reddit's actual approach: Redis INCR for real-time display + async sync to DB
When post ages out of Redis (TTL: 24h for hot, 7d for new) → score frozen in DBComment Tree: Materialized Path vs Adjacency List vs Nested Sets
Reddit threads can have 10K+ comments, 50+ levels deep.
Option 1: Adjacency List (parent_id)
Schema: comments(comment_id, post_id, parent_id, body, score)
"Get entire thread": recursive CTE
WITH RECURSIVE tree AS (
SELECT * FROM comments WHERE post_id = ? AND parent_id IS NULL
UNION ALL
SELECT c.* FROM comments c JOIN tree t ON c.parent_id = t.comment_id
)
✓ Simple schema, easy inserts
✗ Recursive CTE expensive for deep/wide trees
✗ N+1 queries in application code without recursive query
Option 2: Materialized Path ⭐ (Reddit-style)
Schema: comments(comment_id, path TEXT)
path = "1.4.23.87" (IDs of all ancestors, dot-separated)
"All descendants of comment 4": WHERE path LIKE '1.4.%'
"Sort by depth then score": ORDER BY path, score DESC
"Insert reply to comment 87": path = parent.path + '.' + new_id
✓ Single query for entire subtree
✓ Natural sort order (hierarchical)
✓ Path encodes depth (split('.').length)
✓ Prefix index makes queries fast
✗ Path must be updated if post hierarchy changes (rare for comments)
Option 3: Nested Sets (left/right values)
✓ Super fast subtree queries
✗ Inserts require re-numbering all nodes → catastrophically slow for comments
✗ Not suitable for write-heavy trees
Reddit uses materialized path. Fetch top 200 comments sorted by path + score.
Client renders tree UI. "Load more replies" fetches children lazily via path prefix.Hot Score Algorithm: Wilson Score vs Hacker News vs Reddit
Different ranking algorithms for different goals:
Simple score: upvotes - downvotes
✗ Biased toward old posts (more time = more votes)
✗ A post from 2010 with 100K upvotes beats today's post with 5K
Hacker News algorithm:
score = points / (age + 2)^1.8
Simple, but ignores downvotes
Reddit's Wilson Score (for comments) ⭐:
Lower bound of Wilson confidence interval for a Bernoulli parameter
wilson = (p + z²/(2n) - zv((p(1-p) + z²/(4n))/n)) / (1 + z²/n)
p = upvotes/(upvotes + downvotes), n = total votes, z = 1.96
✓ Balances vote ratio with vote count confidence
✓ A 100% upvoted comment with 5 votes ranks below an 80% upvoted comment with 100 votes
✓ Handles controversial content (50/50 split = low confidence = lower rank)
Reddit's "Hot" algorithm (for posts):
score = log10(ups - downs) + sign(ups - downs) × seconds_since_epoch(post) / 45000
Key properties:
- Logarithmic: 10 votes → same boost as 100 votes? No. log10(10)=1, log10(100)=2
Each 10× more votes = 1 rank unit. Prevents mega-viral posts dominating forever
- Recency: +1 rank unit per 12.5 hours → recent posts naturally rise
- Post from today with 100 upvotes ranks above post from 2 weeks ago with 200 upvotesStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core reddit discussion forum flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.