Design a Threaded Discussion Forum (Reddit) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Threaded Discussion Forum (Reddit).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Nested comment trees (materialized path vs adjacency list vs closure table), Hot/best/controversial ranking, Subreddit isolation?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Nested comment trees (materialized path vs adjacency list vs closure table)
Hot/best/controversial ranking
Subreddit isolation
Vote aggregation
Capacity estimation with shown math

Out of scope (state explicitly)

Full ML ranking model training pipeline
Direct messaging / chat (#07)
Ad insertion and monetization

Assumptions

Millions of DAU with heavy fan-out — clarify celebrity/hot-key cases early
Eventual consistency acceptable for non-critical side effects (counts, notifications)
WebSocket or push infrastructure available at the edge

Subreddits: Create and join topic-based communities
Posts: Submit text posts, links, images, videos to a subreddit
Comments: Threaded/nested comments on posts (tree structure)
Voting: Upvote/downvote on posts and comments
Ranking: Posts ranked by Hot, New, Top, Rising, Controversial
User Karma: Aggregated score from upvotes/downvotes received
Home feed: Aggregated posts from subscribed subreddits
Search: Search posts, comments, subreddits
Moderation: Subreddit moderators can remove posts/comments, ban users
Awards / Gilding: Give awards to posts/comments (premium feature)

Metric	Calculation	Value
DAU	Given (product assumption)	50M
Posts / day	50M DAU × 0.1	5M
Comments / day	50M DAU × 1	50M
Votes / day	50M DAU × 10	500M
Votes / sec	500M ÷ 86400	~6,000 (peak 30K)
Avg post size	Given (typical workload assumption)	2 KB
Avg comment size	Given (typical workload assumption)	500 bytes
Storage / day	(5M × 2KB) + (50M × 500B)	35 GB
Hot post cache	100K posts × 10 KB	1 GB

Loading...

Post Service

Create post: Validate content (size limits, flair required for some subs), upload media to S3 if present, write to PostgreSQL, publish post-created event to Kafka
Post types: Text, link, image, video, poll: each stored with type-specific metadata
Flair: Per-subreddit flair system (tags like "Discussion", "Meme", "Question"): enforced by Post Service
Soft delete: Deleted posts keep the entry (for comment tree integrity) but content is replaced with "[deleted]"

Subreddit Service

Manages: Subreddit creation, membership (join/leave), rules, moderator assignments, banned users
Membership: user_subreddits table (user_id, subreddit_id): used by Feed Service to build home feed
Access control: Public, restricted (anyone can view, only approved can post), private (invite-only)
Moderator actions: Remove posts/comments, ban users, pin posts, configure automod rules

Comment Tree Service: The Key Challenge

Reddit's nested comment tree is the most complex data model:

Approach 1: Adjacency List (simple but slow for deep nesting)

SQL

comments (comment_id, post_id, parent_comment_id, ...)
-- To get full tree: recursive query or multiple DB round trips

Approach 2: Materialized Path ? (recommended)

SQL

comments (comment_id, post_id, path, ...)
-- path = "001/003/007/015" (ancestors from root)
-- To get all descendants: WHERE path LIKE '001/003/%'
-- Sorting by path gives pre-order traversal (natural comment display order)

Approach 3: Nested Set Model (fast reads, slow writes)

SQL

comments (comment_id, post_id, lft, rgt, ...)
-- All descendants: WHERE lft > parent.lft AND rgt < parent.rgt
-- But inserting a comment requires updating lft/rgt of many rows

Approach 4: Closure Table (flexible but storage-heavy)

SQL

comment_tree (ancestor_id, descendant_id, depth)
-- One row per ancestor-descendant pair
-- Fast queries but O(depth) inserts

Recommendation: Materialized Path for Reddit-like system. Efficient reads, reasonable writes, and supports sorting.

Vote Service

Challenge: 500M votes/day = high write throughput on potentially hot rows
Architecture:
1. Vote request → check Redis if user already voted on this post/comment
2. If new vote or changed vote → publish to Kafka vote-events
3. Kafka consumer updates PostgreSQL (source of truth) and Redis (cached counts)
4. Redis stores: vote_count:{entity_type}:{entity_id} as atomic counter
5. Redis stores: user_votes:{user_id}:{post_id} → direction (up/down/none)
Why not direct DB write: Hot posts get thousands of concurrent votes → DB contention. Kafka buffers and Redis absorbs reads

Ranking Algorithms: Deep Dive

Hot Ranking (Reddit's actual algorithm)

PYTHON

import math
from datetime import datetime

def hot_score(ups, downs, created_at):
    score = ups - downs
    order = math.log10(max(abs(score), 1))
    sign = 1 if score > 0 else -1 if score < 0 else 0

    # Epoch: Dec 8, 2005 (Reddit's birthday)
    epoch = datetime(2005, 12, 8, 7, 46, 43)
    seconds = (created_at - epoch).total_seconds()

    return round(sign * order + seconds / 45000, 7)

Key insight: Time dominates. A post from 12.5 hours ago needs 10× the votes to rank equally with a new post
The log scale means going from 10 → 100 votes has same effect as 100 → 1000

Wilson Score (for "Best" comment ranking)

PYTHON

import math

def wilson_score(ups, downs, confidence=0.95):
    n = ups + downs
    if n == 0:
        return 0
    z = 1.96  # 95% confidence
    p = ups / n

    return (p + z*z/(2*n) - z * math.sqrt((p*(1-p) + z*z/(4*n)) / n)) / (1 + z*z/n)

Handles the "1 upvote, 0 downvote" vs "100 upvote, 1 downvote" problem
Lower bound of confidence interval → more data = more confident

Controversial

PYTHON

def controversial_score(ups, downs):
    if ups <= 0 or downs <= 0:
        return 0
    magnitude = ups + downs
    balance = min(ups, downs) / max(ups, downs)
    return magnitude * balance

Feed Service

Home feed = aggregation from subscribed subreddits
Pre-computation: Background job periodically computes top posts per subreddit → stores in Redis
On request: Merge top posts from subscribed subreddits, interleave by score
Popular/All: Global feed → single pre-computed ranked list in Redis

Search Service (Elasticsearch)

Indexed: Post title, body text, subreddit name, author username
Features: Full-text search with BM25 ranking, filter by subreddit/time range/post type, sort by relevance or recency
Sync: Post creates/updates → Kafka → Elasticsearch consumer indexes in near real-time

Moderation Pipeline

Consumes post and comment events from Kafka for automated moderation:
- Spam ML model: Detect spam posts (link farms, repetitive content, new account patterns)
- Toxicity classifier: Flag toxic/hateful content for review or auto-removal
- AutoMod rules: Per-subreddit regex rules (e.g., "remove posts with banned keywords", "require minimum karma to post")
- Report queue: User reports → prioritized queue for human moderators
- Auto-removal: Content matching high-confidence spam/toxicity → removed immediately, mod notified

Event Bus Design (Kafka)

Topic: reddit_discussion_forum-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "reddit_discussion_forum-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: reddit_discussion_forum-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Threaded Discussion Forum (Reddit): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Create Post

HTTP

POST /api/v1/subreddits/{subreddit_name}/posts
{
  "title": "System Design Tips",
  "content": "Here are my top tips...",
  "type": "text",
  "flair": "Discussion"
}

Get Post with Comments

HTTP

GET /api/v1/posts/{post_id}?sort=best&depth=5&limit=200
Response:
{
  "post": { ... },
  "comments": [
    {
      "comment_id": "...",
      "content": "Great post!",
      "score": 150,
      "depth": 0,
      "replies": [
        {
          "comment_id": "...",
          "content": "Agreed!",
          "score": 42,
          "depth": 1,
          "replies": [...]
        }
      ]
    }
  ],
  "has_more_comments": true
}

Vote

HTTP

POST /api/v1/vote
{
  "entity_type": "post",
  "entity_id": "post-uuid",
  "direction": 1    // 1 = upvote, -1 = downvote, 0 = remove vote
}

Get Subreddit Feed

HTTP

GET /api/v1/subreddits/{name}/posts?sort=hot&cursor=...&limit=25

Add Comment

HTTP

POST /api/v1/posts/{post_id}/comments
{
  "parent_comment_id": "comment-uuid",  // null for top-level
  "content": "This is a great point!"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

PostgreSQL: Posts (Sharded by post_id)

SQL

CREATE TABLE posts (
    post_id         BIGINT PRIMARY KEY,
    subreddit_id    INT NOT NULL,
    user_id         BIGINT NOT NULL,
    title           VARCHAR(300),
    content         TEXT,
    url             TEXT,
    media_url       TEXT,
    post_type       ENUM('text', 'link', 'image', 'video'),
    score           INT DEFAULT 0,
    upvotes         INT DEFAULT 0,
    downvotes       INT DEFAULT 0,
    comment_count   INT DEFAULT 0,
    hot_score       FLOAT,
    is_locked       BOOLEAN DEFAULT FALSE,
    is_removed      BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMP,
    INDEX idx_subreddit_hot (subreddit_id, hot_score DESC),
    INDEX idx_subreddit_new (subreddit_id, created_at DESC),
    INDEX idx_subreddit_top (subreddit_id, score DESC)
);

PostgreSQL: Comments (Materialized Path)

SQL

CREATE TABLE comments (
    comment_id      BIGINT PRIMARY KEY,
    post_id         BIGINT NOT NULL,
    user_id         BIGINT NOT NULL,
    parent_id       BIGINT,                    -- null for top-level
    path            VARCHAR(1024),             -- "001.003.007.015"
    depth           SMALLINT,
    content         TEXT,
    score           INT DEFAULT 0,
    upvotes         INT DEFAULT 0,
    downvotes       INT DEFAULT 0,
    is_removed      BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMP,
    INDEX idx_post_path (post_id, path),
    INDEX idx_post_score (post_id, depth, score DESC)
);

Querying comment tree

SQL

-- Get all top-level comments sorted by "best"
SELECT * FROM comments
WHERE post_id = ? AND depth = 0
ORDER BY wilson_score DESC LIMIT 25;

-- Get all replies to a comment
SELECT * FROM comments
WHERE post_id = ? AND path LIKE '001.003.%'
ORDER BY path;

Redis: Vote Counts & User Votes

# Score cache
Key:    score:post:{post_id}
Value:  Hash { upvotes: 150, downvotes: 3, hot_score: 123456.789 }

# User's vote on a post (to show UI state and prevent double-voting)
Key:    uv:{user_id}:{post_id}
Value:  1 | -1 | (deleted if removed)
TTL:    30 days

Redis: Ranked Feeds

# Subreddit hot feed (pre-computed)
Key:    feed:sub:{subreddit_id}:hot
Type:   Sorted Set
Members: post_id
Scores:  hot_score

# Home feed for user
Key:    feed:home:{user_id}
Type:   Sorted Set
Members: post_id
Scores:  hot_score
TTL:    300 (5 min, rebuilt by background job)

Kafka Topics

Topic: post-events       (new post, edit, delete)
Topic: comment-events    (new comment, edit, delete)
Topic: vote-events       (upvote, downvote, remove)

General

Concern	Solution
Vote count accuracy	Kafka ensures no vote is lost; Redis is periodically reconciled with DB
Comment tree corruption	Materialized path is append-only; no destructive updates
Hot score staleness	Background job recomputes hot scores every 5 minutes; Kafka consumers update in near real-time for popular posts
DB hot spot (popular post)	Read replicas for post reads; Redis caches for vote counts
Ranking computation failure	Stale rankings served from Redis (still functional, just not perfectly fresh)

Specific: Handling Vote Manipulation

Challenge: Bots mass-upvoting/downvoting
Solutions:
- One vote per user per entity (enforced at DB level with unique constraint)
- Rate limiting votes per user (max 100 votes/minute)
- "Vote fuzzing": Display slightly randomized scores to confuse bots
- Shadow banning: Bot's votes are accepted but not counted
- IP-based detection: Multiple accounts voting from same IP

Vote Counting: Redis INCR vs DB Counters vs Log-Based Aggregation

Reddit has 2B+ votes/day across billions of posts and comments.

Option 1: Direct DB update (naive)
  UPDATE posts SET score = score + 1 WHERE post_id = ?
  → At 2B votes/day → write bottleneck on hot posts
  ✗ Single-row hot spot for viral posts (millions of concurrent UPDATEs)
  ✗ DB row locking under high concurrency

Option 2: Redis INCR (real-time approximate) ?
  INCR vote_score:{post_id}
  INCR vote_up:{post_id}   or   INCR vote_down:{post_id}

  ✓ O(1) atomic operation, handles millions/sec
  → Single-threaded Redis → no contention
  ✓ Reads are instant (no aggregation needed)
  ✗ Volatile (lost on Redis restart without persistence)
  ✗ Periodically needs to sync to DB

  Sync strategy: every 30 seconds, Kafka consumer writes batch to PostgreSQL
  INSERT INTO post_votes (post_id, score, ups, downs, updated_at)
  VALUES (...) ON CONFLICT DO UPDATE SET score=EXCLUDED.score

Option 3: Kafka event log + Flink aggregation
  Each vote → Kafka event → Flink aggregates → Redis + DB write
  ✓ Durable (Kafka replays if Flink fails)
  ✓ Can recompute historical vote counts from log
  ✗ Higher latency (5-30s until vote reflected)
  ✗ More infrastructure complexity

Reddit's actual approach: Redis INCR for real-time display + async sync to DB
  When post ages out of Redis (TTL: 24h for hot, 7d for new) → score frozen in DB

Comment Tree: Materialized Path vs Adjacency List vs Nested Sets

Reddit threads can have 10K+ comments, 50+ levels deep.

Option 1: Adjacency List (parent_id)
  Schema: comments(comment_id, post_id, parent_id, body, score)

  "Get entire thread": recursive CTE
  WITH RECURSIVE tree AS (
    SELECT * FROM comments WHERE post_id = ? AND parent_id IS NULL
    UNION ALL
    SELECT c.* FROM comments c JOIN tree t ON c.parent_id = t.comment_id
  )
  ✓ Simple schema, easy inserts
  ✗ Recursive CTE expensive for deep/wide trees
  ✗ N+1 queries in application code without recursive query

Option 2: Materialized Path ⭐ (Reddit-style)
  Schema: comments(comment_id, path TEXT)
  path = "1.4.23.87" (IDs of all ancestors, dot-separated)

  "All descendants of comment 4": WHERE path LIKE '1.4.%'
  "Sort by depth then score": ORDER BY path, score DESC
  "Insert reply to comment 87": path = parent.path + '.' + new_id

  ✓ Single query for entire subtree
  ✓ Natural sort order (hierarchical)
  ✓ Path encodes depth (split('.').length)
  ✓ Prefix index makes queries fast
  ✗ Path must be updated if post hierarchy changes (rare for comments)

Option 3: Nested Sets (left/right values)
  ✓ Super fast subtree queries
  ✗ Inserts require re-numbering all nodes → catastrophically slow for comments
  ✗ Not suitable for write-heavy trees

Reddit uses materialized path. Fetch top 200 comments sorted by path + score.
Client renders tree UI. "Load more replies" fetches children lazily via path prefix.

Hot Score Algorithm: Wilson Score vs Hacker News vs Reddit

Different ranking algorithms for different goals:

Simple score: upvotes - downvotes
  ✗ Biased toward old posts (more time = more votes)
  ✗ A post from 2010 with 100K upvotes beats today's post with 5K

Hacker News algorithm:
  score = points / (age + 2)^1.8
  Simple, but ignores downvotes

Reddit's Wilson Score (for comments) ⭐:
  Lower bound of Wilson confidence interval for a Bernoulli parameter
  wilson = (p + z²/(2n) - zv((p(1-p) + z²/(4n))/n)) / (1 + z²/n)
  p = upvotes/(upvotes + downvotes), n = total votes, z = 1.96

  ✓ Balances vote ratio with vote count confidence
  ✓ A 100% upvoted comment with 5 votes ranks below an 80% upvoted comment with 100 votes
  ✓ Handles controversial content (50/50 split = low confidence = lower rank)

Reddit's "Hot" algorithm (for posts):
  score = log10(ups - downs) + sign(ups - downs) × seconds_since_epoch(post) / 45000

  Key properties:
    - Logarithmic: 10 votes → same boost as 100 votes? No. log10(10)=1, log10(100)=2
      Each 10× more votes = 1 rank unit. Prevents mega-viral posts dominating forever
    - Recency: +1 rank unit per 12.5 hours → recent posts naturally rise
    - Post from today with 100 upvotes ranks above post from 2 weeks ago with 200 upvotes

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.