Design Quora (Q&A Platform)

Interview Prompt

Design Design Quora (Q&A Platform).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Question deduplication, Answer ranking, Topic graph?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Question deduplication
Answer ranking
Topic graph
Expertise scoring
Knowledge base search
Capacity estimation with shown math

Out of scope (state explicitly)

Full ads auction and monetization stack
Content moderation at scale (#81)
Direct messaging (#07)

Assumptions

Index staleness of minutes is acceptable unless real-time is stated
Clarify query QPS vs index update rate early
Managed search/stream stack (Elasticsearch, Kafka) is fine to propose

Ask questions: Users post questions (with topics/tags)
Answer questions: Users write answers; multiple answers per question
Upvote/Downvote: Vote on answers (and questions) to surface quality
Follow topics/questions: Get notified when new answers are posted
Feed: Personalized home feed of questions and answers from followed topics/users
Search: Full-text search across questions and answers
Spaces (communities): Topic-based groups with moderation
Request answers: Request a specific user to answer a question
Editing: Collaborative editing with revision history
Content moderation: Detect spam, hate speech, low-quality content

Metric	Calculation	Value
DAU	Given	100M
Questions asked / day	Given	500K
Answers posted / day	Given	2M
Votes / day	Given	50M
Feed views / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~30K
Search queries / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~10K
Avg answer size	Given	2 KB
Total content storage	Given	1 TB questions + 4 TB answers

Loading...

Answer Ranking — Wilson Score

Simple upvote - downvote:
  Answer A: 100 up, 2 down → score = 98
  Answer B: 5 up, 0 down → score = 5
  A ranks higher. Seems correct.

But what about:
  Answer C: 1 up, 0 down → score = 1
  Answer D: 500 up, 400 down → score = 100
  D ranks higher, but it's controversial (44% downvote rate!)
  C has 100% approval but only 1 vote → low confidence

Wilson Score Interval ⭐ (Reddit's approach):
  Considers BOTH the ratio of upvotes AND the sample size.
  Low confidence (few votes) → lower bound of score is low → ranks lower
  High confidence (many votes, high upvote ratio) → ranks higher
  
  Formula (lower bound of 95% confidence interval):
  score = (p + z²/2n - z√(p(1-p)/n + z²/4n²)) / (1 + z²/n)
  where p = upvotes/total, n = total votes, z = 1.96 (95% CI)

Feed Generation — Hybrid Push/Pull

Sources for a user's feed:
  1. New answers to questions they follow
  2. New questions in topics they follow
  3. Activity from users they follow
  4. Trending/recommended content
  
For normal users (pull):
  On feed load → query for recent activity from followed entities
  Merge, rank, return top 50
  Cache in Redis: feed:{user_id} (TTL: 5 min)

For prolific authors (fan-out):
  When a popular author writes an answer:
  → Don't fan out to 10M followers (too expensive)
  → Instead: add to trending/recommended pool
  → Followers discover it via periodic feed refresh

Question Deduplication

User types: "What is the best programming language for beginners?"
Similar existing questions:
  "What programming language should beginners learn?"
  "Best first programming language to learn?"
  
Detection pipeline:
  1. On question submit: compute sentence embedding (BERT/sentence-transformers)
  2. ANN search against index of existing question embeddings (Faiss/Milvus)
  3. If top match has similarity > 0.85 → suggest: "Similar question already exists"
  4. User can: merge into existing question OR confirm theirs is distinct
  
If merged: new question redirects to existing → concentrates answers
  → Better than 50 identical questions with 1 answer each

Storage: question_embeddings table in Milvus (768-dim vectors, 500M entries)
  Search time: ~10ms for top-5 similar questions

Concern	Solution
Vote manipulation	Rate limit + only count votes from accounts > 7 days old
Duplicate votes	PRIMARY KEY (user_id, target_type, target_id) prevents duplicates
Search index lag	Elasticsearch updated via CDC (Debezium); lag < 30 seconds
SEO staleness	CDN cache TTL = 5 min; instant purge on content update
Answer quality	ML spam classifier + community flagging + moderator review queue

Vote Count vs Wilson Score Drift

Multiple users vote simultaneously:
  T=0: upvotes=100, downvotes=4
  Thread A: reads (100,4) → votes up → writes upvotes=101
  Thread B: reads (100,4) → votes up → writes upvotes=101 (LOST UPDATE!)

Solution: Atomic increment
  UPDATE answers SET upvotes = upvotes + 1 WHERE answer_id = ?;
  Then: async job recomputes wilson_score from (upvotes, downvotes)
  Wilson score recomputation batched every 30 seconds (not per vote)

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Answer Ranking — Wilson Score

Feed Generation — Hybrid Push/Pull

Question Deduplication

Ask Question

Post Answer

Vote

Common Error Responses

MySQL (Vitess): Core Data

Redis

Vote Count vs Wilson Score Drift

Interview Walkthrough

SEO: Server-Side Rendering vs Client-Side

MySQL vs MongoDB for Q&A Data

Search Ranking

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR