Interview Prompt
Design Design Quora (Q&A Platform).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Question deduplication, Answer ranking, Topic graph? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Question deduplication
- Answer ranking
- Topic graph
- Expertise scoring
- Knowledge base search
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full ads auction and monetization stack
- Content moderation at scale (#81)
- Direct messaging (#07)
Assumptions
- Index staleness of minutes is acceptable unless real-time is stated
- Clarify query QPS vs index update rate early
- Managed search/stream stack (Elasticsearch, Kafka) is fine to propose
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Ask questions: Users post questions (with topics/tags)
- Answer questions: Users write answers; multiple answers per question
- Upvote/Downvote: Vote on answers (and questions) to surface quality
- Follow topics/questions: Get notified when new answers are posted
- Feed: Personalized home feed of questions and answers from followed topics/users
- Search: Full-text search across questions and answers
- Spaces (communities): Topic-based groups with moderation
- Request answers: Request a specific user to answer a question
- Editing: Collaborative editing with revision history
- Content moderation: Detect spam, hate speech, low-quality content
- Low Latency: Feed and search in < 500 ms
- SEO Friendly: Questions/answers are the core value: must be crawlable and indexable
- Scalability: 300M+ MAU, 500M+ questions
- Quality Ranking: Best answers surfaced at top (not just newest)
- Availability: 99.99%
- Write Volume: Moderate (10K answers/min) but read-heavy (1M reads/min)
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given | 100M |
| Questions asked / day | Given | 500K |
| Answers posted / day | Given | 2M |
| Votes / day | Given | 50M |
| Feed views / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~30K |
| Search queries / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~10K |
| Avg answer size | Given | 2 KB |
| Total content storage | Given | 1 TB questions + 4 TB answers |
Answer Ranking — Wilson Score
Simple upvote - downvote: Answer A: 100 up, 2 down → score = 98 Answer B: 5 up, 0 down → score = 5 A ranks higher. Seems correct. But what about: Answer C: 1 up, 0 down → score = 1 Answer D: 500 up, 400 down → score = 100 D ranks higher, but it's controversial (44% downvote rate!) C has 100% approval but only 1 vote → low confidence Wilson Score Interval ⭐ (Reddit's approach): Considers BOTH the ratio of upvotes AND the sample size. Low confidence (few votes) → lower bound of score is low → ranks lower High confidence (many votes, high upvote ratio) → ranks higher Formula (lower bound of 95% confidence interval): score = (p + z²/2n - z√(p(1-p)/n + z²/4n²)) / (1 + z²/n) where p = upvotes/total, n = total votes, z = 1.96 (95% CI)
Feed Generation — Hybrid Push/Pull
Sources for a user's feed:
1. New answers to questions they follow
2. New questions in topics they follow
3. Activity from users they follow
4. Trending/recommended content
For normal users (pull):
On feed load → query for recent activity from followed entities
Merge, rank, return top 50
Cache in Redis: feed:{user_id} (TTL: 5 min)
For prolific authors (fan-out):
When a popular author writes an answer:
→ Don't fan out to 10M followers (too expensive)
→ Instead: add to trending/recommended pool
→ Followers discover it via periodic feed refreshQuestion Deduplication
User types: "What is the best programming language for beginners?" Similar existing questions: "What programming language should beginners learn?" "Best first programming language to learn?" Detection pipeline: 1. On question submit: compute sentence embedding (BERT/sentence-transformers) 2. ANN search against index of existing question embeddings (Faiss/Milvus) 3. If top match has similarity > 0.85 → suggest: "Similar question already exists" 4. User can: merge into existing question OR confirm theirs is distinct If merged: new question redirects to existing → concentrates answers → Better than 50 identical questions with 1 answer each Storage: question_embeddings table in Milvus (768-dim vectors, 500M entries) Search time: ~10ms for top-5 similar questions
Ask Question
POST /api/v1/questions
{ "title": "What is the best database for analytics?",
"details": "I'm comparing ClickHouse, BigQuery, and Redshift...",
"topics": ["databases", "analytics", "data-engineering"] }
→ 201 Created { "question_id": "q-uuid", "url": "/q/what-is-the-best-database-for-analytics" }Post Answer
POST /api/v1/questions/{question_id}/answers
{ "body": "I'd recommend ClickHouse for self-hosted and BigQuery for managed..." }
→ 201 Created { "answer_id": "a-uuid" }Vote
POST /api/v1/votes
{ "target_type": "answer", "target_id": "a-uuid", "vote": "up" }
→ 200 OK { "upvotes": 153, "downvotes": 4 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
MySQL (Vitess): Core Data
CREATE TABLE questions (
question_id BIGINT PRIMARY KEY AUTO_INCREMENT,
author_id BIGINT NOT NULL,
title VARCHAR(500) NOT NULL,
body TEXT,
slug VARCHAR(500) UNIQUE,
view_count INT DEFAULT 0,
answer_count INT DEFAULT 0,
follow_count INT DEFAULT 0,
status ENUM('open','closed','merged') DEFAULT 'open',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP,
FULLTEXT INDEX idx_search (title, body),
INDEX idx_author (author_id)
);
CREATE TABLE answers (
answer_id BIGINT PRIMARY KEY AUTO_INCREMENT,
question_id BIGINT NOT NULL,
author_id BIGINT NOT NULL,
body TEXT NOT NULL,
upvotes INT DEFAULT 0,
downvotes INT DEFAULT 0,
wilson_score FLOAT DEFAULT 0,
is_accepted BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_question_score (question_id, wilson_score DESC),
INDEX idx_author (author_id)
);
CREATE TABLE votes (
user_id BIGINT,
target_type ENUM('question','answer'),
target_id BIGINT,
vote_type ENUM('up','down'),
created_at TIMESTAMP,
PRIMARY KEY (user_id, target_type, target_id)
);Redis
vote_count:{type}:{id} → Hash {up: 153, down: 4}
answer_order:{question_id} → Sorted Set (score = wilson_score, member = answer_id)
feed:{user_id} → List of content_ids (TTL: 5 min)| Concern | Solution |
|---|---|
| Vote manipulation | Rate limit + only count votes from accounts > 7 days old |
| Duplicate votes | PRIMARY KEY (user_id, target_type, target_id) prevents duplicates |
| Search index lag | Elasticsearch updated via CDC (Debezium); lag < 30 seconds |
| SEO staleness | CDN cache TTL = 5 min; instant purge on content update |
| Answer quality | ML spam classifier + community flagging + moderator review queue |
Vote Count vs Wilson Score Drift
Multiple users vote simultaneously: T=0: upvotes=100, downvotes=4 Thread A: reads (100,4) → votes up → writes upvotes=101 Thread B: reads (100,4) → votes up → writes upvotes=101 (LOST UPDATE!) Solution: Atomic increment UPDATE answers SET upvotes = upvotes + 1 WHERE answer_id = ?; Then: async job recomputes wilson_score from (upvotes, downvotes) Wilson score recomputation batched every 30 seconds (not per vote)
Interview Walkthrough
- Anchor on Q&A core entities: questions, answers, topics, votes — clarify read-heavy (browse/search) vs write-light (post/vote) ratio.
- Rank answers with Wilson score, not raw upvote count — a 1-upvote/0-downvote answer must not beat a 500-upvote/50-downvote expert answer.
- Mandate server-side rendering (Next.js) for SEO — Google crawlers need full HTML; CSR SPAs lose 50%+ organic traffic.
- Shard MySQL with Vitess by question_id; cache hot question pages at CDN edge with 5-minute TTL.
- Build feeds with hybrid push/pull: pull recent activity from followed topics on load; fan-out for prolific authors with <10K followers.
- Use Elasticsearch for full-text search across questions and answers with topic/tag filters.
- Handle votes with atomic
UPDATE upvotes = upvotes + 1and async Wilson score recomputation every 30 seconds. - Common pitfall: ranking by simple upvote minus downvote — controversial high-traffic answers dominate over high-confidence expert responses.
SEO: Server-Side Rendering vs Client-Side
CSR (Single Page App): Google crawler loads page → sees empty div → JavaScript runs → content appears ✗ Google may not execute JS properly → pages not indexed → zero SEO traffic SSR ⭐ (Quora's approach): Server renders full HTML with question/answer content Crawler sees complete content → indexes properly → SEO traffic Implementation: Next.js / Nuxt.js → render on server → send HTML Cache: Full HTML cached at CDN edge (CloudFront) → TTL 5 min Dynamic content (vote counts): client-side hydration after page load Result: Quora gets 50%+ of traffic from Google → SSR is non-negotiable
MySQL vs MongoDB for Q&A Data
MySQL ⭐ (Quora's actual choice): ✓ Relational: question → answers → votes (natural joins) ✓ Strong consistency for vote counts ✓ FULLTEXT index for basic search ✓ Vitess for horizontal sharding ✗ Schema changes require migrations MongoDB: ✓ Flexible schema (answers can have varying metadata) ✓ Embed answers inside question document (one read) ✗ Embedded answers grow unbounded (16 MB document limit!) ✗ No joins → vote aggregation is application-level
Search Ranking
User searches: "best database for analytics"
Pipeline:
1. Elasticsearch query: match title + body with BM25 relevance
2. Re-rank results using engagement signals:
search_score = w1 × text_relevance (BM25)
+ w2 × answer_quality (avg Wilson score of answers)
+ w3 × view_count (popularity)
+ w4 × recency (newer questions slightly boosted)
+ w5 × follow_count (more followed = more relevant)
3. Filter: exclude closed/merged/spam questions
4. Return top 20 results with best-answer snippetStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core quora flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.