Design a Timeline and Tweet Service (Twitter) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	The canonical fan-out problem. Draw the fan-out-on-write vs fan-out-on-read decision tree with a concrete follower threshold (e.g., 10K). Timeline merge and ranking basics are table stakes.
Arch 50	Add list feeds, retweet/quote mechanics, and home vs latest vs algorithmic ranking. Discuss cache hierarchy for celebrity timelines.
Arch 75	Staff: hybrid fan-out migration strategy, retweet dedup at scale, and what happens when a user with 100M followers tweets.

Interview Prompt

Design Twitter's home timeline. Users follow other users and see a reverse-chronological feed of their tweets. Support retweets, list-based feeds, and handle users with millions of followers.

Clarifying Questions (ask before designing)

Question	Why it matters
Home timeline only, or also user profile timeline and search?	Home timeline is fan-out heavy; profile timeline is fan-in (one user's tweets). Different optimization paths.
Strict reverse-chronological or algorithmic ranking?	Algorithmic adds ML scoring stage — changes read path from simple merge to rank + filter pipeline.
What's the fan-out threshold for celebrities?	The core architectural fork — push vs pull for high-follower accounts.
Retweets: native retweet, quote tweet, or both?	Retweet creates a new tweet entity pointing to original — affects dedup, timeline merge, and storage.

Scope

In scope

Home timeline generation (follow-based feed)
Fan-out on write vs read hybrid strategy
Tweet creation and fan-out pipeline
Retweet and quote-tweet mechanics
List-based custom feeds

Out of scope (state explicitly)

Full search/indexing system (separate problem)
Ad insertion and monetization
Direct messages
Trending topics computation

Assumptions

300M DAU, 500M tweets/day (~6K tweets/sec avg, 60K peak)
Average 200 followers, average 500 following per user
Fan-out threshold: 10K followers (hybrid boundary)
Timeline page size: 20 tweets, cache last 800 per user

Post a tweet: Text (280 chars), images, videos, links
User timeline: View a user's own tweets (profile page)
Home timeline (feed): Aggregated tweets from followed users, ranked
Follow / Unfollow users
Like, Retweet, Reply, Quote Tweet
Search tweets by keyword, hashtag, user
Trending topics: Real-time trending hashtags and topics
Notifications: Mentions, likes, retweets, new followers

Metric	Calculation	Value
DAU	Given (product assumption)	400M
Tweets / day	Given (~1.25 tweets/user)	500M
Tweets / sec	500M ÷ 86400	~6,000 (peak ~30K)
Home timeline reads / day	400M DAU × 25 views	10B (25 views per user/day)
Reads / sec	10B ÷ 86400	~115K (peak ~500K)
Avg tweet size (metadata)	Given (typical workload assumption)	1 KB
Storage / day (tweets only)	500M × 1 KB	500 GB
Media storage / day	50M media tweets × 2 MB	100 TB
Fan-out: avg followers = 200	500M × 200	100B fan-out writes/day

Hybrid Fan-Out Strategy (Same as News Feed: Critical for Twitter)

Twitter's core architectural challenge:

User Type	Followers	Strategy	Reason
Normal (99.9%)	< 10K	Fan-out on Write	Pre-compute timeline at write time; reads are instant
Celebrity (0.1%)	> 10K	Fan-out on Read	Writing to 50M timelines is too slow

Implementation

User A tweets → Tweet Service publishes to Kafka
Fan-out Service consumes event, checks follower count
If < 10K followers → fetch follower list → write tweet_id to each follower's Redis timeline
If ≥ 10K followers → skip fan-out; store tweet in author's timeline only
When User B reads home timeline:
- Fetch pre-computed timeline from Redis (fan-out-on-write results)
- Separately fetch latest tweets from any celebrities User B follows (fan-out-on-read)
- Merge and rank

Tweet Service

Validates tweet (280 char limit, content moderation)
Stores tweet in MySQL/Cassandra
Uploads media to S3 + CDN
Publishes to Kafka tweet-events topic

Timeline Service

Serves home timeline and user timeline requests
Home timeline: Reads from Redis Sorted Set (pre-computed) + celebrity merge
User timeline: Direct Cassandra query WHERE user_id = X ORDER BY created_at DESC
Enriches tweet IDs with full tweet data (from Tweet Cache or DB)

Fan-Out Service

Kafka consumer group processing tweet-events
For each tweet: queries Social Graph for follower list → writes tweet_id to each follower's Redis timeline
Performance: With Kafka parallelism, can process 100B fan-out writes/day
Handles: New tweet, delete tweet (remove from timelines), retweet

Search Service + Elasticsearch

Indexing: Consumes from Kafka → indexes tweets in Elasticsearch in near real-time
Schema: {tweet_id, user_id, content, hashtags, mentions, created_at, engagement_score}
Query: Full-text search + filters (hashtag, user, date range)
Relevance: BM25 text relevance + recency boost + engagement boost

Trending Service + Apache Flink

Purpose: Identify trending hashtags and topics in real-time
How Flink processes trends:
1. Consume tweet events from Kafka
2. Extract hashtags and keywords
3. Sliding window aggregation (count per 5-minute window, sliding by 1 minute)
4. Calculate velocity: trend_score = current_window_count / previous_window_count
5. Top-K: Maintain a min-heap of top 50 trending topics
6. Output to Redis sorted set for serving
Heavy Hitters / Count-Min Sketch: For approximate counting of very high-cardinality hashtags with bounded memory

Social Graph Service

Stores follow/following relationships
Redis: For fast lookups during fan-out (followers:{user_id} → SET)
MySQL: Durable storage for relationships

Notification Service

Consumes events from Kafka (tweet-events, like-events, retweet-events, follow-events)
Triggers:
- @mention in a tweet → push notification to mentioned user
- Like/retweet → notification to tweet author (batched: "Alice and 15 others liked your tweet")
- New follower → "X started following you"
- Reply → notification to parent tweet author
Channels: APNs (iOS), FCM (Android), in-app notification tab, email digest (daily/weekly)
Batching: Collapse multiple similar events to avoid notification spam (e.g., 50 likes → 1 push)

ML Ranking Service

Timeline is NOT purely reverse-chronological: it's ranked by relevance
Features: tweet recency, engagement velocity (likes/retweets per minute), user affinity (how often you interact with author), content type (image/video/text), author verification status
Model: Lightweight gradient boosted tree for online scoring (< 10ms per candidate set)
Fallback: If ranking service is unavailable → return reverse-chronological feed

Event Bus Design (Kafka)

Topic: twitter_timeline-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "twitter_timeline-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: twitter_timeline-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Timeline and Tweet Service (Twitter): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Post Tweet

HTTP

POST /api/v1/tweets
{
  "content": "Hello Twitter! #systemdesign",
  "media_ids": ["img-uuid"],
  "reply_to": null
}
Response: 201 Created
{
  "tweet_id": "snowflake-id",
  "created_at": "2026-03-13T10:00:00Z"
}

Get Home Timeline

HTTP

GET /api/v1/timeline/home?cursor={last_tweet_id}&limit=20
Response: 200 OK
{
  "tweets": [...],
  "next_cursor": "1234567890"
}

Get User Timeline

HTTP

GET /api/v1/users/{user_id}/tweets?cursor={last_tweet_id}&limit=20

Search

HTTP

GET /api/v1/search?q=%23systemdesign&type=recent&cursor=...

Like / Retweet

HTTP

POST /api/v1/tweets/{tweet_id}/like
POST /api/v1/tweets/{tweet_id}/retweet

Get Trending

HTTP

GET /api/v1/trends?location=US
Response: 200 OK
{
  "trends": [
    {"hashtag": "#SystemDesign", "tweet_count": 125000, "rank": 1},
    ...
  ]
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

MySQL (Sharded by tweet_id): Tweets

SQL

CREATE TABLE tweets (
    tweet_id    BIGINT PRIMARY KEY,  -- Snowflake ID
    user_id     BIGINT NOT NULL,
    content     VARCHAR(280),
    media_urls  JSON,
    reply_to    BIGINT,              -- null if not a reply
    retweet_of  BIGINT,              -- null if not a retweet
    like_count  INT DEFAULT 0,
    retweet_count INT DEFAULT 0,
    reply_count INT DEFAULT 0,
    created_at  TIMESTAMP,
    is_deleted  BOOLEAN DEFAULT FALSE,
    INDEX idx_user (user_id, created_at DESC)
);

Cassandra: User Timeline

SQL

CREATE TABLE user_timeline (
    user_id     UUID,
    tweet_id    BIGINT,
    created_at  TIMESTAMP,
    PRIMARY KEY (user_id, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_id DESC);

Redis: Home Timeline Cache

Key:    timeline:home:{user_id}
Type:   Sorted Set
Members: tweet_id
Scores:  timestamp
Max:     800 entries
TTL:     48 hours

Redis: Trending Topics

Key:    trending:{country_code}
Type:   Sorted Set
Members: hashtag
Scores:  trend_score

Elasticsearch: Tweet Search Index

JSON

{
  "tweet_id": "snowflake-id",
  "user_id": "user-uuid",
  "username": "johndoe",
  "content": "Hello Twitter! #systemdesign",
  "hashtags": ["systemdesign"],
  "mentions": [],
  "created_at": "2026-03-13T10:00:00Z",
  "engagement_score": 150,
  "language": "en"
}

MySQL: Social Graph

SQL

CREATE TABLE follows (
    follower_id  BIGINT,
    followee_id  BIGINT,
    created_at   TIMESTAMP,
    PRIMARY KEY (follower_id, followee_id),
    INDEX idx_followee (followee_id, follower_id)
);

General

Concern	Solution
Tweet never lost	Written to MySQL/Cassandra with replication before ack
Fan-out lag	Kafka buffers events; fan-out workers catch up after recovery
Redis timeline loss	Reconstructable from DB (fan-out service can rebuild)
Celebrity fan-out storm	Hybrid model avoids this entirely
Search index lag	Consumers process Kafka with at-least-once semantics; Elasticsearch is eventually consistent
Trending accuracy	Flink checkpointing ensures exactly-once processing; Count-Min Sketch allows bounded error
Delete propagation	Tweet deletion published to Kafka → fan-out service removes from timelines + search index marked as deleted
Hot tweet (viral)	Cache full tweet object in distributed Redis; use read replicas

Specific: Handling Tweet Deletion

Mark tweet as is_deleted = true in DB (soft delete)
Publish tweet-deleted event to Kafka
Fan-out service removes tweet_id from all followers' Redis timelines (async, eventual)
Search service removes from Elasticsearch index
Client-side: on receiving deleted tweet_id, removes from UI

Trending Algorithm Deep Dive

Not just "most tweets": that would always show popular permanent hashtags
Velocity-based: score = (count_in_current_window - count_in_previous_window) / count_in_previous_window
Topics with sudden spikes rank higher than consistently popular ones
Filter out spam/bot-generated trends
Geo-specific trends (per country/city)

Count-Min Sketch for Trending

Memory-efficient probabilistic data structure
Uses d hash functions mapping to w counters (d × w matrix)
On increment: hash hashtag with each of d functions, increment corresponding counters
On query: take minimum of d counter values (overcounts, never undercounts)
Space: ~1 MB can track millions of unique hashtags with < 1% error

Real-Time Feed Updates

Use Server-Sent Events (SSE) for "new tweets" indicator
Don't push every tweet in real-time (bandwidth waste)
Instead: send "12 new tweets" notification, user pulls to refresh

Tweet Thread / Conversation View

Recursive data model: reply_to forms a tree
To render a thread: fetch root tweet, then all replies recursively
Optimize with a conversation_id field → fetch all tweets in a conversation in one query

Content Moderation

Pre-publish: ML classifier checks for hate speech, spam, misinformation
Post-publish: User reports → moderation queue → human review
Automated actions: Shadow ban (reduce visibility), warning labels, account suspension

Interview Walkthrough

Open with fan-out on write vs read — this single decision shapes storage, latency, and celebrity handling for the entire design.
Propose the hybrid model: precompute timelines for normal users (<10K followers), merge celebrity tweets at read time for power users.
Store home timelines in Redis sorted sets (tweet_id as score) for O(log N) range reads with cursor-based pagination.
Walk through the home timeline merge: fetch precomputed feed + query recent tweets from followed celebrities + k-way merge by timestamp.
Apply Sharding and Partitioning by user_id for tweet storage and timeline caches to isolate hot accounts.
Use Back-of-the-Envelope Estimation on write amplification: 500M tweets/day × 200 avg followers = 100B fan-out writes — justify the hybrid threshold.
Common pitfall: fan-out on write for all users including celebrities, stalling the post API for minutes during high-follower events.

Home Timeline Assembly: The Merge Walkthrough

User B opens home feed. B follows 300 normal users + 5 celebrities.

Step 1: Fetch pre-computed timeline from Redis (1 ms)
  ZREVRANGEBYSCORE timeline:home:{B} +inf -inf LIMIT 0 200
  → Returns ~200 tweet_ids from fan-out-on-write results
  These are tweets from B's 300 normal followees, already placed here
  at write time by the Fan-out Service.

Step 2: Fetch celebrity tweets (5 parallel Redis calls, 2 ms)
  For each celebrity B follows:
    ZREVRANGEBYSCORE timeline:user:{celeb_id} {now} {now - 24h} LIMIT 0 20
  → Returns ~100 recent tweet_ids from 5 celebrities
  These were NOT pre-computed into B's timeline (fan-out-on-read)

Step 3: Merge and Rank (5 ms in-process)
  Combine 200 + 100 = 300 candidate tweet_ids

  For each candidate, compute ranking score:
    score = α × recency + β × engagement + γ × affinity + δ × content_type

    recency:      decay function → tweets from 5 min ago > 5 hours ago
    engagement:   like_count + 2 × retweet_count + 3 × reply_count
    affinity:     how often does B interact with this author?
                  (likes, replies, profile views in last 30 days)
    content_type: photos > text > links (engagement data shows this)

  Sort by score → top 20 for the first page

  Reverse-chronological as fallback:
    If user has algorithmic timeline disabled → sort by timestamp only
    (Twitter's "Latest Tweets" toggle)

Step 4: Hydrate tweet objects (3 ms)
  For each of top 20 tweet_ids:
    Fetch full tweet from Tweet Cache (Redis hash):
      HGETALL tweet:{tweet_id}
    If cache miss → fetch from Cassandra, populate cache

  Attach: author profile (from User Cache), engagement counts,
          whether B has liked/retweeted it, conversation thread context

Step 5: Return to client (total: ~12 ms)
  {
    "tweets": [ fully hydrated tweet objects ],
    "next_cursor": "tweet_id_of_200th_item"
  }

Why the merge doesn't bottleneck:
  Pre-computed (step 1): O(1) Redis read for the common case
  Celebrity fetch (step 2): only 5 calls, not 5 million
  Ranking (step 3): 300 candidates × simple arithmetic = microseconds
  The genius: 99.9% of the work (fan-out to 300 followees) was done at WRITE time

The Celebrity Threshold: 10K Followers, Why?

Fan-out-on-write cost per tweet:
  1 tweet × N followers = N Redis ZADD operations

  N = 100 followers:   100 ZADDs → < 1 ms → trivially fast
  N = 10,000:          10K ZADDs → ~10 ms → acceptable
  N = 100,000:         100K ZADDs → ~100 ms → starting to hurt
  N = 50,000,000:      50M ZADDs → ~50 seconds → IMPOSSIBLE per tweet

  Celebrity tweets: 10-50 per day × 50M ZADDs = 500M-2.5B writes/day from ONE user
  That's more than the entire non-celebrity fan-out combined.

Threshold choice:
  At 10K: fan-out cost is ~10 ms → acceptable with batching
  Below 10K: 99.9% of users, and their fan-out completes in < 10 ms
  Above 10K: ~0.1% of users, but they generate 80%+ of fan-out writes

  Dynamic threshold: can be tuned per system load
    During low traffic (2 AM): raise to 50K (pre-compute more)
    During peak (Super Bowl): lower to 5K (reduce fan-out load)

Fallback if fan-out is slow:
  Fan-out workers lag behind (spike in tweets)
  User opens feed → pre-computed timeline is stale (missing recent tweets)
  Solution: Timeline Service detects staleness (last_updated > 30s ago)
    → falls back to fan-out-on-read for ALL followees (not just celebrities)
    → slower but always fresh
    → cached result used for subsequent reads until fan-out catches up

Delete Propagation: The Consistency Challenge

User tweets → fan-out to 5000 followers → user deletes tweet

Deletion must undo the fan-out:
  1. Mark tweet as deleted in DB (soft delete, instant)
  2. Publish tweet-deleted event to Kafka
  3. Fan-out Service: for each of 5000 followers:
       ZREM timeline:home:{follower} {tweet_id}
     This takes ~5 ms (same speed as fan-out-on-write)
  4. Remove from Elasticsearch search index

Problem: between steps 1 and 3, some followers may still SEE the tweet
  They fetch their timeline → tweet_id is there → hydrate from cache →
  cache still has the tweet? Maybe.

  Solution: Tweet Cache check
    During hydration (Step 4 of timeline assembly):
    Fetch tweet from cache → check is_deleted flag → if deleted, skip
    Even if timeline still has the tweet_id, it's filtered out at read time

    Belt and suspenders: deletion in timeline (eventual) + filtering at read (immediate)

Celebrity tweet deletion:
  Celebrity deletes tweet → no fan-out to undo (was never written to followers)
  Just mark as deleted in DB → when followers do fan-out-on-read,
  the celebrity's timeline no longer includes it → naturally disappears

SLOs & Error Budgets

Metric	Target	Rationale
Timeline read p99 latency	< 200ms	Feed is the core loop — slow feed = churn
Tweet create p99 latency	< 100ms	Write path must be O(1) regardless of follower count
Fan-out lag (normal users)	p99 < 5s	Followers see tweet within 5s of post
Timeline freshness	≤ 10s stale	Merge cache TTL bounds staleness

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Celebrity tweet causes fan-out worker backlog misclassification	Fan-out queue lag >60s for normal users; mis-tagged celebrity processed with push fan-out	Kill switch on fan-out for affected user_id; drain queue; reclassify follower count cache (stale count caused mis-route); replay tweets through correct pull path
Redis timeline cluster memory exhaustion	Eviction rate spikes; timeline cache hit ratio drops from 95% to 60%; p99 read latency 3×	Trim timeline ZSET max size from 800 to 400; enable compression for tweet_id entries; emergency scale Redis cluster; shift cold users to pull-only model temporarily
Retweet dedup bug shows duplicate tweets in feed	User reports spike; monitoring shows dedup filter returning empty set on merge	Disable dedup filter (show duplicates temporarily); hotfix merge logic; backfill dedup index from tweet graph; root cause usually stale bloom filter on retweet_of index

Cost Drivers (Staff lens)

Redis memory: 300M users × 800 tweet_ids × 8 bytes ≈ 1.9 TB timeline cache (only active users cached)
Fan-out write QPS: 6K tweets/sec × 200 avg followers = 1.2M Redis writes/sec — largest cost center
Cassandra storage: 500M tweets/day × 500 bytes × 365 days ≈ 90 TB/year
Kafka: fan-out event volume mirrors write QPS; 3× replication

Multi-Region & DR

User timeline cache lives in home region (determined by user_id hash). Cross-region follows are common — celebrity tweet cache replicated globally (small: 10K celebrities × 100 tweets). Normal user fan-out stays regional. Timeline read never crosses regions except for celebrity merge — pre-replicate celebrity data to all regions via Kafka mirroring.