This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Fan-out on write vs read vs hybrid with celebrity threshold math. Redis sorted-set timelines, feed ranking as separate read path, and pull model for low-activity users. |
| Arch 50 | Cache invalidation on delete, ranking feature store lag, cold-start feed, and pagination cursors without deep offsets. |
| Arch 75 | Staff: eventual consistency of fan-out (repair jobs), ads injection without breaking cursor, and operating hybrid mode with automatic celebrity promotion/demotion. |
Interview Prompt
Design a news feed system like Facebook or Twitter home timeline. Users follow others; when someone posts, followers see it in their feed. Support pagination, rough chronological order with optional ranking, and scale to millions of followers for celebrities.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Chronological feed or ranked (ML) feed? | Chronological = merge sorted timelines; ranked = fetch candidate pool then score — different latency budget. |
| What's the follower/following distribution — many celebrities? | Celebrity with 30M followers breaks fan-out on write — drives hybrid threshold. |
| Read:write ratio and acceptable feed staleness? | 100:1 reads typical; 30s staleness allows async fan-out and cache. |
| Post types — text only or media-heavy? | Media changes CDN, fan-out payload size, and pre-compute vs pointer in timeline. |
Scope
In scope
- Post creation and fan-out to followers
- Home timeline read API with pagination
- Celebrity / high-fanout handling
- Basic ranking layer hook
- Follow graph integration
Out of scope (state explicitly)
- Full ML ranking model training
- Stories/ephemeral content
- Comment thread ranking
Assumptions
- 300M DAU, 500M posts/day, avg 200 followers, 0.1% users have >1M followers
- Feed read 100× post write QPS
- Timeline page size 20 posts, cursor pagination
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- User can create a post (text, image, video, link)
- User can follow/unfollow other users (or friend them)
- News feed displays posts from followed users/friends, ranked by relevance
- Feed supports infinite scroll / pagination
- Real-time or near-real-time feed updates when a followed user posts
- Support reactions (like, love, etc.) and comments on posts
- Support re-sharing / reposting
- Low Latency: Feed loads in < 200 ms (p99)
- High Availability: 99.99% uptime
- Scalability: 500M+ DAU, each with a personalized feed
- Eventual Consistency: It's OK if a new post appears in followers' feeds within a few seconds
- Read-Heavy: Read:Write ratio ~1000:1 (users read feed far more than they post)
- Ranking: Feed should be ranked (not purely chronological) for engagement
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given (500M DAU) | 500M |
| Posts / day | 500M × 0.2 posts/user/day | 100M (avg 0.2 posts per user/day) |
| Feed reads / day | 500M × 10 views/user/day | 5B (avg 10 feed views per user/day) |
| Avg followers per user | Given (200 followers avg) | 200 |
| Avg post size (metadata) | Given (1 KB per post record) | 1 KB |
| Feed size per user | 500 posts × 1 KB | 500 KB |
| Fan-out writes/day (push model, avg) | 100M × 200 | 20B writes |
| Fan-out writes/sec | Given | ~230K |
The Two Approaches: Fan-Out on Write vs. Fan-Out on Read
Fan-Out on Write (Push Model)
- How: When a user creates a post, immediately write the post ID to every follower's feed cache
- Pros: Feed reads are fast (pre-computed), O(1) read
- Cons: Expensive for users with millions of followers (celebrity problem). Wastes writes for inactive users
- Best for: Normal users with < 10K followers
Fan-Out on Read (Pull Model)
- How: When a user requests their feed, fetch latest posts from all followed users at query time
- Pros: No wasted writes, handles celebrities naturally
- Cons: Slow feed reads (must query many users' post timelines and merge)
- Best for: Celebrity accounts
Hybrid Approach ⭐ (Recommended)
- Normal users (< 10K followers): Fan-out on write → push post ID to followers' feed caches
- Celebrities (> 10K followers): Fan-out on read → when a user opens their feed, fetch celebrity posts on-the-fly and merge with pre-computed feed
- Classification: A background job periodically classifies users as "normal" or "celebrity" based on follower count
Post Service
- Handles post creation (text, media references, privacy settings)
- Stores posts in Posts DB (MySQL sharded by user_id)
- Uploads media to Object Store (S3) via CDN
- Publishes
new-postevent to Kafka
Fan-Out Service
- Consumes
new-postevents from Kafka - For each new post:
- Fetch the poster's follower list from Social Graph Service
- If poster is "normal" user → push post_id to each follower's feed in Redis
- If poster is "celebrity" → skip fan-out (will be fetched on read)
- Scaling: Kafka consumer group with many workers; each worker handles fan-out for assigned partitions
- Rate: For a post by a user with 200 followers, fan-out is 200 Redis writes: trivial. For 10K followers, still manageable
Social Graph Service
- Stores follower/following relationships
- Storage: Adjacency list in Redis (for fast lookups) backed by MySQL/Cassandra
following:{user_id}→ SET of user_idsfollowers:{user_id}→ SET of user_ids
- For large followers lists (celebrity): paginated retrieval from Cassandra
Feed Service
- Handles feed read requests
- Flow:
- Fetch pre-computed feed from Redis (list of post_ids)
- Fetch celebrity posts separately (pull model) and merge
- Pass merged list to Ranking Service
- Fetch full post objects for top N ranked posts from Post DB/Cache
- Return enriched feed to client
Ranking Service
- Input: List of candidate post_ids with features
- ML Model Features: Affinity with poster, post age, post type (photo vs. text), engagement signals (likes, comments), user's past interactions
- Output: Ranked list of post_ids
- Implementation: Lightweight inference service (TensorFlow Serving / ONNX Runtime)
- Fallback: If ranking service is down, return reverse-chronological feed
Feed Cache (Redis)
- Data Structure: Sorted Set per user →
feed:{user_id}with score = timestamp - Max size: Keep latest 500 post_ids per user (~500 × 16 bytes = 8 KB per user)
- Eviction: When list exceeds 500, remove oldest entries
- Read:
ZREVRANGEBYSCORE feed:{user_id} +inf -inf LIMIT 0 20→ latest 20 posts
Event Bus Design (Kafka)
Topic: news_feed_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "news_feed_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: news_feed_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a News Feed System (Facebook / Instagram): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Create Post
POST /api/v1/posts
Authorization: Bearer <token>
{
"content": "Hello world!",
"media_ids": ["img-uuid-1"],
"privacy": "public"
}
Response: 201 Created
{ "post_id": "post-uuid", "created_at": "..." }Get News Feed
GET /api/v1/feed?page_size=20&cursor=<last_post_timestamp>
Authorization: Bearer <token>
Response: 200 OK
{
"posts": [
{
"post_id": "...",
"author": {"user_id": "...", "name": "...", "avatar_url": "..."},
"content": "Hello world!",
"media": [{"url": "...", "type": "image"}],
"reactions_count": 42,
"comments_count": 7,
"created_at": "2026-03-13T10:00:00Z"
}
],
"next_cursor": "1710320400000"
}Follow/Unfollow
POST /api/v1/users/{user_id}/follow
DELETE /api/v1/users/{user_id}/followCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingMySQL (Sharded by user_id): Posts
CREATE TABLE posts (
post_id BIGINT PRIMARY KEY, -- Snowflake ID
user_id BIGINT NOT NULL,
content TEXT,
media_urls JSON,
privacy ENUM('public', 'friends', 'private'),
created_at TIMESTAMP,
INDEX idx_user_created (user_id, created_at DESC)
);Cassandra: User Timeline (author's own posts)
CREATE TABLE user_timeline (
user_id UUID,
post_id BIGINT,
created_at TIMESTAMP,
PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);Redis: Pre-computed Feed
Key: feed:{user_id}
Type: Sorted Set
Members: post_id (as string)
Scores: timestamp (epoch ms)
Max: 500 entries (ZREMRANGEBYRANK to trim)Redis / Cassandra: Social Graph
Redis:
Key: following:{user_id} → SET of followed user_ids
Key: followers:{user_id} → SET of follower user_ids
Cassandra (backup/overflow):CREATE TABLE followers (
user_id UUID,
follower_id UUID,
followed_at TIMESTAMP,
PRIMARY KEY (user_id, follower_id)
);Kafka Topic: new-posts
{
"post_id": "post-snowflake-id",
"user_id": "author-uuid",
"content_preview": "Hello world!",
"created_at": "2026-03-13T10:00:00Z",
"is_celebrity": false,
"follower_count": 200
}General
| Technique | Application |
|---|---|
| Kafka durability | Post events never lost (RF=3) |
| Redis persistence | RDB snapshots + AOF; Redis Cluster for HA |
| DB replication | MySQL read replicas, Cassandra RF=3 |
| Cache-aside fallback | If Redis feed is empty, reconstruct from Cassandra |
Problem-Specific
1. Celebrity Fan-Out Storm
- A celebrity with 50M followers posts → 50M writes would take minutes
- Solution: Hybrid model: celebrities use fan-out on read, not write. No storm
2. Feed Cache Miss (New User or Cold Start)
- A user who hasn't opened the app in months has an empty feed cache
- Solution: Fall back to fan-out on read: fetch last 500 posts from all followed users' timelines, merge, rank, cache the result
3. Stale Feed after Unfollow
- User unfollows someone but their posts are still in the feed cache
- Solution: Lazy deletion: when feed is fetched, filter out posts from unfollowed users. Periodically rebuild feed cache in background
4. Post Deletion
- Author deletes a post, but it's in millions of feed caches
- Solution: Don't remove from feed caches (too expensive). When feed is served, check post validity. If deleted, skip it. Eventually, old posts age out of the feed cache
5. Ranking Service Failure
- Solution: Graceful degradation: serve reverse-chronological feed (no ranking). User still gets a feed, just less personalized
Cursor-Based Pagination
- Don't use
OFFSET(performance degrades with depth) - Use cursor = last post's timestamp:
WHERE created_at < cursor ORDER BY created_at DESC LIMIT 20 - For ranked feeds: cursor =
{score}_{post_id}to handle ties
Content Moderation
- Before publishing a post, run through a content moderation pipeline (text: ML classifier, images: NSFW detection)
- Async: post is published immediately, moderation runs in background. If flagged → hide post
Push Notifications for Feed
- When a close friend posts, send a push notification
- Don't notify for every post: use a "close friends" heuristic based on interaction frequency
Feed Freshness
- Use a WebSocket or SSE (Server-Sent Events) to push "new posts available" indicator to the client
- Client can then pull the latest feed
Interview Walkthrough
- Lead with the fan-out decision — push (write on post) vs pull (read on request) — because it drives every downstream choice in this problem.
- Introduce the hybrid model for celebrity accounts: fan-out on write for normal users, fan-out on read for users with millions of followers.
- Apply Caching Patterns to store precomputed feed timelines in Redis sorted sets keyed by
user_id. - Insist on cursor-based pagination over
OFFSET— interviewers expect you to know why deep pagination degrades. - Clarify ranking requirements upfront: pure chronological vs engagement-weighted; ranking adds a scoring pipeline on top of fan-out.
- Partition feeds and social graph edges using Sharding and Partitioning by
user_idto keep hot users isolated. - Use Back-of-the-Envelope Estimation on write amplification: a celebrity post × 50M followers = 50M cache writes — quantify before committing to push-only.
- Common pitfall: fan-out on write for every user without a celebrity threshold, causing write storms that stall the posting path.
The Fan-Out Decision: The Most Important Architectural Choice
Fan-Out on Write (Push):
Post Event → for each follower → write post_id to their feed cache
Cost per post = O(followers_count) WRITES
Cost per feed read = O(1) READ (pre-computed)
✓ Read is instant (pre-computed feed)
✓ Simple read path
✗ Write amplification (celebrity with 50M followers → 50M writes)
✗ Wasteful for inactive users (compute feeds they never read)
✗ Latency for the post to appear in ALL followers' feeds
Fan-Out on Read (Pull):
Feed Request → fetch latest posts from each followed user → merge & rank
Cost per post = O(1) WRITE (just store the post)
Cost per feed read = O(following_count) READS + MERGE
✓ No write amplification
✓ No wasted work for inactive users
✗ Slow reads (must query N users' timelines and merge)
✗ Latency spikes if user follows many accounts
Hybrid (Facebook/Twitter actual approach) →:
- For 99.9% of users (< 10K followers): Fan-out on Write
- For 0.1% celebrities (> 10K followers): Fan-out on Read
Why 10K threshold?
10K × 500M posts/day × 0.2 posts/user = 1B fan-out writes/day (manageable)
50M × 500M posts/day × 0.001 posts/celebrity = impossible fan-out
On feed read:
1. Read pre-computed feed from Redis (fan-out-on-write part)
2. Fetch latest posts from followed celebrities (fan-out-on-read part)
3. Merge + rank → returnWhy Redis Sorted Set for Feed Cache (Not a List)?
Option 1: Redis LIST (LPUSH + LRANGE) ✓ Simple append ✗ No dedup — if fan-out retries, duplicates appear ✗ Cannot efficiently remove a specific post (deletion/moderation) ✓ No scoring for ranked feeds Option 2: Redis SORTED SET (ZADD + ZREVRANGE): ✓ Natural ordering by score (timestamp or ranking score) ✓ Dedup built-in (same post_id is a single member) ✓ Efficient removal: ZREM post_id ✓ Efficient trimming: ZREMRANGEBYRANK to keep top 500 ✓ Range queries: ZREVRANGEBYSCORE for cursor-based pagination ✗ Slightly more memory than LIST (~2× per entry due to skip list structure) Decision: SORTED SET. The memory overhead is negligible compared to the operational benefits (dedup, deletion, ranking). At 500 entries × 8 bytes per member ≈ 4 KB per user feed — trivial.
Why MySQL for Posts, Cassandra for Timeline, Redis for Feed?
| Data | Access Pattern | Best DB | Why |
|---|---|---|---|
| Posts (content) | Write once, read by ID, update | MySQL (sharded) | Relational (post → user, post → media), ACID for edits/deletes, familiar query patterns |
| User Timeline (author's posts) | Append, read by user_id sorted by time | Cassandra | Time-series write pattern, partition by user_id, high write throughput, no joins needed |
| Home Feed (pre-computed) | Write many (fan-out), read by user_id | Redis | In-memory for instant reads, sorted set for ranking, TTL for auto-cleanup |
| Social Graph | Read heavy, set operations | Redis + Cassandra | Redis for fast lookups during fan-out, Cassandra for durability |
Why not just use one DB for everything?
- MySQL can't handle 100B fan-out writes/day (write bottleneck)
- Cassandra can't serve ranked feeds in < 200ms (no in-memory sorted access)
- Redis can't store 10B posts durably (volatile, expensive at that scale)
- Polyglot persistence: each DB for what it does best
Ranking Model: Chronological vs ML-Ranked
Chronological (Twitter's original approach): ✓ Simple, deterministic, transparent to users ✓ No cold-start problem ✗ Noisy — high-volume posters dominate ✗ Important posts from close friends get buried ✗ Lower engagement (users miss relevant content) ML-Ranked (Facebook/Instagram approach) ⭐: ✓ Higher engagement (shows content you care about) ✓ Handles information overload ✗ "Filter bubble" — may limit exposure to diverse content ✗ Complex to build and maintain ✗ Users sometimes complain about non-chronological order Ranking Features: - Affinity: How often you interact with this user (likes, comments, messages) - Post age: Newer posts scored higher (time decay) - Content type: Does this user prefer photos over text? - Engagement velocity: How fast is this post getting likes/comments? - Author engagement: Does the author post high-quality content? - User feedback signals: Did user hide similar posts before? Model Architecture: - Candidate generation: ~1000 posts from followed users - Scoring: Lightweight model (logistic regression or small neural net) scores each - Re-ranking: Apply diversity rules (don't show 5 posts from same user in a row) - Returns top ~50 scored posts for initial feed load
The Delete Propagation Problem (Why "Lazy Delete" is Necessary)
Scenario: User deletes a post. That post_id exists in 200K followers' feed caches. Option 1: Eager Delete (remove from all caches immediately) → 200K Redis ZREM operations → Takes minutes to propagate → If Kafka is backed up, caches are inconsistent for hours → Cost: O(followers) per delete Option 2: Lazy Delete (check validity at read time) ⭐ → Post marked as deleted in Posts DB (instant, O(1)) → When feed is served, batch-lookup all post_ids → filter out deleted ones → Deleted post_id eventually ages out of feed caches (ZREMRANGEBYRANK keeps only 500) → Cost: O(1) per delete + small overhead per feed read → Trade-off: Deleted post might appear briefly for some users (< 1 second if using cache-aside) Production approach: Lazy delete + async background cleanup → Publish "post-deleted" event to Kafka → Background consumer eventually removes from affected feeds (best-effort) → Feed serving always double-checks post validity as safety net
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Fan-out on read (SQL)
SELECT posts FROM follows JOIN posts ORDER BY time. Works for prototype; read DB melts at scale.
Key components: PostgreSQL · Normalized schema
Move to next phase when: p99 read > 500ms with 300+ followings average
Phase 2 — Fan-out on write + Redis
Kafka post.created → fan-out workers → ZADD to follower feeds. Cassandra for posts. Hybrid threshold for celebrities.
Key components: Kafka · Fan-out workers · Redis ZSET · Cassandra · Hybrid threshold
Move to next phase when: Ranking and ads need richer candidate pool than chronological ZSET
Phase 3 — Ranked + multi-region
Ranking service, feature store (Redis/Velox), geo-replicated timelines with user home region. Automatic celebrity flag from follower count stream.
Key components: Ranker · Feature store · Regional Redis · Celebrity cache
Move to next phase when: Cross-region followers require timeline replication or read-time merge latency
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Feed read p99 latency | < 200ms | Includes merge + optional rank; chronological < 100ms |
| Feed freshness | p95 < 5s | Async fan-out lag acceptable for social, not for breaking news tier |
| Fan-out job success | 99.9% within 30s | Backlog indicates Kafka/worker issues |
| Timeline availability | 99.95% | Degrade to following list + pull on Redis failure |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Fan-out Kafka lag 2 hours after viral post storm | Consumer lag metric, users report stale feeds | Scale fan-out consumers; temporarily extend hybrid threshold (more pull); prioritize active users in queue; shed inactive user fan-out |
| Redis memory exhaustion on timeline cluster | evicted_keys > 0, OOM, elevated feed miss rate | Aggressive ZSET trim (500 entries); move cold users to pull model; shard timelines by user_id hash; add nodes |
| Ranking service returns empty — blank feed | Error rate spike, fallback not triggered | Circuit breaker to chronological ZSET only; feature flag disable ranker; postmortem on candidate size zero bug |
Cost Drivers (Staff lens)
- Redis RAM: 1000 post_ids × 8 bytes × 300M active users with materialized feeds (hybrid reduces this)
- Fan-out Kafka write amplification: posts × average fan-out — largest cost center
- Ranking inference CPU if real-time features — often bigger than storage
Multi-Region & DR
User timeline homed in signup region; celebrity posts replicated globally (small set). Cross-region follower: merge local timeline + fetch celebrity cache from global layer. Accept 1–2s extra latency for cross-region celebrity merge vs full timeline replication everywhere.