Design a Follower/Following System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Draw the dual-table Cassandra model (following + followers). Explain is-following hot path, celebrity bucket sharding, and fan-out threshold (100K). Show follow write sequence with idempotency.
Arch 50	Count reconciliation, block list interaction, Kafka fan-out lag SLOs, and Redis failure fallback to Cassandra. Quantify celebrity fan-out cost (100M × 64B).
Arch 75	Staff: TAO pattern vs graph DB for multi-hop queries, follow spam bot detection, and cross-region follow consistency (celebrity in US, follower in EU).

Interview Prompt

Design a follower/following system like Twitter. Users can follow/unfollow, view follower and following lists, see counts on profiles, and get follow suggestions. The system must handle celebrities with 100M+ followers.

Clarifying Questions (ask before designing)

Question	Why it matters
Is the graph directed or bidirectional?	Follow is directed (A follows B ≠ B follows A). Affects storage model and suggestion algorithms.
What's the fan-out model for feed delivery?	Fan-out-on-write vs read determines whether follow graph is on critical path for posts.
How fresh must follower counts be?	Eventual consistency (seconds lag) is acceptable for counts; follow action must be immediately visible to actor.
Do we need multi-hop queries (friends-of-friends)?	1-hop is O(1) with lookup table; 2-hop requires graph traversal or pre-computation.

Scope

In scope

Follow/unfollow with block list
Follower/following lists with pagination
Follower/following counts
Is-following check at 500K QPS
Celebrity scaling (100M followers)
Follow suggestions (friends-of-friends)

Out of scope (state explicitly)

Full news feed design (reference fan-out integration only)
Direct messaging
Account verification / blue check

Assumptions

1B users, 200 avg followers, 200M follows/day
Follow/unfollow p99 < 100ms
Celebrity threshold: 100K followers → fan-out-on-read

Follow/Unfollow: Directed edge between users
Follower list: Paginated, sorted by recency
Following list: Paginated list
Follower/following count: Display on profile
Follow suggestions: "People you may know"
Follow status check: Fast "does A follow B?" lookup
Fan-out on write: Deliver posts to followers' feeds
Notifications: "Alice started following you"

Metric	Calculation	Value
Total users	Given	1B
Avg followers per user	Given (power-law: most users < 500, celebrities skew avg)	200
Total follow edges	1B users × 200 avg followers	200B
Follow events / day	Given (~0.2 follows/user/day)	200M
Follow events / sec	200M/day ÷ 86400 ≈ 2.3K (peak 50K = 20×)	~2.3K (peak 50K)
Follow check queries / sec	Given (is-following on every profile/timeline render)	500K
Edge record size	follower_id (8B) + followee_id (8B) + timestamp (8B)	24 bytes
Edge storage	200B edges × 24 bytes	~4.8 TB
Celebrity fan-out writes/post	100M followers × 64B feed entry	~6.4 GB per celebrity post (fan-out-on-read above 100K threshold)

Loading...

Write Path: Follow Action

Alice follows Bob: validate (not blocked, not self, not already following) → Cassandra BATCH INSERT into following + followers tables → Redis ZADD following/followers sorted sets + INCR counts → Kafka follow-events for feed fan-out and notifications. Target: < 100ms p99.

Follow write sequence:
1. Check block list: GET block:{alice}:{bob} — 403 if blocked
2. Idempotency: SET follow:idem:{key} NX EX 86400 — return cached if exists
3. Cassandra BATCH:
   INSERT INTO following (alice, now, bob)
   INSERT INTO followers (bob, bucket(alice), alice, now)
   INSERT INTO is_following (alice, bob)
4. Redis pipeline: ZADD following:{alice}, ZADD followers:{bob} (if < 100K)
   INCR follower_count:{bob}, INCR following_count:{alice}
5. Kafka publish: { event: "follow", actor, target, ts }
6. Return 201

Is-Following Lookup: 500K QPS Hot Path

Every profile view, timeline render, and DM permission check calls this. Three-tier lookup:

Redis: is_following:{a}:{b}: 1ms, 95% hit rate
Cassandra is_following table: O(1) partition key lookup: 5ms
Negative cache: is_following:{a}:{b} = "0" with 5min TTL to avoid repeated misses

Pipelined batch lookups: check 50 friends in one Redis MGET round trip when rendering mutual friends.

Celebrity Hot Partition: 100M Followers

Bucketed partitioning: PRIMARY KEY ((user_id, bucket), follower_id) with bucket = hash(follower_id) % 100. 100 sub-partitions × 1M rows each = manageable compaction. Count stored separately via Redis INCR/DECR (never SELECT COUNT(*)). Do NOT cache celebrity follower lists in Redis: list is 100M entries; paginate from Cassandra only.

Fan-out threshold: Users with > 100K followers switch to fan-out-on-read. A celebrity post would require 100M Redis LPUSH ops (~6.4 GB) if fan-out-on-write: unacceptable. Hybrid model matches Twitter/Instagram production architecture.

Unfollow, Block, and Consistency

Unfollow: DELETE from both tables + Redis ZREM + DECR counts. Eventual consistency on counts (± few seconds) is acceptable per NFR.

Block: INSERT into blocks table + remove existing follow edge if present + purge from both follower lists. Block check runs before every follow attempt.

Count reconciliation: Hourly Spark job counts edges per user, compares to Redis counters, alerts on drift > 1%. Fixes via admin tool, not automatic overwrite (prevents masking bugs).

Follow Suggestions: Friends-of-Friends

Mutual friends algorithm: for each user Alice follows, get their following lists → aggregate → rank by frequency → filter already-followed and blocked. Offline Spark job daily → store top 50 in Redis suggestions:{uid}. Online refresh on new follow (incremental update for 1-hop only).

For celebrities: intersect viewer's following list (~500) with celebrity's followers using pipelined is_following checks: never materialize full celebrity follower set.

HTTP

POST /api/v1/users/{target_uid}/follow
Authorization: Bearer <token>
Idempotency-Key: follow-{actor}-{target}-{timestamp}

Response: 201 Created
{ "following": true, "followee_id": "uuid", "created_at": "2026-03-14T10:00:00Z" }

Errors:
  400 Bad Request     — cannot follow yourself
  401 Unauthorized    — invalid/missing token
  403 Forbidden       — blocked by target user or rate limited
  404 Not Found       — target user does not exist
  409 Conflict        — already following (idempotent: returns 200 with same body)
  429 Too Many Requests — follow spam (>100 follows/hour)

DELETE /api/v1/users/{target_uid}/follow
Response: 200 OK { "following": false }

GET /api/v1/users/{uid}/followers?cursor={last_id}&limit=50
Response: 200 OK
{ "followers": [{ "id", "name", "avatar" }], "total": 15234, "next_cursor": "..." }

GET /api/v1/users/{uid}/is-following/{target_uid}
Response: 200 OK { "following": true }  // p99 < 2ms via Redis/lookup table

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Cassandra: Source of Truth

SQL

CREATE TABLE following (
    user_id    UUID,
    follows    UUID,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, created_at, follows)
) WITH CLUSTERING ORDER BY (created_at DESC);

CREATE TABLE followers (
    user_id    UUID,
    bucket     INT,
    follower   UUID,
    created_at TIMESTAMP,
    PRIMARY KEY ((user_id, bucket), created_at, follower)
) WITH CLUSTERING ORDER BY (created_at DESC);

CREATE TABLE is_following (
    follower_id UUID,
    followee_id UUID,
    PRIMARY KEY (follower_id, followee_id)
);

Why Cassandra? Write-heavy workload (200M writes/day), bucketed partitioning for celebrities, Last-Writer-Wins conflict resolution, tunable consistency.

Redis: Cache Layer

following:{uid}       → Sorted Set (score=timestamp)
followers:{uid}       → Sorted Set (non-celebrity, < 100K)
is_following:{a}:{b}  → "1" (TTL: 24h)
follower_count:{uid}  → Integer
following_count:{uid} → Integer
suggestions:{uid}     → List (pre-computed, TTL: 24h)

Concern	Solution
Dual table write failure	Cassandra BATCH is atomic per partition; retry on timeout; idempotency key prevents duplicate edges
Count drift	Hourly Spark reconciliation: count edges → compare Redis → alert + manual fix if drift > 1%
Celebrity hot partition	100-way bucket sharding; fan-out-on-read above 100K followers; never cache full list
Follow spam / bot farms	Rate limit 100 follows/hour; CAPTCHA after 50; graph analysis for bot clusters
Redis failure	Fallback to Cassandra is_following table; rebuild cache async; counts may lag seconds
Kafka lag on follow events	Feed fan-out is async: acceptable 5s delay; notifications use separate priority topic
Split-brain on follow/unfollow race	Last-writer-wins on timestamp; unfollow always wins over concurrent follow retry

SLOs & Error Budgets

Metric	Target	Rationale
Follow/unfollow p99 latency	< 100ms	User-facing write — must feel instant to actor
Is-following p99	< 2ms	Called on every profile/timeline render at 500K QPS
Count drift after reconciliation	< 1%	Eventual consistency acceptable but bounded

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Celebrity follow causes hot partition timeout	Cassandra write timeout rate spikes for single user_id partition	Verify bucket distribution; rate-limit follows to celebrity; enable fan-out-on-read for that user
Follower count shows 0 for active user	User reports vs reconciliation job drift > 1%	Check Redis INCR/DECR logs; run reconciliation; find root cause before blind overwrite

Cost Drivers (Staff lens)

Cassandra storage: 200B edges × 24B = 4.8 TB
Redis: is-following cache ~50GB at 95% hit rate
Kafka follow-events: 200M/day × 200B = 40 GB/day

Multi-Region & DR

Follow edges stored in home region of followee. Cross-region follow adds ~50ms. Celebrity reads use local read replica. Counts are eventually consistent across regions (CRDT or periodic sync).

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Write Path: Follow Action

Is-Following Lookup: 500K QPS Hot Path

Celebrity Hot Partition: 100M Followers

Unfollow, Block, and Consistency

Follow Suggestions: Friends-of-Friends

Common Error Responses

Cassandra: Source of Truth

Redis: Cache Layer

Mutual Friends Computation

Feed Fan-Out Integration

Interview Walkthrough

Graph DB (Neo4j) vs Cassandra + Redis (TAO Pattern)

Feed Fan-Out: Write vs Read

Phase 1 — MVP (single Postgres)

Phase 2 — Cassandra + hybrid fan-out

Phase 3 — TAO at billion scale

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR