This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Social Graph Store.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Graph DB vs adjacency list in relational, Fan-out queries (friends-of-friends), Graph partitioning (edge-cut vs vertex-cut)? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Graph DB vs adjacency list in relational
- Fan-out queries (friends-of-friends)
- Graph partitioning (edge-cut vs vertex-cut)
- Bidirectional edges
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Millions of DAU with heavy fan-out — clarify celebrity/hot-key cases early
- Eventual consistency acceptable for non-critical side effects (counts, notifications)
- WebSocket or push infrastructure available at the edge
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Follow/Unfollow: User A follows/unfollows User B (directed edge)
- Friend request: Mutual follow / bidirectional friendship (undirected edge)
- Get followers: List of users following User A
- Get following: List of users User A follows
- Mutual friends: "You and Alice have 12 mutual friends"
- Friend-of-friend: 2nd degree connections for recommendations
- Graph queries: Shortest path, connected components, influence scoring
- Blocking: Exclude blocked users from all graph queries
- Low Latency: Follower/following list in < 50 ms
- High Write Throughput: Millions of follow/unfollow per day
- Scale: 2B+ users, 500B+ edges
- Consistency: Follow action must be immediately visible to the actor
- Availability: 99.99%
- Traversal Performance: 2-hop queries in < 200 ms
| Metric | Calculation | Value |
|---|---|---|
| Users (nodes) | Given | 2B |
| Edges (follow relationships) | Given | 500B |
| Follow/unfollow ops / day | Given | 500M |
| Follower list queries / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 100K |
| Edge record size | Given | 32 bytes |
| Total edge storage | Given | 16 TB |
Storage: Adjacency List vs Edge Table
Approach 1: Edge Table (MySQL) CREATE TABLE follows (follower_id BIGINT, followee_id BIGINT, PRIMARY KEY (follower_id, followee_id)); ✓ Simple, strong consistency ✗ 2-hop traversals require JOIN -> expensive at scale Approach 2: Adjacency List in Cassandra (Facebook TAO-inspired) Two tables: following(user_id, follows) + followers(user_id, follower) ✓ Scales horizontally, fast single-hop queries ✗ Dual writes needed Approach 3: Graph DB (Neo4j, Neptune) ✓ Native graph traversals in < 50ms ✗ Harder to scale beyond 100B edges
Facebook TAO: The Industry Standard
TAO (The Associations and Objects) — Facebook's custom graph store Core abstraction: Objects: users, pages, groups, posts (nodes) Associations: follows, likes, friendships (edges) Stored in MySQL, cached aggressively in distributed cache layer: Cache hit rate: > 99.8% Read from cache: < 1ms Write path: 1. Write to MySQL (source of truth) 2. Invalidate cache for both id1 and id2 3. Async: update denormalized count tables Key insight: 99%+ of queries are single-hop (get followers/get following). Only friend suggestions need multi-hop -> done offline in Spark.
Mutual Friends: The Interview Favorite
"Alice and Bob have 12 mutual friends" — how to compute? Optimization: Redis sorted set intersection ZINTERSTORE mutual_temp followers:alice followers:bob ZCARD mutual_temp -> count Redis does this in-memory in < 5ms for sets up to 10K members. For celebrities (100M followers): Don't compute real-time. Pre-compute mutual count in batch. Or show: "You follow [3 friends who follow this celebrity]"
POST /api/v1/follow
{ "target_user_id": "bob" } -> 200 OK
DELETE /api/v1/follow
{ "target_user_id": "bob" } -> 200 OK
GET /api/v1/users/{uid}/followers?cursor=...&limit=50
-> { "followers": [{id, name, avatar}, ...], "cursor": "..." }
GET /api/v1/users/{uid}/following?cursor=...&limit=50
-> { "following": [...], "cursor": "..." }
GET /api/v1/users/{uid}/mutual-friends?with=bob
-> { "mutual": [{id, name}, ...], "count": 12 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
MySQL/Cassandra: Source of Truth
CREATE TABLE follows (
follower_id BIGINT, followee_id BIGINT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (follower_id, followee_id)
);
CREATE INDEX idx_followee ON follows (followee_id, follower_id);Redis: Cache Layer
followers:{uid} -> Sorted Set (score=timestamp, member=user_id)
following:{uid} -> Sorted Set (score=timestamp, member=user_id)
follow_count:{uid} -> Hash {followers: 1523, following: 342}| Concern | Solution |
|---|---|
| Dual write | Write both tables in same Cassandra batch or MySQL transaction |
| Count drift | Async count reconciliation job (hourly) from edge table |
| Cache invalidation | On follow/unfollow -> invalidate both users' cache entries |
| Celebrity hot partition | Shard followers by (followee_id, follower_id_prefix) |
Race: Follow + Unfollow in Quick Succession
T=0: Follow Bob -> INSERT follows (alice, bob) T=50ms: Unfollow Bob -> DELETE follows (alice, bob) If out of order at DB: DELETE arrives first (no-op, row doesn't exist) INSERT arrives second -> alice follows bob (WRONG!) Solution: Include timestamp, use LWW (Last-Writer-Wins). Or use Cassandra (naturally LWW with cell-level timestamps).
Interview Walkthrough
- Model the social graph as directed edges (follower_id → followee_id) stored in Cassandra with indexes on both directions.
- Apply the TAO pattern: Cassandra/MySQL as source of truth, Redis as a cache-aside layer for hot 1-hop queries (<1ms).
- Cache follower/following lists in Redis sorted sets paginated by cursor — invalidate the cache entry on follow/unfollow writes.
- Maintain denormalized edge counts in Redis hashes; never run COUNT(*) across billions of rows on the read path.
- For mutual-friend queries, use bidirectional BFS or pre-computed 2nd-degree indexes — unidirectional BFS at depth 3 visits millions of nodes.
- Handle concurrent follow/unfollow with Last-Writer-Wins timestamps to prevent out-of-order writes from leaving stale edges.
- Explain why graph databases (Neo4j) fail at 500B+ edges — Facebook built TAO specifically because no graph DB scaled to their needs.
- Common pitfall: proposing Neo4j or a graph DB for a billion-user social network — interviewers expect the TAO relational+cache pattern instead.
Graph DB vs Relational + Cache
For social networks at scale: Relational + cache (TAO pattern). For knowledge graphs / small-scale: Graph DB. At Facebook scale: 2B users x 250 avg connections = 500B edges. No graph DB scales to this level. TAO: MySQL + massive cache layer -> custom-built for their scale.
Degree of Separation: Bidirectional BFS
Why bidirectional?
Unidirectional BFS at depth 3: visits ~200^3 = 8M nodes
Bidirectional BFS at depth 3: visits ~2 x 200^1.5 = ~5.6K nodes
-> 1400x fewer nodes explored!
LinkedIn's approach:
Pre-compute 1st and 2nd degree connections offline (Spark)
Store in graph index: 2nd_degree:{uid} = [user_ids]Graph Partitioning: Sharding Edges
Option 1: Hash partition by user_id ✓ Even distribution ✗ Cross-shard queries for friend-of-friend Option 2: Social-aware partitioning (METIS) ✓ Most traversals stay within one shard ✗ Complex to compute, must re-partition periodically Option 3: Hash + replicated edge list + aggressive caching Facebook TAO: hash partition by user_id with 99.8% cache hit rate
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core social graph store flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.