Design a Social Graph Store

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Social Graph Store.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Graph DB vs adjacency list in relational, Fan-out queries (friends-of-friends), Graph partitioning (edge-cut vs vertex-cut)?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Graph DB vs adjacency list in relational
Fan-out queries (friends-of-friends)
Graph partitioning (edge-cut vs vertex-cut)
Bidirectional edges
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Millions of DAU with heavy fan-out — clarify celebrity/hot-key cases early
Eventual consistency acceptable for non-critical side effects (counts, notifications)
WebSocket or push infrastructure available at the edge

Follow/Unfollow: User A follows/unfollows User B (directed edge)
Friend request: Mutual follow / bidirectional friendship (undirected edge)
Get followers: List of users following User A
Get following: List of users User A follows
Mutual friends: "You and Alice have 12 mutual friends"
Friend-of-friend: 2nd degree connections for recommendations
Graph queries: Shortest path, connected components, influence scoring
Blocking: Exclude blocked users from all graph queries

Metric	Calculation	Value
Users (nodes)	Given	2B
Edges (follow relationships)	Given	500B
Follow/unfollow ops / day	Given	500M
Follower list queries / sec	Derived from daily volume ÷ 86400 (+ peak factor)	100K
Edge record size	Given	32 bytes
Total edge storage	Given	16 TB

Loading...

HTTP

POST /api/v1/follow
{ "target_user_id": "bob" }  -> 200 OK

DELETE /api/v1/follow
{ "target_user_id": "bob" }  -> 200 OK

GET /api/v1/users/{uid}/followers?cursor=...&limit=50
-> { "followers": [{id, name, avatar}, ...], "cursor": "..." }

GET /api/v1/users/{uid}/following?cursor=...&limit=50
-> { "following": [...], "cursor": "..." }

GET /api/v1/users/{uid}/mutual-friends?with=bob
-> { "mutual": [{id, name}, ...], "count": 12 }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

MySQL/Cassandra: Source of Truth

SQL

CREATE TABLE follows (
    follower_id  BIGINT, followee_id  BIGINT,
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (follower_id, followee_id)
);
CREATE INDEX idx_followee ON follows (followee_id, follower_id);

Redis: Cache Layer

followers:{uid}  -> Sorted Set (score=timestamp, member=user_id)
following:{uid}  -> Sorted Set (score=timestamp, member=user_id)
follow_count:{uid} -> Hash {followers: 1523, following: 342}

Concern	Solution
Dual write	Write both tables in same Cassandra batch or MySQL transaction
Count drift	Async count reconciliation job (hourly) from edge table
Cache invalidation	On follow/unfollow -> invalidate both users' cache entries
Celebrity hot partition	Shard followers by (followee_id, follower_id_prefix)

Race: Follow + Unfollow in Quick Succession

T=0: Follow Bob -> INSERT follows (alice, bob)
T=50ms: Unfollow Bob -> DELETE follows (alice, bob)

If out of order at DB:
  DELETE arrives first (no-op, row doesn't exist)
  INSERT arrives second -> alice follows bob (WRONG!)

Solution: Include timestamp, use LWW (Last-Writer-Wins).
  Or use Cassandra (naturally LWW with cell-level timestamps).

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Storage: Adjacency List vs Edge Table

Facebook TAO: The Industry Standard

Mutual Friends: The Interview Favorite

Common Error Responses

MySQL/Cassandra: Source of Truth

Redis: Cache Layer

Race: Follow + Unfollow in Quick Succession

Interview Walkthrough

Graph DB vs Relational + Cache

Degree of Separation: Bidirectional BFS

Graph Partitioning: Sharding Edges

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR