Design a User Presence System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design User Presence System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Heartbeat mechanism, WebSocket state, Presence fan-out to friends?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Heartbeat mechanism
WebSocket state
Presence fan-out to friends
Coalescing updates
Capacity estimation with shown math

Out of scope (state explicitly)

Voice/video calling (WebRTC)
Full Signal E2E protocol implementation
Content moderation ML pipeline

Assumptions

Millions of DAU with heavy fan-out — clarify celebrity/hot-key cases early
Eventual consistency acceptable for non-critical side effects (counts, notifications)
WebSocket or push infrastructure available at the edge

Show online status: Green dot for active users
Show last seen: "Last seen 5 minutes ago"
Real-time updates: Status propagates to friends within 5 seconds
Typing indicator: "Alice is typing..." in chat
Privacy controls: Hide status from specific users
Multi-device: Online on any device → online
Idle detection: Mark "Away" after 5 min inactivity

Metric	Calculation	Value
Concurrent online users	Given	200M
Heartbeat interval	Given	30 seconds
Heartbeats / sec	200M ÷ 30s heartbeat interval	~6.7M
Status changes / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~2M
Avg friends per user	Given	500
Fan-out (optimized)	Given	~15 per change

Loading...

Batch Presence Query

HTTP

POST /api/v1/presence/batch
{ "user_ids": ["u1", "u2", "u3"] }
→ { "u1": {"status": "online"}, "u2": {"status": "offline", "last_seen": "..."} }

Heartbeat

JSON

// WebSocket
{"type": "heartbeat", "device": "mobile"}
→ {"type": "heartbeat_ack"}

Subscribe to Updates

JSON

// WebSocket
{"type": "subscribe_presence", "user_ids": ["u1", "u2"]}
// Server pushes:
{"type": "presence_update", "user_id": "u1", "status": "offline"}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
440 Login Timeout: WebSocket session expired; reconnect required

Redis

presence:{uid}:{device}  → "online"|"away" (TTL: 60s, mobile:120s)
presence:{uid}           → Hash {status, last_active} (TTL: 60s)
presence_subs:{uid}      → SET of subscriber user_ids

Cassandra: Last Seen

SQL

CREATE TABLE last_seen (
    user_id UUID PRIMARY KEY,
    last_active TIMESTAMP,
    device TEXT
);

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Heartbeat + TTL

Fan-Out Optimization

Last Seen Persistence

WebSocket Connection Routing

Batch Presence Query

Heartbeat

Subscribe to Updates

Common Error Responses

Redis

Cassandra: Last Seen

Typing Indicator

Privacy Model

Interview Walkthrough

Redis + TTL vs Custom Gossip vs XMPP vs Firebase

Client-Side Status Formatting

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR