This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design User Presence System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Heartbeat mechanism, WebSocket state, Presence fan-out to friends? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Heartbeat mechanism
- WebSocket state
- Presence fan-out to friends
- Coalescing updates
- Capacity estimation with shown math
Out of scope (state explicitly)
- Voice/video calling (WebRTC)
- Full Signal E2E protocol implementation
- Content moderation ML pipeline
Assumptions
- Millions of DAU with heavy fan-out — clarify celebrity/hot-key cases early
- Eventual consistency acceptable for non-critical side effects (counts, notifications)
- WebSocket or push infrastructure available at the edge
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Show online status: Green dot for active users
- Show last seen: "Last seen 5 minutes ago"
- Real-time updates: Status propagates to friends within 5 seconds
- Typing indicator: "Alice is typing..." in chat
- Privacy controls: Hide status from specific users
- Multi-device: Online on any device → online
- Idle detection: Mark "Away" after 5 min inactivity
- Low Latency: Status propagated in < 5 seconds
- Scale: 200M+ concurrent users; ~500 friends each
- Bandwidth Efficient: Don't flood network
- Eventual Consistency: 5-second lag acceptable
- Availability: 99.99%
| Metric | Calculation | Value |
|---|---|---|
| Concurrent online users | Given | 200M |
| Heartbeat interval | Given | 30 seconds |
| Heartbeats / sec | 200M ÷ 30s heartbeat interval | ~6.7M |
| Status changes / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~2M |
| Avg friends per user | Given | 500 |
| Fan-out (optimized) | Given | ~15 per change |
Heartbeat + TTL
Client sends heartbeat every 30s. Server sets SETEX presence:{uid} "online" 60. No heartbeat for 60s → key expires → user appears offline. Tolerates brief disconnects. Multi-device: presence key per device, online if ANY device key exists.
Fan-Out Optimization
Naive: 2M changes/sec × 500 friends = 1B pushes/sec. Optimized: only notify ONLINE friends (~100), only friends with chat OPEN (~15), batch every 5s, pull on open + push while open. This is WhatsApp/Instagram's actual approach.
Last Seen Persistence
On going offline → persist to Cassandra: last_seen (user_id, last_active, device). Read Path: Query Redis. If present → "Online". If absent → Cassandra lookup → "Last seen at {time}".
WebSocket Connection Routing
Connection registry in Redis: ws_location:{uid} → {server, device}. To push to user: look up server, route via Redis Pub/Sub. Multi-device: push to all registered servers.
Batch Presence Query
POST /api/v1/presence/batch
{ "user_ids": ["u1", "u2", "u3"] }
→ { "u1": {"status": "online"}, "u2": {"status": "offline", "last_seen": "..."} }Heartbeat
// WebSocket
{"type": "heartbeat", "device": "mobile"}
→ {"type": "heartbeat_ack"}Subscribe to Updates
// WebSocket
{"type": "subscribe_presence", "user_ids": ["u1", "u2"]}
// Server pushes:
{"type": "presence_update", "user_id": "u1", "status": "offline"}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
Redis
presence:{uid}:{device} → "online"|"away" (TTL: 60s, mobile:120s)
presence:{uid} → Hash {status, last_active} (TTL: 60s)
presence_subs:{uid} → SET of subscriber user_idsCassandra: Last Seen
CREATE TABLE last_seen (
user_id UUID PRIMARY KEY,
last_active TIMESTAMP,
device TEXT
);WS crash: Client reconnects + heartbeat → status recovers in < 60s.
Redis down: Users show "unknown" status; degrade gracefully.
Force-kill app: No OFFLINE sent → TTL expires in 60s → correct.
Multi-device offline detection: Lua script checks ALL device keys atomically before declaring offline.
Heartbeat delayed after TTL expiry: Set TTL = 2.5× heartbeat interval. Grace period: wait 10s before broadcasting offline.
Typing Indicator
On each keystroke (throttled to 1/3s): WebSocket → Redis SETEX 5s → route to conversation partner. Direct WS-to-WS routing via Redis Pub/Sub. No persistence needed. For group chats, only send to members with chat window OPEN.
Privacy Model
Settings stored in Redis: privacy:{uid} → {presence, last_seen, typing}. Values: everyone, friends, nobody, custom. Blocklist: SADD presence_hidden:{uid} {blocked}. Check at both query time AND push time.
Interview Walkthrough
- Model presence as heartbeat + TTL: client sends ping every 30s, Redis SETEX marks user online with 60s expiry — no DB writes needed.
- Route status changes through WebSocket gateways subscribed to Redis Pub/Sub channels keyed by user ID or conversation.
- On disconnect, let TTL expire naturally rather than synchronous cleanup — stale-online for 30s is acceptable UX.
- Enforce privacy at both query time (API filters blocked users) and push time (gateway checks before forwarding status updates).
- Return raw last-seen timestamps from the server; let the client format "2 minutes ago" based on locale and timezone.
- Handle typing indicators as ephemeral Redis SETEX keys (5s TTL) with 3-second client throttle — no persistence required.
- Scale to 200M concurrent users with Redis Cluster sharded by user ID — each node handles ~25K presence keys.
- Common pitfall: polling a database every 5 seconds for every friend's online status — query load scales as O(friends × users), not O(users).
Redis + TTL vs Custom Gossip vs XMPP vs Firebase
Redis + TTL ⭐: <1ms latency, 200M+ with cluster, low complexity. Best for most apps. Custom gossip protocol needed only at WhatsApp (2B users) scale. XMPP for small-medium IM. Firebase Presence for prototypes.
Client-Side Status Formatting
Server returns raw timestamp → client formats based on locale/timezone. Rules: < 1 min → "just now", < 60 min → "N minutes ago", today → "today at 2:30 PM", yesterday → "yesterday at 10:15 PM", < 7 days → "Monday at 3:00 PM", > 7 days → "Mar 7".
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core user presence system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.