This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Live Comments System (like Facebook Live / YouTube Live).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Ordered message delivery, Comment fan-out, Profanity filtering? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Ordered message delivery
- Comment fan-out
- Profanity filtering
- Rate limiting per user
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full ML ranking model training pipeline
- Direct messaging / chat (#07)
- Ad insertion and monetization
Assumptions
- Clarify scale (DAU, QPS, data volume) for live comments system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Post comments: Allow users to submit real-time comments on active live videos or streaming events.
- Real-time delivery: Broadcast posted comments to all active viewers within 1 second.
- Pinned comments: Enable hosts/moderators to pin selected comments to the top of the chat panel.
- Reply threads: Support structured, short reply chains attached to individual comments.
- Automated moderation: Filter profanity, detect spam bots, and enforce block lists dynamically.
- Comment rate throttling: Implement adaptive sampling to handle viral streams without degrading UI performance.
- Historical comments: Persist comments long-term, keeping them scrollable after live events conclude.
- Low Latency: Deliver comments to millions of concurrent viewers globally within < 1 second.
- High Throughput: Design for 100K+ incoming comments per second during peak viral streams.
- Viewer Scalability: Support up to 10 Million concurrent active connections on a single massive stream.
- High Availability: Deliver 99.99% availability on the comment submission and viewing pipelines.
- Approximate Ordering: Ensure comments scroll in roughly chronological order.
- Rapid Filtering: Execute profanity checks and moderation lookups in < 50 ms.
| Metric | Calculation | Value |
|---|---|---|
| Concurrent Live Streams | Given (peak load assumption) | 50,000 |
| Peak Viewers on One Stream | Given (peak load assumption) | 10,000,000 |
| Comments / sec (One Hot Stream) | From Comments / day ÷ 86400 (+ peak factor in value) | 100,000 |
| Comments / sec (Global) | From Comments / day ÷ 86400 (+ peak factor in value) | 500,000 |
| Avg Comment Size | Given (typical workload assumption) | 200 bytes |
| Aggregate Network Throughput | Derived | 100 MB/s |
I/O and Bandwidth Calculations: - Ingestion Data Rate: 500,000 comments/sec (global) × 200 bytes = 100 MB/s write load. - Storage Footprint (Raw): - 100 MB/s × 3,600 seconds = 360 GB per hour. - Cassandra compression reduces this by ~3×, requiring ~120 GB per hour of persistent disk.
Incoming comments submit via WebSocket Gateways to the Comment Service (assigns Snowflake IDs and runs pre-filters). Approved comments write to Redis Pub/Sub which broadcasts them to WebSocket Gateways utilizing hierarchical fan-out. Async tasks process comments through Kafka and persist them globally inside a clustered Cassandra database.
1. End-to-End Comment Lifecycle
Processing comments from the keystroke to global broadcast requires coordination across multiple microservices:
1. Client publishes Comment → sends via WebSocket connection to Gateway.
2. Comment Service receives the message → assigns a globally unique, time-ordered Snowflake ID.
3. Moderation Service performs async checks (< 50ms):
- Fast dictionary profanity lookup.
- Machine Learning spam classifier (flags repetitive text, suspicious links, excessive rates).
- Redis SET lookup to check if user is on the stream's banned list.
4. If approved → publishes immediately to Redis Pub/Sub: stream:{stream_id}:comments
5. WebSocket Gateways fan out the comment to all local client connections subscribed to that channel.
6. Async task writes event to Kafka → consumed and stored in Cassandra for long-term persistence.2. Hierarchical Fan-Out (Scaling to 10M Viewers) ⭐
Fanning out 100 comments/sec directly to 10M viewers requires broadcasting 1 Billion messages every second, which saturates network cards and CPU threads. We resolve this by decoupling delivery layers:
Problem: 10M concurrent viewers × 100 comments/sec = 1 Billion messages/sec. Direct, per-message fan-out from a single server is a physical impossibility. Solution: Hierarchical Fan-Out Architecture - Tier 1: Comment Service writes directly to a Redis Pub/Sub channel per stream. - Tier 2: WebSocket cluster nodes (100 servers) subscribe to their respective Redis channels. - Tier 3: Each individual WebSocket server handles a capped pool of local connections (e.g., 100K clients). Calculation: - 10M viewers / 100K clients per server = 100 WebSocket servers. - Each WebSocket node consumes only ~100 messages/sec from Redis. - Local processes fan out these 100 comments/sec to the 100K connected viewers. - Redis cluster bandwidth: 100 nodes × 100 messages = 10K events/sec globally (extremely lightweight).
3. Dynamic Server-Side Sampling
Viral chat streams scroll too quickly for human reading. We maintain stream usability while preserving long-term records:
Problem: At 100K comments/sec, displaying every comment is unreadable to humans and saturates client network cards. Solution: Dynamic Server-Side Sampling - Take a uniform random sample of ~50 comments/sec to push to viewers. - High-priority comments ALWAYS bypass sampling: - Host/Moderator comments. - Verified user comments. - Pinned comments. - High-engagement comments (comments with many likes). - Stored records: ALL 100K comments/sec continue to stream to Kafka and Cassandra, enabling perfect chronological scrolling in historical replays after the stream ends.
Event Bus Design (Kafka)
Topic: live_comments_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "live_comments_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: live_comments_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Live Comments System (like Facebook Live / YouTube Live): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
1. Post Comment (WebSocket)
// Client posts comment via WebSocket:
{
"type": "comment",
"stream_id": "live-123",
"text": "Amazing performance! 🔥"
}
// Server responds with immediate acknowledgement:
{
"type": "comment_ack",
"comment_id": "881239589218311",
"status": "published"
}2. Receive Comments (Server Push)
// WebSocket server pushes batched comments to client:
{
"type": "new_comments",
"stream_id": "live-123",
"comments": [
{ "id": "cmt-101", "user": { "name": "Alice", "avatar": "https://..." }, "text": "Wow!", "ts": 1710320000 },
{ "id": "cmt-102", "user": { "name": "Bob" }, "text": "Let's go!", "ts": 1710320001 }
]
}3. Get Historical Comments (REST API)
GET /api/v1/streams/{stream_id}/comments?before={cursor}&limit=50
Response: 200 OK
{
"comments": [...],
"next_cursor": "88123958921500"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Cassandra: Long-Term Comments Database
-- Persisted comments database
CREATE TABLE comments (
stream_id UUID,
comment_id BIGINT, -- Snowflake ID (guarantees strict time ordering)
user_id UUID,
text TEXT,
is_pinned BOOLEAN,
is_deleted BOOLEAN,
created_at TIMESTAMP,
PRIMARY KEY (stream_id, comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC)
AND default_time_to_live = 7776000; -- Auto-expire in 90 daysRedis Cache & Pub/Sub Structure
# Pub/Sub channel for live broadcasts
Channel: stream:{stream_id}:comments
# Banned users list per stream (for O(1) checks)
Key: banned:{stream_id}
Type: SET
Members: [ "user-921", "user-331" ]
# Recent comments list (buffer for late joiners)
Key: recent_comments:{stream_id}
Type: LIST (Capped at 100 items via LPUSH + LTRIM)| Concern Scenario | System Solution Design |
|---|---|
| WebSocket Gateway Crash | Clients reconnect automatically, requesting backfilled messages starting from their last successfully rendered Snowflake ID. |
| Moderation Pool Latency | Fall back gracefully to pre-filters, publishing comments instantly with warning flags for subsequent async review. |
| Redis Pub/Sub Connection Drop | Pub/Sub is ephemeral. Surviving nodes serve recent history buffers; missing comments are fetched from Cassandra if needed. |
1. The Late Joiner Catch-Up Problem ⭐
When users join a stream that has been active for an hour, their screen should not appear empty. We solve this bootstrap challenge:
When a user joins a stream 30 minutes in, they see a completely blank comment timeline.
Solution: Redis Recent Buffer Cache
- Maintain a Redis LIST capped at 100 items (recent_comments:{stream_id}).
- When WebSocket opens:
1. Fetch last 100 comments from Redis LIST → send immediately.
2. Subscribe to Redis Pub/Sub for live comment updates.
3. Client merges: Client-side deduping by comment_id handles any comments delivered in both paths during the subscription handoff.2. Moderation Pipeline Latency & Shadow-Banning ⭐
Blocking offensive text cannot add significant latency to active, clean conversations. We leverage asynchronous pipelines:
Approach 1: Synchronous Moderation (Block until reviewed) - Comment → Moderation Service check → Publish if clean. - ✓ No inappropriate content is ever broadcast. - ✗ Adds 50–100ms lag. If moderation pools experience a spike, all comments freeze. Approach 2: Asynchronous Moderation (Publish First, Moderate Async) ⭐ - Comment → Publish immediately → Evaluate in background thread. - If flagged, broadcast a "delete_comment" command within 2 seconds. - ✓ Zero latency for the overwhelming majority of valid users. - ✗ Offensive text is visible for up to 2 seconds. Approach 3: Modern Hybrid Moderation (Fast filter + Deep Async ML) ⭐ - Pre-screen comments locally using a fast dictionary hash (< 1ms). - If it fails the hash, block instantly. If clean, publish immediately and trigger async ML models for advanced context/spam classification. - Shadow-banning: Allow banned/spam users to see their own comments, but hide them from all other users. Prevents bots from realizing they have been blocked.
3. ephemerality vs Strict Comment Ordering
Network latency means packet arrival order naturally drifts between distant servers. We address ordering:
At 100K comments/sec, packets sent from different regions reach servers out-of-order due to network propagation delays. Why Chronological Ephemerality makes this acceptable: - Under sampled viewing (~50 comments/sec), users do not notice a ±100ms ordering mismatch. - Chat streams scroll off-screen rapidly, meaning strict physical timing is not a functional constraint. - For host Q&As where order is critical: - Generate time-ordered Snowflake IDs at ingestion. - Client-side arrays buffer and sort comments by ID before rendering.
4. WebSocket Server Failure (Thundering Herd) ⭐
If a WebSocket node holding 100K clients drops, we must prevent those clients from crushing other nodes:
If a WebSocket server hosting 100K clients crashes, all connections instantly sever. Thundering Herd Mitigation & Recovery: 1. Client-Side Exponential Backoff & Jitter: - Reconnect randomly over 0–500ms to avoid slamming load balancers. 2. Resume Position Handoff: - Client sends its last_seen_comment_id to the new WebSocket node. 3. Node Replays Gaps: - Server reads recent comments from the Redis capped list where id > last_seen_comment_id. - Pushes missed comments, then subscribes client to live Pub/Sub updates. - Delivers a perfect, seamless user experience.
Verified User and Host Prioritization
To maintain high-quality streaming discussions, comments originating from verified accounts, moderators, or the stream hosts themselves are tagged with priority flags. The sampling engine routes these comments into a dedicated high-priority queue, guaranteeing they bypass random sampling drops entirely.
Interview Walkthrough
- Frame the problem as throughput mismatch: 100K comments/sec in, but viewers can only render ~10/sec — sampling is mandatory, not optional.
- Propose a tiered delivery pipeline: ingest → filter/sample → broadcast, with a high-priority lane for hosts and verified accounts.
- Compare Redis Pub/Sub (low-latency, fire-and-forget) vs Kafka (durable, replayable) for the broadcast layer and justify your pick.
- Use WebSocket fan-out through edge servers; do not push every comment from a single origin to millions of clients.
- Apply per-user rate limiting to prevent bot floods from starving legitimate viewers in the sampling pool.
- Persist all comments via Event Sourcing and CQRS even if only a sample is broadcast — replay and moderation require the full log.
- Quantify with Back-of-the-Envelope Estimation: 1M viewers × 10 comments/sec displayed = 10M WebSocket messages/sec at peak.
- Common pitfall: attempting to deliver every comment to every viewer on a viral stream — the system will collapse at the fan-out layer.
1. Redis Pub/Sub vs Kafka for Real-Time Broadcasting
Redis Pub/Sub: - ✓ Ultra-low broadcast latency (< 1ms). - ✓ Simple subscribe-unsubscribe mechanics. - ✗ Fire-and-forget; if a client goes offline, the message is permanently lost. Best For: ephemerally fanning out live chats and reaction animations. Kafka Event Bus: - ✓ Highly durable, replayable commit logs. - ✓ Partitioned consumer groups support parallel streaming. - ✗ 5–10ms write/read propagation latency overhead. Best For: persistent processing, background ML moderation, and writing to Cassandra. Hybrid System Integration: Use Redis Pub/Sub for immediate, ultra-fast client-side delivery, and Kafka for persistent database storage and analytics pipelines.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core live comments system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.