Design a Live Comments System (like Facebook Live / YouTube Live)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Live Comments System (like Facebook Live / YouTube Live).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Ordered message delivery, Comment fan-out, Profanity filtering?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Ordered message delivery
Comment fan-out
Profanity filtering
Rate limiting per user
Capacity estimation with shown math

Out of scope (state explicitly)

Full ML ranking model training pipeline
Direct messaging / chat (#07)
Ad insertion and monetization

Assumptions

Clarify scale (DAU, QPS, data volume) for live comments system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Post comments: Allow users to submit real-time comments on active live videos or streaming events.
Real-time delivery: Broadcast posted comments to all active viewers within 1 second.
Pinned comments: Enable hosts/moderators to pin selected comments to the top of the chat panel.
Reply threads: Support structured, short reply chains attached to individual comments.
Automated moderation: Filter profanity, detect spam bots, and enforce block lists dynamically.
Comment rate throttling: Implement adaptive sampling to handle viral streams without degrading UI performance.
Historical comments: Persist comments long-term, keeping them scrollable after live events conclude.

Metric	Calculation	Value
Concurrent Live Streams	Given (peak load assumption)	50,000
Peak Viewers on One Stream	Given (peak load assumption)	10,000,000
Comments / sec (One Hot Stream)	From Comments / day ÷ 86400 (+ peak factor in value)	100,000
Comments / sec (Global)	From Comments / day ÷ 86400 (+ peak factor in value)	500,000
Avg Comment Size	Given (typical workload assumption)	200 bytes
Aggregate Network Throughput	Derived	100 MB/s

I/O and Bandwidth Calculations:
- Ingestion Data Rate: 500,000 comments/sec (global) × 200 bytes = 100 MB/s write load.
- Storage Footprint (Raw):
  - 100 MB/s × 3,600 seconds = 360 GB per hour.
  - Cassandra compression reduces this by ~3×, requiring ~120 GB per hour of persistent disk.

Incoming comments submit via WebSocket Gateways to the Comment Service (assigns Snowflake IDs and runs pre-filters). Approved comments write to Redis Pub/Sub which broadcasts them to WebSocket Gateways utilizing hierarchical fan-out. Async tasks process comments through Kafka and persist them globally inside a clustered Cassandra database.

Loading...

1. End-to-End Comment Lifecycle

Processing comments from the keystroke to global broadcast requires coordination across multiple microservices:

1. Client publishes Comment → sends via WebSocket connection to Gateway.
2. Comment Service receives the message → assigns a globally unique, time-ordered Snowflake ID.
3. Moderation Service performs async checks (< 50ms):
   - Fast dictionary profanity lookup.
   - Machine Learning spam classifier (flags repetitive text, suspicious links, excessive rates).
   - Redis SET lookup to check if user is on the stream's banned list.
4. If approved → publishes immediately to Redis Pub/Sub: stream:{stream_id}:comments
5. WebSocket Gateways fan out the comment to all local client connections subscribed to that channel.
6. Async task writes event to Kafka → consumed and stored in Cassandra for long-term persistence.

2. Hierarchical Fan-Out (Scaling to 10M Viewers) ⭐

Fanning out 100 comments/sec directly to 10M viewers requires broadcasting 1 Billion messages every second, which saturates network cards and CPU threads. We resolve this by decoupling delivery layers:

Problem: 10M concurrent viewers × 100 comments/sec = 1 Billion messages/sec.
Direct, per-message fan-out from a single server is a physical impossibility.

Solution: Hierarchical Fan-Out Architecture
- Tier 1: Comment Service writes directly to a Redis Pub/Sub channel per stream.
- Tier 2: WebSocket cluster nodes (100 servers) subscribe to their respective Redis channels.
- Tier 3: Each individual WebSocket server handles a capped pool of local connections (e.g., 100K clients).

Calculation:
- 10M viewers / 100K clients per server = 100 WebSocket servers.
- Each WebSocket node consumes only ~100 messages/sec from Redis.
- Local processes fan out these 100 comments/sec to the 100K connected viewers.
- Redis cluster bandwidth: 100 nodes × 100 messages = 10K events/sec globally (extremely lightweight).

3. Dynamic Server-Side Sampling

Viral chat streams scroll too quickly for human reading. We maintain stream usability while preserving long-term records:

Problem: At 100K comments/sec, displaying every comment is unreadable to humans and saturates client network cards.

Solution: Dynamic Server-Side Sampling
- Take a uniform random sample of ~50 comments/sec to push to viewers.
- High-priority comments ALWAYS bypass sampling:
  - Host/Moderator comments.
  - Verified user comments.
  - Pinned comments.
  - High-engagement comments (comments with many likes).
- Stored records: ALL 100K comments/sec continue to stream to Kafka and Cassandra, enabling perfect chronological scrolling in historical replays after the stream ends.

Event Bus Design (Kafka)

Topic: live_comments_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "live_comments_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: live_comments_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Live Comments System (like Facebook Live / YouTube Live): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

1. Post Comment (WebSocket)

JSON

// Client posts comment via WebSocket:
{
  "type": "comment",
  "stream_id": "live-123",
  "text": "Amazing performance! 🔥"
}

// Server responds with immediate acknowledgement:
{
  "type": "comment_ack",
  "comment_id": "881239589218311",
  "status": "published"
}

2. Receive Comments (Server Push)

JSON

// WebSocket server pushes batched comments to client:
{
  "type": "new_comments",
  "stream_id": "live-123",
  "comments": [
    { "id": "cmt-101", "user": { "name": "Alice", "avatar": "https://..." }, "text": "Wow!", "ts": 1710320000 },
    { "id": "cmt-102", "user": { "name": "Bob" }, "text": "Let's go!", "ts": 1710320001 }
  ]
}

3. Get Historical Comments (REST API)

HTTP

GET /api/v1/streams/{stream_id}/comments?before={cursor}&limit=50
Response: 200 OK
{
  "comments": [...],
  "next_cursor": "88123958921500"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Cassandra: Long-Term Comments Database

SQL

-- Persisted comments database
CREATE TABLE comments (
    stream_id   UUID,
    comment_id  BIGINT,    -- Snowflake ID (guarantees strict time ordering)
    user_id     UUID,
    text        TEXT,
    is_pinned   BOOLEAN,
    is_deleted  BOOLEAN,
    created_at  TIMESTAMP,
    PRIMARY KEY (stream_id, comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC)
  AND default_time_to_live = 7776000;  -- Auto-expire in 90 days

Redis Cache & Pub/Sub Structure

# Pub/Sub channel for live broadcasts
Channel: stream:{stream_id}:comments

# Banned users list per stream (for O(1) checks)
Key:    banned:{stream_id}
Type:   SET
Members: [ "user-921", "user-331" ]

# Recent comments list (buffer for late joiners)
Key:    recent_comments:{stream_id}
Type:   LIST (Capped at 100 items via LPUSH + LTRIM)

Concern Scenario	System Solution Design
WebSocket Gateway Crash	Clients reconnect automatically, requesting backfilled messages starting from their last successfully rendered Snowflake ID.
Moderation Pool Latency	Fall back gracefully to pre-filters, publishing comments instantly with warning flags for subsequent async review.
Redis Pub/Sub Connection Drop	Pub/Sub is ephemeral. Surviving nodes serve recent history buffers; missing comments are fetched from Cassandra if needed.

1. The Late Joiner Catch-Up Problem ⭐

When users join a stream that has been active for an hour, their screen should not appear empty. We solve this bootstrap challenge:

When a user joins a stream 30 minutes in, they see a completely blank comment timeline.

Solution: Redis Recent Buffer Cache
- Maintain a Redis LIST capped at 100 items (recent_comments:{stream_id}).
- When WebSocket opens:
  1. Fetch last 100 comments from Redis LIST → send immediately.
  2. Subscribe to Redis Pub/Sub for live comment updates.
  3. Client merges: Client-side deduping by comment_id handles any comments delivered in both paths during the subscription handoff.

2. Moderation Pipeline Latency & Shadow-Banning ⭐

Blocking offensive text cannot add significant latency to active, clean conversations. We leverage asynchronous pipelines:

Approach 1: Synchronous Moderation (Block until reviewed)
- Comment → Moderation Service check → Publish if clean.
- ✓ No inappropriate content is ever broadcast.
- ✗ Adds 50–100ms lag. If moderation pools experience a spike, all comments freeze.

Approach 2: Asynchronous Moderation (Publish First, Moderate Async) ⭐
- Comment → Publish immediately → Evaluate in background thread.
- If flagged, broadcast a "delete_comment" command within 2 seconds.
- ✓ Zero latency for the overwhelming majority of valid users.
- ✗ Offensive text is visible for up to 2 seconds.

Approach 3: Modern Hybrid Moderation (Fast filter + Deep Async ML) ⭐
- Pre-screen comments locally using a fast dictionary hash (< 1ms).
- If it fails the hash, block instantly. If clean, publish immediately and trigger async ML models for advanced context/spam classification.
- Shadow-banning: Allow banned/spam users to see their own comments, but hide them from all other users. Prevents bots from realizing they have been blocked.

3. ephemerality vs Strict Comment Ordering

Network latency means packet arrival order naturally drifts between distant servers. We address ordering:

At 100K comments/sec, packets sent from different regions reach servers out-of-order due to network propagation delays.

Why Chronological Ephemerality makes this acceptable:
- Under sampled viewing (~50 comments/sec), users do not notice a ±100ms ordering mismatch.
- Chat streams scroll off-screen rapidly, meaning strict physical timing is not a functional constraint.
- For host Q&As where order is critical:
  - Generate time-ordered Snowflake IDs at ingestion.
  - Client-side arrays buffer and sort comments by ID before rendering.

4. WebSocket Server Failure (Thundering Herd) ⭐

If a WebSocket node holding 100K clients drops, we must prevent those clients from crushing other nodes:

If a WebSocket server hosting 100K clients crashes, all connections instantly sever.

Thundering Herd Mitigation & Recovery:
1. Client-Side Exponential Backoff & Jitter:
   - Reconnect randomly over 0–500ms to avoid slamming load balancers.
2. Resume Position Handoff:
   - Client sends its last_seen_comment_id to the new WebSocket node.
3. Node Replays Gaps:
   - Server reads recent comments from the Redis capped list where id > last_seen_comment_id.
   - Pushes missed comments, then subscribes client to live Pub/Sub updates.
   - Delivers a perfect, seamless user experience.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.