Design Google Docs (Real-Time Collaborative Editing)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Design Google Docs (Real-Time Collaborative Editing).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: OT (Operational Transformation) vs CRDT, Cursor position sync, Version vectors?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

OT (Operational Transformation) vs CRDT
Cursor position sync
Version vectors
WebSocket session management
Conflict-free merges
Undo/redo in collaborative context

Out of scope (state explicitly)

Client desktop/mobile app implementation
End-user file preview rendering for every format
Building raw block storage hardware

Assumptions

Clarify scale (DAU, QPS, data volume) for google docs collaborative editing in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Multiple users can simultaneously edit the same document in real time
Each user sees a live cursor and selections of other collaborators
Changes appear on all clients within ~100ms of being made
Full version history with ability to view/restore any past version
Commenting and suggestion mode (track changes)
Offline editing with automatic conflict resolution on reconnect
Rich text formatting: bold, italic, headings, lists, tables, images
Permissions: owner, editor, commenter, viewer
Document sharing via link with configurable access levels
Export to PDF, DOCX, plain text

Metric	Calculation	Value
Total documents	Given (assumption documented in value)	500M
DAU	Given (product assumption)	50M users
Concurrently active docs	Given (peak load assumption)	10M
Avg editors per active doc	Given (typical workload assumption)	2-5 (up to 200 max)
Operations/sec (global)	From Operations/day ÷ 86400 (+ peak factor in value)	500K
Avg document size	Given (typical workload assumption)	50 KB (text) + 500 KB (images/embeds)
Storage (docs)	500M × 550 KB	~275 TB
Version history (30 days)	Given	~500 TB (with delta compression)
WebSocket connections	Given (assumption documented in value)	20M concurrent

Loading...

Core Design Decision: OT vs CRDT

Aspect	OT (Operational Transformation)	CRDT (Conflict-free Replicated Data Type)
How it works	Transform concurrent ops relative to shared server sequence	Each op carries unique IDs; merge is commutative, associative, idempotent
Server required?	Yes — central server assigns canonical order	No — can merge peer-to-peer
Offline support	Harder (ops must be rebased)	Natural (merge on reconnect)
Used by	Google Docs (original), SharePoint	Figma, Yjs, Automerge, Apple Notes
Recommendation	Legacy	Use CRDT for new systems — better offline, simpler reasoning

Collaboration Server: Why One Per Document

Each active document is assigned to exactly ONE collaboration server
Avoids distributed coordination for op ordering
Server holds document state in memory for fast op processing
If server dies → another server loads latest snapshot + replays op log
Consistent hashing or ZooKeeper assigns doc → server

Document Session Lifecycle

1. User opens doc → WebSocket Gateway routes to assigned Collaboration Server
2. If doc not in memory → Load latest snapshot from PostgreSQL
                        → Replay ops from Cassandra since snapshot
                        → Build in-memory state
3. User types → Client generates op → Sends via WebSocket
4. Server transforms → Assigns seq_num → Persists to op log → Broadcasts
5. Periodically (every 100 ops or 30 sec) → Write snapshot to PostgreSQL
6. Last user leaves → After 5 min idle, evict doc from memory

WebSocket Protocol (bidirectional)

JSON

// Client → Server: operation
{
  "type": "operation",
  "doc_id": "d_abc123",
  "client_seq": 42,
  "server_seq": 100,
  "ops": [
    { "type": "insert", "pos": 15, "content": "Hello", "attrs": {"bold": true} },
    { "type": "delete", "pos": 10, "len": 3 },
    { "type": "format", "pos": 5, "len": 10, "attrs": {"italic": true} }
  ]
}

// Server → Client (acknowledgement)
{ "type": "ack", "client_seq": 42, "server_seq": 105 }

// Presence updates
{ "type": "cursor", "user_id": "u_xyz", "cursor": { "pos": 42, "selection": { "start": 42, "end": 50 } } }

REST APIs

POST   /api/docs                         → Create new document
GET    /api/docs/{doc_id}                → Get document
DELETE /api/docs/{doc_id}                → Delete document (soft delete)
GET    /api/docs/{doc_id}/history        → Get version history
GET    /api/docs/{doc_id}/version/{v}    → Get doc at specific version
POST   /api/docs/{doc_id}/restore/{v}   → Restore to version v
POST   /api/docs/{doc_id}/share         → Update sharing permissions
POST   /api/docs/{doc_id}/export?fmt=pdf → Export document

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Operation Log (Cassandra / ScyllaDB)

Table: operation_log
  Partition Key: doc_id
  Clustering Key: seq_num (ASC)

  doc_id       TEXT
  seq_num      BIGINT
  user_id      TEXT
  op_type      TEXT  -- 'insert' | 'delete' | 'format'
  op_data      BLOB  -- Serialized operation (protobuf)
  timestamp    TIMESTAMP
  client_seq   INT

Document Snapshots (PostgreSQL)

SQL

CREATE TABLE documents (
    doc_id UUID PRIMARY KEY,
    title TEXT,
    owner_id UUID,
    content_snapshot JSONB,
    snapshot_seq BIGINT,
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ,
    is_deleted BOOLEAN DEFAULT FALSE
);

Redis: Active Sessions

# Active document sessions
HSET doc:session:{doc_id} user:{user_id} '{"cursor":42,"color":"#FF5733","connected_at":"..."}'

# Permission cache (TTL 5 min)
HSET doc:perms:{doc_id} user:{user_id} "editor"

# Document-to-server mapping
SET doc:server:{doc_id} "collab-server-17" EX 300

Concern	Solution
Collaboration server failure	Failover via ZooKeeper; new server loads latest snapshot + replays op log; clients reconnect with exponential backoff
Conflict resolution	OT/CRDT ensures all clients converge; no lost updates
Offline edits	Buffered in IndexedDB; merged via CRDT on reconnect
Data durability	Cassandra RF=3 for op log; periodic snapshots to PostgreSQL
Hot document overload	Batched presence updates; op batching; view-only mode above 200 editors
Exactly-once application	Client seq + server seq dedup prevents double-apply

Collaboration Server Failure

1. WebSocket Gateway detects broken connection (heartbeat)
2. Gateway triggers failover: assigns doc to another collab server via ZooKeeper
3. New server loads latest snapshot from PostgreSQL
4. Replays all ops from Cassandra where seq_num > snapshot_seq
5. Clients reconnect (WebSocket auto-reconnect with exponential backoff)
6. Clients send their pending ops (buffered locally)
7. New server transforms and applies pending client ops

Recovery time: < 5 seconds

Conflict Resolution (OT Deep Dive)

Scenario: User A inserts "X" at position 5, User B deletes at position 3.

With OT:
  Server receives A's op first → applies insert("X", 5) → seq 101
  Server receives B's op (based on seq 100) → must transform:
    B's delete(3) is transformed against A's insert(5)
    Since 3 < 5: delete(3) stays at 3
    BUT A's insert(5) vs B's delete(3): since 5 > 3, adjust to insert(4)
  
  Both clients converge to same state ✓

CRDT Deep Dive (Recommended Approach)

Using RGA (Replicated Growable Array):
  Each character = (uniqueID, value, isDeleted, parentID)
  UniqueID = (siteID, lamportClock) — globally unique, totally ordered

Two inserts same parent: order by uniqueID (siteID tiebreak)
Result: deterministic on ALL replicas — no server needed

Tombstone GC:
  Deleted chars kept as tombstones
  Periodic compaction: if all replicas have seen delete, remove tombstone

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Core Design Decision: OT vs CRDT

Collaboration Server: Why One Per Document

Document Session Lifecycle

OT vs CRDT: Merge Semantics

Collaboration Server Memory Model

Cursor & Presence Sync

Offline → Reconnect Merge

WebSocket Protocol (bidirectional)

REST APIs

Common Error Responses

Operation Log (Cassandra / ScyllaDB)

Document Snapshots (PostgreSQL)

Redis: Active Sessions

Collaboration Server Failure

Conflict Resolution (OT Deep Dive)

CRDT Deep Dive (Recommended Approach)

Offline Editing & Sync

Version History & Snapshotting

Real-Time Presence at Scale

Interview Walkthrough

OT vs CRDT: The Core Collaborative Editing Decision

Document Snapshot Strategy: Frequency vs Storage Cost

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR