This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Design Google Docs (Real-Time Collaborative Editing).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: OT (Operational Transformation) vs CRDT, Cursor position sync, Version vectors? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- OT (Operational Transformation) vs CRDT
- Cursor position sync
- Version vectors
- WebSocket session management
- Conflict-free merges
- Undo/redo in collaborative context
Out of scope (state explicitly)
- Client desktop/mobile app implementation
- End-user file preview rendering for every format
- Building raw block storage hardware
Assumptions
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Multiple users can simultaneously edit the same document in real time
- Each user sees a live cursor and selections of other collaborators
- Changes appear on all clients within ~100ms of being made
- Full version history with ability to view/restore any past version
- Commenting and suggestion mode (track changes)
- Offline editing with automatic conflict resolution on reconnect
- Rich text formatting: bold, italic, headings, lists, tables, images
- Permissions: owner, editor, commenter, viewer
- Document sharing via link with configurable access levels
- Export to PDF, DOCX, plain text
- Low Latency: Local keystroke reflected instantly; remote changes visible within 100–200ms
- Consistency: All clients must eventually converge to the same document state (strong eventual consistency)
- Availability: 99.99%: editing must work even if some servers are down
- Scalability: Support 100M+ documents, up to 200 concurrent editors per document
- Durability: Zero data loss: every accepted edit persisted
- Conflict Resolution: Concurrent edits must merge automatically without user intervention
- Offline Support: Edits queued locally and synced on reconnect
- Security: Encryption in transit (TLS) and at rest, RBAC, audit logs
| Metric | Calculation | Value |
|---|---|---|
| Total documents | Given (assumption documented in value) | 500M |
| DAU | Given (product assumption) | 50M users |
| Concurrently active docs | Given (peak load assumption) | 10M |
| Avg editors per active doc | Given (typical workload assumption) | 2-5 (up to 200 max) |
| Operations/sec (global) | From Operations/day ÷ 86400 (+ peak factor in value) | 500K |
| Avg document size | Given (typical workload assumption) | 50 KB (text) + 500 KB (images/embeds) |
| Storage (docs) | 500M × 550 KB | ~275 TB |
| Version history (30 days) | Given | ~500 TB (with delta compression) |
| WebSocket connections | Given (assumption documented in value) | 20M concurrent |
Core Design Decision: OT vs CRDT
| Aspect | OT (Operational Transformation) | CRDT (Conflict-free Replicated Data Type) |
|---|---|---|
| How it works | Transform concurrent ops relative to shared server sequence | Each op carries unique IDs; merge is commutative, associative, idempotent |
| Server required? | Yes — central server assigns canonical order | No — can merge peer-to-peer |
| Offline support | Harder (ops must be rebased) | Natural (merge on reconnect) |
| Used by | Google Docs (original), SharePoint | Figma, Yjs, Automerge, Apple Notes |
| Recommendation | Legacy | Use CRDT for new systems — better offline, simpler reasoning |
Collaboration Server: Why One Per Document
- Each active document is assigned to exactly ONE collaboration server
- Avoids distributed coordination for op ordering
- Server holds document state in memory for fast op processing
- If server dies → another server loads latest snapshot + replays op log
- Consistent hashing or ZooKeeper assigns doc → server
Document Session Lifecycle
1. User opens doc → WebSocket Gateway routes to assigned Collaboration Server
2. If doc not in memory → Load latest snapshot from PostgreSQL
→ Replay ops from Cassandra since snapshot
→ Build in-memory state
3. User types → Client generates op → Sends via WebSocket
4. Server transforms → Assigns seq_num → Persists to op log → Broadcasts
5. Periodically (every 100 ops or 30 sec) → Write snapshot to PostgreSQL
6. Last user leaves → After 5 min idle, evict doc from memoryOT vs CRDT: Merge Semantics
OT transforms concurrent ops against a canonical server order: requires central server, harder offline. CRDT ops carry unique IDs (lamport/vector clocks); merge is commutative: natural offline support. New systems (Figma, Yjs) prefer CRDT; Google Docs originally used OT.
Collaboration Server Memory Model
In-memory: document tree + op log tail (last N ops for fast catch-up) On cold start: PostgreSQL snapshot + replay Cassandra ops since snapshot_index Hot doc (200 editors): batch ops every 50ms; presence updates every 500ms Eviction: LRU after 5 min idle — snapshot before evict
Cursor & Presence Sync
Cursor positions sent as ephemeral ops (not persisted to op log). Throttle to 10 updates/sec per user. Use CRDT-style position identifiers (not raw offsets) so cursors stay valid after concurrent inserts.
Offline → Reconnect Merge
Client buffers ops in IndexedDB while offline. On reconnect: send buffered ops with client_seq; server rebases against current doc version, returns transformed ops. Conflicts resolved automatically: user never sees a merge dialog for text edits.
WebSocket Protocol (bidirectional)
// Client → Server: operation
{
"type": "operation",
"doc_id": "d_abc123",
"client_seq": 42,
"server_seq": 100,
"ops": [
{ "type": "insert", "pos": 15, "content": "Hello", "attrs": {"bold": true} },
{ "type": "delete", "pos": 10, "len": 3 },
{ "type": "format", "pos": 5, "len": 10, "attrs": {"italic": true} }
]
}
// Server → Client (acknowledgement)
{ "type": "ack", "client_seq": 42, "server_seq": 105 }
// Presence updates
{ "type": "cursor", "user_id": "u_xyz", "cursor": { "pos": 42, "selection": { "start": 42, "end": 50 } } }REST APIs
POST /api/docs → Create new document
GET /api/docs/{doc_id} → Get document
DELETE /api/docs/{doc_id} → Delete document (soft delete)
GET /api/docs/{doc_id}/history → Get version history
GET /api/docs/{doc_id}/version/{v} → Get doc at specific version
POST /api/docs/{doc_id}/restore/{v} → Restore to version v
POST /api/docs/{doc_id}/share → Update sharing permissions
POST /api/docs/{doc_id}/export?fmt=pdf → Export documentCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Operation Log (Cassandra / ScyllaDB)
Table: operation_log Partition Key: doc_id Clustering Key: seq_num (ASC) doc_id TEXT seq_num BIGINT user_id TEXT op_type TEXT -- 'insert' | 'delete' | 'format' op_data BLOB -- Serialized operation (protobuf) timestamp TIMESTAMP client_seq INT
Document Snapshots (PostgreSQL)
CREATE TABLE documents (
doc_id UUID PRIMARY KEY,
title TEXT,
owner_id UUID,
content_snapshot JSONB,
snapshot_seq BIGINT,
created_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ,
is_deleted BOOLEAN DEFAULT FALSE
);Redis: Active Sessions
# Active document sessions
HSET doc:session:{doc_id} user:{user_id} '{"cursor":42,"color":"#FF5733","connected_at":"..."}'
# Permission cache (TTL 5 min)
HSET doc:perms:{doc_id} user:{user_id} "editor"
# Document-to-server mapping
SET doc:server:{doc_id} "collab-server-17" EX 300| Concern | Solution |
|---|---|
| Collaboration server failure | Failover via ZooKeeper; new server loads latest snapshot + replays op log; clients reconnect with exponential backoff |
| Conflict resolution | OT/CRDT ensures all clients converge; no lost updates |
| Offline edits | Buffered in IndexedDB; merged via CRDT on reconnect |
| Data durability | Cassandra RF=3 for op log; periodic snapshots to PostgreSQL |
| Hot document overload | Batched presence updates; op batching; view-only mode above 200 editors |
| Exactly-once application | Client seq + server seq dedup prevents double-apply |
Collaboration Server Failure
1. WebSocket Gateway detects broken connection (heartbeat) 2. Gateway triggers failover: assigns doc to another collab server via ZooKeeper 3. New server loads latest snapshot from PostgreSQL 4. Replays all ops from Cassandra where seq_num > snapshot_seq 5. Clients reconnect (WebSocket auto-reconnect with exponential backoff) 6. Clients send their pending ops (buffered locally) 7. New server transforms and applies pending client ops Recovery time: < 5 seconds
Conflict Resolution (OT Deep Dive)
Scenario: User A inserts "X" at position 5, User B deletes at position 3.
With OT:
Server receives A's op first → applies insert("X", 5) → seq 101
Server receives B's op (based on seq 100) → must transform:
B's delete(3) is transformed against A's insert(5)
Since 3 < 5: delete(3) stays at 3
BUT A's insert(5) vs B's delete(3): since 5 > 3, adjust to insert(4)
Both clients converge to same state ✓CRDT Deep Dive (Recommended Approach)
Using RGA (Replicated Growable Array): Each character = (uniqueID, value, isDeleted, parentID) UniqueID = (siteID, lamportClock) — globally unique, totally ordered Two inserts same parent: order by uniqueID (siteID tiebreak) Result: deterministic on ALL replicas — no server needed Tombstone GC: Deleted chars kept as tombstones Periodic compaction: if all replicas have seen delete, remove tombstone
Offline Editing & Sync
1. User goes offline → all edits stored in IndexedDB/SQLite 2. Local CRDT state diverges from server 3. User comes back online: a. Client sends all buffered ops to server b. Server sends all ops since client's last_seen_seq c. CRDT merge: both sides apply all ops — convergence guaranteed d. No manual conflict resolution needed (unlike Git)
Version History & Snapshotting
- Snapshot every 100 ops or every 30 seconds
- Op log kept for 90 days (compliance)
- To view version at time T: find latest snapshot before T, replay ops to T
- Delta compression between adjacent snapshots (~90% reduction)
Real-Time Presence at Scale
Problem: 200 users in a doc, each moving cursor → 40K msgs/sec Solution: - Cursor debouncing (client-side): throttle to 100ms intervals - Cursor batching (server-side): aggregate all 200 positions, broadcast every 100ms - 200 individual msgs/100ms → 1 batch msg/100ms → 200× reduction
Interview Walkthrough
- Lead with the OT vs CRDT decision — it is the defining architectural fork; state your recommendation and why (CRDT for new systems).
- Explain how concurrent edits converge: OT transforms ops against each other on a central server, CRDT merges ops commutatively without coordination.
- Use WebSocket for real-time op delivery; batch cursor/presence updates to avoid 40K msgs/sec from 200 users moving cursors.
- Persist an append-only op log plus periodic snapshots — replay ops from the latest snapshot for version history and crash recovery.
- Address offline editing: buffer ops locally, sync on reconnect, and rely on CRDT convergence (or OT replay) for conflict-free merge.
- Frame consistency using CAP Theorem and Consistency Models — favor availability and partition tolerance; brief edit conflicts are acceptable.
- Mention tombstone compaction for CRDT storage growth (~30% overhead after heavy editing).
- Common pitfall: last-write-wins timestamps — two users typing simultaneously will silently lose one user's edits.
OT vs CRDT: The Core Collaborative Editing Decision
Operational Transformation (OT) — Google Docs: ✓ Compact representation (small operations) ✓ Works well for text (insert/delete are natural) ✗ Server is required as central coordinator ✗ Correctness is notoriously difficult to prove ✗ Complex to implement correctly for rich text CRDT — Figma, Notion: ✓ Provably correct (mathematical guarantee of convergence) ✓ Works P2P (no central server required) ✓ Easier to reason about correctness ✗ Higher storage overhead (each character has metadata) ✗ Tombstones accumulate: ~30% of storage after heavy editing Industry trend: CRDTs are winning for new systems. For interviews: Mention both, recommend CRDT for new systems.
Document Snapshot Strategy: Frequency vs Storage Cost
Snapshot every 1000 ops: Recovery: at most 999 ops to replay → acceptable (< 1 sec) Storage: manageable Google Docs approach: - Every 1000 ops (automated) - On explicit "Version History" save (user-triggered) - On document sharing (ensure shareable state is snapshotted) Delta compression between snapshots: 60-90% storage reduction for documents with small incremental changes
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core google docs collaborative editing flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.