System Design Problem

Design Google Docs (Real-Time Collaborative Editing)

Commonly Asked By:GoogleMicrosoftAtlassianFigma

  • Multiple users can simultaneously edit the same document in real time
  • Each user sees a live cursor and selections of other collaborators
  • Changes appear on all clients within ~100ms of being made
  • Full version history with ability to view/restore any past version
  • Commenting and suggestion mode (track changes)
  • Offline editing with automatic conflict resolution on reconnect
  • Rich text formatting: bold, italic, headings, lists, tables, images
  • Permissions: owner, editor, commenter, viewer
  • Document sharing via link with configurable access levels
  • Export to PDF, DOCX, plain text
Loading...

Core Design Decision: OT vs CRDT

AspectOT (Operational Transformation)CRDT (Conflict-free Replicated Data Type)
How it worksTransform concurrent ops relative to shared server sequenceEach op carries unique IDs; merge is commutative, associative, idempotent
Server required?Yes — central server assigns canonical orderNo — can merge peer-to-peer
Offline supportHarder (ops must be rebased)Natural (merge on reconnect)
Used byGoogle Docs (original), SharePointFigma, Yjs, Automerge, Apple Notes
RecommendationLegacyUse CRDT for new systems — better offline, simpler reasoning

Collaboration Server: Why One Per Document

  • Each active document is assigned to exactly ONE collaboration server
  • Avoids distributed coordination for op ordering
  • Server holds document state in memory for fast op processing
  • If server dies → another server loads latest snapshot + replays op log
  • Consistent hashing or ZooKeeper assigns doc → server

Document Session Lifecycle

1. User opens doc → WebSocket Gateway routes to assigned Collaboration Server
2. If doc not in memory → Load latest snapshot from PostgreSQL
                        → Replay ops from Cassandra since snapshot
                        → Build in-memory state
3. User types → Client generates op → Sends via WebSocket
4. Server transforms → Assigns seq_num → Persists to op log → Broadcasts
5. Periodically (every 100 ops or 30 sec) → Write snapshot to PostgreSQL
6. Last user leaves → After 5 min idle, evict doc from memory