This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design P2P File Transfer System (like BitTorrent).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Piece selection (rarest first), Peer discovery (DHT/tracker), Tit-for-tat incentive? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Piece selection (rarest first)
- Peer discovery (DHT/tracker)
- Tit-for-tat incentive
- NAT traversal (STUN/TURN)
- Swarm management
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for p2p file transfer bittorrent in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Distribute large files (100MB-100GB) across thousands of peers without central server bottleneck
- File split into fixed-size pieces (256KB-4MB); each piece independently downloadable
- Peer discovery: find peers who have pieces of the desired file
- Piece selection: download rarest pieces first (rarest-first strategy)
- Tit-for-tat: preferentially upload to peers who upload to you
- Torrent file / magnet link: metadata describing file, piece hashes, tracker URL
- Distributed Hash Table (DHT): trackerless peer discovery (Kademlia)
- Integrity verification: SHA-1 hash per piece
- Resume interrupted downloads; partial file sharing while downloading
- Scalability: More downloaders = more bandwidth (opposite of client-server!)
- Fault Tolerance: Any peer can leave at any time without affecting others
- Decentralization: No single point of failure (DHT for trackerless mode)
- Integrity: Corrupted pieces detected and re-downloaded from another peer
- Fairness: Tit-for-tat prevents free-riding
| Metric | Calculation | Value |
|---|---|---|
| File size | Given (assumption documented in value) | 4 GB (typical movie) |
| Piece size | Given (assumption documented in value) | 2 MB → 2,000 pieces |
| Peers in swarm | Given (assumption documented in value) | 10,000 |
| Seeders (have complete file) | Given (assumption documented in value) | 2,000 |
| Leechers (downloading) | Given (assumption documented in value) | 8,000 |
| Per-peer upload capacity | Given (assumption documented in value) | 1 Mbps avg |
| Total swarm upload capacity | 10,000 × 1 Mbps | 10 Gbps |
| Download time (single peer) | 4 GB / 10 Mbps combined | ~53 min |
Piece Selection: Rarest-First Algorithm
Problem: If everyone downloads piece 1 first, piece 1 becomes common
and rare pieces might become unavailable if the only seeder leaves
Rarest-first:
1. Track which pieces each connected peer has (via bitfield + have messages)
2. Count availability of each piece across all connected peers
3. Download the piece with LOWEST availability first
4. Exception: first few pieces → random (get something to upload ASAP)
Example:
Piece 1: available from 50 peers → low priority
Piece 42: available from 2 peers → HIGH priority (download NOW)
Piece 99: available from 1 peer → CRITICAL priority
If the 1 peer with piece 99 leaves → piece 99 lost forever → swarm can't completeTit-for-Tat (Choking Algorithm)
Problem: Free-riders (download but never upload) degrade the swarm
Solution: Preferentially upload to peers who reciprocate
1. Every 10 seconds, unchoke top 4 peers by upload rate to you
2. Every 30 seconds, optimistic unchoke: randomly unchoke 1 peer
(gives new peers a chance to prove themselves)
3. All other peers are "choked" (you won't upload to them)
Result:
- Fast uploaders get fast downloads (reciprocity)
- Slow/no uploaders get slow downloads
- New peers get bootstrapped via optimistic unchoke
- Nash equilibrium: everyone uploads → everyone benefitsKademlia DHT Lookup
Setup: NodeA = 01101001, info_hash = 01100011
XOR distance = 00001010 (= 10 decimal)
Iteration 1: Pick a=3 closest known nodes to target
Send find_node(01100011) to all 3 in parallel (UDP)
Iteration 2: Closest node responds with ITS closest nodes
[01100001, 01100010, 01100100]
01100010 ? 01100011 = 00000001 (dist=1) ? EVEN CLOSER!
Send find_node to these new closer nodes
Iteration 3: Node 01100010 responds: [01100011] ? This IS the target!
Send get_peers(info_hash) → receives peer list
Total hops: O(log2 10M) ˜ 23 hops max. In practice: 4-8 hops.BitTorrent Wire Protocol
# Peer handshake (TCP) <pstrlen=19><pstr="BitTorrent protocol"><reserved=8 bytes> <info_hash=20 bytes><peer_id=20 bytes> # Messages: choke → "I won't upload to you" unchoke → "I will upload to you" interested → "I want pieces you have" not interested → "I don't need anything from you" have → "I just completed piece #42" bitfield → "Here's a bitmap of all pieces I have" request → "Send me piece #42, offset 0, length 16384" piece → "Here's the data for piece #42"
Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
.torrent File (Bencoded)
{
"announce": "http://tracker.example.com/announce",
"info": {
"name": "movie.mkv",
"piece length": 2097152,
"pieces": "<concatenated SHA-1 hashes of all pieces>",
"length": 4294967296
}
}
info_hash = SHA-1(bencoded info dict) → 20-byte identifierKademlia DHT Routing Table
Node ID: 160-bit random number. K-buckets: 160 buckets, each stores up to 8 nodes. Bucket i: nodes with distance 2^i to 2^(i+1) from self.
Endgame Mode
Problem: Last few pieces take forever. Solution: When < 5% remaining, request ALL missing pieces from ALL peers. First response wins, cancel duplicates.
Peer Churn
Peers constantly join and leave. Tracker: re-announce every 30 min. DHT: announce every 15 min, timeout after 15 min. Client: maintain 20-50 connections, replace disconnected peers.
Piece Corruption
Every piece verified with SHA-1 hash. If hash mismatch ? discard, re-download from different peer. Ban peers that repeatedly send corrupt data.
Why P2P > Client-Server for Large Files
Client-Server: server bandwidth = bottleneck, cost ? number of downloaders. P2P: more downloaders = more aggregate bandwidth ? faster for everyone. 10,000 peers × 1 Mbps each = 10 Gbps total upload.
Magnet Links
magnet:?xt=urn:btih:<info_hash>: No .torrent file needed. Piece hashes fetched from peers via extension protocol. DHT used for initial peer discovery. Fully decentralized.
WebTorrent
Browser-based P2P using WebRTC data channels. WebRTC signaling via WebSocket tracker. Same algorithms. Use case: streaming video P2P to reduce CDN costs.
Interview Walkthrough
- Frame as content-addressed chunking:
info_hash(SHA-1 of piece hashes) identifies the swarm — integrity and dedup are built into the protocol. - Tracker and DHT provide peer discovery; clients maintain 20–50 connections and replace churned peers continuously.
- Rarest-first piece selection maximizes swarm efficiency — downloading pieces fewest peers have accelerates overall completion.
- Tit-for-tat choking: upload bandwidth goes to peers who upload to you; optimistic unchoke periodically probes new peers.
- End-game mode when < 5% remains: request all missing pieces from all peers in parallel — first response wins.
- Every piece verified with SHA-1 on receipt; corrupt data discarded and peer banned after repeated failures.
- Common pitfall: sequential piece download in order — ignores rarest-first strategy and starves the swarm when popular pieces are already saturated.
Choking/Unchoking Algorithm: State Machine
Each peer connection has 4 boolean states:
am_choking: We are choking them (not uploading)
am_interested: We are interested in their pieces
peer_choking: They are choking us
peer_interested: They are interested in our pieces
State transitions:
CONNECTION ESTABLISHED ? BITFIELD EXCHANGE ? EVERY 10 SECONDS
Rank interested peers by upload speed TO US ? Unchoke top 4
? EVERY 30 SECONDS: Optimistic Unchoke (random 1 peer)
Seeder behavior (special case):
Seeders have all pieces → nobody uploads TO them
Instead: unchoke peers with fastest DOWNLOAD rateRouting Table Maintenance
On receiving message from node X: compute distance, find bucket. If bucket full: ping least recently seen node; if it responds ? keep it (prefer long-lived nodes), discard X. Empirical: nodes online for 1 hour likely stay online. Bucket refresh every 15 minutes.
Why Piece Size Matters
Small pieces (256 KB): → Fine-grained piece selection → better rarest-first distribution ✓ Faster initial upload capability ✓ Less data wasted if piece fails hash check ? More SHA-1 hashes in .torrent file ✗ More protocol overhead, more seeks on HDD Large pieces (4 MB): ✓ Smaller .torrent file, less protocol overhead ✓ Better sequential disk I/O ? Slower to complete first piece ✗ More data wasted on hash failure Industry standard: 256 KB – 2 MB (adaptive based on total file size)
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core p2p file transfer bittorrent flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.