Design a P2P File Transfer System (like BitTorrent)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design P2P File Transfer System (like BitTorrent).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Piece selection (rarest first), Peer discovery (DHT/tracker), Tit-for-tat incentive?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Piece selection (rarest first)
Peer discovery (DHT/tracker)
Tit-for-tat incentive
NAT traversal (STUN/TURN)
Swarm management
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for p2p file transfer bittorrent in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Distribute large files (100MB-100GB) across thousands of peers without central server bottleneck
File split into fixed-size pieces (256KB-4MB); each piece independently downloadable
Peer discovery: find peers who have pieces of the desired file
Piece selection: download rarest pieces first (rarest-first strategy)
Tit-for-tat: preferentially upload to peers who upload to you
Torrent file / magnet link: metadata describing file, piece hashes, tracker URL
Distributed Hash Table (DHT): trackerless peer discovery (Kademlia)
Integrity verification: SHA-1 hash per piece
Resume interrupted downloads; partial file sharing while downloading

Metric	Calculation	Value
File size	Given (assumption documented in value)	4 GB (typical movie)
Piece size	Given (assumption documented in value)	2 MB → 2,000 pieces
Peers in swarm	Given (assumption documented in value)	10,000
Seeders (have complete file)	Given (assumption documented in value)	2,000
Leechers (downloading)	Given (assumption documented in value)	8,000
Per-peer upload capacity	Given (assumption documented in value)	1 Mbps avg
Total swarm upload capacity	10,000 × 1 Mbps	10 Gbps
Download time (single peer)	4 GB / 10 Mbps combined	~53 min

Loading...

.torrent File (Bencoded)

JSON

{
  "announce": "http://tracker.example.com/announce",
  "info": {
    "name": "movie.mkv",
    "piece length": 2097152,
    "pieces": "<concatenated SHA-1 hashes of all pieces>",
    "length": 4294967296
  }
}

info_hash = SHA-1(bencoded info dict) → 20-byte identifier

Kademlia DHT Routing Table

Node ID: 160-bit random number. K-buckets: 160 buckets, each stores up to 8 nodes. Bucket i: nodes with distance 2^i to 2^(i+1) from self.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Piece Selection: Rarest-First Algorithm

Tit-for-Tat (Choking Algorithm)

Kademlia DHT Lookup

BitTorrent Wire Protocol

Common Error Responses

.torrent File (Bencoded)

Kademlia DHT Routing Table

Endgame Mode

Peer Churn

Piece Corruption

Why P2P > Client-Server for Large Files

Magnet Links

WebTorrent

Interview Walkthrough

Choking/Unchoking Algorithm: State Machine

Routing Table Maintenance

Why Piece Size Matters

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR