Design a Blob Storage System (like S3)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Blob Storage System (like S3).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Object metadata store, Erasure coding vs replication, Multipart upload?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Object metadata store
Erasure coding vs replication
Multipart upload
Garbage collection
Pre-signed URLs
Consistency model (strong read-after-write)

Out of scope (state explicitly)

Client desktop/mobile app implementation
End-user file preview rendering for every format
Building raw block storage hardware

Assumptions

Clarify scale (DAU, QPS, data volume) for blob storage s3 in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Upload objects (blobs) of any size (1 byte to 5 TB) via PUT
Download objects via GET with support for range requests (partial downloads)
Delete objects (immediate metadata removal, lazy storage reclamation)
List objects in a bucket with prefix filtering and pagination
Multipart upload: upload large objects in parts, assemble on completion
Versioning: maintain multiple versions of the same object key
Object metadata: custom key-value headers, content-type, content-encoding
Access control: per-bucket and per-object ACLs, IAM policies
Pre-signed URLs: generate time-limited URLs for upload/download without credentials
Lifecycle policies: auto-transition to cheaper storage tiers, auto-delete after N days

Metric	Calculation	Value
Total stored objects	Given (assumption documented in value)	100+ trillion (S3-scale)
Total storage	Given (assumption documented in value)	Exabytes
Requests / sec (global)	From Requests / day ÷ 86400 (+ peak factor in value)	100M+
Avg object size	Given (typical workload assumption)	100 KB (highly variable)
Metadata per object	Given	~1 KB
Metadata storage	100T × 1KB	100 PB
Write throughput	Given (assumption documented in value)	10M objects/sec
Replication factor	Given (assumption documented in value)	3 (cross-AZ) + erasure coding for cold

S3-style blob storage with Object Service, Placement Service, Metadata Store (KV DB), and replicated Data Nodes. Write path streams data to primary node which replicates, then metadata is written last for strong consistency.

Loading...

Write Path (PUT Object)

Client → API Gateway → Object Service (authenticates + authorizes)
If object > 5GB → require multipart upload
Object Service asks Placement Service: "Where to store 3 replicas?"
Object Service streams data to primary data node
Primary replicates to secondary nodes (chain replication)
After all 3 replicas written + checksums verified: write metadata
Metadata write is LAST step → ensures strong read-after-write consistency

Read Path (GET Object)

Client → API Gateway → Object Service
Object Service queries Metadata Store with (bucket, key)
Gets data locations: [{dn-17, vol-3, offset-4096}, ...]
Selects closest/healthiest data node
Streams data directly from data node to client
Verifies checksum on the fly
If checksum mismatch → try next replica + trigger repair

Optimization for small objects (< 256 KB): cache in Redis/in-memory. For large objects: range requests for parallel download.

Data Node: Block Format

Storage is organized as an append-only volume file (log-structured). Volume file contains fixed-size 64 MB blocks appended sequentially. Objects ≥ 64 MB span multiple blocks. Multiple small objects are packed into a single 64 MB block to avoid wasted space. An in-memory block index maps object_id → (volume_id, block_offset, length) for O(1) random access. Deletion is soft-delete (remove index entry); compaction reclaims space.

Event Bus Design (Kafka)

Topic: blob_storage_s3-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "blob_storage_s3-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: blob_storage_s3-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Blob Storage System (like S3): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Metadata Store (FoundationDB / DynamoDB)

JSON

Primary key: (bucket_name, object_key, version_id)
{
  "bucket": "my-photos",
  "key": "2026/march/sunset.jpg",
  "version_id": "v_abc123",
  "is_latest": true,
  "size": 2457600,
  "etag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
  "content_type": "image/jpeg",
  "storage_class": "STANDARD",
  "checksum_sha256": "abc123...",
  "data_locations": [
    {"node_id": "dn-17", "volume_id": "vol-3", "offset": 4096, "length": 2457600},
    {"node_id": "dn-42", "volume_id": "vol-1", "offset": 8192, "length": 2457600},
    {"node_id": "dn-63", "volume_id": "vol-7", "offset": 2048, "length": 2457600}
  ],
  "custom_metadata": { "x-amz-meta-photographer": "Alice" },
  "delete_marker": false,
  "created_at": "2026-03-14T10:00:00Z"
}

Bucket Metadata

SQL

CREATE TABLE buckets (
    bucket_name     TEXT PRIMARY KEY,
    owner_id        UUID NOT NULL,
    region          TEXT NOT NULL,
    versioning      TEXT DEFAULT 'disabled',
    storage_class   TEXT DEFAULT 'STANDARD',
    lifecycle_rules JSONB,
    cors_config     JSONB,
    policy          JSONB,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

Storage Tiers

Tier	Use Case	Durability	Availability	Access Time
Standard	Frequently accessed	11 nines	99.99%	< 100ms
Infrequent Access	Monthly access	11 nines	99.9%	< 100ms
Glacier	Quarterly access	11 nines	99.99%	1-5 min
Deep Archive	Yearly access	11 nines	99.99%	12 hours

Lifecycle automation: "After 30 days → IA, after 90 days → Glacier, after 365 days → Deep Archive": applied daily by Lifecycle Manager.

Event Notifications

S3 emits events for ObjectCreated, ObjectRemoved, and lifecycle transitions. Sent to Kafka/SQS or Lambda. Use cases: image uploaded → trigger thumbnail generation, log uploaded → trigger ETL, object deleted → update search index.

Interview Walkthrough

Split metadata store (bucket, key, version, replica locations) from blob data (append-only volumes on data nodes) — the defining architectural boundary.
Write path order matters: stream data to 3 replicas and verify checksums FIRST, write metadata LAST — this is how strong read-after-write consistency is achieved.
Multipart upload for objects > 5 GB: upload parts independently, complete assembles metadata only with no data copy.
Erasure coding (RS 10+4) for cold/archive tiers — 1.4× storage overhead vs 3× for hot replication.
Background scrubbing compares replica checksums weekly; bit-rot detected on read triggers re-replication from healthy copy.
Lifecycle policies automate tier transitions (Standard → IA → Glacier) to optimize cost without manual intervention.
Common pitfall: writing metadata before all replicas confirm — GET after PUT can return 404 or stale data, the classic eventual-consistency trap.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Write Path (PUT Object)

Read Path (GET Object)

Data Node: Block Format

Event Bus Design (Kafka)

Common Error Responses

Metadata Store (FoundationDB / DynamoDB)

Bucket Metadata

How 11-Nines Durability Is Achieved

Strong Read-After-Write Consistency

Multipart Upload Recovery

Handling Hot Objects

Garbage Collection

Storage Tiers

Event Notifications

Interview Walkthrough

Erasure Coding vs Replication for Durability

Consistency: Eventual vs Strong Read-After-Write

Multipart Upload: How Large File Uploads Work

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR