Design a Music Streaming Service (Spotify) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Music Streaming Service (Spotify).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Audio streaming protocols, Playlist service, Offline sync?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Audio streaming protocols
Playlist service
Offline sync
Codec selection
CDN delivery
Capacity estimation with shown math

Out of scope (state explicitly)

Recommendation / home feed ranking (#48, #65)
Live chat and comments (#36)
DRM license server internals

Assumptions

Clarify scale (DAU, QPS, data volume) for music streaming in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Stream music: Play songs on-demand with continuous playback
Search: Search by song, artist, album, genre, lyrics
Playlists: Create, edit, share, follow playlists (personal + editorial + algorithmic)
Library: Save songs, albums, artists to personal library
Recommendations: Discover Weekly, Daily Mix, Release Radar (personalized)
Social: Follow friends, see what they're listening to, collaborative playlists
Offline mode: Download songs for offline playback
Podcasts: Stream and download podcast episodes
Queue: Manage playback queue, shuffle, repeat
Cross-device: Seamlessly switch playback between devices (Spotify Connect)

Metric	Calculation	Value
Total users	Given (product assumption)	500M
DAU	Given (product assumption)	200M
Total songs	Given (assumption documented in value)	100M
Avg song size	Given (typical workload assumption)	5 MB (compressed @ 256 kbps)
Total music storage	100M × 5 MB	500 TB
Concurrent listeners	Given (peak load assumption)	50M
Bandwidth (256 kbps per listener)	50M × 256 Kbps	12.8 Tbps
Songs played / day	200M DAU × 10	2B (10 per user)
Streams / sec	2B ÷ 86400	~23K

Loading...

Audio Streaming Architecture

Client requests song → API returns song metadata + CDN URL
Client downloads audio from CDN in chunks (HTTP range requests)
Gapless playback: Client prefetches next song's first 5 seconds while current song plays
Crossfade: Overlap end of current song with start of next song
Audio format: OGG Vorbis (free) at multiple bitrates (24, 96, 160, 320 kbps)
Normalization: ReplayGain / loudness normalization so songs play at consistent volume

Music Catalog Service

Master database of all songs, albums, artists, metadata
Data source: Record labels provide metadata via ingestion pipelines
Relationships: Song → Album → Artist, Song → Genre, Song ↔ Song (features/remixes)
MySQL sharded by artist_id or song_id

Search Service (Elasticsearch)

Full-text search across songs, artists, albums, playlists, podcasts
Features: Fuzzy matching ("bettles" → "Beatles"), autocomplete, did-you-mean
Index fields: title, artist_name, album_name, genre, lyrics (if available), popularity_score
Ranking: BM25 text relevance × popularity × recency

Playlist Service

Personal playlists: User-created, stored in Cassandra
Collaborative playlists: Multiple users can add/remove songs
Algorithmic playlists: Generated by recommendation engine (Discover Weekly, Daily Mix, Release Radar)

Recommendation Engine: Deep Dive

Collaborative Filtering:

User-user: "Users with similar listening habits liked song X"
Item-item: "Users who listened to song A also listened to song B"
Matrix factorization (ALS: Alternating Least Squares) on user-song interaction matrix

Content-Based Filtering:

Audio features: tempo, key, energy, danceability, acousticness (extracted via audio analysis ML models)
Genre, artist similarity

Spotify's Actual Approach (simplified):

Audio embeddings: CNN analyzes raw audio → 128-dim embedding vector
NLP on playlists: Word2Vec-like model trained on playlist track sequences (playlist = "sentence", song = "word")
Graph-based: Artist/song knowledge graph
Bandit exploration: Deliberately recommend new/unfamiliar music to explore user preferences

Serving: Pre-compute top 100 recommendations per user (Spark batch job → Redis cache) served in < 50 ms

Spotify Connect (Cross-Device)

Each device registers with the Device Session Service
User can see all active devices and transfer playback
How: Current device sends "transfer" command to server → server notifies target device → target device starts streaming from the same position
Uses MQTT or WebSocket for real-time device communication

Device Session Service

Tracks which devices a user has and which one is currently active
Redis storage:
- session:{user_id}:active → {device_id} (currently playing device)
- session:{user_id}:devices → SET of {device_id: {type, name, last_seen}}
On playback transfer: update active key → push notification to target device via MQTT/WebSocket
On device disconnect (no heartbeat for 60s): remove from devices set
Ensures only ONE device plays at a time (free tier restriction): check active device before allowing playback start

Playback & Royalty Pipeline

Tracks every stream for analytics, billing, and royalty payments to rights holders:

Client reports playback: POST /api/v1/playback/report with track_id, duration_listened_ms, completed, quality, device_id
API Server publishes to Kafka topic playback-events (partitioned by user_id)
Flink streaming job processes events:
- Deduplication: Same user + same track + same timestamp → drop duplicate (client retry)
- Validation: Duration ≥ 30 seconds to count as a "stream" (industry standard for royalty)
- Fraud filtering: Bot detection: inhuman patterns (1000 plays/hour, zero skip rate, same track on loop from fresh accounts)
- Enrichment: Join with track metadata (artist_id, label_id, territory)
Outputs: PostgreSQL (stream counts), ClickHouse (listening analytics), Royalty Ledger DB
Monthly royalty calculation (batch Spark job): Total revenue pool for the month split per contractual agreement: label gets X%, artist gets Y%, songwriter gets Z%

Loading...

Event Bus Design (Kafka)

Topic: music_streaming-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "music_streaming-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: music_streaming-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Music Streaming Service (Spotify): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Get Track / Stream

HTTP

GET /api/v1/tracks/{track_id}
Response: 200 OK
{
  "track_id": "track-uuid",
  "title": "Bohemian Rhapsody",
  "artist": {"id": "...", "name": "Queen"},
  "album": {"id": "...", "name": "A Night at the Opera", "cover_url": "..."},
  "duration_ms": 354000,
  "stream_url": "https://cdn.spotify.com/audio/{track_id}/320.ogg",
  "preview_url": "https://cdn.spotify.com/preview/{track_id}.mp3"
}

Search

HTTP

GET /api/v1/search?q=bohemian+rhapsody&type=track,artist,album&limit=10

Playlist Operations

HTTP

POST /api/v1/playlists
{ "name": "My Playlist", "description": "...", "public": true }

POST /api/v1/playlists/{playlist_id}/tracks
{ "track_ids": ["track-1", "track-2"], "position": 0 }

GET /api/v1/playlists/{playlist_id}/tracks?offset=0&limit=50

Get Recommendations

HTTP

GET /api/v1/recommendations?seed_tracks=track-1,track-2&seed_artists=artist-1&limit=30

Report Playback

HTTP

POST /api/v1/playback/report
{
  "track_id": "track-uuid",
  "duration_listened_ms": 200000,
  "completed": false,
  "device_id": "device-uuid",
  "quality": "320kbps"
}

Get Historical Rankings

HTTP

GET /api/v1/rankings/history?category=pop&date=2026-03-01&limit=10

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

MySQL: Song Metadata

SQL

CREATE TABLE tracks (
    track_id        BIGINT PRIMARY KEY,
    title           VARCHAR(256),
    artist_id       BIGINT,
    album_id        BIGINT,
    duration_ms     INT,
    genre           VARCHAR(64),
    release_date    DATE,
    popularity      INT,          -- 0-100
    explicit        BOOLEAN,
    audio_features  JSON,         -- tempo, key, energy, etc.
    cdn_path        VARCHAR(512),
    created_at      TIMESTAMP,
    INDEX idx_artist (artist_id),
    INDEX idx_album (album_id),
    FULLTEXT idx_title (title)
);

CREATE TABLE artists (
    artist_id       BIGINT PRIMARY KEY,
    name            VARCHAR(256),
    bio             TEXT,
    image_url       TEXT,
    follower_count  BIGINT,
    genre           VARCHAR(64)
);

CREATE TABLE albums (
    album_id        BIGINT PRIMARY KEY,
    title           VARCHAR(256),
    artist_id       BIGINT,
    cover_url       TEXT,
    release_date    DATE,
    track_count     INT
);

Cassandra: User Library & Listen History

SQL

CREATE TABLE user_library (
    user_id     UUID,
    item_type   TEXT,        -- 'track', 'album', 'artist', 'playlist'
    item_id     BIGINT,
    added_at    TIMESTAMP,
    PRIMARY KEY (user_id, item_type, added_at)
) WITH CLUSTERING ORDER BY (item_type ASC, added_at DESC);

CREATE TABLE listen_history (
    user_id         UUID,
    listened_at     TIMESTAMP,
    track_id        BIGINT,
    duration_ms     INT,
    completed       BOOLEAN,
    context_type    TEXT,     -- 'playlist', 'album', 'radio', 'search'
    context_id      TEXT,
    PRIMARY KEY (user_id, listened_at)
) WITH CLUSTERING ORDER BY (listened_at DESC)
  AND default_time_to_live = 7776000;  -- 90 days

Cassandra: Playlists

SQL

CREATE TABLE playlists (
    playlist_id     UUID PRIMARY KEY,
    owner_id        UUID,
    name            TEXT,
    description     TEXT,
    is_public       BOOLEAN,
    follower_count  INT,
    track_count     INT,
    cover_url       TEXT,
    created_at      TIMESTAMP
);

CREATE TABLE playlist_tracks (
    playlist_id     UUID,
    position        INT,
    track_id        BIGINT,
    added_by        UUID,
    added_at        TIMESTAMP,
    PRIMARY KEY (playlist_id, position)
);

S3: Audio Files

Bucket: spotify-audio
Path:   /{track_id}/
Files:  24.ogg, 96.ogg, 160.ogg, 320.ogg
        preview.mp3 (30-second preview)

Redis: Recommendation Cache

Key:    reco:{user_id}:discover_weekly
Value:  List of track_ids
TTL:    604800 (7 days, until next Monday)

Concern	Solution
CDN failure	Multi-CDN setup; fallback to alternate CDN or direct-from-origin
Playback interruption	Client buffers 30+ seconds ahead; can survive short outages
Metadata DB failure	MySQL read replicas; client caches metadata locally
Recommendation service down	Serve pre-cached recommendations from Redis; fallback to popularity-based
Audio file corruption	Checksum verification; original master files always preserved

Specific: Seamless Playback During Issues

Client downloads 30-60 seconds of audio ahead of playback position
If CDN is slow, quality auto-downgrades (320k → 160k → 96k)
If CDN is unreachable, play from local cache/offline downloads
"Offline mode" allows downloaded content to play without any network

Audio Streaming: How Chunks Are Delivered

Client requests: GET /audio/{track_id}/320.ogg
But NOT as a single giant download — uses HTTP Range Requests:

Why range requests (not full download)?
  1. Instant playback: start playing after first 256 KB (< 1 second at 256 kbps)
  2. Seek support: user scrubs to 3:00 → request bytes at offset for 3:00
     No need to download 0:00-3:00 first
  3. Bandwidth savings: user skips after 30 seconds → only downloaded 30s of audio
  4. Resumable: if connection drops, resume from last byte received

Prefetch strategy:
  While playing chunk N, request chunk N+1 in background
  Buffer target: 30-60 seconds ahead of playback position
  If buffer drops below 10 seconds → reduce quality (320 → 160 kbps)

Loading...

Adaptive Bitrate for Audio

Unlike video (which uses HLS/DASH manifests), audio streaming uses
simpler adaptive bitrate based on network probing:

  Client measures: download_time for each chunk
  Estimate bandwidth: chunk_size / download_time

  bandwidth > 500 kbps → stream 320 kbps (very high)
  bandwidth 200-500 kbps → stream 160 kbps (high)
  bandwidth 100-200 kbps → stream 96 kbps (normal)
  bandwidth < 100 kbps → stream 24 kbps (low, or pause to buffer)

Quality switch happens at chunk boundaries (not mid-chunk):
  Playing 320k chunk → bandwidth drops → next chunk requested at 160k
  Decoder handles codec switch (OGG Vorbis at any bitrate → same codec)
  User hears brief quality change but no interruption

Pre-encoded files:
  Each track stored as 4 separate files on S3:
    {track_id}/24.ogg, 96.ogg, 160.ogg, 320.ogg
  CDN caches all quality levels (most traffic hits 160 and 320)
  
  Why not dynamic transcoding?
    Audio transcoding is cheap (~0.1s per track) but:
    - Adds latency to first chunk
    - CDN can't cache dynamically generated content efficiently
    - Pre-encoded: S3 storage is cheaper than real-time compute

Shuffle Algorithm: Why Naive Random Is Wrong

Naive random shuffle (Fisher-Yates):
  Randomly permute the queue → play in that order
  
  Problem: pure random can produce "clumpy" sequences
    12 songs, 4 by Artist A, 4 by Artist B, 4 by Artist C
    Random shuffle might produce: A, A, A, B, C, B, A, C, C, B, B, C
    → Three Artist A songs in a row → feels "not shuffled" to humans
  
  Humans expect "random" to mean "evenly spread" (not truly random)

Spotify's dithered shuffle algorithm:
  1. Group songs by artist
  2. Place each artist's songs evenly spaced across the queue:
     Artist A (4 songs): positions 0, 3, 6, 9
     Artist B (4 songs): positions 1, 4, 7, 10
     Artist C (4 songs): positions 2, 5, 8, 11
  3. Add small random jitter to each position (±1 position)
  4. Sort by jittered position → final shuffle order
  
  Result: A, B, C, A, B, C, A, B, C, A, B, C (with slight variation)
  → No artist clumping → feels "more random" to users
  
  Also spreads by: genre, tempo, mood
    Avoid: two slow ballads back-to-back
    Avoid: three hip-hop tracks then three classical
    
  This is why Spotify shuffle "sounds right" while true random doesn't.

Collaborative Playlist: Concurrency Challenges

Two users edit the same playlist simultaneously:

  User A: adds "Song X" at position 3
  User B: removes song at position 2

  Without coordination:
    Original: [S1, S2, S3, S4]
    User A sees: [S1, S2, S3, S4] → inserts at 3 → [S1, S2, Song X, S3, S4]
    User B sees: [S1, S2, S3, S4] → removes at 2 → [S1, S3, S4]
    
    Server receives both → which one wins? Depends on order.
    If A first: [S1, S2, Song X, S3, S4] → B removes pos 2 → removes Song X (WRONG!)
    B intended to remove S2, not Song X.

Solution: Operate on track IDs, not positions

  User A: INSERT track_id="song-x" AFTER track_id="s2"
  User B: DELETE track_id="s2"
  
  Regardless of order:
    Result: [S1, Song X, S3, S4] (S2 removed, Song X inserted after where S2 was)
    OR: [S1, S3, S4, Song X] (Song X appended since its anchor S2 was deleted)
  
  Implementation in Cassandra:
    playlist_tracks keyed by (playlist_id, position)
    Each modification: read current state → compute new state → batch write
    Optimistic concurrency: IF version = expected_version (lightweight transaction)
    On conflict: re-read, re-apply, retry

  For real-time sync (multiple users viewing playlist):
    WebSocket: server pushes playlist diff to all connected viewers
    Clients apply diff to local state → instant UI update

Stream Counting for Royalty Payments

A "stream" counts for royalties ONLY if:
  1. Song played for ≥ 30 seconds
  2. User is authenticated (not anonymous/preview)
  3. Not detected as bot/fraud playback

This is critical financial data — accuracy directly affects artist payments.

Data flow:
  Client sends playback report every 30 seconds:
    POST /playback/report
    { track_id, duration_listened_ms, quality, device_id, session_id }
  
  API Gateway → Kafka topic: playback-events (partitioned by user_id)
  
  Flink streaming job:
    1. Deduplicate: same (user_id, track_id, session_id) within 5 min → count once
    2. Validate: duration ≥ 30,000 ms
    3. Fraud detection:
       - Same user playing 500+ tracks/hour → bot
       - Same track on repeat 100+ times → fraud farm
       - Device fingerprint → known bot device → reject
    4. Aggregate: per-track daily stream count → write to PostgreSQL
    5. Publish: validated stream events → Kafka: royalty-events
  
  Nightly batch job:
    1. Read daily aggregates from PostgreSQL
    2. Compute per-artist, per-label stream share
    3. Pro-rata calculation:
       artist_payment = (artist_streams / total_streams) × total_revenue_pool
    4. Write to royalty ledger → finance team processes payouts

Race condition in deduplication:
  User plays song on phone → playback report sent
  User switches to laptop (Spotify Connect) → same song continues
  Laptop also sends playback report → same (user, track, session)
  
  Flink deduplication window: 5-minute session window per (user, track)
  First report counted, second suppressed
  Key: session_id changes on device switch → separate sessions → both count
  (This is correct: user deliberately chose to listen on two devices)

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.