Design a Podcast Delivery Platform

Interview Prompt

Design Podcast Delivery Platform.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: RSS ingestion, Audio processing, CDN distribution?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

RSS ingestion
Audio processing
CDN distribution
Subscription management
Download tracking
Capacity estimation with shown math

Out of scope (state explicitly)

Recommendation / home feed ranking (#48, #65)
Live chat and comments (#36)
DRM license server internals

Assumptions

Clarify scale (DAU, QPS, data volume) for podcast delivery platform in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Upload podcasts: Creators upload audio episodes with metadata (title, description, show notes, chapters)
Streaming playback: Stream episodes with seeking, speed control (0.5×–3×), skip silence
Downloading: Offline download for listening without internet
RSS feed: Generate and serve RSS/Atom feeds for distribution to Apple Podcasts, Spotify, etc.
Show management: Create shows (series), manage episodes, schedule future releases
Discovery: Browse by category, charts (top podcasts), search, recommendations
Subscriptions: Users subscribe to shows; new episodes appear in their feed
Playback state: Sync playback position across devices (resume where you left off)
Analytics: Download counts, listener demographics, retention graphs per episode
Monetization: Dynamic ad insertion (pre-roll, mid-roll, post-roll), premium subscriptions

Metric	Calculation	Value
Total shows	Given (assumption documented in value)	5M
Total episodes	Given (assumption documented in value)	100M
New episodes / day	Given (assumption documented in value)	100K
Avg episode duration	Given (typical workload assumption)	45 minutes
Avg episode size	Given (typical workload assumption)	50 MB (128 kbps MP3)
Upload storage / day	100K × 50 MB	5 TB
Total storage	Given (assumption documented in value)	5 PB
Daily active listeners	Given (assumption documented in value)	30M
Concurrent streams	Given (peak load assumption)	5M
Stream bandwidth	5M × 128 kbps	640 Gbps
Downloads / day	Given (assumption documented in value)	500M (including RSS aggregators)
Download bandwidth	500M × 50 MB	25 PB / day

Loading...

The system leverages CloudFront CDN for high-availability RSS feed delivery and globally cached audio, utilizing PostgreSQL for primary transactional tables, Redis for volatile playback tracking and caching, and S3 for processing files and artwork assets.

1. Audio Processing Pipeline

Creators upload raw audio (WAV, FLAC, MP3). The pipeline normalizes, trims, transcodes, generates waveforms, chapters, and optional transcripts.

Loading...

Step 1: Validate + Probe (FFprobe: extract duration, format, code, reject if >12 hrs or >2GB)
Step 2: Normalize Audio
  - Normalize loudness: target -16 LUFS (loudness standard for podcasts)
    ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1.5:LRA=11" normalized.wav
  - Silence trimming: remove > 3 seconds of silence at start/end
Step 3: Transcode to Multiple Formats/Bitrates
  - MP3 128 kbps: Universal compatibility (RSS feed reference)
  - AAC 128 kbps: iOS/Android native high quality
  - Opus 48 kbps: Incredible compression for modern clients (50% smaller than MP3!)
Step 4: Chapter Markers (Embed in ID3/M4A tags: { title, start_time, end_time })
Step 5: Generate Waveform (RMS amplitude per 100ms for custom seekbars, ~50KB JSON)
Step 6: Speech-to-Text Transcription (Whisper API, cost-optimized: only run for shows with >100 subs)

2. RSS Feed Service: The Core Distribution Mechanism

Podcasts are distributed via RSS. Every external aggregator (Spotify, Apple, Overcast) polls RSS feeds constantly.

Loading...

Feed serving at scale:
  5M shows × polled every 15-30 minutes by 10+ aggregators = ~30M feed requests/hour
  
Strategy:
  1. Pre-generate RSS XML for each show → store in S3
  2. Serve via CDN with 15-minute TTL
  3. On new episode publish: regenerate feed XML → invalidate CDN cache
  4. Conditional requests: ETag/If-Modified-Since → 304 Not Modified (saves 90% bandwidth)
  
Stable RSS URL format: https://feeds.example.com/shows/{show_id}/rss

3. Playback Sync: Resume Across Devices

Syncs playback position dynamically, allowing a seamless transition from phone commute to desktop browser.

Sync mechanism:
  Client reports position every 30 seconds:
    POST /api/v1/playback/progress  { episode_id, position_seconds, speed }
  On opening:
    GET /api/v1/playback/progress/{episode_id}  → resumes from stored point

Storage: Redis
  Key: playback:{user_id}:{episode_id}
  Value: Hash { position, speed, duration, updated_at }
  TTL: 90 days (auto-cleanup old progress)

Scale: 30M DAU × update every 30 sec = ~100K writes/sec (easily handled by 10 Redis Cluster shards)

4. Dynamic Ad Insertion (DAI): The Revenue Engine

Rather than "baked-in" static ads, Server-Side Ad Insertion (SSAI) stitches targeted ads into streams at request time.

SSAI Splicing Flow:
  1. Creator marks ad breaks: { "breaks": [{"position": 0, "type": "pre-roll"}, {"position": 1200, "type": "mid-roll"}] }
  2. On request, Ad Decision Service evaluates demographics, frequency caps, and targets ads.
  3. Audio Stitching Service:
     - Splicing HLS segments dynamically on edge.
     - Segmented playlist: [segment_pre, ad_1, segment_mid, ad_2]
     - Allows pre-encoded segments cached on CDN separately. No real-time heavy CPU re-encoding.

Ad Impression Tracking:
  Client fires event when passing ad bounds: { ad_id, event: "impression|start|50%|complete" } → Kafka → ClickHouse

Event Bus Design (Kafka)

Topic: podcast_delivery_platform-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "podcast_delivery_platform-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: podcast_delivery_platform-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Podcast Delivery Platform: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Upload Episode

HTTP

POST /api/v1/shows/{show_id}/episodes
{
  "title": "Episode 42: System Design",
  "description": "In this episode...",
  "audio_file_key": "uploads/ep-42-raw.wav",
  "publish_at": "2025-03-14T08:00:00Z",
  "season": 3,
  "episode_number": 42,
  "explicit": false,
  "chapters": [
    {"title": "Introduction", "start": 0},
    {"title": "Main Topic", "start": 180},
    {"title": "Interview", "start": 1200}
  ],
  "ad_breaks": [
    {"position": 0, "type": "pre_roll", "max_duration": 30},
    {"position": 1200, "type": "mid_roll", "max_duration": 60}
  ]
}

Stream Episode

HTTP

GET /api/v1/episodes/{episode_id}/stream?format=aac&quality=128k
Response: 302 Redirect
Location: https://cdn.example.com/audio/ep-uuid/aac_128k.m4a

Or with ad insertion:
Location: https://cdn.example.com/dai/ep-uuid/playlist.m3u8

Get User's Subscription Feed

HTTP

GET /api/v1/feed?limit=20&cursor={last}
Response: 200 OK
{
  "episodes": [
    {
      "episode_id": "ep-uuid",
      "show": {"id": "show-uuid", "title": "Tech Talk", "art": "..."},
      "title": "Episode 42: System Design",
      "duration_seconds": 2700,
      "published_at": "2025-03-14T08:00:00Z",
      "progress": {"position": 1847, "percent": 68},
      "stream_url": "https://cdn.example.com/audio/ep-uuid/aac_128k.m4a",
      "download_url": "https://cdn.example.com/audio/ep-uuid/mp3_128k.mp3",
      "file_size_bytes": 32400000
    }
  ]
}

Update Playback Progress

HTTP

POST /api/v1/playback/progress
{
  "episode_id": "ep-uuid",
  "position_seconds": 1847,
  "speed": 1.5,
  "duration_seconds": 2700
}

Get RSS XML (Aggregator facing)

XML

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:podcast="https://podcastindex.org/namespace/1.0">
  <channel>
    <title>My Podcast Show</title>
    <link>https://example.com/shows/my-podcast</link>
    <itunes:author>John Doe</itunes:author>
    <itunes:category text="Technology"/>
    <itunes:image href="https://cdn.example.com/art/show-123.jpg"/>
    
    <item>
      <title>Episode 42: System Design</title>
      <enclosure url="https://cdn.example.com/audio/ep-42.mp3" length="57000000" type="audio/mpeg"/>
      <pubDate>Fri, 14 Mar 2025 08:00:00 GMT</pubDate>
      <itunes:duration>3600</itunes:duration>
      <description>In this episode we discuss...</description>
      <podcast:chapters url="https://cdn.example.com/chapters/ep-42.json"/>
    </item>
  </channel>
</rss>

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

PostgreSQL: Core Relational Data

SQL

CREATE TABLE shows (
    show_id         UUID PRIMARY KEY,
    creator_id      UUID NOT NULL,
    title           VARCHAR(255) NOT NULL,
    description     TEXT,
    category        VARCHAR(50),
    subcategory     VARCHAR(50),
    language        CHAR(5),
    artwork_url     TEXT,
    website_url     TEXT,
    rss_feed_url    TEXT NOT NULL,          -- public feed URL (stable, permanent)
    explicit        BOOLEAN DEFAULT FALSE,
    subscriber_count INT DEFAULT 0,
    total_episodes  INT DEFAULT 0,
    status          ENUM('active', 'paused', 'archived') DEFAULT 'active',
    created_at      TIMESTAMP,
    updated_at      TIMESTAMP,
    INDEX idx_category (category, subscriber_count DESC),
    INDEX idx_creator (creator_id)
);

CREATE TABLE episodes (
    episode_id      UUID PRIMARY KEY,
    show_id         UUID NOT NULL,
    title           VARCHAR(255) NOT NULL,
    description     TEXT,
    show_notes      TEXT,
    season          SMALLINT,
    episode_number  SMALLINT,
    duration_seconds INT,
    audio_url_mp3   TEXT,                  -- CDN URL for MP3
    audio_url_aac   TEXT,                  -- CDN URL for AAC
    audio_url_opus  TEXT,                  -- CDN URL for Opus
    original_s3_key TEXT,
    file_size_bytes INT,
    chapters        JSONB,
    ad_breaks       JSONB,
    transcript_url  TEXT,
    waveform_url    TEXT,
    explicit        BOOLEAN DEFAULT FALSE,
    status          ENUM('draft','processing','scheduled','published','archived'),
    published_at    TIMESTAMPTZ,
    created_at      TIMESTAMP,
    INDEX idx_show (show_id, published_at DESC),
    INDEX idx_published (status, published_at DESC)
);

CREATE TABLE subscriptions (
    user_id         UUID NOT NULL,
    show_id         UUID NOT NULL,
    subscribed_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    notifications   BOOLEAN DEFAULT TRUE,
    PRIMARY KEY (user_id, show_id),
    INDEX idx_show (show_id)               -- "who subscribes to this show"
);

Redis Key Schemas

# Playback progress
playback:{user_id}:{episode_id}  → Hash { position, speed, updated_at } (TTL: 90 days)

# User's episode queue
queue:{user_id}  → List of episode_ids (ordered)

# RSS feed cache
rss:{show_id}  → String (RSS XML blob) (TTL: 15 minutes)

# Podcast charts (Sorted Sets)
charts:top:{category}    → Sorted Set { show_id: score }
charts:trending          → Sorted Set { show_id: growth_score }

# Episode download counter (incremented in Redis, flushed daily to ClickHouse)
downloads:{episode_id}:{date}  → INT (INCR) (TTL: 2 days)

S3 Storage Layout

Bucket: podcast-originals (cross-region replicated, permanent)
  /{show_id}/{episode_id}/original.wav

Bucket: podcast-processed (CDN-served)
  /{show_id}/{episode_id}/mp3_128k.mp3
  /{show_id}/{episode_id}/aac_128k.m4a
  /{show_id}/{episode_id}/opus_48k.ogg
  /{show_id}/{episode_id}/waveform.json
  /{show_id}/{episode_id}/transcript.json

Bucket: podcast-artwork
  /{show_id}/artwork_3000x3000.jpg
  /{show_id}/artwork_600x600.jpg

Kafka Message Bus Topics

Topic: episode-published     (triggers RSS regeneration + push notifications)
Topic: playback-events       (play, seek, complete — feeds ClickHouse analytics)
Topic: download-events       (download started — used for IAB compliance filters)
Topic: ad-events             (ad impressions — monetization statistics)

ClickHouse: Analytics DB

SQL

CREATE TABLE episode_plays (
    episode_id      UUID,
    show_id         UUID,
    user_id         UUID,
    event_type      Enum8('play'=0,'pause'=1,'seek'=2,'complete'=3,'download'=4),
    position_seconds UInt32,
    duration_seconds UInt32,
    speed           Float32,
    platform        Enum8('ios'=0,'android'=1,'web'=2,'rss'=3),
    country         FixedString(2),
    city            String,
    event_date      Date MATERIALIZED toDate(timestamp),
    timestamp       DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (show_id, episode_id, timestamp);

Concern	Solution
Audio file corruption	Checksum verification after upload; re-upload from creator if corrupt
CDN failure	Multi-CDN (CloudFront + Akamai); DNS failover in < 30 seconds
RSS feed stale	Max TTL 15 min; manual cache purge on publish; ETag for conditional requests
Playback sync loss	Client buffers progress locally → retry sync when online
Ad insertion failure	Serve episode WITHOUT ads (degrade gracefully; better than no audio)
Processing pipeline failure	Retry 3× from Kafka; DLQ for persistent failures; alert creator
Download counter loss	Redis AOF + batch flush to ClickHouse every hour; ClickHouse is source of truth

Specific: RSS Polling Storm (Thundering Herd)

Aggregators sync feeds simultaneously on the hour, triggering a massive thundering herd request spike of 55K req/sec.

CDN Edge Caching: Feeds are cached globally on CDN edges with a 15-minute TTL. Only 5% of requests hit the origin.
Conditional Requests: Aggregators support ETag and If-None-Match headers. 90% of requests return 304 Not Modified, saving massive bandwidth.
WebSub PubSubHubbub Push: Pushes new episode announcements to aggregators in real-time webhook endpoints instead of regular polling, eliminating 99% of requests.

Specific: Download Counting Accuracy (IAB Standard)

Advertisers pay per 1000 downloads (CPM), making overcounting (fraud) or undercounting (lost revenue) highly sensitive issues.

IAB Podcast Measurement Guidelines:
  1. Deduplication: In Flink stream, generate key = SHA256(ip + user_agent + episode_id).
     Window of 24 hours. Ignore matches within this window.
  2. Bot filtering: Filter out automated crawlers matching the IAB bot list.
  3. Byte-range filtering:
     - Ignore byte 0-1000 requests (metadata fetching only).
     - Ignore downloads where total bytes served < 50% of episode size.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Audio Processing Pipeline

2. RSS Feed Service: The Core Distribution Mechanism

3. Playback Sync: Resume Across Devices

4. Dynamic Ad Insertion (DAI): The Revenue Engine

Event Bus Design (Kafka)

Upload Episode

Stream Episode

Get User's Subscription Feed

Update Playback Progress

Get RSS XML (Aggregator facing)

Common Error Responses

PostgreSQL: Core Relational Data

Redis Key Schemas

S3 Storage Layout

Kafka Message Bus Topics

ClickHouse: Analytics DB

Specific: RSS Polling Storm (Thundering Herd)

Specific: Download Counting Accuracy (IAB Standard)

Interview Walkthrough

1. Audio Codec Choice: MP3 vs AAC vs Opus

2. Silence Detection and Skip

3. Podcast Discovery: Charts and Recommendations

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR