Design a Thumbnail Generation Service

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Thumbnail Generation Service.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Async job queue, Image resizing, Format conversion?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Async job queue
Image resizing
Format conversion
CDN caching
Idempotent generation
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for thumbnail generation service in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Auto-generate thumbnails: From videos, images, PDFs, documents
Multiple sizes: Small 150×150, medium 300×200, large 640×360
Video thumbnails: Extract the "best" frame using ML scoring
Sprite sheets: Grid of thumbnails for video seek preview
Custom thumbnails: Upload or select from candidates
A/B test thumbnails: Serve variants, measure CTR
Animated thumbnails: Short WebP preview on hover
Format optimization: Serve WebP/AVIF with JPEG fallback

Metric	Calculation	Value
Videos uploaded / hour	Given	50K
Images uploaded / hour	Given	500K
Thumbnails per video	Given	6 (5 candidates + sprite)
Total thumbnails / hour	Given	1.8M
Thumbnails / sec	Derived from daily volume ÷ 86400 (+ peak factor)	500
Avg thumbnail size	Given	20 KB
Thumbnail storage / day	1.8M/hr × 24 × 20 KB	864 GB
CDN bandwidth	Given	~100 Gbps

Loading...

Worker Queue and Scaling Pipeline

Upload triggers S3 event → SQS/Kafka message with content_id and processing options. Worker pool auto-scales on queue depth (target: process within 30s p99).

Queue sizing:
  500 thumbs/sec × 2 sec avg FFmpeg job = ~1,000 concurrent workers
  GPU workers for ML scoring: 50 nodes × 20 parallel = 1,000 inferences/sec
  Priority queue: user-facing uploads > batch backfill

Worker lifecycle:
  1. Pull job from queue (visibility timeout = 2× expected duration)
  2. Download source from S3 to local /tmp (streaming for large videos)
  3. FFmpeg extract frames → ML score → select best → resize → upload to S3
  4. Write metadata to MySQL, invalidate Redis cache
  5. ACK message; on failure: retry 3× → DLQ with alert

Poison messages: corrupt video → skip segment, use adjacent frame;
  unrecoverable → mark failed, serve placeholder, notify uploader

Video Thumbnail Selection: Finding the Best Frame

Multi-criteria scoring: extract N candidate frames (uniform sampling + scene changes), score each on sharpness (25%), brightness (15%), contrast (10%), face presence (30%), aesthetic quality (20%). Select top 3-5 candidates.

Sprite Sheet Generation

Extract 1 frame per 5 seconds, resize to 160×90, arrange in grid (10 cols × 12 rows). Generate VTT metadata file for client-side seek preview. Single HTTP request vs 120 individual requests.

Animated Thumbnail (Hover Preview)

Select 3 interesting segments (2 seconds each), extract at 10fps, combine into animated WebP (~200-400 KB). Only generate for top 10% most-viewed videos.

A/B Testing Thumbnails

Generate 3 candidates. Consistent variant assignment via user hash. Measure CTR + watch time (avoid clickbait). Auto-promote winner when statistically significant (Chi-squared test, >10K impressions per variant).

Event Bus Design (Kafka)

Topic: thumbnail_generation_service-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "thumbnail_generation_service-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: thumbnail_generation_service-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Thumbnail Generation Service: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Generate Thumbnails for Video

HTTP

POST /api/v1/thumbnails/video
{
  "video_id": "vid-uuid",
  "s3_key": "originals/vid-uuid/video.mp4",
  "options": {
    "sizes": ["150x150", "300x200", "640x360"],
    "candidates": 5,
    "sprite_sheet": true,
    "animated_preview": true,
    "format": "webp"
  }
}

Get Thumbnails

HTTP

GET /api/v1/thumbnails/{content_id}
Response: 200 OK
{
  "content_id": "vid-uuid",
  "thumbnails": {
    "default": "https://cdn.example.com/thumbs/vid-uuid/default_640x360.webp",
    "small": "https://cdn.example.com/thumbs/vid-uuid/small_150x150.webp"
  },
  "candidates": [
    {"index": 0, "url": "https://cdn.example.com/thumbs/vid-uuid/candidate_0.webp", "score": 0.92}
  ],
  "sprite_sheet": {
    "url": "https://cdn.example.com/thumbs/vid-uuid/sprite.jpg",
    "vtt_url": "https://cdn.example.com/thumbs/vid-uuid/sprite.vtt",
    "columns": 10, "rows": 12
  }
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

MySQL: Thumbnail Metadata

SQL

CREATE TABLE thumbnails (
    thumbnail_id    BIGINT PRIMARY KEY AUTO_INCREMENT,
    content_id      VARCHAR(36) NOT NULL,
    content_type    ENUM('video', 'image', 'document') NOT NULL,
    variant_type    ENUM('default', 'candidate', 'custom', 'sprite', 'animated') NOT NULL,
    s3_key          TEXT NOT NULL,
    cdn_url         TEXT NOT NULL,
    format          VARCHAR(10) DEFAULT 'webp',
    width           INT,
    height          INT,
    file_size_bytes INT,
    quality_score   DECIMAL(4,3),
    is_default      BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_content (content_id, variant_type)
);

S3 + Redis

S3 Bucket: thumbnails
  /{content_id}/default_640x360.webp
  /{content_id}/sprite_sheet.jpg
  /{content_id}/animated_preview.webp

Redis: thumb:{content_id} → CDN URL (TTL: 86400)
       thumb_ab:{content_id}:{user_hash} → variant_index (TTL: 7d)

Concern	Solution
FFmpeg crash	Retry 3× with exponential backoff; DLQ
Corrupt video frame	Skip corrupt segment; use adjacent frame
ML model failure	Fall back to rule-based scoring
Missing thumbnail	CDN serves placeholder; queue regeneration
Worker pool exhaustion	Auto-scale based on Kafka consumer lag

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.