This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Thumbnail Generation Service.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Async job queue, Image resizing, Format conversion? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Async job queue
- Image resizing
- Format conversion
- CDN caching
- Idempotent generation
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for thumbnail generation service in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Auto-generate thumbnails: From videos, images, PDFs, documents
- Multiple sizes: Small 150×150, medium 300×200, large 640×360
- Video thumbnails: Extract the "best" frame using ML scoring
- Sprite sheets: Grid of thumbnails for video seek preview
- Custom thumbnails: Upload or select from candidates
- A/B test thumbnails: Serve variants, measure CTR
- Animated thumbnails: Short WebP preview on hover
- Format optimization: Serve WebP/AVIF with JPEG fallback
- Speed: Thumbnails ready within 30 seconds of upload
- Quality: Visually appealing, representative frames
- Scale: Process 50K+ videos/hour and 500K+ images/hour
- Cacheability: Highly cacheable on CDN (immutable URLs)
- Cost Efficient: Avoid unnecessary regeneration
- Availability: 99.9%
| Metric | Calculation | Value |
|---|---|---|
| Videos uploaded / hour | Given | 50K |
| Images uploaded / hour | Given | 500K |
| Thumbnails per video | Given | 6 (5 candidates + sprite) |
| Total thumbnails / hour | Given | 1.8M |
| Thumbnails / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 500 |
| Avg thumbnail size | Given | 20 KB |
| Thumbnail storage / day | 1.8M/hr × 24 × 20 KB | 864 GB |
| CDN bandwidth | Given | ~100 Gbps |
Worker Queue and Scaling Pipeline
Upload triggers S3 event → SQS/Kafka message with content_id and processing options. Worker pool auto-scales on queue depth (target: process within 30s p99).
Queue sizing: 500 thumbs/sec × 2 sec avg FFmpeg job = ~1,000 concurrent workers GPU workers for ML scoring: 50 nodes × 20 parallel = 1,000 inferences/sec Priority queue: user-facing uploads > batch backfill Worker lifecycle: 1. Pull job from queue (visibility timeout = 2× expected duration) 2. Download source from S3 to local /tmp (streaming for large videos) 3. FFmpeg extract frames → ML score → select best → resize → upload to S3 4. Write metadata to MySQL, invalidate Redis cache 5. ACK message; on failure: retry 3× → DLQ with alert Poison messages: corrupt video → skip segment, use adjacent frame; unrecoverable → mark failed, serve placeholder, notify uploader
Video Thumbnail Selection: Finding the Best Frame
Multi-criteria scoring: extract N candidate frames (uniform sampling + scene changes), score each on sharpness (25%), brightness (15%), contrast (10%), face presence (30%), aesthetic quality (20%). Select top 3-5 candidates.
Sprite Sheet Generation
Extract 1 frame per 5 seconds, resize to 160×90, arrange in grid (10 cols × 12 rows). Generate VTT metadata file for client-side seek preview. Single HTTP request vs 120 individual requests.
Animated Thumbnail (Hover Preview)
Select 3 interesting segments (2 seconds each), extract at 10fps, combine into animated WebP (~200-400 KB). Only generate for top 10% most-viewed videos.
A/B Testing Thumbnails
Generate 3 candidates. Consistent variant assignment via user hash. Measure CTR + watch time (avoid clickbait). Auto-promote winner when statistically significant (Chi-squared test, >10K impressions per variant).
Event Bus Design (Kafka)
Topic: thumbnail_generation_service-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "thumbnail_generation_service-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: thumbnail_generation_service-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Thumbnail Generation Service: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Generate Thumbnails for Video
POST /api/v1/thumbnails/video
{
"video_id": "vid-uuid",
"s3_key": "originals/vid-uuid/video.mp4",
"options": {
"sizes": ["150x150", "300x200", "640x360"],
"candidates": 5,
"sprite_sheet": true,
"animated_preview": true,
"format": "webp"
}
}Get Thumbnails
GET /api/v1/thumbnails/{content_id}
Response: 200 OK
{
"content_id": "vid-uuid",
"thumbnails": {
"default": "https://cdn.example.com/thumbs/vid-uuid/default_640x360.webp",
"small": "https://cdn.example.com/thumbs/vid-uuid/small_150x150.webp"
},
"candidates": [
{"index": 0, "url": "https://cdn.example.com/thumbs/vid-uuid/candidate_0.webp", "score": 0.92}
],
"sprite_sheet": {
"url": "https://cdn.example.com/thumbs/vid-uuid/sprite.jpg",
"vtt_url": "https://cdn.example.com/thumbs/vid-uuid/sprite.vtt",
"columns": 10, "rows": 12
}
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingMySQL: Thumbnail Metadata
CREATE TABLE thumbnails (
thumbnail_id BIGINT PRIMARY KEY AUTO_INCREMENT,
content_id VARCHAR(36) NOT NULL,
content_type ENUM('video', 'image', 'document') NOT NULL,
variant_type ENUM('default', 'candidate', 'custom', 'sprite', 'animated') NOT NULL,
s3_key TEXT NOT NULL,
cdn_url TEXT NOT NULL,
format VARCHAR(10) DEFAULT 'webp',
width INT,
height INT,
file_size_bytes INT,
quality_score DECIMAL(4,3),
is_default BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_content (content_id, variant_type)
);S3 + Redis
S3 Bucket: thumbnails
/{content_id}/default_640x360.webp
/{content_id}/sprite_sheet.jpg
/{content_id}/animated_preview.webp
Redis: thumb:{content_id} → CDN URL (TTL: 86400)
thumb_ab:{content_id}:{user_hash} → variant_index (TTL: 7d)| Concern | Solution |
|---|---|
| FFmpeg crash | Retry 3× with exponential backoff; DLQ |
| Corrupt video frame | Skip corrupt segment; use adjacent frame |
| ML model failure | Fall back to rule-based scoring |
| Missing thumbnail | CDN serves placeholder; queue regeneration |
| Worker pool exhaustion | Auto-scale based on Kafka consumer lag |
Interview Walkthrough
- Position thumbnails as a latency-sensitive async job triggered on video upload — the player needs a poster frame before transcoding finishes.
- Walk through candidate extraction: FFmpeg input seeking at strategic timestamps (intro skip, midpoint, action peaks).
- Explain ML scoring to pick the best frame — brightness, face detection, motion blur — with rule-based fallback if the model is down.
- Cover perceptual hashing to deduplicate near-identical candidates so the picker returns visually diverse options.
- Mention content-addressable storage: hash the bytes → immutable CDN URL with
max-age=31536000for near-100% hit rate. - Discuss sprite sheet generation as a batch follow-up — single decode pass beats seeking repeatedly per timestamp.
- Common pitfall: FFmpeg output seeking (frame-exact, decodes from start) for every candidate — a 2-hour video takes minutes instead of seconds.
Content-Addressable Thumbnails
Hash thumbnail content: sha256(thumbnail_bytes) → S3 key. Duplicate uploads → same hash → no extra storage. Immutable URL: Cache-Control: max-age=31536000, immutable. Near 100% CDN hit rate.
Perceptual Hashing for Dedup
Compute pHash for each candidate. Skip if Hamming distance < 5 bits from selected candidates. Ensures diverse, non-redundant thumbnail candidates.
FFmpeg Seek: Input vs Output Seeking
Input seeking (-ss before -i): very fast (~50 ms), may be off by <2 sec. Good for thumbnails.
Output seeking (-ss after -i): frame-exact but slow (decodes from start). Use input seeking for thumbnails, single decode pass for sprite sheets.
CDN Format Negotiation
URL-based format selection recommended over Accept header negotiation. Client requests correct format URL. CDN caches per URL → near 100% hit rate. No Vary: Accept cache efficiency loss.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core thumbnail generation service flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.