This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Image Processing Pipeline.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Multi-step pipeline, Worker pools, Retry logic? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Multi-step pipeline
- Worker pools
- Retry logic
- Output storage
- Webhook callbacks
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for image processing pipeline in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Image upload: Accept JPEG, PNG, WebP, HEIC, RAW, TIFF, GIF
- Format conversion: Convert to WebP, AVIF with JPEG fallback
- Resizing: Multiple sizes for responsive serving
- Cropping: Smart crop (face-aware, content-aware) and manual
- Filters & transforms: Rotate, flip, brightness, blur, watermark
- Content moderation: NSFW, violence detection via ML
- Metadata extraction: EXIF, dominant colors, quality score
- CDN delivery: Serve via CDN with on-the-fly transformations
- Throughput: Process 50K+ images/sec
- Latency: Pre-processed variants ready in < 30s; on-demand in < 200ms
- Durability: Original images NEVER lost (11 nines)
- Quality: SSIM > 0.95
- Security: Strip sensitive EXIF (GPS) before serving public images
- Idempotent: Re-processing same image produces identical output
| Metric | Calculation | Value |
|---|---|---|
| Images uploaded / day | 100M ÷ 86400 | 100M |
| Images / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~1,200 |
| Avg original image size | Given | 3 MB |
| Upload storage / day | 300 TB ÷ 86400 | 300 TB |
| Variants per image | Given | 5 |
| Variant storage / day | 450 TB ÷ 86400 | 450 TB |
| On-demand transforms / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| CDN bandwidth | Given | ~500 Gbps |
Async Processing Queue: Kafka + Worker Pool
S3 upload completion event → Kafka topic image-processing (partition by image_id for ordering). Workers pull jobs, process, ACK on success.
Scale: 1,200 images/sec upload × 5 variants × 200ms avg = ~1,200 workers (I/O bound) Moderation ML adds GPU pool: 200K inferences/sec ÷ 4 batch = 50 GPUs Backpressure: Consumer lag > 60s → scale workers (K8s HPA on lag metric) Lag > 5 min → shed low-priority reprocessing, alert on-call Partial failure: Variant 3/5 fails → mark partial, retry failed variants only Idempotency key = image_id + variant_spec → safe to retry DLQ: after 3 retries → human review queue for corrupt uploads
Pre-Processing vs On-Demand: Hybrid Architecture
Pre-process core variants on upload: thumbnail 150×150, feed 1080×1080, profile 640×640 (cover 90% of requests). On-demand for the long tail (10%): unusual sizes, formats, crops processed on first request and cached. 70% less compute than pre-processing all variants.
Image Format Pipeline
WebP (30% smaller than JPEG, 96% browser support), AVIF (50% smaller, growing support), JPEG (universal fallback). Adaptive quality: complex images get quality 85, simple images get quality 70. Content negotiation at CDN via Accept header or URL-based format selection.
Smart Cropping
Face detection (OpenCV/MTCNN, ~50ms) ? saliency/attention crop fallback ? rule of thirds. Pre-compute crop coordinates per variant on upload, apply at serving time. Lightweight face detection at 50K images/sec requires GPU acceleration.
Content Moderation ML Pipeline
Every image passes through: NSFW detection (ResNet/EfficientNet), violence detection, OCR + spam filter. At 50K images/sec: ~200K inferences/sec. Optimized with batch inference, model distillation (MobileNet for initial screening), TensorRT optimization. ~25 GPUs needed at scale.
Event Bus Design (Kafka)
Topic: image_processing_pipeline-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "image_processing_pipeline-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: image_processing_pipeline-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an Image Processing Pipeline: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Upload Image
POST /api/v1/images/upload
{
"filename": "vacation.jpg",
"content_type": "image/jpeg"
}
Response: 200 OK
{
"image_id": "img-uuid",
"upload_url": "https://s3.amazonaws.com/originals/img-uuid?X-Amz-..."
}Get Processed Image
GET /api/v1/images/{image_id}?w=720&h=480&format=webp&crop=smart&quality=80
Response: 302 Redirect → CDN URL
Location: https://cdn.example.com/images/img-uuid/720x480_smart_q80.webpCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingMySQL: Image Metadata
CREATE TABLE images (
image_id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
original_s3_key TEXT NOT NULL,
original_format VARCHAR(10),
original_width INT,
original_height INT,
file_size_bytes INT,
exif_data JSON,
dominant_colors JSON,
faces_detected SMALLINT DEFAULT 0,
nsfw_score DECIMAL(4,3),
moderation_status ENUM('pending','approved','rejected','review') DEFAULT 'pending',
processing_status ENUM('uploaded','processing','completed','failed') DEFAULT 'uploaded',
smart_crop_data JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user (user_id, created_at DESC)
);S3 Storage Layout
Bucket: image-originals (cross-region, never deleted)
/{image_id}/original.jpg
Bucket: image-processed (CDN-served)
/{image_id}/thumbnail_150x150.webp
/{image_id}/medium_640x640.webp
/{image_id}/large_1080x1080.webp| Concern | Solution |
|---|---|
| Original image lost | S3 cross-region replication; 11 nines durability |
| Processing worker crash | Retry from Kafka; idempotent processing |
| Corrupt image upload | Validate header + decode test before processing |
| ML moderation error | Human review queue for borderline scores |
| Decompression bomb | Check dimensions before decode; use libvips streaming |
| EXIF GPS leak | Strip GPS in all processed variants; keep in originals |
Interview Walkthrough
- Frame upload as fire-and-forget: store original to S3, publish a Kafka event, return immediately — processing is async.
- Walk through the worker pipeline: validate header dimensions → decode with libvips (streaming, low memory) → generate WebP/JPEG variants → upload to CDN origin.
- Explain responsive serving via URL convention (
/images/{id}/w_{width}.webp) with long-lived CDN cache headers. - Cover ML moderation as a parallel step: scores above threshold go to human review, borderline cases never block the resize path.
- Mention security hardening: strip EXIF GPS from all public variants, reject decompression bombs before full decode.
- Discuss when to self-host (imgproxy + CDN) vs Cloudinary based on daily image volume and cost crossover.
- Common pitfall: using ImageMagick with full in-memory decode — a 50 MB PNG ballooning to 2 GB RAM takes down the worker pool.
Image Bomb Detection
Read header ? get dimensions before full decode. Reject if width × height × 4 > 1 GB. Set resource limits. Use libvips (streaming, doesn't load full image). Timeout after 30 seconds.
Responsive Image Serving
HTML srcset with multiple widths (300w, 600w, 1200w). Client Hints for automatic selection. URL convention: /images/{id}/w_{width}.{format}.
ImageMagick vs libvips vs Pillow
libvips ?: 10x faster than ImageMagick, 10× less memory (20 MB RAM for resize). Streaming architecture. Recommended for production.
Sharp (Node.js): good for web backends.
Pillow: simple but single-threaded (GIL).
Self-Hosted vs SaaS (Cloudinary)
< 1M images/day ? Cloudinary (simpler). 1M-100M ? imgproxy + CDN (cost-effective). > 100M ? custom pipeline + imgproxy for long tail. imgproxy signed URLs prevent abuse (HMAC).
Storage Cost Optimization
Format conversion (JPEG?WebP saves 30%), perceptual quality targeting (quality 85 vs 95 saves 40%), storage tiering (S3 Standard ? IA ? Glacier), deduplication via perceptual hash, progressive deletion of regenerable variants.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core image processing pipeline flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.