Design an Image Processing Pipeline

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Image Processing Pipeline.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Multi-step pipeline, Worker pools, Retry logic?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Multi-step pipeline
Worker pools
Retry logic
Output storage
Webhook callbacks
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for image processing pipeline in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Image upload: Accept JPEG, PNG, WebP, HEIC, RAW, TIFF, GIF
Format conversion: Convert to WebP, AVIF with JPEG fallback
Resizing: Multiple sizes for responsive serving
Cropping: Smart crop (face-aware, content-aware) and manual
Filters & transforms: Rotate, flip, brightness, blur, watermark
Content moderation: NSFW, violence detection via ML
Metadata extraction: EXIF, dominant colors, quality score
CDN delivery: Serve via CDN with on-the-fly transformations

Metric	Calculation	Value
Images uploaded / day	100M ÷ 86400	100M
Images / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~1,200
Avg original image size	Given	3 MB
Upload storage / day	300 TB ÷ 86400	300 TB
Variants per image	Given	5
Variant storage / day	450 TB ÷ 86400	450 TB
On-demand transforms / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50K
CDN bandwidth	Given	~500 Gbps

Loading...

Async Processing Queue: Kafka + Worker Pool

S3 upload completion event → Kafka topic image-processing (partition by image_id for ordering). Workers pull jobs, process, ACK on success.

Scale:
  1,200 images/sec upload × 5 variants × 200ms avg = ~1,200 workers (I/O bound)
  Moderation ML adds GPU pool: 200K inferences/sec ÷ 4 batch = 50 GPUs

Backpressure:
  Consumer lag > 60s → scale workers (K8s HPA on lag metric)
  Lag > 5 min → shed low-priority reprocessing, alert on-call

Partial failure:
  Variant 3/5 fails → mark partial, retry failed variants only
  Idempotency key = image_id + variant_spec → safe to retry

DLQ: after 3 retries → human review queue for corrupt uploads

Pre-Processing vs On-Demand: Hybrid Architecture

Pre-process core variants on upload: thumbnail 150×150, feed 1080×1080, profile 640×640 (cover 90% of requests). On-demand for the long tail (10%): unusual sizes, formats, crops processed on first request and cached. 70% less compute than pre-processing all variants.

Image Format Pipeline

WebP (30% smaller than JPEG, 96% browser support), AVIF (50% smaller, growing support), JPEG (universal fallback). Adaptive quality: complex images get quality 85, simple images get quality 70. Content negotiation at CDN via Accept header or URL-based format selection.

Smart Cropping

Face detection (OpenCV/MTCNN, ~50ms) ? saliency/attention crop fallback ? rule of thirds. Pre-compute crop coordinates per variant on upload, apply at serving time. Lightweight face detection at 50K images/sec requires GPU acceleration.

Content Moderation ML Pipeline

Every image passes through: NSFW detection (ResNet/EfficientNet), violence detection, OCR + spam filter. At 50K images/sec: ~200K inferences/sec. Optimized with batch inference, model distillation (MobileNet for initial screening), TensorRT optimization. ~25 GPUs needed at scale.

Event Bus Design (Kafka)

Topic: image_processing_pipeline-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "image_processing_pipeline-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: image_processing_pipeline-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an Image Processing Pipeline: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Upload Image

HTTP

POST /api/v1/images/upload
{
  "filename": "vacation.jpg",
  "content_type": "image/jpeg"
}
Response: 200 OK
{
  "image_id": "img-uuid",
  "upload_url": "https://s3.amazonaws.com/originals/img-uuid?X-Amz-..."
}

Get Processed Image

HTTP

GET /api/v1/images/{image_id}?w=720&h=480&format=webp&crop=smart&quality=80
Response: 302 Redirect → CDN URL
Location: https://cdn.example.com/images/img-uuid/720x480_smart_q80.webp

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

MySQL: Image Metadata

SQL

CREATE TABLE images (
    image_id        VARCHAR(36) PRIMARY KEY,
    user_id         VARCHAR(36) NOT NULL,
    original_s3_key TEXT NOT NULL,
    original_format VARCHAR(10),
    original_width  INT,
    original_height INT,
    file_size_bytes INT,
    exif_data       JSON,
    dominant_colors JSON,
    faces_detected  SMALLINT DEFAULT 0,
    nsfw_score      DECIMAL(4,3),
    moderation_status ENUM('pending','approved','rejected','review') DEFAULT 'pending',
    processing_status ENUM('uploaded','processing','completed','failed') DEFAULT 'uploaded',
    smart_crop_data JSON,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_user (user_id, created_at DESC)
);

S3 Storage Layout

Bucket: image-originals (cross-region, never deleted)
  /{image_id}/original.jpg
Bucket: image-processed (CDN-served)
  /{image_id}/thumbnail_150x150.webp
  /{image_id}/medium_640x640.webp
  /{image_id}/large_1080x1080.webp

Concern	Solution
Original image lost	S3 cross-region replication; 11 nines durability
Processing worker crash	Retry from Kafka; idempotent processing
Corrupt image upload	Validate header + decode test before processing
ML moderation error	Human review queue for borderline scores
Decompression bomb	Check dimensions before decode; use libvips streaming
EXIF GPS leak	Strip GPS in all processed variants; keep in originals

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.