Design a Distributed Job Scheduler (Quartz / Airflow) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Orchestration problem — expect exactly-once semantics discussion, dedup strategies, cron parsing edge cases, and how distributed locks prevent double execution.
Arch 50	Add priority queues, job DAG dependencies, lease-based execution, and worker pool autoscaling.
Arch 75	Staff: compare with Temporal/Airflow, discuss clock skew in cron, split-brain during scheduler failover, and backfill at scale.

Interview Prompt

Design a distributed job scheduler that runs recurring (cron) and one-time jobs across a cluster of workers. Support job priorities, dependencies (DAGs), at-least-once or exactly-once execution guarantees, and deduplication of duplicate submissions.

Clarifying Questions (ask before designing)

Question	Why it matters
Exactly-once, at-least-once, or at-most-once?	Exactly-once is hardest — requires idempotent workers + dedup. At-least-once is default with idempotent handlers.
How many jobs per second and max concurrent executions?	10K jobs/sec drives sharded scheduler metadata store and worker pool sizing.
Do jobs have dependencies (DAG) or are they independent?	DAGs require topological sort, dependency tracking, and failure propagation.
What's the max job duration and retry policy?	Long-running jobs (hours) need lease renewal; short jobs need fast retry with backoff.

Scope

In scope

Cron scheduling with timezone support
Exactly-once / dedup via idempotency keys
Priority-based job queuing
Distributed locks for single execution
Job DAGs with dependency resolution
Worker assignment and lease management

Out of scope (state explicitly)

Full workflow engine (sagas, compensation) — mention Temporal comparison
Building a container orchestrator (K8s handles that layer)
Real-time streaming job processing

Assumptions

1M scheduled jobs, 10K executions/sec peak
Jobs are idempotent or dedup-able via idempotency key
Max job duration: 4 hours; typical: 30 seconds
At-least-once delivery with exactly-once effect via dedup

Submit jobs: Submit jobs to be executed at a specific time or on a recurring schedule (cron-like)
One-time jobs: Execute once at a specific time (e.g., "send email at 3pm tomorrow")
Recurring jobs: Execute on a schedule (e.g., "every 5 minutes", "daily at midnight", cron expressions)
Job priorities: Support priority levels (critical, high, normal, low)
Job dependencies: Job B runs only after Job A completes (DAG execution)
Retry on failure: Configurable retry policy (max retries, backoff strategy)
Job status: Track status (pending, scheduled, running, completed, failed, cancelled)
Job cancellation: Cancel a pending or scheduled job
Effectively-once per schedule: At-least-once dispatch with idempotent workers + dispatch locks: duplicate runs produce the same side effect once

Metric	Value
Total scheduled jobs	100M
Jobs executing / minute	100K
Jobs executing / sec	~1,700
Avg job duration	30 seconds
Concurrent running jobs	~50K
Job metadata size	1 KB
Storage	100M × 1 KB = 100 GB

Scale insight: Memory management is key. 
100M jobs in Redis = 20GB+. 
We should only keep the next 24 hours of jobs in Redis RAM, keeping the rest in cold PostgreSQL storage.

Loading...

The system splits jobs between PostgreSQL (cold/durable) and Redis (hot/fast). A leader-elected Schedule Manager scans Redis buckets every second and dispatches due jobs to Kafka priority topics.

1. Job Store (CRUD API)

Entry point for all job submissions and status queries. On job submission, the service validates payload, checks idempotency using an idempotency_key, writes metadata to PostgreSQL, computes the target timer bucket, and adds the job ID to the Redis ZSET timer bucket if scheduled within the next 24 hours.

2. Time-Bucket Strategy (Redis ZSETs) ⭐

To avoid scanning one massive 100M-entry set, we partition jobs into 1-minute buckets. Each bucket is a Redis Sorted Set: timer:bucket:{minute_timestamp} containing job IDs scored by their exact execution second.

Loading...

3. Schedule Manager (Scanner Loop) ⭐

Runs on a leader node elected via etcd/ZooKeeper. Every second, it polls the current and previous minute buckets using ZRANGEBYSCORE to handle boundary stragglers. It atomically claims a job using ZREM, sets a belt-and-suspenders dispatch lock, updates the database status, and publishes the job to Kafka.

4. Job Dispatcher (Priority-Based Routing)

Separate Kafka topics are used for each priority level (e.g., jobs.critical, jobs.high, jobs.normal, jobs.low). This isolates workloads, prevents head-of-line blocking, and allows independent scaling of worker pools.

5. Worker Pool & Executor Lifecycle

Workers are organized in consumer groups per priority topic. The execution loop includes:

Consume job from Kafka topic.
Acquire distributed execution lock: SET job:exec:{id} {worker_id} NX EX {timeout*2} to prevent duplicate execution during rebalances.
Check DB status (idempotency check): skip if already completed.
Update DB status to running and record start time.
Execute the actual job payload (HTTP call, script, etc.).
On success, mark completed, release lock, trigger callbacks, and trigger dependent jobs.
On failure, increment retry count and schedule retry using exponential backoff with jitter. If max retries exhausted, push to Dead Letter Queue (DLQ).

6. Recurring Job Scheduler

Triggered upon a recurring job's completion. The scheduler parses the cron expression, handles timezone shifts/DST transitions, computes the next execution time, and inserts a new job record in the database. Future instances are queued in Redis only when they fall within the 24-hour hot window.

7. DAG Dependency Resolver ⭐

Handles Job C → [Job A, Job B] dependencies using atomic Redis counters. When Job C is submitted with dependencies, an edge database record is created and a Redis counter is set: SET deps:job_C 2. As parent jobs complete, they decrement the counter; once it hits 0, Job C is dispatched immediately.

Loading...

8. Dead Letter Queue (DLQ)

Failed jobs that exhaust all retries land in jobs.dead-letter. The DLQ consumer logs the full traceback, increments metrics, and triggers alerts (PagerDuty for critical failures). Admin consoles allow manual inspection, payload modification, and re-enqueueing.

9. Callback Service

Delivers job results to target services asynchronously. It retries 3x with exponential backoff. If callback fails completely, the job's completed state remains intact; callback delivery is decoupled and best-effort.

10. Effectively-Once Execution

True exactly-once is impossible across distributed systems without transactional infrastructure. We achieve effectively-once via three layers: 1) Atomic ZREM claim on Redis timer bucket, 2) Distributed dispatch lock (SET NX with fencing token), and 3) Worker execution lock + DB idempotency check (skip if status = completed). Duplicate dispatches may occur but produce the same side effect once.

Event Bus Design (Kafka)

Topic: distributed_job_scheduler-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "distributed_job_scheduler-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: distributed_job_scheduler-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Distributed Job Scheduler (Quartz / Airflow): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Submit One-Time Job

HTTP

POST /api/v1/jobs
Content-Type: application/json

{
  "job_type": "send_email",
  "execute_at": "2026-03-14T15:00:00Z",
  "priority": "normal",
  "payload": {
    "to": "user@example.com",
    "template": "welcome_email",
    "vars": {"name": "Alice"}
  },
  "retry_policy": {
    "max_retries": 3,
    "backoff": "exponential",
    "initial_delay_sec": 60
  },
  "callback_url": "https://myservice.com/job-callback"
}

Response: 201 Created
{
  "job_id": "job-uuid-12345",
  "status": "scheduled",
  "execute_at": "2026-03-14T15:00:00Z"
}

Submit Recurring Job

HTTP

POST /api/v1/jobs/recurring
Content-Type: application/json

{
  "job_type": "generate_report",
  "cron_expression": "0 0 * * *",
  "timezone": "America/New_York",
  "payload": {"report_type": "daily_sales"},
  "priority": "high"
}

Response: 201 Created
{
  "recurring_id": "rec-uuid-67890",
  "next_execute_at": "2026-03-15T00:00:00-04:00"
}

Get Job Status

HTTP

GET /api/v1/jobs/job-uuid-12345

Response: 200 OK
{
  "job_id": "job-uuid-12345",
  "status": "completed",
  "execute_at": "2026-03-14T15:00:00Z",
  "started_at": "2026-03-14T15:00:01Z",
  "completed_at": "2026-03-14T15:00:05Z",
  "retries": 0,
  "result": {"email_sent": true}
}

Cancel Job

HTTP

DELETE /api/v1/jobs/job-uuid-12345

Response: 200 OK
{
  "status": "cancelled"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

PostgreSQL Schema

SQL

CREATE TABLE jobs (
    job_id          UUID PRIMARY KEY,
    job_type        VARCHAR(64) NOT NULL,
    status          VARCHAR(20) NOT NULL,    -- scheduled, dispatched, running, completed, failed, cancelled
    priority        SMALLINT DEFAULT 3,      -- 1=critical, 2=high, 3=normal, 4=low
    payload         JSONB,
    execute_at      TIMESTAMP NOT NULL,
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    retry_count     INT DEFAULT 0,
    max_retries     INT DEFAULT 3,
    backoff_strategy VARCHAR(20),
    next_retry_at   TIMESTAMP,
    result          JSONB,
    error_message   TEXT,
    callback_url    TEXT,
    created_by      VARCHAR(128),
    created_at      TIMESTAMP,
    updated_at      TIMESTAMP,
    INDEX idx_execute (status, execute_at),
    INDEX idx_type (job_type, status)
);

CREATE TABLE recurring_jobs (
    recurring_id    UUID PRIMARY KEY,
    job_type        VARCHAR(64),
    cron_expression VARCHAR(64),
    timezone        VARCHAR(64),
    payload         JSONB,
    priority        SMALLINT,
    is_active       BOOLEAN DEFAULT TRUE,
    last_executed   TIMESTAMP,
    next_execute_at TIMESTAMP,
    created_at      TIMESTAMP
);

CREATE TABLE job_deps (
    parent_job_id   UUID NOT NULL,
    child_job_id    UUID NOT NULL,
    PRIMARY KEY (parent_job_id, child_job_id),
    INDEX idx_child (child_job_id)
);

Redis Key Schemas

# Time bucket ZSET for due jobs in a given minute
Key:    timer:bucket:{minute_epoch}
Type:   ZSET (member: job_id, score: exact_timestamp_seconds)

# Lock to ensure single dispatch of a job from scheduler
Key:    job:dispatch:{job_id}
Type:   String (value: 1, TTL: 60s)

# Lock held by worker during job execution
Key:    job:exec:{job_id}
Type:   String (value: worker_id, TTL: job_timeout * 2)

# DAG dependency counter tracking unmet parent completions
Key:    deps:{child_job_id}
Type:   String (value: remaining_parent_count)

Kafka Topics Catalog

jobs.critical: High-priority, dedicated worker pool.
jobs.high: Processed with high priority.
jobs.normal: Default task pipeline.
jobs.low: Low priority/batch processing.
jobs.dead-letter: Failed tasks that exhausted all retries.

Concern	Solution
Scheduler leader crash	Standby scheduler promoted via etcd leader election in < 5s.
Worker crash mid-execution	Execution lock expires → scheduler re-dispatches job after visibility timeout.
Scheduler down during job times	On boot, scanner sweeps past-due scheduled jobs from PG and loads them into Redis immediately.
Duplicate executions	Atomic ZREM, dispatch locks, execution locks, and idempotent handlers.
Kafka brokers offline	Buffer dispatches in scheduler local memory/queue, retrying with backoff.
Clock skew on nodes	Strict NTP synchronization. 1-minute time buckets are highly tolerant of millisecond skews.

Retry Backoff Jitter Policy

Retries follow an exponential backoff formula: delay = initial_delay * 2^attempt + random_jitter. Adding random jitter prevents a thundering herd of retrying jobs from hitting downstream servers simultaneously.

Comparison with Existing Systems

System	Type	Primary Use Case	Trade-off
Cron	Single Node	Simple server scripts	Single point of failure, no scaling.
Celery	Task Queue	Asynchronous task execution	No robust cron/recurring scheduler built-in.
Airflow	Workflow Engine	Heavy ETL data pipelines	High latency, not meant for sub-second jobs.
Quartz	Distributed Java Lib	JVM-based application schedules	Coupled to Java ecosystem.
Temporal	Stateful Workflows	Complex, reliable state machines	Heavy operational overhead.

System Metrics to Monitor

Lag Time: Difference between scheduled and actual start execution time (target < 1s).
Success Rate: Completed vs failed jobs (target > 99.9%).
DLQ Depth: Backlog of permanently failed tasks (alert if growing).
Resource Utilization: Redis CPU/RAM, worker CPU utilization.

Interview Walkthrough

Frame as a priority queue with at-least-once execution — distinguish scheduling (when) from execution (who).
Explain leader-elected scheduler for cron triggers vs worker pool pulling from a Distributed Message Broker.
Cover job deduplication via idempotency keys and lease-based claiming so crashed workers don't lose jobs permanently.
Discuss retry with exponential backoff, dead-letter queue, and max-attempt limits for poison jobs.
Mention DAG dependencies (job B after job A) using a workflow graph, not flat queue ordering.
Common pitfall: polling the DB every second for due jobs — use time-bucketed scheduling or delay queues instead.

SLOs & Error Budgets

Metric	Target	Rationale
Job schedule accuracy	± 5 sec of cron fire time	Billing/report jobs must fire on time
Dedup correctness	100%	Zero duplicate side effects for financial jobs
Scheduler failover time	< 60s	Lease TTL bounds recovery
P0 job queue wait	< 30s p99	Critical jobs can't wait behind batch

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Cron storm after 4-hour scheduler outage	Job queue depth > 100K; 500 cron jobs fire simultaneously on recovery	Coalesce missed fires to single execution; rate-limit worker dispatch; prioritize P0; extend catch-up window to 24h
Duplicate job execution detected (double charge)	Reconciliation finds two ledger entries for same dedup_key	Halt scheduler; audit lock TTL config; verify fencing tokens on workers; add DB-level UNIQUE on side-effect records
DAG job stuck — dependency never resolves	Job C in PENDING for > 24h; upstream job A in RUNNING with expired lease	Lease expiry marks A as FAILED; cascade skip or retry per policy; alert on DAG stall > 1h; manual override API for ops

Cost Drivers (Staff lens)

Worker compute: dominant cost — autoscale to zero for batch queues
PostgreSQL metadata: small (< 100 GB) even at 1M jobs
etcd/Redis for locks: low cost but critical path — HA cluster required

Multi-Region & DR

Single active scheduler region initially. Standby region for failover (cold/warm). Cron jobs run in one region only — avoid dual-fire. Global jobs: designate primary region per job. Cross-region worker dispatch adds latency — keep workers regional.