This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Orchestration problem — expect exactly-once semantics discussion, dedup strategies, cron parsing edge cases, and how distributed locks prevent double execution. |
| Arch 50 | Add priority queues, job DAG dependencies, lease-based execution, and worker pool autoscaling. |
| Arch 75 | Staff: compare with Temporal/Airflow, discuss clock skew in cron, split-brain during scheduler failover, and backfill at scale. |
Interview Prompt
Design a distributed job scheduler that runs recurring (cron) and one-time jobs across a cluster of workers. Support job priorities, dependencies (DAGs), at-least-once or exactly-once execution guarantees, and deduplication of duplicate submissions.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Exactly-once, at-least-once, or at-most-once? | Exactly-once is hardest — requires idempotent workers + dedup. At-least-once is default with idempotent handlers. |
| How many jobs per second and max concurrent executions? | 10K jobs/sec drives sharded scheduler metadata store and worker pool sizing. |
| Do jobs have dependencies (DAG) or are they independent? | DAGs require topological sort, dependency tracking, and failure propagation. |
| What's the max job duration and retry policy? | Long-running jobs (hours) need lease renewal; short jobs need fast retry with backoff. |
Scope
In scope
- Cron scheduling with timezone support
- Exactly-once / dedup via idempotency keys
- Priority-based job queuing
- Distributed locks for single execution
- Job DAGs with dependency resolution
- Worker assignment and lease management
Out of scope (state explicitly)
- Full workflow engine (sagas, compensation) — mention Temporal comparison
- Building a container orchestrator (K8s handles that layer)
- Real-time streaming job processing
Assumptions
- 1M scheduled jobs, 10K executions/sec peak
- Jobs are idempotent or dedup-able via idempotency key
- Max job duration: 4 hours; typical: 30 seconds
- At-least-once delivery with exactly-once effect via dedup
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Submit jobs: Submit jobs to be executed at a specific time or on a recurring schedule (cron-like)
- One-time jobs: Execute once at a specific time (e.g., "send email at 3pm tomorrow")
- Recurring jobs: Execute on a schedule (e.g., "every 5 minutes", "daily at midnight", cron expressions)
- Job priorities: Support priority levels (critical, high, normal, low)
- Job dependencies: Job B runs only after Job A completes (DAG execution)
- Retry on failure: Configurable retry policy (max retries, backoff strategy)
- Job status: Track status (pending, scheduled, running, completed, failed, cancelled)
- Job cancellation: Cancel a pending or scheduled job
- Effectively-once per schedule: At-least-once dispatch with idempotent workers + dispatch locks: duplicate runs produce the same side effect once
- Reliability: Jobs must never be lost; scheduled jobs must execute even if nodes fail
- Scalability: Support millions of scheduled jobs, thousands executing concurrently
- Low Latency: Jobs execute within 1 second of their scheduled time
- High Availability: 99.99%: scheduler failure means jobs don't run
- Fault Tolerant: Survive node failures, network partitions
- Idempotent Execution: Same job running twice should produce the same result (or be deduplicated)
- Ordered: Respect job dependencies and priorities
| Metric | Value |
|---|---|
| Total scheduled jobs | 100M |
| Jobs executing / minute | 100K |
| Jobs executing / sec | ~1,700 |
| Avg job duration | 30 seconds |
| Concurrent running jobs | ~50K |
| Job metadata size | 1 KB |
| Storage | 100M × 1 KB = 100 GB |
Scale insight: Memory management is key. 100M jobs in Redis = 20GB+. We should only keep the next 24 hours of jobs in Redis RAM, keeping the rest in cold PostgreSQL storage.
The system splits jobs between PostgreSQL (cold/durable) and Redis (hot/fast). A leader-elected Schedule Manager scans Redis buckets every second and dispatches due jobs to Kafka priority topics.
1. Job Store (CRUD API)
Entry point for all job submissions and status queries. On job submission, the service validates payload, checks idempotency using an idempotency_key, writes metadata to PostgreSQL, computes the target timer bucket, and adds the job ID to the Redis ZSET timer bucket if scheduled within the next 24 hours.
2. Time-Bucket Strategy (Redis ZSETs) ⭐
To avoid scanning one massive 100M-entry set, we partition jobs into 1-minute buckets. Each bucket is a Redis Sorted Set: timer:bucket:{minute_timestamp} containing job IDs scored by their exact execution second.
3. Schedule Manager (Scanner Loop) ⭐
Runs on a leader node elected via etcd/ZooKeeper. Every second, it polls the current and previous minute buckets using ZRANGEBYSCORE to handle boundary stragglers. It atomically claims a job using ZREM, sets a belt-and-suspenders dispatch lock, updates the database status, and publishes the job to Kafka.
4. Job Dispatcher (Priority-Based Routing)
Separate Kafka topics are used for each priority level (e.g., jobs.critical, jobs.high, jobs.normal, jobs.low). This isolates workloads, prevents head-of-line blocking, and allows independent scaling of worker pools.
5. Worker Pool & Executor Lifecycle
Workers are organized in consumer groups per priority topic. The execution loop includes:
- Consume job from Kafka topic.
- Acquire distributed execution lock:
SET job:exec:{id} {worker_id} NX EX {timeout*2}to prevent duplicate execution during rebalances. - Check DB status (idempotency check): skip if already
completed. - Update DB status to
runningand record start time. - Execute the actual job payload (HTTP call, script, etc.).
- On success, mark
completed, release lock, trigger callbacks, and trigger dependent jobs. - On failure, increment retry count and schedule retry using exponential backoff with jitter. If max retries exhausted, push to Dead Letter Queue (DLQ).
6. Recurring Job Scheduler
Triggered upon a recurring job's completion. The scheduler parses the cron expression, handles timezone shifts/DST transitions, computes the next execution time, and inserts a new job record in the database. Future instances are queued in Redis only when they fall within the 24-hour hot window.
7. DAG Dependency Resolver ⭐
Handles Job C → [Job A, Job B] dependencies using atomic Redis counters. When Job C is submitted with dependencies, an edge database record is created and a Redis counter is set: SET deps:job_C 2. As parent jobs complete, they decrement the counter; once it hits 0, Job C is dispatched immediately.
8. Dead Letter Queue (DLQ)
Failed jobs that exhaust all retries land in jobs.dead-letter. The DLQ consumer logs the full traceback, increments metrics, and triggers alerts (PagerDuty for critical failures). Admin consoles allow manual inspection, payload modification, and re-enqueueing.
9. Callback Service
Delivers job results to target services asynchronously. It retries 3x with exponential backoff. If callback fails completely, the job's completed state remains intact; callback delivery is decoupled and best-effort.
10. Effectively-Once Execution
True exactly-once is impossible across distributed systems without transactional infrastructure. We achieve effectively-once via three layers: 1) Atomic ZREM claim on Redis timer bucket, 2) Distributed dispatch lock (SET NX with fencing token), and 3) Worker execution lock + DB idempotency check (skip if status = completed). Duplicate dispatches may occur but produce the same side effect once.
Event Bus Design (Kafka)
Topic: distributed_job_scheduler-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "distributed_job_scheduler-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: distributed_job_scheduler-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Distributed Job Scheduler (Quartz / Airflow): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Submit One-Time Job
POST /api/v1/jobs
Content-Type: application/json
{
"job_type": "send_email",
"execute_at": "2026-03-14T15:00:00Z",
"priority": "normal",
"payload": {
"to": "user@example.com",
"template": "welcome_email",
"vars": {"name": "Alice"}
},
"retry_policy": {
"max_retries": 3,
"backoff": "exponential",
"initial_delay_sec": 60
},
"callback_url": "https://myservice.com/job-callback"
}
Response: 201 Created
{
"job_id": "job-uuid-12345",
"status": "scheduled",
"execute_at": "2026-03-14T15:00:00Z"
}Submit Recurring Job
POST /api/v1/jobs/recurring
Content-Type: application/json
{
"job_type": "generate_report",
"cron_expression": "0 0 * * *",
"timezone": "America/New_York",
"payload": {"report_type": "daily_sales"},
"priority": "high"
}
Response: 201 Created
{
"recurring_id": "rec-uuid-67890",
"next_execute_at": "2026-03-15T00:00:00-04:00"
}Get Job Status
GET /api/v1/jobs/job-uuid-12345
Response: 200 OK
{
"job_id": "job-uuid-12345",
"status": "completed",
"execute_at": "2026-03-14T15:00:00Z",
"started_at": "2026-03-14T15:00:01Z",
"completed_at": "2026-03-14T15:00:05Z",
"retries": 0,
"result": {"email_sent": true}
}Cancel Job
DELETE /api/v1/jobs/job-uuid-12345
Response: 200 OK
{
"status": "cancelled"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingPostgreSQL Schema
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
job_type VARCHAR(64) NOT NULL,
status VARCHAR(20) NOT NULL, -- scheduled, dispatched, running, completed, failed, cancelled
priority SMALLINT DEFAULT 3, -- 1=critical, 2=high, 3=normal, 4=low
payload JSONB,
execute_at TIMESTAMP NOT NULL,
started_at TIMESTAMP,
completed_at TIMESTAMP,
retry_count INT DEFAULT 0,
max_retries INT DEFAULT 3,
backoff_strategy VARCHAR(20),
next_retry_at TIMESTAMP,
result JSONB,
error_message TEXT,
callback_url TEXT,
created_by VARCHAR(128),
created_at TIMESTAMP,
updated_at TIMESTAMP,
INDEX idx_execute (status, execute_at),
INDEX idx_type (job_type, status)
);
CREATE TABLE recurring_jobs (
recurring_id UUID PRIMARY KEY,
job_type VARCHAR(64),
cron_expression VARCHAR(64),
timezone VARCHAR(64),
payload JSONB,
priority SMALLINT,
is_active BOOLEAN DEFAULT TRUE,
last_executed TIMESTAMP,
next_execute_at TIMESTAMP,
created_at TIMESTAMP
);
CREATE TABLE job_deps (
parent_job_id UUID NOT NULL,
child_job_id UUID NOT NULL,
PRIMARY KEY (parent_job_id, child_job_id),
INDEX idx_child (child_job_id)
);Redis Key Schemas
# Time bucket ZSET for due jobs in a given minute
Key: timer:bucket:{minute_epoch}
Type: ZSET (member: job_id, score: exact_timestamp_seconds)
# Lock to ensure single dispatch of a job from scheduler
Key: job:dispatch:{job_id}
Type: String (value: 1, TTL: 60s)
# Lock held by worker during job execution
Key: job:exec:{job_id}
Type: String (value: worker_id, TTL: job_timeout * 2)
# DAG dependency counter tracking unmet parent completions
Key: deps:{child_job_id}
Type: String (value: remaining_parent_count)Kafka Topics Catalog
jobs.critical: High-priority, dedicated worker pool.jobs.high: Processed with high priority.jobs.normal: Default task pipeline.jobs.low: Low priority/batch processing.jobs.dead-letter: Failed tasks that exhausted all retries.
| Concern | Solution |
|---|---|
| Scheduler leader crash | Standby scheduler promoted via etcd leader election in < 5s. |
| Worker crash mid-execution | Execution lock expires → scheduler re-dispatches job after visibility timeout. |
| Scheduler down during job times | On boot, scanner sweeps past-due scheduled jobs from PG and loads them into Redis immediately. |
| Duplicate executions | Atomic ZREM, dispatch locks, execution locks, and idempotent handlers. |
| Kafka brokers offline | Buffer dispatches in scheduler local memory/queue, retrying with backoff. |
| Clock skew on nodes | Strict NTP synchronization. 1-minute time buckets are highly tolerant of millisecond skews. |
Retry Backoff Jitter Policy
Retries follow an exponential backoff formula: delay = initial_delay * 2^attempt + random_jitter. Adding random jitter prevents a thundering herd of retrying jobs from hitting downstream servers simultaneously.
Comparison with Existing Systems
| System | Type | Primary Use Case | Trade-off |
|---|---|---|---|
| Cron | Single Node | Simple server scripts | Single point of failure, no scaling. |
| Celery | Task Queue | Asynchronous task execution | No robust cron/recurring scheduler built-in. |
| Airflow | Workflow Engine | Heavy ETL data pipelines | High latency, not meant for sub-second jobs. |
| Quartz | Distributed Java Lib | JVM-based application schedules | Coupled to Java ecosystem. |
| Temporal | Stateful Workflows | Complex, reliable state machines | Heavy operational overhead. |
System Metrics to Monitor
- Lag Time: Difference between scheduled and actual start execution time (target < 1s).
- Success Rate: Completed vs failed jobs (target > 99.9%).
- DLQ Depth: Backlog of permanently failed tasks (alert if growing).
- Resource Utilization: Redis CPU/RAM, worker CPU utilization.
Interview Walkthrough
- Frame as a priority queue with at-least-once execution — distinguish scheduling (when) from execution (who).
- Explain leader-elected scheduler for cron triggers vs worker pool pulling from a Distributed Message Broker.
- Cover job deduplication via idempotency keys and lease-based claiming so crashed workers don't lose jobs permanently.
- Discuss retry with exponential backoff, dead-letter queue, and max-attempt limits for poison jobs.
- Mention DAG dependencies (job B after job A) using a workflow graph, not flat queue ordering.
- Common pitfall: polling the DB every second for due jobs — use time-bucketed scheduling or delay queues instead.
1. End-to-End Execution Flow
For a job scheduled at 15:00:00:
- T - 60m (Submission): Client submits job. PG records it. Added to Redis ZSET timer bucket for 15:00.
- T - 0 (Execution): Scheduler leader scans ZSET, claims job via ZREM, updates PG to
dispatched, publishes to Kafka. - T + 100ms (Execution): Worker consumes from Kafka, acquires execution lock, runs task.
- T + 5s (Completion): Task finishes. Worker updates status to
completedin PG and releases lock.
2. Race Conditions in Distributed Scheduling
Split-Brain Schedulers: If leader election splits and two instances scan the same bucket, Redis atomic ZREM ensures only one instance receives 1 (claimed) while the other receives 0 (skipped).
Bucket Boundary Misses: A job scheduled at 15:00:59.9 might be missed if the scanner runs at 15:00:59.5 and never looks back. By scanning both the current and previous buckets, stragglers are always processed.
3. Scale Out Timer Buckets
If 100M jobs exist in a single Redis instance, memory consumption and single-threaded ZSET operations bottleneck the CPU. We resolve this by sharding buckets across Redis nodes using consistent hashing: timer:bucket:{minute}:{hash(job_id) % 16}, distributing the scanning workload among scheduler shards.
4. Timer Approaches Comparison
- Redis Sorted Set: Perfect balance of speed, memory, and persistence. Shards easily.
- Hierarchical Timing Wheel: O(1) performance but difficult to persist and distribute cleanly.
- Database Polling (SKIP LOCKED): Extremely durable but constant polling stresses relational indexes. Highly useful as a backup fallback.
5. Delivery Guarantees
We choose At-Least-Once Delivery combined with Idempotent Handlers. If a worker crashes mid-task, the visibility lock expires and the job is re-dispatched. Workers must check for duplicate keys in their application handlers to avoid double-processing side effects.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Single scheduler + worker pool
PostgreSQL for job metadata. Single scheduler process polls cron table every minute. Workers pull from Redis queue. At-least-once with idempotent handlers.
Key components: PostgreSQL job store · Cron poller · Redis queue · Worker pool · Basic retry
Move to next phase when: Scheduler becomes SPOF; need HA and exactly-once for billing jobs
Phase 2 — HA scheduler with distributed locks
Leader-elected scheduler (etcd). Dedup via UNIQUE constraints. Priority queues. Lease-based execution with fencing tokens. DAG support for ETL pipelines.
Key components: etcd leader election · Dedup store · Priority queues · Lease + fencing · DAG engine
Move to next phase when: 10K jobs/sec exceeds single scheduler; need sharded scheduling
Phase 3 — Sharded scheduler at scale
Consistent hash on job_id for scheduler shards. Each shard owns subset of cron jobs. Global priority coordinator. Autoscaling worker pools per job type. Integration with Temporal for long-running sagas.
Key components: Sharded schedulers · Worker autoscaling · Temporal integration · Catch-up coalescing · Multi-region standby
Move to next phase when: Complex workflows need compensation logic — migrate long DAGs to Temporal
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Job schedule accuracy | ± 5 sec of cron fire time | Billing/report jobs must fire on time |
| Dedup correctness | 100% | Zero duplicate side effects for financial jobs |
| Scheduler failover time | < 60s | Lease TTL bounds recovery |
| P0 job queue wait | < 30s p99 | Critical jobs can't wait behind batch |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Cron storm after 4-hour scheduler outage | Job queue depth > 100K; 500 cron jobs fire simultaneously on recovery | Coalesce missed fires to single execution; rate-limit worker dispatch; prioritize P0; extend catch-up window to 24h |
| Duplicate job execution detected (double charge) | Reconciliation finds two ledger entries for same dedup_key | Halt scheduler; audit lock TTL config; verify fencing tokens on workers; add DB-level UNIQUE on side-effect records |
| DAG job stuck — dependency never resolves | Job C in PENDING for > 24h; upstream job A in RUNNING with expired lease | Lease expiry marks A as FAILED; cascade skip or retry per policy; alert on DAG stall > 1h; manual override API for ops |
Cost Drivers (Staff lens)
- Worker compute: dominant cost — autoscale to zero for batch queues
- PostgreSQL metadata: small (< 100 GB) even at 1M jobs
- etcd/Redis for locks: low cost but critical path — HA cluster required
Multi-Region & DR
Single active scheduler region initially. Standby region for failover (cold/warm). Cron jobs run in one region only — avoid dual-fire. Global jobs: designate primary region per job. Cross-region worker dispatch adds latency — keep workers regional.