This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design a workflow orchestration engine that coordinates multi-step, long-running business processes across distributed services. Workflows must survive crashes, restarts, and deployments.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| How long do workflows run — minutes, days, or months? | Timers and human-in-the-loop signals require durable state, not just a job queue. |
| Do we need compensation (saga rollback) or is retry enough? | Sagas for multi-service transactions; simple retry for idempotent single-step failures. |
| Build from scratch or use Temporal/Cadence? | Staff candidates discuss both; most production teams buy unless you're a platform company. |
| What's the activity failure semantics — at-least-once or exactly-once effects? | Activities are at-least-once; side effects need idempotency keys. Workflow state is exactly-once via replay. |
Scope
In scope
- Durable execution via event-sourced replay
- Activity dispatch with retry policies
- Saga compensation pattern
- Workflow versioning for safe deploys
Out of scope (state explicitly)
- Full Temporal server implementation details (discuss architecture)
- UI/workflow designer (assume code-defined workflows)
- Cron scheduling (use job scheduler #28 for that)
Assumptions
- 100K concurrent workflows, 10K starts/sec
- Workflows span seconds to months (timers, human approval)
- Activities call external services (payments, shipping, email)
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Long-running, multi-step workflows spanning seconds to months
- Durable execution: workflow state survives crashes, restarts, deployments
- Activity tasks: individual units of work executed by workers
- Timers and sleep: workflow.sleep(Duration.ofDays(30))
- Retry policies: automatic retry with configurable backoff
- Saga pattern: compensating transactions for distributed rollback
- Child workflows: compose workflows hierarchically
- Signals and queries: send events TO or read state FROM running workflows
- Versioning: update workflow code without breaking in-flight executions
- Visibility: search, filter, monitor workflows by custom attributes
- Durability: Zero state loss: survives any infrastructure failure
- Scalability: 100K+ concurrent workflows, 10K+ starts/sec
- Low Latency: Activity dispatch < 50ms; workflow decision < 10ms
- Effectively-once side effects: Workflow state transitions are exactly-once via replay; activities run at-least-once with idempotent handlers
- High Availability: 99.99%: unavailability blocks all business processes
- Observability: Full audit trail of every state transition
| Metric | Calculation | Value |
|---|---|---|
| Concurrent workflows | Given (peak load assumption) | 100K |
| Workflow starts / sec | From Workflow starts / day ÷ 86400 (+ peak factor in value) | 10K |
| Activity dispatches / sec | From Activity dispatches / day ÷ 86400 (+ peak factor in value) | 50K |
| Avg activities per workflow | Given (typical workload assumption) | 10 |
| Avg workflow duration | Given (typical workload assumption) | 5 minutes (some: months) |
| Event history per workflow | avg 50 events × 2 KB | 100 KB |
| Total history storage | Given | ~86 TB/day |
| History retention (30 days) | Given | ~2.5 PB |
How Durable Execution Works: The Core Innovation
Traditional approach (stateless service):
function processOrder(order) {
payment = chargeCard(order) // ? crashes here
shipment = createShipment(order) // never reached
}
// On crash: partial state, manual recovery needed
Temporal approach (durable execution):
function processOrder(order) {
payment = workflow.executeActivity(ChargeCard, order) // Step 1
shipment = workflow.executeActivity(CreateShipment, order) // Step 2
}
// On crash: Temporal replays history, skips ChargeCard (already done),
// resumes at step 2
HOW THIS WORKS:
1. Worker replays event history:
Event 1: WorkflowExecutionStarted
Event 2: ActivityTaskScheduled (ChargeCard)
Event 3: ActivityTaskCompleted (result: {payment_id: "pay_123"})
? History ends here
2. Worker re-executes code: chargeCard ? cached result; createShipment ? NEW command
3. Worker returns command: ScheduleActivityTask(CreateShipment)
CRITICAL RULE: Workflow code must be DETERMINISTIC
No random(), no System.currentTimeMillis(), no direct I/O
All side effects must go through activitiesEvent History (Event-Sourced State)
Every workflow execution is an append-only event log:
Event 1: WorkflowExecutionStarted {input: {orderId: "42"}}
Event 2: WorkflowTaskScheduled
Event 3: WorkflowTaskCompleted {commands: [ScheduleActivity(ChargeCard)]}
Event 4: ActivityTaskScheduled {activityType: "ChargeCard"}
Event 5: ActivityTaskStarted {attempt: 1}
Event 6: ActivityTaskCompleted {result: {paymentId: "pay_123"}}
Event 7: WorkflowTaskScheduled
Event 8: WorkflowTaskCompleted {commands: [ScheduleActivity(CreateShipment)]}
Event 9: TimerStarted {duration: 30 days}
...
Event 42: TimerFired
Event 43: WorkflowTaskCompleted {commands: [CompleteWorkflow]}
Event 44: WorkflowExecutionCompleted {result: {status: "success"}}
This log IS the workflow state. No separate database.Saga Pattern (Distributed Compensation)
Problem: Book a trip = Flight + Hotel + Car Rental (3 services)
Temporal Saga:
function bookTrip(trip) {
flight = workflow.executeActivity(BookFlight, trip)
try {
hotel = workflow.executeActivity(BookHotel, trip)
} catch (e) {
workflow.executeActivity(CancelFlight, flight) // compensate
throw e
}
try {
car = workflow.executeActivity(RentCar, trip)
} catch (e) {
workflow.executeActivity(CancelHotel, hotel) // compensate
workflow.executeActivity(CancelFlight, flight) // compensate
throw e
}
return {flight, hotel, car}
}
Why better than choreography (event-driven):
- Compensation logic in ONE place, sequential code not event chains
- Temporal guarantees compensation runs even if worker crashesActivity Idempotency (At-Least-Once Delivery)
Problem: Temporal guarantees at-least-once activity execution.
If worker crashes mid-activity, Temporal retries → side effects run twice.
Idempotency Key Pattern:
1. Generate key before calling activity (MUST be stable across retries —
do NOT include attempt_number, or every retry would re-charge):
key = sha256(workflow_id + run_id + activity_id)
2. Pass key to downstream: POST /charge { idempotency_key: "abc123" }
3. Downstream stores (key → result) for TTL=24h
4. Duplicate request → return cached result, no re-charge
Non-retryable errors:
HTTP 400 → non_retryable (retrying won't help)
HTTP 503 ? retryable with backoff (transient failure)Event Bus Design (Kafka)
Topic: workflow_orchestration_temporal-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "workflow_orchestration_temporal-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: workflow_orchestration_temporal-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Workflow Orchestration Engine (like Temporal/Cadence): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Temporal Client SDK (Go example)
// Start a workflow
we, err := client.ExecuteWorkflow(ctx, client.StartWorkflowOptions{
ID: "order-42",
TaskQueue: "order-processing",
}, ProcessOrder, orderInput)
// Query workflow state (synchronous)
var status OrderStatus
err = client.QueryWorkflow(ctx, "order-42", "", "getStatus", &status)
// Signal a running workflow
err = client.SignalWorkflow(ctx, "order-42", "", "cancelOrder", reason)
// Wait for result
var result OrderResult
err = we.Get(ctx, &result)Temporal Server gRPC API
StartWorkflowExecution → Start new workflow SignalWorkflowExecution → Send signal QueryWorkflow → Query state TerminateWorkflowExecution → Force-terminate RequestCancelWorkflowExecution → Request graceful cancellation ListWorkflowExecutions → Search via visibility GetWorkflowExecutionHistory → Full event history PollWorkflowTaskQueue → Worker long-poll for workflow PollActivityTaskQueue → Worker long-poll for activities
Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Cassandra (History Store)
-- Execution state
CREATE TABLE executions (
namespace_id UUID, workflow_id TEXT, run_id UUID,
state INT, next_event_id BIGINT,
PRIMARY KEY ((namespace_id, workflow_id), run_id)
);
-- Event history (append-only, event-sourced)
CREATE TABLE history_events (
namespace_id UUID, workflow_id TEXT, run_id UUID,
event_id BIGINT, event_type INT, data BLOB,
PRIMARY KEY ((namespace_id, workflow_id, run_id), event_id)
) WITH CLUSTERING ORDER BY (event_id ASC);
-- Pending tasks
CREATE TABLE tasks (
namespace_id UUID, task_queue TEXT, task_type INT,
task_id BIGINT, scheduled_time TIMESTAMP,
PRIMARY KEY ((namespace_id, task_queue, task_type), task_id)
);
-- Timers
CREATE TABLE timers (
namespace_id UUID, fire_time TIMESTAMP,
workflow_id TEXT, timer_id TEXT,
PRIMARY KEY ((namespace_id), fire_time, workflow_id, timer_id)
) WITH CLUSTERING ORDER BY (fire_time ASC);Elasticsearch (Visibility)
{
"namespace": "production",
"workflow_id": "order-42",
"workflow_type": "ProcessOrder",
"status": "Running",
"start_time": "2026-03-14T10:00:00Z",
"custom_attributes": {
"customer_id": "cust-789",
"order_total": 299.99
}
}| Concern | Solution |
|---|---|
| Workflow worker crash | Task times out → server re-dispatches → new worker replays and resumes |
| Activity worker crash | Task times out → server retries per retry policy |
| Server node crash | History shards rebalanced; all state in Cassandra (durable) |
| DB unavailable | Server queues writes; client retries; no state lost |
| Long activity timeout | Heartbeat: worker sends heartbeat every N seconds |
| Poison pill workflow | Retry with backoff → max retries → fail → dead-letter |
| Versioning during deploy | getVersion() → old code for in-flight, new code for new |
Production Patterns
Order Processing: validate ? reserve ? charge ? ship ? notify. Saga on failure. User Onboarding: create account ? send email ? wait 3 days ? send tutorial ? wait 7 days ? send offer. Subscription Billing: charge monthly ? retry 3x over 7 days ? if still failed ? cancel. Human-in-the-Loop: submit expense ? wait for approval (signal) ? reimburse or reject.
Temporal vs Saga with Events (Choreography vs Orchestration)
Choreography: logic scattered across services, hard to see full flow, compensation is complex. Orchestration (Temporal): flow visible in ONE place, compensation = try/catch, full history, ordering guaranteed, unit-testable. Recommendation: Use orchestration for critical business flows, choreography for decoupled notifications/analytics.
Interview Walkthrough
- Define the problem as durable execution — workflows that survive process crashes, network failures, and deploys without losing state.
- Contrast orchestration (Temporal, centralized flow) vs choreography (Kafka events, decentralized) using Distributed Transactions 2PC vs Saga framing.
- Model each step as an idempotent Activity with configurable retry policies — Temporal replays history, so side effects must be safe to repeat.
- Show compensation explicitly:
try { charge } catch { refund }— interviewers want to see rollback paths, not just happy paths. - Cover human-in-the-loop via Signals — a workflow waits days for an approval event without holding threads or polling a database.
- Address workflow versioning for zero-downtime deploys — pin running instances to old code paths while new executions use updated logic.
- Quantify timeout budgets: activity start-to-close timeout, schedule-to-start timeout, and heartbeat timeout for long-running tasks.
- Common pitfall: hand-rolling saga logic with Kafka consumers and cron reconciliation — it becomes unmaintainable without durable state and replay.
Temporal vs Airflow vs AWS Step Functions
| Dimension | Temporal | Airflow | Step Functions | |---|---|---|---|---| | Execution model | Durable functions | DAG scheduler | State machine (JSON) | | Replay | Full event-sourced replay | Re-run whole task | Re-drive from failed state | | Long-running | ? Timers up to years | ? Task timeout limits | ? Heartbeat timers | | Saga/compensation | Built-in via code | Manual compensation | Catch + rollback | | Developer model | Code (Go/Java/TS) | Python DAGs | JSON/YAML | Temporal shines when: 1. Flow has > 3 steps across services 2. Steps need retry/compensation 3. Flow spans minutes to months (timers, human approval) 4. You need visibility into every step 5. You want workflow logic in CODE, not YAML
Workflow Versioning for Zero-Downtime Deploys
Problem: 10,000 workflows running processOrder v1.
Deploy v2 (added new step). How to handle in-flight v1 workflows?
Solution: Temporal's getVersion()
function processOrder(order) {
payment = executeActivity(ChargeCard, order)
version = workflow.getVersion("add-fraud-check", 1, 2)
if (version >= 2) {
executeActivity(FraudCheck, order) // NEW step in v2
}
shipment = executeActivity(CreateShipment, order)
}
In-flight v1 workflows: getVersion returns 1 → skip FraudCheck
New workflows: getVersion returns 2 → run FraudCheck
Eventually all v1 workflows complete → remove version check in v3When to Use Temporal vs Other Approaches
| Approach | Use When | Don't Use When |
|---|---|---|
| Temporal | Multi-step, long-running, retry/compensation | Simple request-response |
| Kafka + Consumers | Event-driven, fan-out, decoupled | Request-response or saga |
| Step Functions | AWS-native, < 25K/sec, simple DAGs | High throughput, complex logic |
| Choreography (events) | Loose coupling, simple flows | Complex compensation, visibility |
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Database + cron (manual recovery)
State in DB rows. Cron polls for next step. On crash, manual intervention or duplicate runs.
Key components: PostgreSQL state table · Cron worker · Manual runbooks
Move to next phase when: Partial failures require engineering intervention; can't scale past 1K concurrent workflows
Phase 2 — Queue-based saga (choreography)
Each step publishes event to next consumer. Compensation via compensating events. Hard to debug.
Key components: Kafka · Compensating events · Distributed tracing
Move to next phase when: Compensation logic scattered; need visibility into workflow state
Phase 3 — Workflow engine (Temporal/Cadence)
Code-defined workflows with durable replay. Centralized saga logic. Full history for debugging.
Key components: Temporal server · History shards · Matching service · Worker fleet · Visibility (Elasticsearch)
Move to next phase when: Business requires audit trail, human-in-the-loop, and safe deploys with in-flight workflows
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Workflow start latency | < 100ms p99 | User-facing flows start synchronously |
| Activity dispatch latency | < 50ms p99 | Worker poll + schedule overhead |
| Zero workflow state loss | 100% | Event log durability via Cassandra quorum writes |
| Visibility query latency | < 2s | Ops/debugging via Elasticsearch index |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| History shard overloaded (hot workflow type) | Workflow task schedule latency p99 > 500ms; shard CPU pegged | Increase shard count (requires migration); rate-limit workflow starts per namespace; scale history service nodes |
| Poison pill workflow infinite retry | Single workflow_id accumulating events; worker CPU on replay | Terminate workflow after max retries; dead-letter queue for manual review; fix activity code and redeploy with getVersion |
| Cassandra write latency spike | Workflow start errors increase; event append timeout | Temporal server queues writes; client retries; no state lost. Scale Cassandra cluster; check compaction backlog |
Cost Drivers (Staff lens)
- Event history storage: ~86 TB/day at 10K starts/sec (plan compaction + retention)
- Worker fleet: workflow workers CPU-bound (replay); activity workers I/O-bound
- Elasticsearch visibility index size
Multi-Region & DR
Namespace-per-region initially. Global workflows: accept higher latency for cross-region history writes or use region-local workflows with cross-region signals. DR: Cassandra multi-DC replication; replay workers in standby region.