Design a Workflow Orchestration Engine (like Temporal/Cadence)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Understand orchestration vs choreography and when sagas beat event chains.
Arch 75	Platform/infra question — staff interviewers expect durable execution via event replay, saga compensation, workflow versioning, and build-vs-buy (Temporal vs Step Functions).

Interview Prompt

Design a workflow orchestration engine that coordinates multi-step, long-running business processes across distributed services. Workflows must survive crashes, restarts, and deployments.

Clarifying Questions (ask before designing)

Question	Why it matters
How long do workflows run — minutes, days, or months?	Timers and human-in-the-loop signals require durable state, not just a job queue.
Do we need compensation (saga rollback) or is retry enough?	Sagas for multi-service transactions; simple retry for idempotent single-step failures.
Build from scratch or use Temporal/Cadence?	Staff candidates discuss both; most production teams buy unless you're a platform company.
What's the activity failure semantics — at-least-once or exactly-once effects?	Activities are at-least-once; side effects need idempotency keys. Workflow state is exactly-once via replay.

Scope

In scope

Durable execution via event-sourced replay
Activity dispatch with retry policies
Saga compensation pattern
Workflow versioning for safe deploys

Out of scope (state explicitly)

Full Temporal server implementation details (discuss architecture)
UI/workflow designer (assume code-defined workflows)
Cron scheduling (use job scheduler #28 for that)

Assumptions

100K concurrent workflows, 10K starts/sec
Workflows span seconds to months (timers, human approval)
Activities call external services (payments, shipping, email)

Long-running, multi-step workflows spanning seconds to months
Durable execution: workflow state survives crashes, restarts, deployments
Activity tasks: individual units of work executed by workers
Timers and sleep: workflow.sleep(Duration.ofDays(30))
Retry policies: automatic retry with configurable backoff
Saga pattern: compensating transactions for distributed rollback
Child workflows: compose workflows hierarchically
Signals and queries: send events TO or read state FROM running workflows
Versioning: update workflow code without breaking in-flight executions
Visibility: search, filter, monitor workflows by custom attributes

Metric	Calculation	Value
Concurrent workflows	Given (peak load assumption)	100K
Workflow starts / sec	From Workflow starts / day ÷ 86400 (+ peak factor in value)	10K
Activity dispatches / sec	From Activity dispatches / day ÷ 86400 (+ peak factor in value)	50K
Avg activities per workflow	Given (typical workload assumption)	10
Avg workflow duration	Given (typical workload assumption)	5 minutes (some: months)
Event history per workflow	avg 50 events × 2 KB	100 KB
Total history storage	Given	~86 TB/day
History retention (30 days)	Given	~2.5 PB

Loading...

How Durable Execution Works: The Core Innovation

Traditional approach (stateless service):
  function processOrder(order) {
    payment = chargeCard(order)         // ? crashes here
    shipment = createShipment(order)    // never reached
  }
  // On crash: partial state, manual recovery needed

Temporal approach (durable execution):
  function processOrder(order) {
    payment = workflow.executeActivity(ChargeCard, order)     // Step 1
    shipment = workflow.executeActivity(CreateShipment, order) // Step 2
  }
  // On crash: Temporal replays history, skips ChargeCard (already done),
  // resumes at step 2

HOW THIS WORKS:
  1. Worker replays event history:
     Event 1: WorkflowExecutionStarted
     Event 2: ActivityTaskScheduled (ChargeCard)
     Event 3: ActivityTaskCompleted (result: {payment_id: "pay_123"})
     ? History ends here
  2. Worker re-executes code: chargeCard ? cached result; createShipment ? NEW command
  3. Worker returns command: ScheduleActivityTask(CreateShipment)

CRITICAL RULE: Workflow code must be DETERMINISTIC
  No random(), no System.currentTimeMillis(), no direct I/O
  All side effects must go through activities

Event History (Event-Sourced State)

Every workflow execution is an append-only event log:

  Event 1:  WorkflowExecutionStarted     {input: {orderId: "42"}}
  Event 2:  WorkflowTaskScheduled
  Event 3:  WorkflowTaskCompleted        {commands: [ScheduleActivity(ChargeCard)]}
  Event 4:  ActivityTaskScheduled        {activityType: "ChargeCard"}
  Event 5:  ActivityTaskStarted          {attempt: 1}
  Event 6:  ActivityTaskCompleted        {result: {paymentId: "pay_123"}}
  Event 7:  WorkflowTaskScheduled
  Event 8:  WorkflowTaskCompleted        {commands: [ScheduleActivity(CreateShipment)]}
  Event 9:  TimerStarted                 {duration: 30 days}
  ...
  Event 42: TimerFired
  Event 43: WorkflowTaskCompleted        {commands: [CompleteWorkflow]}
  Event 44: WorkflowExecutionCompleted   {result: {status: "success"}}

This log IS the workflow state. No separate database.

Saga Pattern (Distributed Compensation)

Problem: Book a trip = Flight + Hotel + Car Rental (3 services)

Temporal Saga:
  function bookTrip(trip) {
    flight = workflow.executeActivity(BookFlight, trip)
    try {
      hotel = workflow.executeActivity(BookHotel, trip)
    } catch (e) {
      workflow.executeActivity(CancelFlight, flight)  // compensate
      throw e
    }
    try {
      car = workflow.executeActivity(RentCar, trip)
    } catch (e) {
      workflow.executeActivity(CancelHotel, hotel)    // compensate
      workflow.executeActivity(CancelFlight, flight)  // compensate
      throw e
    }
    return {flight, hotel, car}
  }

Why better than choreography (event-driven):
  - Compensation logic in ONE place, sequential code not event chains
  - Temporal guarantees compensation runs even if worker crashes

Activity Idempotency (At-Least-Once Delivery)

Problem: Temporal guarantees at-least-once activity execution.
  If worker crashes mid-activity, Temporal retries → side effects run twice.

Idempotency Key Pattern:
  1. Generate key before calling activity (MUST be stable across retries —
     do NOT include attempt_number, or every retry would re-charge):
     key = sha256(workflow_id + run_id + activity_id)
  2. Pass key to downstream: POST /charge { idempotency_key: "abc123" }
  3. Downstream stores (key → result) for TTL=24h
  4. Duplicate request → return cached result, no re-charge

Non-retryable errors:
  HTTP 400 → non_retryable (retrying won't help)
  HTTP 503 ? retryable with backoff (transient failure)

Event Bus Design (Kafka)

Topic: workflow_orchestration_temporal-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "workflow_orchestration_temporal-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: workflow_orchestration_temporal-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Workflow Orchestration Engine (like Temporal/Cadence): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Temporal Client SDK (Go example)

GO

// Start a workflow
we, err := client.ExecuteWorkflow(ctx, client.StartWorkflowOptions{
    ID:        "order-42",
    TaskQueue: "order-processing",
}, ProcessOrder, orderInput)

// Query workflow state (synchronous)
var status OrderStatus
err = client.QueryWorkflow(ctx, "order-42", "", "getStatus", &status)

// Signal a running workflow
err = client.SignalWorkflow(ctx, "order-42", "", "cancelOrder", reason)

// Wait for result
var result OrderResult
err = we.Get(ctx, &result)

Temporal Server gRPC API

StartWorkflowExecution          → Start new workflow
SignalWorkflowExecution         → Send signal
QueryWorkflow                   → Query state
TerminateWorkflowExecution      → Force-terminate
RequestCancelWorkflowExecution  → Request graceful cancellation
ListWorkflowExecutions          → Search via visibility
GetWorkflowExecutionHistory     → Full event history
PollWorkflowTaskQueue           → Worker long-poll for workflow
PollActivityTaskQueue           → Worker long-poll for activities

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Cassandra (History Store)

SQL

-- Execution state
CREATE TABLE executions (
    namespace_id   UUID, workflow_id TEXT, run_id UUID,
    state          INT, next_event_id BIGINT,
    PRIMARY KEY ((namespace_id, workflow_id), run_id)
);

-- Event history (append-only, event-sourced)
CREATE TABLE history_events (
    namespace_id   UUID, workflow_id TEXT, run_id UUID,
    event_id       BIGINT, event_type INT, data BLOB,
    PRIMARY KEY ((namespace_id, workflow_id, run_id), event_id)
) WITH CLUSTERING ORDER BY (event_id ASC);

-- Pending tasks
CREATE TABLE tasks (
    namespace_id   UUID, task_queue TEXT, task_type INT,
    task_id        BIGINT, scheduled_time TIMESTAMP,
    PRIMARY KEY ((namespace_id, task_queue, task_type), task_id)
);

-- Timers
CREATE TABLE timers (
    namespace_id   UUID, fire_time TIMESTAMP,
    workflow_id TEXT, timer_id TEXT,
    PRIMARY KEY ((namespace_id), fire_time, workflow_id, timer_id)
) WITH CLUSTERING ORDER BY (fire_time ASC);

Elasticsearch (Visibility)

JSON

{
  "namespace": "production",
  "workflow_id": "order-42",
  "workflow_type": "ProcessOrder",
  "status": "Running",
  "start_time": "2026-03-14T10:00:00Z",
  "custom_attributes": {
    "customer_id": "cust-789",
    "order_total": 299.99
  }
}

Concern	Solution
Workflow worker crash	Task times out → server re-dispatches → new worker replays and resumes
Activity worker crash	Task times out → server retries per retry policy
Server node crash	History shards rebalanced; all state in Cassandra (durable)
DB unavailable	Server queues writes; client retries; no state lost
Long activity timeout	Heartbeat: worker sends heartbeat every N seconds
Poison pill workflow	Retry with backoff → max retries → fail → dead-letter
Versioning during deploy	getVersion() → old code for in-flight, new code for new

Temporal vs Airflow vs AWS Step Functions

| Dimension | Temporal | Airflow | Step Functions |
|---|---|---|---|---|
| Execution model | Durable functions | DAG scheduler | State machine (JSON) |
| Replay | Full event-sourced replay | Re-run whole task | Re-drive from failed state |
| Long-running | ? Timers up to years | ? Task timeout limits | ? Heartbeat timers |
| Saga/compensation | Built-in via code | Manual compensation | Catch + rollback |
| Developer model | Code (Go/Java/TS) | Python DAGs | JSON/YAML |

Temporal shines when:
  1. Flow has > 3 steps across services
  2. Steps need retry/compensation
  3. Flow spans minutes to months (timers, human approval)
  4. You need visibility into every step
  5. You want workflow logic in CODE, not YAML

Workflow Versioning for Zero-Downtime Deploys

Problem: 10,000 workflows running processOrder v1.
  Deploy v2 (added new step). How to handle in-flight v1 workflows?

Solution: Temporal's getVersion()
  function processOrder(order) {
    payment = executeActivity(ChargeCard, order)
    
    version = workflow.getVersion("add-fraud-check", 1, 2)
    if (version >= 2) {
      executeActivity(FraudCheck, order)  // NEW step in v2
    }
    
    shipment = executeActivity(CreateShipment, order)
  }

  In-flight v1 workflows: getVersion returns 1 → skip FraudCheck
  New workflows: getVersion returns 2 → run FraudCheck
  
  Eventually all v1 workflows complete → remove version check in v3

When to Use Temporal vs Other Approaches

Approach	Use When	Don't Use When
Temporal	Multi-step, long-running, retry/compensation	Simple request-response
Kafka + Consumers	Event-driven, fan-out, decoupled	Request-response or saga
Step Functions	AWS-native, < 25K/sec, simple DAGs	High throughput, complex logic
Choreography (events)	Loose coupling, simple flows	Complex compensation, visibility

SLOs & Error Budgets

Metric	Target	Rationale
Workflow start latency	< 100ms p99	User-facing flows start synchronously
Activity dispatch latency	< 50ms p99	Worker poll + schedule overhead
Zero workflow state loss	100%	Event log durability via Cassandra quorum writes
Visibility query latency	< 2s	Ops/debugging via Elasticsearch index

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
History shard overloaded (hot workflow type)	Workflow task schedule latency p99 > 500ms; shard CPU pegged	Increase shard count (requires migration); rate-limit workflow starts per namespace; scale history service nodes
Poison pill workflow infinite retry	Single workflow_id accumulating events; worker CPU on replay	Terminate workflow after max retries; dead-letter queue for manual review; fix activity code and redeploy with getVersion
Cassandra write latency spike	Workflow start errors increase; event append timeout	Temporal server queues writes; client retries; no state lost. Scale Cassandra cluster; check compaction backlog

Cost Drivers (Staff lens)

Event history storage: ~86 TB/day at 10K starts/sec (plan compaction + retention)
Worker fleet: workflow workers CPU-bound (replay); activity workers I/O-bound
Elasticsearch visibility index size

Multi-Region & DR

Namespace-per-region initially. Global workflows: accept higher latency for cross-region history writes or use region-local workflows with cross-region signals. DR: Cassandra multi-DC replication; replay workers in standby region.