This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Compare Snowflake (64-bit, time-ordered) vs UUID v4 (random) vs DB auto-increment. State k-sortable benefit for indexes, clock skew failure mode, and throughput per machine. |
| Arch 50 | Ticket server / DB segment allocation, ZooKeeper worker ID lease, and multi-DC ID schemes without central bottleneck. |
| Arch 75 | Staff: leap seconds and NTP step, epoch exhaustion, privacy (ID enumerability), and migration from auto-increment without dual-write hell. |
Interview Prompt
Design a unique ID generator that produces 64-bit numeric IDs at high throughput (thousands per second per core). IDs should be unique across distributed servers, roughly time-sortable for database indexing, and available with minimal latency.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Must IDs be numeric only, or are UUID strings acceptable? | Snowflake is 64-bit long; UUID is 128-bit string — affects storage and index locality. |
| Roughly sortable by creation time (k-sortable) or purely random? | K-sortable reduces B-tree page splits and enables time-range scans; random spreads write load on DB. |
| Global uniqueness across all DCs or per-region uniqueness? | Snowflake needs unique machine ID bits globally; regional prefixes simplify. |
| Can IDs be predictable (enumerable) or must they be opaque? | Security-sensitive systems avoid sequential public IDs — affects UUID vs Snowflake exposure. |
Scope
In scope
- High-throughput ID generation API
- Uniqueness guarantees under concurrency
- K-sortable / time-ordered property
- Multi-machine deployment
- Clock skew handling strategy
Out of scope (state explicitly)
- Full distributed transaction IDs across services (2PC)
- Human-readable IDs (Base62 short codes)
- Blockchain-style global consensus per ID
Assumptions
- Target 10K IDs/sec per process, 1M IDs/sec cluster-wide
- 64-bit integer returned to clients
- IDs live 50+ years — epoch bit width matters
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Generate globally unique IDs across a distributed system
- IDs must be 64-bit integers (fit in a long)
- IDs must be roughly sortable by time (IDs generated later are larger)
- Generate IDs with extremely high throughput (100K+ IDs/sec per node)
- No coordination between nodes required for ID generation
- IDs should not expose sensitive information (e.g., exact count of objects)
- Uniqueness: No two IDs are ever the same across all nodes and all time
- High Availability: ID generation must never block (no single point of failure)
- Low Latency: < 1 ms per ID generation (ideally sub-microsecond)
- Scalability: Support 1000+ nodes generating IDs concurrently
- Time-Ordered: IDs generated at time T1 < IDs generated at time T2 (within same node, guaranteed; across nodes, approximately)
- Compact: 64-bit (not UUIDs which are 128-bit)
| Metric | Calculation | Value |
|---|---|---|
| Total ID generation rate | Given | 10M IDs/sec (system-wide) |
| Number of nodes | Given | 1024 (10 bits) |
| IDs per node per sec | 10M / 1024 nodes | ~10K |
| Sequence bits (12) | Given | 4096 per millisecond per node |
| Epoch lifespan (41 bits) | Given | 2^41 ms ≈ 69.7 years |
| Storage per ID | Given | 8 bytes |
Approaches Compared
| Approach | Pros | Cons |
|---|---|---|
| UUID (v4) | No coordination, simple | 128 bits, not sortable, poor index performance |
| DB Auto-Increment | Simple, ordered | Single point of failure, not scalable |
| DB with Ranges | Scalable | Requires coordination to assign ranges |
| Twitter Snowflake ⭐ | 64-bit, sortable, no coordination, fast | Requires clock synchronization |
| MongoDB ObjectId | 96-bit, sortable | Larger than 64-bit |
Snowflake ID Structure (64 bits)
Architecture
Snowflake Engine (In-Process Library)
- Why in-process: Zero network latency. Each app server has the Snowflake library embedded
- How it works:
- Get current timestamp in milliseconds since custom epoch (e.g., 2020-01-01)
- If same millisecond as last ID → increment sequence counter
- If sequence overflows (> 4095) → wait until next millisecond
- If timestamp moved backward (clock skew) → reject or wait
- Combine:
(timestamp << 22) | (datacenter_id << 17) | (worker_id << 12) | sequence
ZooKeeper / etcd (Worker ID Assignment)
- Why: Need to assign unique
(datacenter_id, worker_id)pairs to each node - How:
- On startup, each node creates an ephemeral sequential znode in ZooKeeper
- The znode's sequence number becomes the worker ID
- If the node dies, the ephemeral znode is deleted → worker ID can be reused
- Alternative: Store worker ID mappings in a simple DB or config file (less dynamic but simpler)
NTP Synchronization
- Why: Snowflake depends on clock timestamps. Clocks must be reasonably synchronized
- How: All servers run NTP daemon synced to the same NTP server pool
- Tolerance: Clocks within 10ms of each other is acceptable (IDs are time-sorted within a node, approximately sorted across nodes)
Core API
// Core API — typically an in-process library call
long nextId()
// Decompose an ID back to its components (for debugging)
IdComponents parse(long id)
// Returns: {timestamp, datacenterId, workerId, sequence}
// Batch generation for efficiency
long[] nextIds(int count)Snowflake Implementation (Java-like pseudocode)
class SnowflakeIdGenerator {
private final long epoch = 1577836800000L; // 2020-01-01 UTC
private final long datacenterIdBits = 5L;
private final long workerIdBits = 5L;
private final long sequenceBits = 12L;
private final long maxDatacenterId = ~(-1L << datacenterIdBits); // 31
private final long maxWorkerId = ~(-1L << workerIdBits); // 31
private final long maxSequence = ~(-1L << sequenceBits); // 4095
private final long workerIdShift = sequenceBits; // 12
private final long datacenterIdShift = sequenceBits + workerIdBits; // 17
private final long timestampShift = sequenceBits + workerIdBits + datacenterIdBits; // 22
private final long datacenterId;
private final long workerId;
private long sequence = 0L;
private long lastTimestamp = -1L;
public synchronized long nextId() {
long timestamp = currentTimeMillis();
// Clock moved backwards — refuse to generate
if (timestamp < lastTimestamp) {
throw new RuntimeException(
"Clock moved backwards by " + (lastTimestamp - timestamp) + "ms");
}
if (timestamp == lastTimestamp) {
// Same millisecond — increment sequence
sequence = (sequence + 1) & maxSequence;
if (sequence == 0) {
// Sequence exhausted — wait for next millisecond
timestamp = waitNextMillis(lastTimestamp);
}
} else {
// New millisecond — reset sequence
sequence = 0L;
}
lastTimestamp = timestamp;
return ((timestamp - epoch) << timestampShift)
| (datacenterId << datacenterIdShift)
| (workerId << workerIdShift)
| sequence;
}
private long waitNextMillis(long lastTs) {
long ts = currentTimeMillis();
while (ts <= lastTs) {
ts = currentTimeMillis();
}
return ts;
}
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
ID Bit Layout
| Field | Bits | Range | Purpose |
|---|---|---|---|
| Sign | 1 | 0 | Always positive |
| Timestamp | 41 | 0 – 2,199,023,255,551 ms (~69 years) | Time ordering |
| Datacenter ID | 5 | 0 – 31 | 32 data centers |
| Worker ID | 5 | 0 – 31 | 32 workers per DC |
| Sequence | 12 | 0 – 4095 | 4096 IDs per ms per worker |
Custom Epoch Choice
Standard Epoch: 1970-01-01 (Unix) Custom Epoch: 2020-01-01 00:00:00 UTC = 1577836800000 ms Why custom? Extends the 41-bit lifespan. Starting from 2020 gives us until ~2089. Starting from 1970 would only last until ~2039.
Worker ID Assignment Table (ZooKeeper / DB)
CREATE TABLE worker_assignments (
datacenter_id SMALLINT NOT NULL,
worker_id SMALLINT NOT NULL,
hostname VARCHAR(256),
assigned_at TIMESTAMP,
heartbeat_at TIMESTAMP,
PRIMARY KEY (datacenter_id, worker_id)
);General
| Issue | Solution |
|---|---|
| Single node failure | No impact on other nodes: each generates IDs independently |
| No SPOF | ID generation is decentralized; no central service needed at runtime |
| Worker ID reuse | ZooKeeper ephemeral nodes auto-cleanup; new instance gets a fresh ID |
Problem-Specific Fault Tolerance
1. Clock Backward Jump (NTP Adjustment)
- NTP can adjust clock backward (e.g., -5ms correction)
- If small (< 5ms): Wait until clock catches up to
lastTimestamp - If large (> 5ms): Throw exception, alert ops team, refuse to generate IDs until clock is fixed
- Prevention: Configure NTP for slew mode (gradually adjust, never jump)
2. Sequence Exhaustion (> 4096 IDs in 1ms)
- Solution: Spin-wait until the next millisecond (waitNextMillis)
- In practice, 4096/ms = 4M IDs/sec per node: rarely a bottleneck
- If needed: Use 13 sequence bits (8192/ms) by reducing worker bits
3. Worker ID Exhaustion
- 10 bits = 1024 unique workers (5 DC bits × 5 worker bits)
- If not enough: Reduce timestamp bits by 1 (still ~34 years) → use 11 bits for worker (2048)
- Or: Dynamic worker ID assignment with lease renewal
4. ZooKeeper Failure
- Worker IDs are assigned once on startup and cached locally
- ZooKeeper being down doesn't affect running generators
- Only affects new node registration
5. Duplicate IDs After Crash/Restart
- Node crashes, restarts within the same millisecond with sequence=0
- Solution: On startup, wait 1ms or persist lastTimestamp to disk and read on startup
Alternative Bit Allocations
| Variant | Timestamp | DC | Worker | Sequence | Trade-off |
|---|---|---|---|---|---|
| Standard Snowflake | 41 | 5 | 5 | 12 | Balanced |
| High Throughput | 39 | 4 | 5 | 15 | 32K IDs/ms, 17 year lifespan |
| Many Workers | 41 | 0 | 12 | 10 | 4096 workers, 1K IDs/ms |
| Long Lifespan | 43 | 5 | 5 | 10 | 278 years, 1K IDs/ms |
Comparison with Other Approaches
UUID v4
- 128 bits, truly random, no ordering
- B-tree index performance is poor (random inserts)
- Good when ordering doesn't matter and 128 bits is acceptable
ULID (Universally Unique Lexicographically Sortable Identifier)
- 128 bits: 48-bit timestamp + 80-bit randomness
- Lexicographically sortable (string comparison works)
- No worker coordination needed
- Collision probability: negligible for < 2^80 IDs per millisecond
Database Ticket Server (Flickr approach)
- Two MySQL servers, one generates odd IDs, one generates even
- Simple but limited scalability, requires network round-trip
- Use
auto_increment_increment=2andauto_increment_offset=1/2
ID as Database Primary Key
- Snowflake IDs are monotonically increasing within a node → excellent B-tree insert performance
- Across nodes, IDs are roughly time-ordered → much better than random UUIDs
- 64-bit fits in a
BIGINTcolumn → half the storage of UUID
Embedding Metadata in IDs
| Timestamp | Service Type (4 bits) | Shard ID (8 bits) | Sequence |
This allows routing a request to the correct shard just by looking at the ID: no separate lookup needed.
Monitoring
- Track IDs generated per second per node
- Alert if clock skew detected (lastTimestamp > currentTimestamp)
- Monitor sequence utilization (if consistently hitting 4095, need more capacity)
- Dashboard showing worker ID assignments across datacenters
Interview Walkthrough
- Start by comparing ID approaches — UUID v4, DB auto-increment, range allocation — and justify Snowflake for 64-bit, time-ordered, coordination-free IDs.
- Draw the 64-bit layout (timestamp, datacenter ID, worker ID, sequence) and show Back-of-the-Envelope Estimation: 12 sequence bits = 4096 IDs/ms per worker.
- Explain worker ID assignment via ZooKeeper/etcd coordination service — each machine gets a unique (datacenter, worker) pair at startup.
- Address clock synchronization: NTP drift, wait-if-clock-moves-backward logic, and why leap seconds matter for monotonicity.
- Discuss sortability benefits: B-tree index locality in databases, time-range queries without a separate created_at column.
- Cover failure modes: worker crash (ID reuse risk if not deregistered), datacenter failover (worker ID pool exhaustion).
- Common pitfall: using system clock without backward-clock handling — duplicate IDs after NTP correction can corrupt primary keys silently.
Why Snowflake Over UUID? The Critical Trade-off
UUID v4 (128-bit random):
ID: 550e8400-e29b-41d4-a716-446655440000
✓ Zero coordination (generate anywhere, anytime)
✓ Collision probability: ~10^-18 even at 1B IDs (effectively zero)
✗ 128 bits → 16 bytes per ID (vs 8 bytes for Snowflake)
✗ NOT sortable by time → terrible as DB primary key:
- Random UUIDs cause B-tree page splits on INSERT
- Index fragmentation → 5-10× worse write performance vs sequential IDs
- No natural ordering for "get recent records"
✗ 36-character string representation (ugly in URLs, logs)
Snowflake (64-bit, time-ordered):
ID: 1541815603606036480
✓ 64 bits → 8 bytes → fits in BIGINT column (half UUID's storage)
✓ Time-sortable → EXCELLENT B-tree performance (sequential inserts)
✓ Embeds timestamp → "when was this created?" from ID alone
✓ Compact: 19-digit decimal, 11-char Base62
✗ Requires worker ID coordination (ZooKeeper or similar)
✗ Clock-dependent (must handle clock skew)
✗ 4096 IDs/ms/worker limit (enough for most, but a ceiling)
Benchmark (PostgreSQL, 1M INSERTs):
UUID v4 as PK: ~45 seconds (random inserts, page splits)
Snowflake as PK: ~12 seconds (sequential inserts, no splits)
Auto-increment: ~10 seconds (perfectly sequential, but centralized)
Use Snowflake when: Performance matters, time-ordering is useful,
and you can coordinate worker IDs
Use UUID when: Simplicity > performance, or in serverless/stateless
environments where worker ID coordination is impracticalClock Skew: The Hardest Problem in Snowflake
Problem:
Snowflake depends on wall clock for the timestamp component.
If the clock jumps BACKWARD (NTP correction, VM migration, leap second):
T=1000: Generate ID with timestamp 1000
T=1001: Clock jumps back to 999
T=999: Generate ID with timestamp 999 → DUPLICATE possible!
Also: IDs are now OUT OF ORDER (999 < 1000)
Solutions:
1. Wait/Block ⭐ (Twitter's original approach):
if currentTimestamp < lastTimestamp:
sleep(lastTimestamp - currentTimestamp) // wait for clock to catch up
Pro: Simple, guaranteed correctness
Con: Blocks ID generation during skew (could be seconds)
2. Throw Exception:
if currentTimestamp < lastTimestamp:
throw ClockMovedBackwardException()
Pro: Caller decides how to handle
Con: Caller must implement retry logic
3. Logical Clock Fallback:
if currentTimestamp < lastTimestamp:
use lastTimestamp, increment sequence
Pro: No blocking, no exceptions
Con: Timestamp in ID is slightly wrong (acceptable for ordering purposes)
4. Hybrid Logical Clock (HLC):
Combine wall clock with a logical counter
When clock moves backward, increment the logical counter instead
HLC provides causal ordering even with clock skew
Used by CockroachDB, TiDB
Prevention:
- Use hardware clock sources (GPS, PTP) instead of NTP
- Monitor clock drift with alerting (alert if drift > 100ms)
- AWS: use ClockBound service for bounded clock uncertaintyWorker ID Assignment: Coordination Strategies
Option 1: ZooKeeper / etcd (Persistent Coordination) Each worker acquires a unique sequential ID from ZK ✓ Guaranteed unique across restarts ✗ Dependency on ZK (SPOF if ZK cluster fails) ✗ Operational complexity of maintaining ZK Used by: Twitter Snowflake (original) Option 2: Database Sequence Each worker claims an ID from a DB table at startup ✓ Simple, uses existing infrastructure ✗ DB is a bottleneck at startup (many workers starting at once) Option 3: MAC Address / IP Hash worker_id = hash(mac_address) % max_workers ✓ No external coordination ✗ MAC collisions possible (VMs, containers) ✗ Worker ID changes if machine moves Option 4: Kubernetes Pod Ordinal In StatefulSet: pod-0, pod-1, pod-2 → worker_id = ordinal ✓ Natural for K8s environments ✓ No external service needed ✗ Only works with StatefulSets Used by: Many modern microservice deployments Option 5: Pre-assigned Range (Instagram approach) Each shard has a fixed worker_id embedded in its configuration ✓ No runtime coordination ✗ Static, doesn't scale dynamically
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — DB auto-increment
Single MySQL SEQUENCE or BIGSERIAL. Simple, strictly increasing, single point of failure and max ~few K/sec.
Key components: MySQL AUTO_INCREMENT · Single primary
Move to next phase when: Write QPS exceeds DB allocator or need horizontal generators
Phase 2 — Snowflake per service
Embedded generator library, machine ID from ZooKeeper lease, NTP discipline on hosts. HTTP/gRPC sidecar optional for non-JVM languages.
Key components: Snowflake lib · ZooKeeper leases · NTP monitoring
Move to next phase when: Multi-region DCs need IDs without cross-DC coordination latency
Phase 3 — Federated epochs (global scale)
Region prefix in high bits (e.g., 3-bit DC + 38-bit time + 10-bit worker + 12-bit seq). Separate epoch rollover runbook. UUID v7 export for external APIs while internal stays numeric.
Key components: Region bits · Epoch migration · Dual-format API
Move to next phase when: 41-bit ms epoch nears end or privacy requires non-enumerable public IDs
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| ID generation availability | 99.99% | Blocks all writes if generator down |
| Generation p99 latency | < 1ms | In-process should be microseconds; budget includes lease check |
| Duplicate ID rate | 0 | Any duplicate is Sev-1 data corruption |
| Clock skew alarm | < 100ms offset | Prevent backward-step duplicates |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| ZooKeeper partition — workers can't renew machine ID lease | Generator refuses new IDs; lease expired logs | Continue with cached machine ID until TTL safety window ends; extend lease TTL; fallback to pre-assigned static IDs for emergency read-only mode |
| Leap second / NTP slew causes duplicate IDs | Unique constraint violations spike in downstream DB | Halt generators cluster-wide; advance logical clock offset; backfill affected rows; enable DB unique constraint as last line of defense |
| Sequence exhaustion in one ms (traffic spike > 4096/ms) | Generator throws 'sequence overflow'; latency errors | Wait 1ms boundary; add generator instances (more machine IDs); widen sequence bits in v2 format migration |
Cost Drivers (Staff lens)
- ZooKeeper/etcd cluster for leases — small but always-on
- Downstream DB index size: 64-bit vs 128-bit UUID × row count
- Operational cost of NTP/chrony discipline on every generator host
Multi-Region & DR
Embed region ID in high bits so global uniqueness needs no cross-DC RPC per ID. Accept total ordering only within region; cross-region sort by (region, time, seq). Don't run single global Snowflake coordinator — latency and SPoF.