This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Feature Flag System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Flag evaluation engine, Percentage rollouts, User segmentation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Flag evaluation engine
- Percentage rollouts
- User segmentation
- Kill switches
- Flag dependency graph
- Audit trail
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for feature flag system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Create, update, and delete feature flags (boolean, string, number, JSON variants)
- Target flags by user ID, user attributes (country, plan, device), percentage rollout
- Gradual rollout: 1% → 5% → 25% → 50% → 100% (with rollback at any point)
- A/B testing integration: assign users to experiment variants deterministically
- Kill switch: instantly disable a feature globally in < 5 seconds
- Flag dependencies: flag B requires flag A to be enabled
- Audit log: who changed what flag, when, and why
- SDK support: server-side (Java, Go, Python) and client-side (JS, iOS, Android)
- Environment separation: dev, staging, prod with independent flag states
- Scheduled flags: auto-enable at a specific time (launch events)
- Ultra-Low Latency: Flag evaluation in < 1µs (local in-memory, no network call)
- High Availability: 99.999%: flag evaluation must never fail (default fallback)
- Consistency: Flag change propagated to all servers within 10 seconds
- Scalability: 10K+ flags, 1B+ evaluations/day
- Resilience: SDK works offline / when flag service is down (cached state)
- Zero Performance Impact: No measurable overhead in the hot path
| Metric | Calculation | Value |
|---|---|---|
| Feature flags (total) | Given | 10,000 |
| Active flags | Given | 2,000 |
| Flag evaluations / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~12K/sec per instance × 100 instances |
| Flag changes / day | ~50 ÷ 86400 | ~50 |
| SDK instances | Given | 100,000 |
| Flag definition payload | Given | ~50 KB |
| Streaming update bandwidth | Given | ~1.7 MB/sec |
Flag Relay Service
Why a dedicated relay? Direct DB polling from 100K SDKs = 100K queries/sec → DB dies. Relay multiplexes: 1 Kafka consumer → 100K SSE pushes.
Scaling: 1 relay handles ~10K SSE connections. 10 relays per region.
Regional deployment: us-east, eu-west, ap-south → low latency push.
Deterministic Percentage Rollout (Key Algorithm)
bucket = murmurhash3(flag_key + ":" + user_id) % 100 rollout_percent = 25 → bucket < 25 → ON Monotonic increase: 25% → 50%: users 0-24 STILL ON, users 25-49 NOW ON Nobody loses access — only gains Why MurmurHash3: fast (~5ns), uniform distribution, deterministic
Multi-Variant Experiments
Flag: "checkout_layout" Variant A: "single_page" (33%), Variant B: "multi_step" (33%), Variant C: "wizard" (34%) Mutual exclusion via experiment layers: Layer 1 (checkout): hash1(user_id) % 100 < 50 Layer 2 (pricing): hash2(user_id) % 100 >= 50 Different hash seed per layer ensures non-correlation
Event Bus Design (Kafka)
Topic: feature_flag_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "feature_flag_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: feature_flag_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Feature Flag System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
POST /api/flags → Create flag with targeting rules
GET /api/flags → List all flags (paginated)
GET /api/flags/{key} → Flag details with all environments
PUT /api/flags/{key} → Update flag (creates audit entry)
DELETE /api/flags/{key} → Archive flag (soft delete)
POST /api/flags/{key}/toggle → Kill switch enable/disable
GET /api/flags/{key}/audit → Audit log with diffs
POST /api/flags/{key}/schedule → Schedule future enable/disable
# SDK endpoints
GET /api/sdk/flags?env=prod → Full flag definitions
GET /api/sdk/stream?env=prod → SSE stream of changes
POST /api/sdk/evaluate → Server-side evaluation for client SDKsCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL: Source of Truth
Why PostgreSQL? ACID for metadata, JSONB for flexible rules, reliable for low-write workload (~50 writes/day).
CREATE TABLE feature_flags (
flag_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flag_key TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
description TEXT,
flag_type TEXT NOT NULL CHECK (flag_type IN ('boolean','string','number','json')),
created_by UUID, created_at TIMESTAMPTZ DEFAULT NOW(),
archived BOOLEAN DEFAULT FALSE
);
CREATE TABLE flag_environments (
flag_id UUID REFERENCES feature_flags(flag_id),
environment TEXT NOT NULL,
enabled BOOLEAN DEFAULT FALSE,
default_variant JSONB,
fallthrough JSONB,
rules JSONB,
version BIGINT DEFAULT 1,
updated_at TIMESTAMPTZ DEFAULT NOW(),
updated_by UUID,
PRIMARY KEY (flag_id, environment)
);
CREATE TABLE flag_audit_log (
audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flag_id UUID NOT NULL,
flag_key TEXT NOT NULL,
environment TEXT,
action TEXT NOT NULL,
old_value JSONB,
new_value JSONB,
changed_by UUID NOT NULL,
reason TEXT,
changed_at TIMESTAMPTZ DEFAULT NOW()
);Redis Cache
SET flags:production '{...all flags...}' EX 300
SET flag_version:production 42SDK In-Memory (Atomic Pointer Swap)
class FlagStore {
volatile Map<String, FlagDef> flags;
FlagDef get(String key) { return flags.get(key); }
void update(Map<String, FlagDef> n) { this.flags = Map.copyOf(n); }
}| Technique | Application |
|---|---|
| SDK disk cache | Survives restart; fallback when relay unreachable |
| Default values | Flag not found → hardcoded default |
| Streaming + polling | SSE primary, 30s poll backup, disk cache last resort |
| PostgreSQL replicas | Read replicas for API reads |
| Kafka RF=3 | Change events survive broker failure |
| Relay redundancy | Multiple instances per region; SDK reconnects on failure |
SDK Resilience
Startup: disk cache → relay fetch → defaults. Runtime: streaming → polling → cache. Evaluation NEVER makes a network call.
Concurrent Flag Updates
Optimistic locking: UPDATE ... WHERE version = $expected. Conflict → 409, UI shows diff.
Relay Failure
SDKs detect disconnect → reconnect to another relay → send last_seen_version → receive delta. If ALL relays down → poll API. If API down → use cached flags.
Interview Walkthrough
- Separate metadata (bucket, key, version, locations) from blob data (append-only volumes on data nodes) — interviewers expect this split on day one.
- Deterministic percentage rollout:
murmurhash3(flag_key + ":" + user_id) % 100— monotonic increase means users never lose access when rollout expands from 25% to 50%. - SDK evaluates flags locally with zero network calls on the hot path — relay/SSE pushes updates; disk cache → relay → defaults is the fallback chain.
- Flag relay multiplexes 100K SDK connections through Kafka consumers — direct DB polling from every SDK instance kills the database.
- Experiment layers with independent hash seeds prevent correlated flag assignments across overlapping experiments.
- Optimistic locking on flag updates (
WHERE version = $expected) prevents silent overwrites during concurrent admin edits. - Common pitfall: random per-request rollout — users flip between variants on every page load, breaking experiments and eroding trust in gradual releases.
Full Snapshot Push vs Delta Updates
Full Snapshot (this design, LaunchDarkly approach): Every update: relay sends complete set of all flag definitions (~50 KB) SDK replaces entire in-memory map atomically ✓ Always consistent ✓ Recovery is trivial ✗ Bandwidth: 100K SDKs × 50 KB × 50 updates/day = 250 GB/day Delta Updates: Only send the changed flag definition (~1 KB) ✓ 50× less bandwidth ✗ Ordering matters: missed delta → state diverges ✗ Need sequence numbers and gap detection Recommendation: Full snapshot (simplicity + consistency > bandwidth) 50 KB compressed to ~15 KB with gzip → 75 GB/day → trivial at scale
In-Process SDK vs Remote Evaluation API
In-Process SDK (recommended): Latency: ~100 ns, Availability: 100%, works offline ✗ Flag definitions exposed, SDK must be updated Remote Evaluation API: Latency: 5-20 ms, Availability: depends on service ✓ Secure, no SDK per language ✗ Network failure → flags stop working Production pattern: Server-side: in-process SDK → 0ms latency Client-side: server evaluates on behalf of client
Deterministic Hash vs Server-Assigned Cohort
Deterministic Hash (MurmurHash3): ✓ Stateless, no storage, monotonic rollout ✗ Can't manually override specific users Server-Assigned Cohort: ✓ Full control, stable assignment ✗ Requires DB lookup, storage expensive Recommendation: Deterministic hash for most flags. Server-assigned only for formal A/B experiments.
Consistency Model: Bounded Eventual Consistency
Timeline of a flag change: T+0s: Admin clicks "Enable flag" T+0.1s: API writes to PostgreSQL T+0.2s: API publishes to Kafka T+0.5s: Relay Service consumes from Kafka T+1s: Relay pushes update via SSE T+5s: All healthy SDKs have received update T+10s: Reconnecting SDKs have received update (poll fallback) CAP analysis: AP system by design. Network partition → SDKs serve stale cached flags (availability over consistency) Flags are NOT a source of truth for critical business logic.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core feature flag system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.