Design a Feature Flag System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Feature Flag System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Flag evaluation engine, Percentage rollouts, User segmentation?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Flag evaluation engine
Percentage rollouts
User segmentation
Kill switches
Flag dependency graph
Audit trail

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for feature flag system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Create, update, and delete feature flags (boolean, string, number, JSON variants)
Target flags by user ID, user attributes (country, plan, device), percentage rollout
Gradual rollout: 1% → 5% → 25% → 50% → 100% (with rollback at any point)
A/B testing integration: assign users to experiment variants deterministically
Kill switch: instantly disable a feature globally in < 5 seconds
Flag dependencies: flag B requires flag A to be enabled
Audit log: who changed what flag, when, and why
SDK support: server-side (Java, Go, Python) and client-side (JS, iOS, Android)
Environment separation: dev, staging, prod with independent flag states
Scheduled flags: auto-enable at a specific time (launch events)

Metric	Calculation	Value
Feature flags (total)	Given	10,000
Active flags	Given	2,000
Flag evaluations / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~12K/sec per instance × 100 instances
Flag changes / day	~50 ÷ 86400	~50
SDK instances	Given	100,000
Flag definition payload	Given	~50 KB
Streaming update bandwidth	Given	~1.7 MB/sec

Loading...

PostgreSQL: Source of Truth

Why PostgreSQL? ACID for metadata, JSONB for flexible rules, reliable for low-write workload (~50 writes/day).

SQL

CREATE TABLE feature_flags (
    flag_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    flag_key    TEXT UNIQUE NOT NULL,
    name        TEXT NOT NULL,
    description TEXT,
    flag_type   TEXT NOT NULL CHECK (flag_type IN ('boolean','string','number','json')),
    created_by  UUID, created_at TIMESTAMPTZ DEFAULT NOW(),
    archived    BOOLEAN DEFAULT FALSE
);

CREATE TABLE flag_environments (
    flag_id         UUID REFERENCES feature_flags(flag_id),
    environment     TEXT NOT NULL,
    enabled         BOOLEAN DEFAULT FALSE,
    default_variant JSONB,
    fallthrough     JSONB,
    rules           JSONB,
    version         BIGINT DEFAULT 1,
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_by      UUID,
    PRIMARY KEY (flag_id, environment)
);

CREATE TABLE flag_audit_log (
    audit_id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    flag_id    UUID NOT NULL,
    flag_key   TEXT NOT NULL,
    environment TEXT,
    action     TEXT NOT NULL,
    old_value  JSONB,
    new_value  JSONB,
    changed_by UUID NOT NULL,
    reason     TEXT,
    changed_at TIMESTAMPTZ DEFAULT NOW()
);

Redis Cache

SET flags:production '{...all flags...}' EX 300
SET flag_version:production 42

SDK In-Memory (Atomic Pointer Swap)

JAVA

class FlagStore {
    volatile Map<String, FlagDef> flags;
    FlagDef get(String key) { return flags.get(key); }
    void update(Map<String, FlagDef> n) { this.flags = Map.copyOf(n); }
}

Technique	Application
SDK disk cache	Survives restart; fallback when relay unreachable
Default values	Flag not found → hardcoded default
Streaming + polling	SSE primary, 30s poll backup, disk cache last resort
PostgreSQL replicas	Read replicas for API reads
Kafka RF=3	Change events survive broker failure
Relay redundancy	Multiple instances per region; SDK reconnects on failure

SDK Resilience

Startup: disk cache → relay fetch → defaults. Runtime: streaming → polling → cache. Evaluation NEVER makes a network call.

Concurrent Flag Updates

Optimistic locking: UPDATE ... WHERE version = $expected. Conflict → 409, UI shows diff.

Relay Failure

SDKs detect disconnect → reconnect to another relay → send last_seen_version → receive delta. If ALL relays down → poll API. If API down → use cached flags.

Full Snapshot Push vs Delta Updates

Full Snapshot (this design, LaunchDarkly approach):
  Every update: relay sends complete set of all flag definitions (~50 KB)
  SDK replaces entire in-memory map atomically
  ✓ Always consistent
  ✓ Recovery is trivial
  ✗ Bandwidth: 100K SDKs × 50 KB × 50 updates/day = 250 GB/day

Delta Updates:
  Only send the changed flag definition (~1 KB)
  ✓ 50× less bandwidth
  ✗ Ordering matters: missed delta → state diverges
  ✗ Need sequence numbers and gap detection

Recommendation: Full snapshot (simplicity + consistency > bandwidth)
  50 KB compressed to ~15 KB with gzip → 75 GB/day → trivial at scale

In-Process SDK vs Remote Evaluation API

In-Process SDK (recommended):
  Latency: ~100 ns, Availability: 100%, works offline
  ✗ Flag definitions exposed, SDK must be updated

Remote Evaluation API:
  Latency: 5-20 ms, Availability: depends on service
  ✓ Secure, no SDK per language
  ✗ Network failure → flags stop working

Production pattern:
  Server-side: in-process SDK → 0ms latency
  Client-side: server evaluates on behalf of client

Deterministic Hash vs Server-Assigned Cohort

Deterministic Hash (MurmurHash3):
  ✓ Stateless, no storage, monotonic rollout
  ✗ Can't manually override specific users

Server-Assigned Cohort:
  ✓ Full control, stable assignment
  ✗ Requires DB lookup, storage expensive

Recommendation: Deterministic hash for most flags.
  Server-assigned only for formal A/B experiments.

Consistency Model: Bounded Eventual Consistency

Timeline of a flag change:
  T+0s:     Admin clicks "Enable flag"
  T+0.1s:   API writes to PostgreSQL
  T+0.2s:   API publishes to Kafka
  T+0.5s:   Relay Service consumes from Kafka
  T+1s:     Relay pushes update via SSE
  T+5s:     All healthy SDKs have received update
  T+10s:    Reconnecting SDKs have received update (poll fallback)

CAP analysis: AP system by design.
  Network partition → SDKs serve stale cached flags (availability over consistency)
  Flags are NOT a source of truth for critical business logic.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Flag Relay Service

Deterministic Percentage Rollout (Key Algorithm)

Multi-Variant Experiments

Event Bus Design (Kafka)

Common Error Responses