Design an ML Feature Store

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design ML Feature Store.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Online vs offline store, Feature freshness, Point-in-time correctness?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Online vs offline store
Feature freshness
Point-in-time correctness
Feature serving latency
Capacity estimation with shown math

Out of scope (state explicitly)

GPU cluster training and hyperparameter tuning
Content moderation of recommended items
Ad auction / sponsored placement ranking

Assumptions

Clarify scale (DAU, QPS, data volume) for ml feature store in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Register feature definitions (name, type, entity, description, owner, SLA)
Ingest features from batch (Spark/Hive) and streaming (Flink/Kafka) sources
Serve features at low latency for online inference (< 5ms p99)
Serve features in batch for model training (point-in-time correct joins)
Feature versioning: track schema changes, backward compatibility
Feature sharing: discover and reuse features across teams
Point-in-time correctness: no data leakage
Feature monitoring: drift detection, freshness alerts
Feature lineage: trace from raw data source → transformation → feature

Metric	Calculation	Value
Feature definitions	Given	10,000
Entities (users, items, etc.)	Given	1B
Features per entity	Given	50-200
Online feature retrievals / sec	Derived from daily volume ÷ 86400 (+ peak factor)	1M
Online store size	Given	2 TB (fits in Redis cluster)
Offline store size	Given	~100 TB (S3/Parquet)
Batch ingestion (daily)	Given	2 TB
Streaming ingestion	Given	100K events/sec → feature updates

Loading...

Training-Serving Skew: The #1 Problem

What it is: Features computed differently in training vs serving ? model performs worse

Example:
  Training: avg_purchase computed with pandas (Python float64)
  Serving: avg_purchase computed in Java (Java double, different rounding)
  Result: 0.1% difference → model accuracy degrades

Prevention:
  1. Single transformation definition: same code for batch AND streaming
  2. Validation job: compute features both ways, compare → alert on divergence
  3. Feature logging: log online features at serving time → use THESE for training

Point-in-Time Correctness

Wrong approach:
  Join training data (March 1 prediction) with current features (March 14)
  → Model sees "future" information → fails in production

Correct approach:
  For training example at time T:
    feature_value = latest(feature WHERE timestamp <= T)

Implementation:
  1. Store features as (entity_id, timestamp, value) triples
  2. Training join: AS OF JOIN on timestamp
  3. Delta Lake / Apache Iceberg support time-travel queries natively

Feature Freshness: Batch vs Streaming vs On-Demand

Batch (Spark, daily/hourly):
  user_age_bucket, user_avg_purchase_value_30d: recomputed daily
  ? Cheap
  ✗ Stale: feature value may be 24 hours old

Streaming (Flink, near real-time):
  user_clicks_last_5min, user_cart_total: updated every event
  ✓ Fresh (seconds latency)
  ? More infrastructure, higher cost

On-demand (at inference time):
  "Does this user follow the seller of this product?"
  ? Always fresh
  ✗ Adds latency to inference pipeline

Prioritization:
  Historical purchase categories ? weekly batch
  Average order value last 30 days ? daily batch
  Recent search queries ? 5-minute streaming
  Current cart contents ? on-demand (< 100ms)

Event Bus Design (Kafka)

Topic: ml_feature_store-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "ml_feature_store-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: ml_feature_store-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an ML Feature Store: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

HTTP

# Feature definition
POST /api/features
{
  "name": "user_avg_purchase_30d",
  "entity": "user", "value_type": "FLOAT",
  "description": "Average purchase amount in last 30 days",
  "source": "transactions_table",
  "freshness_sla": "1h"
}

# Online serving
GET /api/features/online?entity=user&entity_id=123,456&features=avg_purchase_30d,click_rate_7d

# Training data generation (batch)
POST /api/features/historical
{
  "entity": "user",
  "entity_ids_with_timestamps": [
    {"entity_id": "123", "timestamp": "2026-03-01T10:00:00Z"}
  ],
  "features": ["avg_purchase_30d", "click_rate_7d"]
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Online Store (Redis)

HSET user:123
  avg_purchase_30d "45.20"
  click_rate_7d "0.032"
  last_login_hours "2.5"
  _updated_at "2026-03-14T10:05:00Z"
EXPIRE user:123 86400

Offline Store (S3 / Delta Lake / Parquet)

Path: s3://feature-store/user/avg_purchase_30d/
Partitioned by date: date=2026-03-14/part-00000.parquet
Schema: entity_id, timestamp, value, created_at

Feature Registry (PostgreSQL)

SQL

CREATE TABLE features (
    feature_id    UUID PRIMARY KEY,
    name          TEXT UNIQUE NOT NULL,
    entity_type   TEXT NOT NULL,
    value_type    TEXT NOT NULL,
    description   TEXT,
    transformation_code TEXT,
    freshness_sla INTERVAL,
    owner_team    TEXT,
    deprecated    BOOLEAN DEFAULT FALSE
);

Popular Feature Store Systems

System	Type	Key Trait
Feast	Open source	Most popular OSS, Redis/DynamoDB + S3/BigQuery
Tecton	SaaS	Founded by Uber Michelangelo team
Hopsworks	Open source	Feature pipeline as first-class citizen
Vertex AI	GCP managed	Integrated with Vertex AI
SageMaker	AWS managed	Integrated with SageMaker

When You Need a Feature Store

DON'T need: < 10 features, 1 model, 1 team → inline feature computation is fine. Batch-only ML → just use SQL views.

DO need: Real-time serving (< 10ms), multiple teams sharing features, training-serving skew issues, point-in-time correctness matters, streaming features.

Interview Walkthrough

Explain offline store (batch training on historical snapshots) vs online store (low-latency serving) duality.
Cover point-in-time correct joins — features must reflect state at prediction time, not latest value.
Discuss feature versioning and lineage for reproducible training runs.
Mention materialization pipeline: batch compute → push to online KV store for serving.
Cover training-serving skew detection as a first-class monitoring concern.
Common pitfall: online serving reads latest feature value while training used historical snapshot — silent model degradation.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.