This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design ML Feature Store.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Online vs offline store, Feature freshness, Point-in-time correctness? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Online vs offline store
- Feature freshness
- Point-in-time correctness
- Feature serving latency
- Capacity estimation with shown math
Out of scope (state explicitly)
- GPU cluster training and hyperparameter tuning
- Content moderation of recommended items
- Ad auction / sponsored placement ranking
Assumptions
- Clarify scale (DAU, QPS, data volume) for ml feature store in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Register feature definitions (name, type, entity, description, owner, SLA)
- Ingest features from batch (Spark/Hive) and streaming (Flink/Kafka) sources
- Serve features at low latency for online inference (< 5ms p99)
- Serve features in batch for model training (point-in-time correct joins)
- Feature versioning: track schema changes, backward compatibility
- Feature sharing: discover and reuse features across teams
- Point-in-time correctness: no data leakage
- Feature monitoring: drift detection, freshness alerts
- Feature lineage: trace from raw data source → transformation → feature
- Ultra-Low Latency (Online): < 5ms p99 for feature retrieval during inference
- High Throughput (Online): 1M+ feature lookups/sec
- Training Correctness: Point-in-time joins must be exact (no future data leakage)
- Freshness: Online features reflect latest state (< 1 minute for streaming)
- Consistency: Online and offline features must compute identically (no training-serving skew)
- Scalability: 10K+ feature definitions, 1B+ entity rows
| Metric | Calculation | Value |
|---|---|---|
| Feature definitions | Given | 10,000 |
| Entities (users, items, etc.) | Given | 1B |
| Features per entity | Given | 50-200 |
| Online feature retrievals / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 1M |
| Online store size | Given | 2 TB (fits in Redis cluster) |
| Offline store size | Given | ~100 TB (S3/Parquet) |
| Batch ingestion (daily) | Given | 2 TB |
| Streaming ingestion | Given | 100K events/sec → feature updates |
Training-Serving Skew: The #1 Problem
What it is: Features computed differently in training vs serving ? model performs worse Example: Training: avg_purchase computed with pandas (Python float64) Serving: avg_purchase computed in Java (Java double, different rounding) Result: 0.1% difference → model accuracy degrades Prevention: 1. Single transformation definition: same code for batch AND streaming 2. Validation job: compute features both ways, compare → alert on divergence 3. Feature logging: log online features at serving time → use THESE for training
Point-in-Time Correctness
Wrong approach:
Join training data (March 1 prediction) with current features (March 14)
→ Model sees "future" information → fails in production
Correct approach:
For training example at time T:
feature_value = latest(feature WHERE timestamp <= T)
Implementation:
1. Store features as (entity_id, timestamp, value) triples
2. Training join: AS OF JOIN on timestamp
3. Delta Lake / Apache Iceberg support time-travel queries nativelyFeature Freshness: Batch vs Streaming vs On-Demand
Batch (Spark, daily/hourly): user_age_bucket, user_avg_purchase_value_30d: recomputed daily ? Cheap ✗ Stale: feature value may be 24 hours old Streaming (Flink, near real-time): user_clicks_last_5min, user_cart_total: updated every event ✓ Fresh (seconds latency) ? More infrastructure, higher cost On-demand (at inference time): "Does this user follow the seller of this product?" ? Always fresh ✗ Adds latency to inference pipeline Prioritization: Historical purchase categories ? weekly batch Average order value last 30 days ? daily batch Recent search queries ? 5-minute streaming Current cart contents ? on-demand (< 100ms)
Event Bus Design (Kafka)
Topic: ml_feature_store-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "ml_feature_store-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: ml_feature_store-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an ML Feature Store: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
# Feature definition
POST /api/features
{
"name": "user_avg_purchase_30d",
"entity": "user", "value_type": "FLOAT",
"description": "Average purchase amount in last 30 days",
"source": "transactions_table",
"freshness_sla": "1h"
}
# Online serving
GET /api/features/online?entity=user&entity_id=123,456&features=avg_purchase_30d,click_rate_7d
# Training data generation (batch)
POST /api/features/historical
{
"entity": "user",
"entity_ids_with_timestamps": [
{"entity_id": "123", "timestamp": "2026-03-01T10:00:00Z"}
],
"features": ["avg_purchase_30d", "click_rate_7d"]
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Online Store (Redis)
HSET user:123 avg_purchase_30d "45.20" click_rate_7d "0.032" last_login_hours "2.5" _updated_at "2026-03-14T10:05:00Z" EXPIRE user:123 86400
Offline Store (S3 / Delta Lake / Parquet)
Path: s3://feature-store/user/avg_purchase_30d/ Partitioned by date: date=2026-03-14/part-00000.parquet Schema: entity_id, timestamp, value, created_at
Feature Registry (PostgreSQL)
CREATE TABLE features (
feature_id UUID PRIMARY KEY,
name TEXT UNIQUE NOT NULL,
entity_type TEXT NOT NULL,
value_type TEXT NOT NULL,
description TEXT,
transformation_code TEXT,
freshness_sla INTERVAL,
owner_team TEXT,
deprecated BOOLEAN DEFAULT FALSE
);Online Store Failure
Redis cluster failure → feature retrieval fails → inference fails Mitigations: 1. Redis Cluster with replicas (6-node: 3 masters + 3 replicas) 2. Local feature cache on inference servers (LRU, 5-min TTL) 3. Default values: if Redis unavailable, use population median for each feature (degrades accuracy, preserves availability) 4. Feature importance: top 10 features provide 80% of model accuracy ? cache only critical features locally
Popular Feature Store Systems
When You Need a Feature Store
DON'T need: < 10 features, 1 model, 1 team → inline feature computation is fine. Batch-only ML → just use SQL views.
DO need: Real-time serving (< 10ms), multiple teams sharing features, training-serving skew issues, point-in-time correctness matters, streaming features.
Interview Walkthrough
- Explain offline store (batch training on historical snapshots) vs online store (low-latency serving) duality.
- Cover point-in-time correct joins — features must reflect state at prediction time, not latest value.
- Discuss feature versioning and lineage for reproducible training runs.
- Mention materialization pipeline: batch compute → push to online KV store for serving.
- Cover training-serving skew detection as a first-class monitoring concern.
- Common pitfall: online serving reads latest feature value while training used historical snapshot — silent model degradation.
Online Store: Redis vs DynamoDB vs Aerospike
Redis (most common choice): ✓ Sub-millisecond latency (memory-only, < 0.5ms p99) ✓ Flexible data types (Hash, String, List) ✓ TTL natively (features expire automatically) ✗ Memory-only: 10M users × 50 features × 50B = 25 GB RAM Best for: < 100M active users DynamoDB (AWS managed): ? Auto-scaling, zero ops overhead ? Latency: 5-10ms without DAX Best for: AWS ecosystem, medium scale Aerospike (Uber, Criteo, Twitter): ? NVMe SSD-optimized, < 1ms latency ✓ 10-100× cheaper than Redis for same data volume ✗ More complex to operate Best for: 1B+ entities where RAM is too expensive
Training-Serving Skew: The Silent Killer
Common causes of training-serving skew: 1. Different code paths: Training: SQL in Spark Serving: Python application code with different time window Fix: Feature Store shares ONE feature definition 2. Timestamp handling: Training: uses order creation time Serving: uses current time Fix: point-in-time joins (as-of join on timestamp) 3. Missing value handling: Training: NaN → fill with 0 Serving: missing key → return None (not 0) Fix: feature store enforces consistent default values 4. Type mismatches: Training: integer feature [1, 2, 3] Serving: string feature ["1", "2", "3"] Fix: feature store enforces schema with type validation Monitoring: Log feature values at serving time → compare distribution to training KS test between training and serving distributions ? alert on significant shift
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core ml feature store flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.