This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Service Discovery System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Client-side vs server-side discovery, Health check mechanisms, DNS-based vs registry (Consul/etcd)? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Client-side vs server-side discovery
- Health check mechanisms
- DNS-based vs registry (Consul/etcd)
- Self-registration
- Stale entry eviction
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full service mesh control plane
- Application business logic in downstream services
- Building a custom service registry from scratch when managed options exist
Assumptions
- Clarify scale (DAU, QPS, data volume) for service discovery in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Services register themselves on startup (name, address, port, health endpoint, metadata)
- Services deregister on shutdown (graceful) or get auto-deregistered on failure
- Clients look up healthy instances of a service by name
- Health checking: periodic checks (HTTP, TCP, gRPC) to detect unhealthy instances
- Support for multiple environments/namespaces (prod, staging, per-tenant)
- Service metadata: version, region, weight, canary flags
- Watch/subscribe: clients get notified of changes (new instances, removals)
- DNS-based and API-based discovery
- Load balancing integration: return instances in weighted/round-robin order
- High Availability: 99.999%: if discovery is down, no service can call another
- Low Latency: Lookup in < 1ms (client-side caching with server push)
- Consistency: Eventually consistent with < 5s propagation of changes
- Scalability: 100K+ service instances, 1M+ lookups/sec
- Fault Tolerance: Continue operating during network partitions
- Zero Downtime: Rolling updates, cluster resizing without disruption
| Metric | Calculation | Value |
|---|---|---|
| Services | Given | 5,000 distinct names |
| Instances (total) | Given | 100,000 |
| Registrations / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50 (deploys, autoscaling) |
| Health checks / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 10K/sec |
| Lookups / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 1M (with client caching, most served locally) |
| Watch subscribers | Given | 50K (one per client instance) |
| Metadata per instance | Given | ~500 bytes |
A Consul-like service discovery system with a Raft-consistent registry, health checking by local agents, gossip-based anti-entropy, and client-side caching with watch support for instant updates.
Server-Side vs Client-Side Discovery
| Approach | How | Pros | Cons |
|---|---|---|---|
| Server-Side (ELB/K8s Service) | Client → LB → Backend | Client is simple (just knows LB address) | Extra hop, LB is bottleneck |
| Client-Side (Consul/Eureka) ⭐ | Client queries registry → picks instance directly | No extra hop, locality-aware decisions | Per-language library needed |
| Service Mesh (Envoy/Istio) | Client → local Envoy sidecar → Backend | Language-agnostic, rich features | Sidecar resource overhead |
| Kubernetes DNS | svc.namespace.svc.cluster.local → ClusterIP | Built-in, zero setup | Limited health checking, DNS caching |
Health Check Strategies
Level 0: TCP Check: Is port open? Catches process crash. < 1ms.
Level 1: HTTP Shallow ⭐: GET /healthz → 200. Catches HTTP server failure. < 5ms.
Level 2: HTTP Deep: Checks DB connection, cache, external APIs. Catches dependency failures but risks cascading. 10-100ms.
Level 3: Liveness + Readiness (K8s): /healthz/live (is process stuck?) vs /healthz/ready (can it handle traffic?). Service starting: live=true, ready=false. DB lost: live=true, ready=false. Deadlocked: live=false → K8s KILLS pod.
Recommendation: SD health check = Level 1 (shallow, every 5s). Application readiness = Level 2 (deep, every 30s). K8s liveness = Level 0 or 1. NEVER use deep health checks at high frequency with SD.
Anti-Entropy & Convergence
Consul uses SERF gossip protocol: nodes share state updates peer-to-peer. Anti-entropy sync every 30s: each agent syncs full state with servers. On conflict: latest write wins (LWW with Lamport timestamps). All nodes converge within seconds (< 5s). Health checks are NOT centralized on servers: each local agent health-checks services on its node and gossips status to servers.
Graceful Deployment with Service Discovery
Blue-Green: Deploy v2, register with tag "v2", gradually shift weight v1→v2, deregister v1. Canary: Register 1 canary with tag "canary", route 5% to canary via metadata, monitor, promote or deregister. Connection Draining: Before deregistration, mark as "draining": SD stops sending new requests, waits 30s for in-flight requests to complete, then fully deregisters.
Event Bus Design (Kafka)
Topic: service_discovery-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "service_discovery-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: service_discovery-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Service Discovery System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
# Registration
PUT /v1/agent/service/register
{
"id": "payment-svc-i-abc123",
"name": "payment-svc",
"address": "10.0.1.42",
"port": 8080,
"tags": ["v2.3.1", "canary"],
"meta": {"region": "us-east", "weight": "100"},
"check": {
"http": "http://10.0.1.42:8080/healthz",
"interval": "10s",
"timeout": "3s",
"deregister_critical_service_after": "60s"
}
}
# Deregistration
PUT /v1/agent/service/deregister/{service_id}
# Lookup healthy instances
GET /v1/health/service/{service_name}?passing=true&near=_agent&tag=v2.3.1
# Watch for changes (long-poll / blocking query)
GET /v1/health/service/{service_name}?passing=true&index=42&wait=30s
→ Returns immediately if index > 42 (changes occurred)
→ Blocks up to 30s if no changes
# DNS interface
dig payment-svc.service.consul SRV
→ 1 0 8080 i-abc123.node.dc1.consul.
→ 1 0 8080 i-def456.node.dc1.consul.Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Service Registry (In-Memory + Raft-Replicated)
{
"services": {
"payment-svc": {
"instances": {
"payment-svc-i-abc123": {
"id": "payment-svc-i-abc123",
"address": "10.0.1.42",
"port": 8080,
"tags": ["v2.3.1", "canary"],
"meta": {"region": "us-east", "weight": "100"},
"health": "passing",
"last_heartbeat": "2026-03-14T10:05:00Z",
"registered_at": "2026-03-14T08:00:00Z",
"check": {
"type": "http",
"endpoint": "http://10.0.1.42:8080/healthz",
"interval_sec": 10,
"consecutive_failures": 0
}
}
}
}
}
}Consul vs etcd vs ZooKeeper Comparison
SD Cluster Failure
Defense layers: 1) Raft cluster with 5 nodes (tolerates 2 failures). 2) If majority lost → cluster read-only. 3) Clients use local cache → continue routing to last-known instances. 4) Client-side health checking as backup. 5) On recovery: services re-register. Key principle: Service-to-service calls must NOT depend on SD availability. Client caches must be robust enough to last through SD outages. Graceful degradation: stale routing > no routing.
Race Conditions in Service Discovery
Race 1: Stale Cache During Rolling Deployment: Service B1 deregisters, but A's cache still has B1 (TTL: 30s). A sends request to B1 → connection refused. Mitigation: client-side retry with next instance, circuit breaker, watch-based (not poll-based) cache refresh, and connection draining (B1 sends "draining" status before deregistering).
Race 2: Split-Brain During Network Partition: Minority partition (2/5) becomes read-only. Majority partition (3/5) elects new leader. Clients in minority serve stale but valid data. When partition heals: minority catches up via Raft log replay. Key: clients MUST work with stale data during partition.
Race 3: Zombie Instance: Process hangs but TCP port is open. Health check times out. After 3 consecutive timeouts → deregistered. Window: 30 seconds. Solution: passive health check (if actual request fails, immediately mark unhealthy) + shallow every 5s + deep every 30s.
Client-Side Caching: How It Actually Works
Every service maintains a LOCAL cache of SD data. Update strategy (layered): 1) WATCH: blocking query to Consul/etcd (instant notification on change, < 100ms). 2) POLL: every 30s, full refresh as backup. 3) ON-FAILURE: if request to instance fails, immediately refresh cache for that service. 4) DISK: persist cache to disk on exit → load on restart (survives SD outage). Recommended: watch for real-time + TTL=120s as safety net.
Self-Registration vs Third-Party Registration
Self-Registration (Eureka, Consul agent): Service registers itself on startup, sends heartbeats. ✅ Service knows its own state best. ❌ Every service needs registration logic. Third-Party Registration (Kubernetes, Registrator): External component watches for new instances and registers/deregisters on behalf of services. ✅ Services don't need discovery-aware code. ❌ Registrator is another component to manage.
Interview Walkthrough
- Contrast discovery patterns: server-side (LB/K8s Service), client-side (Consul/Eureka), and service mesh (Envoy sidecar) — pick based on latency vs simplicity trade-off.
- Health check tiers: shallow HTTP
/healthzevery 5s for SD; deep dependency checks only at low frequency — never deep-check at SD cadence. - Client-side cache with layered refresh: watch (blocking query) + poll backup + on-failure immediate refresh + disk persistence for SD outage survival.
- Connection draining before deregistration: mark instance as draining → stop new requests → wait 30s for in-flight → fully deregister.
- Raft-backed registry (Consul/etcd) tolerates node failures; clients must degrade gracefully to stale cache rather than fail entirely.
- DNS SRV records as a fallback interface — useful for legacy clients that cannot embed a discovery SDK.
- Common pitfall: deep health checks that verify database connectivity at 5s intervals — one DB blip deregisters all healthy instances and causes a cascading outage.
DNS-Based vs API-Based Service Discovery
| Aspect | DNS-Based | API-Based ⭐ |
|---|---|---|
| Universality | ✓ Every language supports DNS natively | ✗ Requires client library per language |
| Data richness | ✗ IP + port only (SRV records help) | ✓ Full metadata: tags, health, weight |
| Freshness | ✗ DNS caching (TTL-dependent) | ✓ Instant via long-poll/watch |
| Smart routing | ✗ Round-robin only | ✓ Weighted, canary, version-based |
Recommendation: Use DNS for simple cases (Kubernetes internal). Use API for microservices with advanced routing needs. Many systems use both: DNS for initial discovery, API for watch/health.
Health Check Depth Spectrum
Level 0 (TCP): < 1ms, catches process crash.
Level 1 ⭐ (HTTP Shallow): < 5ms, catches HTTP failure. Safe for SD at 5s intervals.
Level 2 (HTTP Deep): 10-100ms, catches dependency failures but risks cascading (DB slow → all services marked unhealthy).
Level 3 (Liveness + Readiness): K8s pattern. SD health check = Level 1. Application readiness = Level 2 (every 30s). K8s liveness = Level 0 or 1. NEVER use deep checks at high frequency.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core service discovery flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.