Design a Web Crawler (Googlebot) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	The politeness + scale problem. Nail BFS URL frontier, per-domain rate limiting, robots.txt caching, SimHash near-duplicate detection, and DNS caching. Show why naive parallel crawling gets you blocked.
Arch 50	Add priority crawling, freshness scheduling, and distributed coordination without single-point bottleneck.
Arch 75	Staff: crawl budget allocation across 100M domains, adversarial sites (crawler traps), and index pipeline backpressure.

Interview Prompt

Design a web crawler that discovers and downloads billions of web pages for a search engine index. Respect robots.txt, avoid overloading sites, and deduplicate near-identical content.

Clarifying Questions (ask before designing)

Question	Why it matters
What's the crawl rate target and how many domains?	1B pages/month across 100M domains = strict per-domain politeness. 1B pages from 10K domains = different bottleneck.
Freshness priority or breadth-first discovery?	BFS discovers new sites; freshness scheduler re-crawls known pages — often separate queues.
Exact dedup or near-duplicate detection?	Exact URL dedup is a hash set; near-dedup (mirrors, templates) needs SimHash/MinHash.
JavaScript-rendered pages in scope?	Headless browser crawling is 100× slower — usually a separate rendering pipeline.

Scope

In scope

URL frontier with BFS and priority scheduling
Politeness: per-domain rate limiting
robots.txt fetch, parse, and cache
Content deduplication (exact URL + SimHash near-dedup)
DNS caching and resolution
Distributed crawler worker architecture

Out of scope (state explicitly)

HTML parsing and index extraction (downstream indexer)
JavaScript rendering (headless Chrome farm)
Login/authenticated crawling
Image/video binary crawling

Assumptions

Crawl 1B pages/month (~400 pages/sec sustained)
100M unique domains, avg 10 pages/domain
Default politeness: 1 request/sec/domain, 10 concurrent per domain max
Average page size 50 KB, 90% HTML text

Crawl the web starting from a set of seed URLs
Discover new URLs by extracting links from crawled pages
Download and store web page content for indexing
Respect robots.txt directives (politeness)
Handle URL deduplication (don't crawl the same page twice)
Support recrawling to detect updated content
Prioritize important/popular pages for crawling first
Handle different content types (HTML, PDF, images, etc.)

Metric	Calculation	Value
Pages to crawl	Given (assumption documented in value)	15B (entire web, roughly)
Target crawl rate	Given (assumption documented in value)	1B pages/day
Pages / sec	Given	~11,500
Avg page size	Given (typical workload assumption)	100 KB
Download / day	1B × 100 KB	100 TB
Download bandwidth	Given	~10 Gbps
URL frontier size	10B URLs × 100 bytes	1 TB
Content storage / month	Given (assumption documented in value)	3 PB
Crawl workers	Given	~1000 machines (each handles ~12 pages/sec)

Loading...

URL Frontier: The Most Complex Component

The URL frontier is NOT a simple queue. It has two sub-systems:

1. Priority Queue (Front Queue): Determines WHICH URLs to crawl first. Priority based on PageRank, change frequency, domain authority, and freshness requirement. Implementation: Multiple priority queues (P0, P1, P2, ..., PN).

2. Politeness Queue (Back Queue): Ensures we don't overwhelm any single domain. One FIFO queue per domain. A rate limiter enforces max 1 request per domain per second (or per Crawl-delay in robots.txt).

Loading...

Fetcher (HTTP Downloader)

Connection pooling: Reuse HTTP connections to the same host
Timeouts: Connect timeout = 5s, read timeout = 30s
Redirect handling: Follow up to 5 redirects (301, 302)
robots.txt compliance: Before crawling any page on a domain, fetch and cache robots.txt. Parse directives: Disallow, Allow, Crawl-delay, Sitemap. Cache robots.txt per domain with TTL = 24 hours
Content types: Accept HTML, PDF, DOC (reject binary, video, etc.)

Content Deduplication

Problem: Many pages have identical or near-identical content (mirrors, syndication, plagiarism).

Exact dedup: MD5/SHA-256 hash of page content → check against seen hashes
Near-dedup: SimHash or MinHash algorithm. Two pages are near-duplicates if Hamming distance of their SimHashes ≤ 3
Storage: Bloom filter for exact hashes, SimHash table for near-dupes

URL Deduplication (Seen URLs)

Solution: Bloom filter with 100 billion entries (~120 GB for 1% false positive rate)
Before adding a URL to the frontier, check the Bloom filter
URL normalization: lowercase scheme/host, remove fragment, remove default ports, sort query parameters, resolve relative paths, remove tracking parameters

DNS Resolver

DNS lookups are slow (50-200ms) and become a bottleneck at scale
Solution: Local DNS cache + custom DNS resolver
Pre-resolve DNS for URLs in the frontier (batch DNS lookups)
Cache DNS results with TTL

Parser / Link Extractor

Parse HTML using robust parser (handles malformed HTML)
Extract: links, text content, metadata, base tag
Handle JavaScript-rendered pages via headless browser (expensive) or only crawl server-rendered HTML

Event Bus Design (Kafka)

Topic: web_crawler-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "web_crawler-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: web_crawler-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Web Crawler (Googlebot): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Add Seed URLs

HTTP

POST /api/v1/crawler/seeds
{
  "urls": ["https://example.com", "https://news.ycombinator.com"],
  "priority": "high"
}

Get Crawl Status

HTTP

GET /api/v1/crawler/status
Response: 200 OK
{
  "pages_crawled_today": 892345678,
  "pages_in_frontier": 5234567890,
  "crawl_rate_per_sec": 11500,
  "active_workers": 980,
  "errors_today": 12345
}

Block Domain

HTTP

POST /api/v1/crawler/block
{
  "domain": "spam-site.com",
  "reason": "spam"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

Concern	Solution
Worker crash	URL remains in frontier (not removed until crawl confirmed). Reassigned to another worker
DNS failure	Retry with exponential backoff; fallback to secondary DNS
HTTP timeout	Retry up to 3 times with backoff; then mark URL as failed and deprioritize
Content store failure	S3 with 11 nines durability + cross-region replication
Frontier data loss	Checkpoint frontier to disk periodically; rebuild from content store's crawled URLs
Bloom filter loss	Rebuild from frontier + content store URLs (takes hours but possible)

Spider Traps and Infinite Loops

Defenses: Max URL depth (stop after depth 15), Max URLs per domain (cap at 1M per cycle), URL pattern detection (if > 1000 URLs match same pattern stop following), Maximum page size (skip > 10 MB), Content hash dedup (if same content at different URLs → stop following)

Recrawl Strategy

News sites: Every 15 minutes
E-commerce (prices): Every few hours
Wikipedia/docs: Every few days
Static pages: Weekly or monthly
Adaptive recrawl: Track how often a page changes → adjust recrawl interval dynamically. If page changed: halve interval (capped at MIN). If unchanged: double interval (capped at MAX).

Distributed Architecture

Master-Worker: Master assigns URL batches to workers; workers fetch and report results
Alternative (Masterless): Each worker manages its own set of domains via consistent hashing on domain name. No single point of failure. Workers coordinate via message queue (Kafka).

Legal and Ethical Considerations

robots.txt MUST be respected
Crawl-delay: Honor the specified delay between requests
noindex meta tag: Don't index pages with noindex
Terms of service: Some sites prohibit crawling beyond robots.txt
Personal data: Don't store PII discovered during crawling (GDPR)

Sitemap Processing

Parse sitemap.xml for a curated list of a site's pages
Priority and changefreq hints from sitemap help with scheduling
Sitemaps can reference other sitemaps (sitemap index files)

Interview Walkthrough

Start with the URL frontier as the central scheduler — it decides what to crawl next while enforcing politeness constraints.
Explain the two-tier frontier: priority queues (importance) feeding per-domain back queues (rate limiting) — this is the pattern interviewers probe.
Use a Bloom filter for visited-URL deduplication to keep the frontier memory-efficient at billions of URLs.
Describe worker distribution: master-worker for simplicity or consistent-hashing workers by domain for fault tolerance.
Cover recrawl scheduling — adaptive intervals based on page change frequency, not a fixed global timer.
Emphasize robots.txt and crawl-delay compliance; ignoring politeness gets your crawler blocked and is an instant red flag.
Quantify throughput with Back-of-the-Envelope Estimation: pages/sec × average page size = bandwidth and storage needs.
Common pitfall: a global BFS queue with no per-domain throttle — you hammer one host and trigger IP bans.

URL Frontier: Priority + Politeness Architecture

The URL Frontier acts as the central scheduler and is built as a two-tier system: **Front Queues (Priority)** and **Back Queues (Politeness)**. Together they balance finding the most important pages early while respecting domain rate limits.

Loading...

1. Front Queues (Priority Allocation):

URLs are classified into multiple priority levels from critical to low:
- Queue P1 (Critical): e.g., [google.com/news, bbc.com/latest]
- Queue P2 (High): e.g., [wikipedia.org/..., nytimes.com/]
- Queue P3 (Normal): e.g., [example.com/about, blog.io/post]
- Queue P4 (Low): e.g., [random-site.xyz/page42]
Priority Assignment Signals:
- PageRank: Pages on domains with higher PageRank get prioritised first.
- Freshness Need: Frequently updated directories (e.g. news sites) are scheduled on higher-priority tiers.
- Sitemap Hints: Explicit <priority> hints in sitemaps (e.g. 0.8 high vs 0.2 low).
- Historical Change Frequency: If historical scans show a page mutates daily, its priority is increased.
Selection Process: The prioritizer selects URLs using a weighted random probability distribution: P1: 40% chance, P2: 30% chance, P3: 20% chance, and P4: 10% chance.

2. Back Queues (Politeness Rate Limiting):

To avoid crashing web servers (Denial-of-Service), the crawler maintains exactly **one FIFO queue per target domain**.
Each domain queue tracks performance states:
- Queue [google.com]: Last fetch timestamp, and active crawl_delay (e.g. 2s) → Next crawl window is calculated dynamically.
- Queue [wikipedia.org]: Long crawl delay (e.g. 5s) → next fetch scheduled 5s after the last successful download.
Worker Loop: Worker threads scan the back queues to find domains where next_fetch_time ≤ now(), dequeue one URL, issue the HTTP fetch, and set the next available window.

Crawl Execution Flow: One URL End-to-End

Processing a single target URL (e.g., https://example.com/products/shoes) follows a strictly ordered synchronous timeline to maximize efficiency while validating constraints:

Step 1: Dequeue from Frontier (0.1 ms)
Checks eligibility of the back queue (e.g., [example.com]). If last_fetch was 3s ago and crawl_delay is 2s, the queue is eligible. Pops the URL from the queue.
Step 2: DNS Resolution (1 ms: Cached)
Checks the local DNS cache for the host ip (e.g., example.com → 93.184.216.34). On a cache miss, queries recursive DNS resolvers and updates the cache with a standard TTL (e.g., 300s).
Optimization: Custom batch DNS prefetching executes in the background for upcoming back-queue items.
Step 3: Robots.txt Constraint Validation (0.1 ms: Cached)
Fetches cached robots rules from Redis (e.g., key robots:example.com). If the URL is matched against a block pattern (e.g. Disallow: /products/shoes) for our user agent, the crawler drops the URL instantly, logs a rejection, and moves to the next candidate.
Step 4: HTTP Fetch (200 ms: Network)
Leases a connection from the HTTP client pool for the domain. Transmits a `GET` request declaring a distinct User-Agent:
User-Agent: MyCrawler/1.0 (+https://mycrawler.com/about)
Follows HTTP redirects (301, 302, 307) up to 5 levels max. Imposes a 30-second absolute timeout and caps response sizes at 10 MB to prevent crawler bloat.
Step 5: Content Deduplication (1 ms)
Computes the SimHash on the downloaded body. Checks the value against a seen-pages Bloom filter. If the Hamming distance is < 3 (near-duplicate), the page is skipped to prevent indexing duplicate content. Otherwise, it is written to the filter and processing continues.
Step 6: Parsing and Link Extraction (5 ms)
Runs robust HTML parsers to pull structured headers, meta tags, and all child anchor links (<a href="...">) along with JSON-LD micro-data and language parameters.
Step 7: Distributing Storage (10 ms)
Persists the raw HTML, headers, URL, and time metadata into GFS/S3 under a unique key generated by the page content hash.
Write to S3: { url, content, headers, crawled_at, content_hash } (Deduplication at storage level using content_hash key).
Step 8: Discovered URLs Ingestion (1 ms)
Each extracted child URL is normalized, checked against the global seen-URLs Bloom filter, and added to the frontier with a calculated priority score if it is brand new.
Trap Protection: Increments crawled:{domain} in Redis. If a domain spans more than 1 million pages, crawling for that domain is halted to avoid infinite folder traps.
Step 9: Release & Back Queue Cooling
Sets `last_fetch_time = now()` for example.com, locking the back queue from worker dequeues for the duration of the crawl delay.

Timeline Summary: Total time per URL:

~220 ms (dominated by network I/O). A single thread achieves ~4.5 URLs/sec. To scale to 1 billion pages per day, we scale up to ~2,500 active worker processes.

Distributed Coordination: Domain-Sharded Architecture

In a large-scale system with 1,000+ crawler workers, coordinating which thread crawls which domain is critical to prevent politeness rate violations and high locking overheads.

The Shared Frontier Concurrency Problem:

If multiple workers query a single centralized queue of URLs, they might fetch from the same domain at the exact same moment.
This issues sudden spike loads to web hosts, violating politeness agreements and generating massive distributed lock contention across workers.

The Consistent Hashing Solution:

We partition domains deterministically across workers using consistent hashing:
domain_shard = hash(domain) % num_workers
For instance:
- Worker 0: assigned to [google.com, stackoverflow.com, ...]
- Worker 1: assigned to [wikipedia.org, amazon.com, ...]
- ...
- Worker 999: assigned to domains hashing to 999
Each worker maintains its own local URL frontier and Back Queues for its assigned subset of domains.
Zero-Lock Coordination: Because a domain is only ever processed by a single writer node, no distributed locks or inter-process rate limiters are required to guarantee politeness!

Data Ingestion and Failover via Kafka:

Ingestion Pipeline: Extracted child URLs are published to a Kafka topic discovered-urls, partitioned by domain hash. Kafka automatically routes URLs to the partition of the worker responsible for that domain.
When any worker discovers new URLs, it publishes to Kafka with key = domain. Kafka routes to correct partition → correct worker consumes.
The worker consumes URLs for its assigned domains, checks the shared seen-URLs Bloom filter (Redis-backed or local with periodic sync), and adds to the local frontier if it is brand new.
Failover and Rebalancing: If a worker (e.g. Worker 42) crashes, the Kafka consumer group triggers a rebalance. Another worker inherits its partition, retrieves the last crawl state and timestamps from Redis, and resumes crawling that domain's queue within 30 seconds.
Mitigating Hot Domains: If a massive site (like wikipedia.org) overloads a single worker, we split the domain into paths (e.g. wikipedia.org/wiki/A-M vs wikipedia.org/wiki/N-Z) and hash the path prefixes as separate domains.

Advantages over Centralized Frontier:

✓ No Single Point of Failure: Masterless architecture ensures there is no centralized coordinator/master node to fail.
✓ Linear Scalability: Adding more workers/machines dynamically scales the number of domains handled.
✓ Politeness by Design: Guaranteed by a single writer per domain.
✗ Load Imbalances (Hot Domains): Hot domains (e.g., google.com) are assigned to one worker creating uneven loads. Mitigated by path-based sub-crawler hashing.

BFS vs DFS vs Priority-Based Crawling

Selecting the path traversal model changes which pages are crawled first and impacts memory consumption:

Breadth-First Search (BFS):
- Crawls all root seeds, then all depth-1 pages, followed by depth-2 pages.
- ✓ Pros: Discovers crucial landing pages early (homepage → navigation sections).
- ✗ Cons: Tends to waste valuable network bandwidth on low-value pages at the same depth level, and cannot prioritize important content.
Depth-First Search (DFS):
- Follows links straight down a site structure before backtracking.
- ✓ Pros: Low memory consumption since only the active page traversal path is stored.
- ✗ Cons: Easily trapped in deep recursive page paths, and takes a long time to discover major sections of other domains.
Priority-Based Crawling (⭐ Recommended):
- Scores each URL based on relative importance and crawls the highest-scored URLs first.
- Scoring Equation:
  Score = α × PageRank(domain) + β × depth_penalty + γ × freshness_need
- ✓ Pros: Guarantees the highest-value internet pages are discovered and cached first.
- ✓ Adaptive Recrawling: Can reprioritize dynamic schedulers based on real-time change discovery.
- ✗ Cons: Requires managing complex distributed priority queues.

Practical Hybrid Implementation: Start with BFS from seeds to quickly map a website's core architecture, switch to a priority-based scheduler for daily operations (focusing on important pages), and use DFS locally to walk sitemap hierarchies efficiently.

URL Normalization: Why It Matters

Websites often serve the exact same page content across multiple distinct URL formats. Normalization prevents wasting network and storage bandwidth on redundant crawls.

The Redundancy Problem:

Without normalization, the following 7 URLs would be treated as completely different entities, causing the crawler to fetch the same page 7 separate times (wasted bandwidth):

https://example.com/page (Canonical template)
https://Example.COM/page → lowercase host
https://example.com/page/ → trailing slash removal
https://example.com/page?a=1&b=2 → query parameter sorting
https://example.com/page?b=2&a=1 → same params, different order
http://example.com/page → scheme normalization
https://example.com/./page/../page → relative path segment resolution

Normalization Rules (Applied before Seen Bloom Filter check):

Lowercase the scheme and host components.
Remove default protocol ports (e.g. :80 for HTTP and :443 for HTTPS).
Strip trailing slashes (except for root domains).
Sort all query parameters alphabetically.
Remove tracking parameters (e.g., utm_source, fbclid, gclid).
Resolve path references (e.g., strip /. and /..).
Decode unreserved percent-encoded hex codes (e.g. %41 → A).
Extract and honor canonical tags (<link rel="canonical" href="...">) where available.

SLOs & Error Budgets

Metric	Target	Rationale
Crawl throughput	400 pages/sec sustained	1B pages/month budget
Politeness violation rate	0%	One violation can get entire IP range blocked
robots.txt compliance	100%	Legal and ethical requirement
Near-dup detection recall	> 95%	Duplicates waste index storage and degrade search quality

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
IP range blocked by major domain (Amazon, Google)	403 rate for domain >50%; zero successful fetches for 30 min	Immediate stop crawling that domain; rotate to backup IP pool; review recent rate changes; contact domain via robots.txt contact; resume at 0.1 req/sec after 24h
Crawler trap domain consumes 50% of crawl budget	Single domain >500K pages queued; pages/sec for other domains drops	Emergency domain cap (1000 pages); add URL pattern to trap detector; purge frontier entries for domain; backfill budget to other domains
DNS resolver outage causes cascade of fetch failures	DNS cache miss rate 100%; fetch error rate >90%; all workers idle	Failover to secondary resolver pool; extend DNS cache TTL to 24h temporarily; serve from stale cache; scale resolver fleet

Cost Drivers (Staff lens)

Egress bandwidth: 400 pages/sec × 50 KB = 20 MB/sec = 50 TB/month
Worker fleet: 100 nodes × 4 CPU (mostly I/O wait) — not CPU-bound
Frontier storage: 10B URLs × 200 bytes = 2 TB Redis/Cassandra
SimHash + seen-URL store: 1B entries × 100 bytes ≈ 100 GB Cassandra

Multi-Region & DR

Crawl workers in multiple regions for geo-distributed content and IP diversity. URL frontier is global (single source of truth) — domain partition ensures no duplicate fetch across regions. robots.txt cache replicated globally. DNS cache regional (TTL respected). Output (page content + metadata) streamed to indexer via Kafka with region tag.