Indexing and Query Optimization – System Design Core Concept

What:

An index is a secondary database access path built on top of primary tables.

Primary purpose:

Accelerating read query speeds by replacing slow full table disk scans with binary tree traversals.

Usually used for:

Filtering rows (WHERE), sorting datasets (ORDER BY), joining tables, and resolving point lookups.

How should I think about this inside system architectures?

🔍 Secondary Access Path

Instead of reading millions of database rows sequentially from disk, follow a tree branch directly to the row pointer in O(log N) steps.

⚖️ Write & Memory Tax

Indexes are not free. Every added index slows database mutations because the engine must update both the raw table and the index tree.

⛓️ Left-to-Right Sort Order

Composite (multi-column) indexes are sorted hierarchically. They only support queries matching from left to right (leftmost prefix rule).

The default indexing structure for standard SQL and NoSQL databases is the B+Tree. It places navigational keys in internal branches, and links leaf nodes sequentially to support quick range scans:

Loading...

Index Classification Matrix

Type	Lookup	Range	Use Case
B+Tree (General purpose)	O(log N)	Excellent (linked leaves)	Standard SQL/NoSQL primary/secondary indexes
Hash (Point lookup)	O(1)	None (unordered keys)	Redis, exact match filters in SQL databases
Covering (Index-only)	O(log N)	Excellent	Queries reading ONLY the indexed columns (zero disk row seek)
Inverted (Full-text)	Token-based	N/A	Search engines like Elasticsearch, PostgreSQL GIN/GiST
Geospatial (R-Tree / S2)	O(log N)	Proximity	Location queries near coords (PostGIS, Uber, MongoDB)

Benefit	Cost
Drastically Faster Reads (swaps slow full table scans for O(log N) lookups)	Slower Writes (every INSERT/UPDATE/DELETE must update all secondary indexes)
Efficient Range Queries & ORDER BY (linked leaf nodes keep data physically sorted)	Memory & Disk Overhead (large indexes consume RAM buffer space and storage)
Covering Index Optimization (serves select queries entirely from index metadata)	Maintenance Overhead (indexes require periodic defragmentation/rebuilding)

Problem	Usage
WhatsApp Message Timeline	Composite index on `(chat_thread_id, timestamp)` for rapid chat histories
Uber Driver Dispatch Proximity	Geospatial index (S2 / Geohash cells) to find nearby drivers in milliseconds
E-commerce Faceted Search	Inverted index (Elasticsearch) to filter product catalogs by color, size, price
Instagram Feed Timeline Generation	Index on `(user_id, created_at DESC)` for cursor-based user home timelines
API Idempotency Key Validation	Unique B+Tree or Hash index on `(idempotency_key)` to intercept duplicate requests

Clustered vs Non-Clustered Storage Mechanics

In clustered indexes (e.g., MySQL's InnoDB primary key), leaf nodes store the actual physical data row on disk. In secondary non-clustered indexes, the leaf nodes store the secondary search key and the primary key as a row pointer. To retrieve non-indexed columns, the database must traverse the secondary index tree, find the primary key, and then traverse the primary clustered index tree (double lookup or key lookup).

Composite Index Rule Mathematics

When planning multi-column composite indexes, you must adhere to the Equality-first, Sort-next, Range-last optimization math:

SQL

-- For Query: 
WHERE status = 'ACTIVE' AND user_id = 100 AND created_at > '2024-01-01' 
ORDER BY created_at DESC;

-- Best Composite Index columns:
(user_id, status, created_at)

Columns filtered with exact match (=) must come first.
Columns utilized in sorting (ORDER BY) come second.
Columns queried with range comparisons (>, <, BETWEEN) must come last. Range columns break B+Tree index traversal chain for subsequent columns.

Covering Indexes & Index-Only Scans

A covering index contains all columns requested in the query. For example, if you index (user_id, status) and query: SELECT status FROM users WHERE user_id = 10, the database reads the index leaf node directly and returns the response. It completely skips accessing raw table data on disk, reducing I/O operations to near zero.

Query Execution Analysis (EXPLAIN)

To verify database optimizer decisions, prefix queries with EXPLAIN or EXPLAIN ANALYZE. High-performing queries will list:

type: ref or eq_ref (using index lookup).
Extra: Using index (confirming covering index).
Low rows scanned ratio (ideally close to returned count).
Avoid Extra: Using filesort or Extra: Using temporary (indicates slow non-indexed in-memory sorting).

LSM-Tree Write Path (Cassandra, RocksDB, LevelDB)

B+Tree indexes optimize for read latency — every INSERT updates random pages on disk, which hurts write-heavy workloads. Log-Structured Merge (LSM) trees flip the model: writes append sequentially to an in-memory memtable and a write-ahead log; when the memtable fills, it flushes to immutable SST files on disk. Reads check memtable → bloom filter → SST layers (newest to oldest).

Write amplification: Background compaction merges SST files — disk I/O continues after writes "complete."
Read amplification: A key may exist in multiple SST layers until compaction runs; bloom filters skip absent keys cheaply.
Interview fit: Choose LSM-backed stores for append-heavy telemetry, messaging metadata, and write-heavy KV — choose B+Tree OLTP when complex indexed reads dominate.