Database Indexes

The Problem: Full Table Scans

Suppose you have a table with three columns: id, name, and salary.

SELECT * FROM employees WHERE name = 'Felix';

To understand why this can be slow, you need to understand how data is physically stored.

How Data Lives on Disk

Databases store records in contiguous blocks on disk. In PostgresSQL, each block holds up to 8KB of data. When you run a query, the DB must:

Load blocks from disk into memory
Scan every record in every block
Return matches

This is a Full Table Scan — and its time complexity is O(n).

flowchart LR
    Q["Query: name = 'Felix'"]
    Q --> B1["Block 1\nAlex, Maria, Tom"]
    Q --> B2["Block 2\nJane, Felix, Sam"]
    Q --> B3["Block 3\nRaj, Lee, Anna"]
    B2 -->|"✓ Found!"| R["Return Felix"]
    B1 -->|"✗ No match"| skip1[" "]
    B3 -->|"✗ No match"| skip2[" "]

    style skip1 fill:none,stroke:none
    style skip2 fill:none,stroke:none
    style R fill:#4ade80,color:#000
    style Q fill:#818cf8,color:#fff

The Scale Problem

With 1,000 records, a full table scan is fast. With 100 million records, it’s catastrophic. Every query reads the entire table, regardless of how many rows match.

The Solution: Database Indexes

An index is a separate data structure the database maintains alongside your table. You define an index on a specific column (or columns), and the DB keeps it updated automatically.

Key Insight

An index doesn’t store your full row data — it stores column values alongside pointers to where the full record lives (block number + offset within that block).

How an Index Query Works

flowchart TD
    Q["Query: WHERE name = 'Felix'"]
    Q --> I["🌲 B-Tree Index\non 'name' column"]
    I -->|"O(log n) traversal"| P["Pointer: Block 2, Row 5"]
    P --> D["Fetch ONLY Block 2 from disk"]
    D --> R["✓ Return Felix's record"]

    style Q fill:#818cf8,color:#fff
    style I fill:#fb923c,color:#fff
    style R fill:#4ade80,color:#000

Performance Comparison

Approach	Time Complexity	Disk Reads	1M rows
Full Table Scan	O(n)	All blocks	~125,000 block reads
B-Tree Index	O(log n)	~3–4 blocks	~20 block reads

The B-Tree: How Indexes Are Stored

Indexes in Postgres (and most relational DBs) are stored in a B-Tree — a self-balancing tree structure that keeps data sorted and allows fast lookups.

graph TD
    Root["[Maria]"]
    Root --> L["[Felix | Jane]"]
    Root --> R["[Sam | Tom]"]
    L --> LL["[Alex | Anna]"]
    L --> LR["[Felix → Block 2, Offset 5]"]
    L --> LM["[Jane → Block 2, Offset 6]"]
    R --> RL["[Raj → Block 3, Offset 1]"]
    R --> RR["[Tom → Block 1, Offset 3]"]

    style Root fill:#fb923c,color:#fff
    style LR fill:#4ade80,color:#000

B-Tree Properties

Balanced: All leaf nodes are at the same depth

Sorted: Keys are kept in order, enabling range queries

Self-balancing: Automatically reorganizes on insert/delete

Search complexity: O(log n)

Creating an Index

-- Create an index on the name column
CREATE INDEX idx_employees_name ON employees(name);
 
-- Composite index (multiple columns)
CREATE INDEX idx_employees_name_salary ON employees(name, salary);
 
-- Unique index (enforces uniqueness + speeds up lookups)
CREATE UNIQUE INDEX idx_employees_id ON employees(id);

Postgres automatically creates an index on PRIMARY KEY and UNIQUE columns — you don't need to do this manually.

✅ Pros of Indexing

Benefits

Benefit	Detail
Faster reads	Dramatically reduces lookup time: O(n) → O(log n)
Faster range queries	B-Trees are sorted, so `WHERE salary BETWEEN 50000 AND 80000` is efficient
Faster `ORDER BY`	If the index matches the sort column, sorting is free
Faster `JOIN` operations	Joins on indexed foreign keys are significantly faster
Supports uniqueness constraints	Unique indexes enforce data integrity at the DB level

❌ Cons of Indexing

Tradeoffs

Drawback	Detail
Write overhead	Every `INSERT`, `UPDATE`, and `DELETE` must also update the index
Storage cost	Each index is a full separate data structure stored on disk
Index bloat	Over time, B-Trees can fragment and require `REINDEX` or `VACUUM`
Query planner confusion	Too many indexes can confuse the optimizer into picking a bad plan
Composite index ordering matters	A composite index on `(name, salary)` won’t help a query filtering only on `salary`

When to Use an Index

Good candidates for indexing

High-cardinality columns — columns with many distinct values (e.g., email, user_id, name)
Columns used in WHERE clauses frequently
Foreign key columns used in JOIN operations
Columns used in ORDER BY or GROUP BY
Columns with range queries (BETWEEN, >, <)

-- These queries benefit greatly from indexes:
SELECT * FROM employees WHERE name = 'Felix';           -- equality
SELECT * FROM employees WHERE salary > 80000;           -- range
SELECT * FROM orders WHERE user_id = 42;                -- FK join
SELECT * FROM employees ORDER BY name LIMIT 100;        -- sort

When NOT to Use an Index

Poor candidates for indexing

Low-cardinality columns — columns with very few distinct values (e.g., is_active BOOLEAN, status ENUM('active','inactive'))
- An index on a boolean column with 50% true/false is often slower than a full scan
Small tables — if a table has fewer than ~1,000 rows, the query planner will likely ignore the index anyway
Write-heavy tables — if a table has far more writes than reads, indexes hurt more than they help (e.g., log/event tables)
Columns rarely used in queries — unused indexes waste disk space and slow down writes with no benefit
Columns with frequent bulk updates — indexes must be rebuilt on every bulk operation

flowchart TD
    A["Should I add an index?"]
    A --> B{"Is the table\nlarge? (10k+ rows)"}
    B -->|No| N1["❌ Skip it — full scan is fine"]
    B -->|Yes| C{"Is this column used\nin WHERE / JOIN / ORDER BY?"}
    C -->|No| N2["❌ Skip it — unused index"]
    C -->|Yes| D{"Is the column\nhigh-cardinality?"}
    D -->|No| N3["⚠️ Probably skip — low selectivity"]
    D -->|Yes| E{"Read-heavy\nor write-heavy?"}
    E -->|Write-heavy| N4["⚠️ Weigh tradeoffs carefully"]
    E -->|Read-heavy| Y["✅ Add the index"]

    style Y fill:#4ade80,color:#000
    style N1 fill:#f87171,color:#fff
    style N2 fill:#f87171,color:#fff
    style N3 fill:#fb923c,color:#fff
    style N4 fill:#fb923c,color:#fff

Summary

flowchart LR
    subgraph Without Index
        W1["Full Table Scan"] --> W2["O(n) — reads ALL blocks"]
    end
    subgraph With Index
        I1["B-Tree Lookup"] --> I2["O(log n) — reads ~3 blocks"]
    end
    W2 -->|"Slow at scale"| X["😬"]
    I2 -->|"Fast at scale"| Y["⚡"]

TL;DR

Indexes trade write speed + disk space for read speed

Use them on large tables, high-cardinality columns, frequently queried columns

Avoid on small tables, low-cardinality columns, and write-heavy workloads

Postgres auto-indexes PRIMARY KEY and UNIQUE constraints

Linked Map of Contexts

System Design, Computer Science, Development

Pensieve

Recent Notes

Explorer

Database Indexing for Dumb Developers

Table of Contents

Database Indexes

The Problem: Full Table Scans

How Data Lives on Disk

The Solution: Database Indexes

How an Index Query Works

Performance Comparison

The B-Tree: How Indexes Are Stored

Creating an Index

✅ Pros of Indexing

❌ Cons of Indexing

When to Use an Index

When NOT to Use an Index

Summary

Graph View

Table of Contents