Database Indexes

The Problem: Full Table Scans

Suppose you have a table with three columns: id, name, and salary.

SELECT * FROM employees WHERE name = 'Felix';

To understand why this can be slow, you need to understand how data is physically stored.

How Data Lives on Disk

Databases store records in contiguous blocks on disk. In PostgresSQL, each block holds up to 8KB of data. When you run a query, the DB must:

  1. Load blocks from disk into memory
  2. Scan every record in every block
  3. Return matches

This is a Full Table Scan — and its time complexity is O(n).

flowchart LR
    Q["Query: name = 'Felix'"]
    Q --> B1["Block 1\nAlex, Maria, Tom"]
    Q --> B2["Block 2\nJane, Felix, Sam"]
    Q --> B3["Block 3\nRaj, Lee, Anna"]
    B2 -->|"✓ Found!"| R["Return Felix"]
    B1 -->|"✗ No match"| skip1[" "]
    B3 -->|"✗ No match"| skip2[" "]

    style skip1 fill:none,stroke:none
    style skip2 fill:none,stroke:none
    style R fill:#4ade80,color:#000
    style Q fill:#818cf8,color:#fff

The Scale Problem

With 1,000 records, a full table scan is fast. With 100 million records, it’s catastrophic. Every query reads the entire table, regardless of how many rows match.


The Solution: Database Indexes

An index is a separate data structure the database maintains alongside your table. You define an index on a specific column (or columns), and the DB keeps it updated automatically.

Key Insight

An index doesn’t store your full row data — it stores column values alongside pointers to where the full record lives (block number + offset within that block).

How an Index Query Works

flowchart TD
    Q["Query: WHERE name = 'Felix'"]
    Q --> I["🌲 B-Tree Index\non 'name' column"]
    I -->|"O(log n) traversal"| P["Pointer: Block 2, Row 5"]
    P --> D["Fetch ONLY Block 2 from disk"]
    D --> R["✓ Return Felix's record"]

    style Q fill:#818cf8,color:#fff
    style I fill:#fb923c,color:#fff
    style R fill:#4ade80,color:#000

Performance Comparison

ApproachTime ComplexityDisk Reads1M rows
Full Table ScanO(n)All blocks~125,000 block reads
B-Tree IndexO(log n)~3–4 blocks~20 block reads

The B-Tree: How Indexes Are Stored

Indexes in Postgres (and most relational DBs) are stored in a B-Tree — a self-balancing tree structure that keeps data sorted and allows fast lookups.

graph TD
    Root["[Maria]"]
    Root --> L["[Felix | Jane]"]
    Root --> R["[Sam | Tom]"]
    L --> LL["[Alex | Anna]"]
    L --> LR["[Felix → Block 2, Offset 5]"]
    L --> LM["[Jane → Block 2, Offset 6]"]
    R --> RL["[Raj → Block 3, Offset 1]"]
    R --> RR["[Tom → Block 1, Offset 3]"]

    style Root fill:#fb923c,color:#fff
    style LR fill:#4ade80,color:#000

B-Tree Properties

  • Balanced: All leaf nodes are at the same depth
  • Sorted: Keys are kept in order, enabling range queries
  • Self-balancing: Automatically reorganizes on insert/delete
  • Search complexity: O(log n)

Creating an Index

-- Create an index on the name column
CREATE INDEX idx_employees_name ON employees(name);
 
-- Composite index (multiple columns)
CREATE INDEX idx_employees_name_salary ON employees(name, salary);
 
-- Unique index (enforces uniqueness + speeds up lookups)
CREATE UNIQUE INDEX idx_employees_id ON employees(id);

Postgres automatically creates an index on PRIMARY KEY and UNIQUE columns — you don't need to do this manually.


✅ Pros of Indexing

Benefits

BenefitDetail
Faster readsDramatically reduces lookup time: O(n) → O(log n)
Faster range queriesB-Trees are sorted, so WHERE salary BETWEEN 50000 AND 80000 is efficient
Faster ORDER BYIf the index matches the sort column, sorting is free
Faster JOIN operationsJoins on indexed foreign keys are significantly faster
Supports uniqueness constraintsUnique indexes enforce data integrity at the DB level

❌ Cons of Indexing

Tradeoffs

DrawbackDetail
Write overheadEvery INSERT, UPDATE, and DELETE must also update the index
Storage costEach index is a full separate data structure stored on disk
Index bloatOver time, B-Trees can fragment and require REINDEX or VACUUM
Query planner confusionToo many indexes can confuse the optimizer into picking a bad plan
Composite index ordering mattersA composite index on (name, salary) won’t help a query filtering only on salary

When to Use an Index

Good candidates for indexing

  • High-cardinality columns — columns with many distinct values (e.g., email, user_id, name)
  • Columns used in WHERE clauses frequently
  • Foreign key columns used in JOIN operations
  • Columns used in ORDER BY or GROUP BY
  • Columns with range queries (BETWEEN, >, <)
-- These queries benefit greatly from indexes:
SELECT * FROM employees WHERE name = 'Felix';           -- equality
SELECT * FROM employees WHERE salary > 80000;           -- range
SELECT * FROM orders WHERE user_id = 42;                -- FK join
SELECT * FROM employees ORDER BY name LIMIT 100;        -- sort

When NOT to Use an Index

Poor candidates for indexing

  • Low-cardinality columns — columns with very few distinct values (e.g., is_active BOOLEAN, status ENUM('active','inactive'))
    • An index on a boolean column with 50% true/false is often slower than a full scan
  • Small tables — if a table has fewer than ~1,000 rows, the query planner will likely ignore the index anyway
  • Write-heavy tables — if a table has far more writes than reads, indexes hurt more than they help (e.g., log/event tables)
  • Columns rarely used in queries — unused indexes waste disk space and slow down writes with no benefit
  • Columns with frequent bulk updates — indexes must be rebuilt on every bulk operation
flowchart TD
    A["Should I add an index?"]
    A --> B{"Is the table\nlarge? (10k+ rows)"}
    B -->|No| N1["❌ Skip it — full scan is fine"]
    B -->|Yes| C{"Is this column used\nin WHERE / JOIN / ORDER BY?"}
    C -->|No| N2["❌ Skip it — unused index"]
    C -->|Yes| D{"Is the column\nhigh-cardinality?"}
    D -->|No| N3["⚠️ Probably skip — low selectivity"]
    D -->|Yes| E{"Read-heavy\nor write-heavy?"}
    E -->|Write-heavy| N4["⚠️ Weigh tradeoffs carefully"]
    E -->|Read-heavy| Y["✅ Add the index"]

    style Y fill:#4ade80,color:#000
    style N1 fill:#f87171,color:#fff
    style N2 fill:#f87171,color:#fff
    style N3 fill:#fb923c,color:#fff
    style N4 fill:#fb923c,color:#fff

Summary

flowchart LR
    subgraph Without Index
        W1["Full Table Scan"] --> W2["O(n) — reads ALL blocks"]
    end
    subgraph With Index
        I1["B-Tree Lookup"] --> I2["O(log n) — reads ~3 blocks"]
    end
    W2 -->|"Slow at scale"| X["😬"]
    I2 -->|"Fast at scale"| Y["⚡"]

TL;DR

  • Indexes trade write speed + disk space for read speed
  • Use them on large tables, high-cardinality columns, frequently queried columns
  • Avoid on small tables, low-cardinality columns, and write-heavy workloads
  • Postgres auto-indexes PRIMARY KEY and UNIQUE constraints

Linked Map of Contexts