Database Indexes
The Problem: Full Table Scans
Suppose you have a table with three columns: id, name, and salary.
SELECT * FROM employees WHERE name = 'Felix';To understand why this can be slow, you need to understand how data is physically stored.
How Data Lives on Disk
Databases store records in contiguous blocks on disk. In PostgresSQL, each block holds up to 8KB of data. When you run a query, the DB must:
- Load blocks from disk into memory
- Scan every record in every block
- Return matches
This is a Full Table Scan — and its time complexity is O(n).
flowchart LR Q["Query: name = 'Felix'"] Q --> B1["Block 1\nAlex, Maria, Tom"] Q --> B2["Block 2\nJane, Felix, Sam"] Q --> B3["Block 3\nRaj, Lee, Anna"] B2 -->|"✓ Found!"| R["Return Felix"] B1 -->|"✗ No match"| skip1[" "] B3 -->|"✗ No match"| skip2[" "] style skip1 fill:none,stroke:none style skip2 fill:none,stroke:none style R fill:#4ade80,color:#000 style Q fill:#818cf8,color:#fff
The Scale Problem
With 1,000 records, a full table scan is fast. With 100 million records, it’s catastrophic. Every query reads the entire table, regardless of how many rows match.
The Solution: Database Indexes
An index is a separate data structure the database maintains alongside your table. You define an index on a specific column (or columns), and the DB keeps it updated automatically.
Key Insight
An index doesn’t store your full row data — it stores column values alongside pointers to where the full record lives (block number + offset within that block).
How an Index Query Works
flowchart TD Q["Query: WHERE name = 'Felix'"] Q --> I["🌲 B-Tree Index\non 'name' column"] I -->|"O(log n) traversal"| P["Pointer: Block 2, Row 5"] P --> D["Fetch ONLY Block 2 from disk"] D --> R["✓ Return Felix's record"] style Q fill:#818cf8,color:#fff style I fill:#fb923c,color:#fff style R fill:#4ade80,color:#000
Performance Comparison
| Approach | Time Complexity | Disk Reads | 1M rows |
|---|---|---|---|
| Full Table Scan | O(n) | All blocks | ~125,000 block reads |
| B-Tree Index | O(log n) | ~3–4 blocks | ~20 block reads |
The B-Tree: How Indexes Are Stored
Indexes in Postgres (and most relational DBs) are stored in a B-Tree — a self-balancing tree structure that keeps data sorted and allows fast lookups.
graph TD Root["[Maria]"] Root --> L["[Felix | Jane]"] Root --> R["[Sam | Tom]"] L --> LL["[Alex | Anna]"] L --> LR["[Felix → Block 2, Offset 5]"] L --> LM["[Jane → Block 2, Offset 6]"] R --> RL["[Raj → Block 3, Offset 1]"] R --> RR["[Tom → Block 1, Offset 3]"] style Root fill:#fb923c,color:#fff style LR fill:#4ade80,color:#000
B-Tree Properties
- Balanced: All leaf nodes are at the same depth
- Sorted: Keys are kept in order, enabling range queries
- Self-balancing: Automatically reorganizes on insert/delete
- Search complexity: O(log n)
Creating an Index
-- Create an index on the name column
CREATE INDEX idx_employees_name ON employees(name);
-- Composite index (multiple columns)
CREATE INDEX idx_employees_name_salary ON employees(name, salary);
-- Unique index (enforces uniqueness + speeds up lookups)
CREATE UNIQUE INDEX idx_employees_id ON employees(id);Postgres automatically creates an index on
PRIMARY KEYandUNIQUEcolumns — you don't need to do this manually.
✅ Pros of Indexing
Benefits
| Benefit | Detail |
|---|---|
| Faster reads | Dramatically reduces lookup time: O(n) → O(log n) |
| Faster range queries | B-Trees are sorted, so WHERE salary BETWEEN 50000 AND 80000 is efficient |
Faster ORDER BY | If the index matches the sort column, sorting is free |
Faster JOIN operations | Joins on indexed foreign keys are significantly faster |
| Supports uniqueness constraints | Unique indexes enforce data integrity at the DB level |
❌ Cons of Indexing
Tradeoffs
| Drawback | Detail |
|---|---|
| Write overhead | Every INSERT, UPDATE, and DELETE must also update the index |
| Storage cost | Each index is a full separate data structure stored on disk |
| Index bloat | Over time, B-Trees can fragment and require REINDEX or VACUUM |
| Query planner confusion | Too many indexes can confuse the optimizer into picking a bad plan |
| Composite index ordering matters | A composite index on (name, salary) won’t help a query filtering only on salary |
When to Use an Index
Good candidates for indexing
- High-cardinality columns — columns with many distinct values (e.g.,
email,user_id,name) - Columns used in
WHEREclauses frequently - Foreign key columns used in
JOINoperations - Columns used in
ORDER BYorGROUP BY - Columns with range queries (
BETWEEN,>,<)
-- These queries benefit greatly from indexes:
SELECT * FROM employees WHERE name = 'Felix'; -- equality
SELECT * FROM employees WHERE salary > 80000; -- range
SELECT * FROM orders WHERE user_id = 42; -- FK join
SELECT * FROM employees ORDER BY name LIMIT 100; -- sortWhen NOT to Use an Index
Poor candidates for indexing
- Low-cardinality columns — columns with very few distinct values (e.g.,
is_active BOOLEAN,status ENUM('active','inactive'))- An index on a boolean column with 50% true/false is often slower than a full scan
- Small tables — if a table has fewer than ~1,000 rows, the query planner will likely ignore the index anyway
- Write-heavy tables — if a table has far more writes than reads, indexes hurt more than they help (e.g., log/event tables)
- Columns rarely used in queries — unused indexes waste disk space and slow down writes with no benefit
- Columns with frequent bulk updates — indexes must be rebuilt on every bulk operation
flowchart TD A["Should I add an index?"] A --> B{"Is the table\nlarge? (10k+ rows)"} B -->|No| N1["❌ Skip it — full scan is fine"] B -->|Yes| C{"Is this column used\nin WHERE / JOIN / ORDER BY?"} C -->|No| N2["❌ Skip it — unused index"] C -->|Yes| D{"Is the column\nhigh-cardinality?"} D -->|No| N3["⚠️ Probably skip — low selectivity"] D -->|Yes| E{"Read-heavy\nor write-heavy?"} E -->|Write-heavy| N4["⚠️ Weigh tradeoffs carefully"] E -->|Read-heavy| Y["✅ Add the index"] style Y fill:#4ade80,color:#000 style N1 fill:#f87171,color:#fff style N2 fill:#f87171,color:#fff style N3 fill:#fb923c,color:#fff style N4 fill:#fb923c,color:#fff
Summary
flowchart LR subgraph Without Index W1["Full Table Scan"] --> W2["O(n) — reads ALL blocks"] end subgraph With Index I1["B-Tree Lookup"] --> I2["O(log n) — reads ~3 blocks"] end W2 -->|"Slow at scale"| X["😬"] I2 -->|"Fast at scale"| Y["⚡"]
TL;DR
- Indexes trade write speed + disk space for read speed
- Use them on large tables, high-cardinality columns, frequently queried columns
- Avoid on small tables, low-cardinality columns, and write-heavy workloads
- Postgres auto-indexes
PRIMARY KEYandUNIQUEconstraints
Linked Map of Contexts