Indexing

Why Index?

Without an index, finding a record requires scanning potentially the entire file — O(b) block accesses. An index is an auxiliary data structure that maps key values to the blocks containing those records, reducing search to O(log b) or better.

Trade-off: Indexes speed up search and range queries but slow down insertions and deletions (must update the index). Choose indexes based on query patterns.

Index Types

Primary vs Secondary Index

	Primary Index	Secondary Index
Built on	Ordering key field of an ordered file	Any non-ordering field
Entry type	Sparse or dense	Usually dense
One index entry per	Data block (sparse) or record (dense)	Every record
Ordering	File must be sorted on the index key	File need not be sorted

Dense vs Sparse Index

	Dense Index	Sparse Index
Entries	One per record	One per block
Points to	Individual record	First record in the block
Space	More	Less
Useful for	Non-ordered files, secondary indexes	Ordered files (primary)
Requires sorted file?	No	Yes

Primary Index (Sparse — one entry per data block)

A primary index is built on the ordering key field of a sorted data file. One index entry per data block — points to the first record (anchor record) in each block.

Index entry = (key value of anchor record, pointer to block)
Index entry size = K + P  (key size + pointer size)

Building the index:

Index entries = number of data blocks (b)
bfr_i = ⌊B / (K + P)⌋      (index blocking factor)
Index blocks = ⌈b / bfr_i⌉

Search algorithm:

Binary search on the index to find the block containing the target key
Fetch that one data block → 1 more access
Total accesses = ⌈log₂(index blocks)⌉ + 1

Worked example:

File: 50,000 records, R = 300 bytes, B = 4096 → bfr = 13 → data blocks = 3847
Index: K = 9 bytes, P = 6 bytes → entry = 15 bytes → bfr_i = 273 → index blocks = 15
Search: ⌈log₂(15)⌉ + 1 = 4 + 1 = 5 block accesses (vs 12 for binary search on data file)

Clustering Index (on non-key ordering field)

When a file is sorted on a non-key field (multiple records can have the same value), a clustering index has one entry per distinct value of the field, pointing to the first block with that value.

Secondary Index (Dense — one entry per record)

Built on a non-ordering field. Must be dense because records with the same secondary key value are scattered throughout the file.

Index entries = number of records (r)
Index blocks = ⌈r / bfr_i⌉

Problem: For a non-key secondary index, multiple records share the same index key value. Two approaches:

Duplicate index entries: One entry per record with that key value (wastes index space)
Bucket pointer: Index entry points to a bucket of record pointers — space efficient

Multi-Level Index

An index on an index. When the first-level index is too large to binary search efficiently, build another (smaller) index on top.

Level 1: ⌈r / bfr_i⌉ blocks   (or ⌈b / bfr_i⌉ for sparse)
Level 2: ⌈Level 1 blocks / bfr_i⌉ blocks
...
Stop when level k has exactly 1 block
Total accesses = number of levels + 1 (for data block)

Number of levels: t = ⌈log_{bfr_i}(r)⌉ for dense, ⌈log_{bfr_i}(b)⌉ for sparse

A multi-level index with a branching factor of bfr_i is essentially a static B-tree.

B+ Trees

The standard dynamic index structure used in almost all production databases.

Structure

Internal (non-leaf) nodes: Contain keys only — used for navigation/search
Leaf nodes: Contain keys + pointers to actual data records; linked in a chain for range queries
Root: Special case — can have as few as 1 key (2 children) if non-leaf, or 1 key if leaf

Order (Degree) n

Non-leaf node: can hold up to n-1 keys and n child pointers
  Constraint: n × P + (n-1) × K ≤ B
  Solve for n: n = ⌊(B + K) / (K + P)⌋

Leaf node: holds up to n-1 (key, record-pointer) pairs + 1 sibling pointer
  Constraint: (n-1) × (K + P_data) + P_sibling ≤ B
  Solve for n-1: n-1 = ⌊(B - P_sibling) / (K + P_data)⌋

Where K = key size, P = node pointer size, P_data = data record pointer size.

Capacity Constraints (for exam — critical)

Node Type	Minimum	Maximum
Non-leaf keys	⌈n/2⌉ - 1	n - 1
Non-leaf children	⌈n/2⌉	n
Leaf keys	⌈(n-1)/2⌉	n - 1
Root (non-leaf)	1 key, 2 children	n - 1 keys, n children
Root (leaf, i.e., only node)	0 keys	n - 1 keys

Height

For N records:

Height h satisfies:
  min records: first level has 2 children, each ≥ ⌈n/2⌉ children, ..., leaves ≥ ⌈(n-1)/2⌉ keys
  max height: h ≤ ⌊log_{⌈n/2⌉}(N)⌋ + 1

Typical search: h block accesses (one per level) + 1 for data block = h+1 total

Insertion

Search to find the correct leaf node
If leaf has room, insert key in sorted order
If leaf is full (overflow): split the leaf
- Move ⌈(n-1)/2⌉ entries to a new leaf (right half)
- Copy the smallest key of the new leaf up to the parent
- Update leaf chain pointers
If parent is also full: split the internal node
- Move ⌈n/2⌉ - 1 keys to new node
- Push the middle key up to the parent (key is removed from leaf level — “pushed”, not “copied”)
Splits propagate upward; root split increases tree height

Key distinction:

Leaf splits: key is copied up (stays in leaf + also in parent)
Internal node splits: key is pushed up (removed from current level, goes to parent only)

Deletion

Find and remove the key from the leaf
If leaf has ≥ ⌈(n-1)/2⌉ keys → done
If underflow:
- Redistribute: Try to borrow a key from a sibling leaf (via the parent) — update parent key
- Merge: If no sibling can lend, merge the leaf with a sibling; the corresponding parent key is deleted
Internal node underflow handled similarly (redistribute or merge)
Merges can propagate upward; root can lose its last key if both children merge (tree height decreases)

Bulk Loading

Inserting N records one at a time into a B+ tree is expensive due to repeated splits and cache misses. Bulk loading:

Sort all records by the index key
Fill leaf nodes sequentially to a desired fill factor (e.g., 70%)
Build internal nodes bottom-up as leaves are created

Issue with standard bulk loading: If data is loaded nearly full and subsequent insertions happen with sequential keys, the right-most leaf node is always split (right-biased splits). This creates poorly utilized leaves on the left side. Solution: leave some free space during loading or use a fill factor < 100%.

Extendible Hashing

A dynamic hashing scheme that grows without overflow chains.

Structure

Directory: An array of 2^d pointers (d = global depth), stored in memory
Buckets: Disk pages, each with a local depth ld ≤ d
Multiple directory entries can point to the same bucket

Lookup

hash(K) → h (use high-order OR low-order bits, per convention)
Use the first d bits of h to index into the directory
Follow the pointer to the bucket

Overflow Handling

When a bucket with local depth ld overflows:

If ld < d (global depth): Split the bucket only; double the number of directory entries pointing to this bucket; redistribute records; increment ld for both halves
If ld = d: First double the directory (d becomes d+1; each old entry spawns two new entries pointing to the same bucket); then split the overflowing bucket

Space

Directory size: 2^d entries
After n splits from an initial directory of size 1: directory can grow to 2^n entries
Buckets: proportional to actual number of records

Linear Hashing

Another dynamic hashing scheme, but avoids doubling the directory.

Key Idea

Maintain a split pointer s (initially 0) and level l (initially 0)
Total buckets at any time: n × 2^l + s (where n = initial number of buckets)

Hash Functions

h_l(K) = K mod (n × 2^l)          — current-level hash
h_{l+1}(K) = K mod (n × 2^{l+1}) — next-level hash (for "already split" buckets)

Lookup

bucket = h_l(K)
if bucket < s (split pointer):
  use h_{l+1}(K) instead

Splitting

Overflow anywhere → split the bucket at the split pointer (NOT the overflowing bucket)
Create a new bucket; redistribute records in the split-pointer bucket using h_{l+1}
Increment s
If s = n × 2^l: set l = l + 1, reset s = 0

Properties

No directory doubling — size grows linearly with number of records
Split order is predetermined, not triggered by where overflow occurs
May have overflow chains (handled by chaining buckets)

Multidimensional Indexing

Problem

Traditional B+ trees and hash indexes work on a single attribute. For spatial/multidimensional queries (range queries on lat/lon, k-NN search), we need multi-attribute indexes.

Grid File

Partition each dimension independently using linear scale arrays
Each cell in the grid → a bucket of records
Point query: Determine which grid cell contains the point; access that bucket — O(1)
Range query: Find all overlapping cells; access each bucket
Nearest neighbor: Find closest candidate; then search all buckets whose closest point to the query is ≤ current best distance

R-Tree

A tree structure for spatial objects (points, rectangles, polygons). Each node contains MBRs (Minimum Bounding Rectangles) of its children.

Search: Traverse the tree, following subtrees whose MBR overlaps the query region.

Nearest Neighbor Search using MINDIST:

For query point Q = (qx, qy) and MBR [xmin, xmax] × [ymin, ymax]:

dx = xmin - qx  if qx < xmin
     0           if xmin ≤ qx ≤ xmax
     qx - xmax   if qx > xmax

dy = (same logic for y-dimension)

MINDIST(Q, MBR) = √(dx² + dy²)

Pruning rule: If MINDIST(Q, node’s MBR) > current best distance, prune the entire subtree.

Algorithm:

Start at root; maintain a priority queue of (MINDIST, node)
Dequeue the node with smallest MINDIST
If leaf: check actual records; update best if closer
If internal: enqueue all children with their MINDIST values
Prune any queued node with MINDIST > current best

Common Exam Patterns

Factual Questions

How many levels does a multi-level index have? (⌈log_{bfr_i}(b)⌉ for sparse)
What is the order n formula for a B+ tree non-leaf node? (n = ⌊(B+K)/(K+P)⌋)
What is MINDIST in R-tree NN search? (minimum possible distance from query point to any point in the MBR)

Conceptual Questions

When would you choose a sparse primary index over a dense index? (When file is sorted on the index key; sparse uses less space and still enables binary search)
Compare extendible hashing and linear hashing on: directory growth, overflow handling, and lookup cost.
Why does leaf splitting in B+ trees copy a key up while internal node splitting pushes?

Solving Questions

Index size and search cost calculation (very common in exams):

Compute data blocking factor
Compute number of data blocks
Compute index entry size
Compute index blocking factor
Compute number of index blocks
Compute search accesses = ⌈log₂(index blocks)⌉ + 1

B+ tree insertion/deletion:

Draw the tree before the operation
Follow split/merge rules precisely
Show the final tree state

Extendible hashing:

Track global depth, local depths, directory size
Perform insertions; show directory doubling when needed
Show final directory and bucket states

Linear hashing:

Track level l, split pointer s, number of buckets
Compute correct hash function based on whether bucket < s
Show splits triggered by overflow