Storage & File Organization


Storage Hierarchy

Database data lives in a hierarchy of storage media, each with different speed, capacity, and cost trade-offs.

Level Type Access Time Capacity Volatility
L1 Cache SRAM ~1 ns KBs Volatile
Main Memory DRAM ~100 ns GBs Volatile
Flash (SSD) NAND Flash ~0.1 ms TBs Non-volatile
Magnetic Disk (HDD) Spinning disk ~10 ms TBs Non-volatile
Magnetic Tape Sequential tape Seconds PBs Non-volatile

Primary storage (cache + main memory): Fast, volatile — data lost on power failure. Secondary storage (SSD + HDD): Persistent, slower — the main focus for DBMS. Tertiary storage (tape): Archival, very slow.

Key insight for DBMS: The bottleneck is disk I/O. Query optimization and indexing exist primarily to reduce the number of block transfers.


Magnetic Disk Structure

A hard disk consists of:

Disk Access Time Components

Total access time = Seek time + Rotational delay + Transfer time
Component Description Typical Value
Seek time Time to move arm to correct cylinder 3–15 ms avg
Rotational delay Time to wait for sector to rotate under head 0–T ms (avg = T/2 where T = rotation period)
Transfer time Time to actually read/write the data Proportional to data size
Controller overhead Processing time in disk controller Small, often ignored

Rotational speed: 7,200 RPM → one rotation = 60/7200 = 8.33 ms → average rotational delay = 4.17 ms

Worked example:

Disk Scheduling — Elevator Algorithm (SCAN)

When multiple block requests are pending, service them in a way that minimizes total seek time.

SCAN (Elevator) Algorithm:

SSTF (Shortest Seek Time First): Service the request nearest to the current head position. Minimizes seek time per request but can cause starvation of distant requests.

C-SCAN (Circular SCAN): Only services requests in one direction; resets to the beginning after reaching the end. More uniform wait times.


Flash Memory (SSD)

NAND Flash properties:

Why SSDs change DBMS design:


Buffer Management

The DBMS buffer pool is a region of main memory used to cache disk blocks.

Buffer replacement policies:

Dirty pages: A buffer block that has been modified but not yet written to disk. Must be flushed to disk before eviction (or before checkpoint completion).


Records and Blocks

Record Types

Type Description
Fixed-length All fields have fixed sizes; record size is constant
Variable-length One or more fields are variable-length (VARCHAR, TEXT, BLOB)

Block Size and Blocking Factor

Record size (R) = sum of all field sizes (in bytes)
Blocking factor (bfr) = ⌊ Block size (B) / Record size (R) ⌋
Number of blocks = ⌈ Number of records (r) / bfr ⌉

Example: Block = 4096 bytes, record = 200 bytes

Variable-Length Record Representation

Approach 1 — Slotted page structure (used in PostgreSQL, Oracle):

Approach 2 — Fixed-width fields + separator:

Spanning vs Non-spanning


File Organization

Heap (Unordered) Files

Records inserted in any order; new records go to the last block (or first block with free space).

Sorted (Sequential) Files

Records stored in sorted order on a key field (the ordering field).

Binary search on sorted file:

Number of block accesses = ⌈ log₂(b) ⌉

Hash Files

Records stored based on the hash value of a key field; hash function maps key → bucket number → block address.

File Organization Summary

Operation Heap Sorted Hash
Insert Best Worst Good
Exact match Worst Good Best
Range search Worst Best Worst
Sequential scan OK Best Worst
Delete OK Bad Good

Record Addressing

Method Description
Physical address Disk address: (device, cylinder, track, sector, offset)
Logical (relative) address Record number relative to file start; OS translates to physical
Variable-length (pointer) Pointer (block address + offset within block); used in slotted pages

Indirection via RID (Record ID = page_number + slot_number): Allows the physical location of a record to change (due to updates or reorganization) without invalidating pointers from indexes, as long as the slot array is updated.


RAID

RAID (Redundant Array of Inexpensive Disks) improves reliability and/or performance.

Level Description Benefit
RAID 0 Striping only High throughput; NO redundancy
RAID 1 Mirroring Full redundancy; doubles storage cost
RAID 5 Striping + distributed parity Balance of cost and fault tolerance (tolerates 1 disk failure)
RAID 6 Striping + 2 parity blocks Tolerates 2 simultaneous disk failures

Common Exam Patterns

Factual Questions

Conceptual Questions

Solving Questions

Block calculation: Given: 50,000 records, record size = 300 bytes, block size = 4096 bytes

Access time calculation: Given: 5400 RPM disk, average seek = 9 ms, transfer rate = 80 MB/s, block = 8 KB

Binary search on sorted file: Given: 3847 blocks