Storage & File Organization
Storage Hierarchy
Database data lives in a hierarchy of storage media, each with different speed, capacity, and cost trade-offs.
| Level |
Type |
Access Time |
Capacity |
Volatility |
| L1 Cache |
SRAM |
~1 ns |
KBs |
Volatile |
| Main Memory |
DRAM |
~100 ns |
GBs |
Volatile |
| Flash (SSD) |
NAND Flash |
~0.1 ms |
TBs |
Non-volatile |
| Magnetic Disk (HDD) |
Spinning disk |
~10 ms |
TBs |
Non-volatile |
| Magnetic Tape |
Sequential tape |
Seconds |
PBs |
Non-volatile |
Primary storage (cache + main memory): Fast, volatile — data lost on power failure.
Secondary storage (SSD + HDD): Persistent, slower — the main focus for DBMS.
Tertiary storage (tape): Archival, very slow.
Key insight for DBMS: The bottleneck is disk I/O. Query optimization and indexing exist primarily to reduce the number of block transfers.
Magnetic Disk Structure
A hard disk consists of:
- Platters: Circular disks coated with magnetic material
- Tracks: Concentric circles on each platter surface
- Cylinders: All tracks at the same radius across all platters
- Sectors: Smallest unit that can be read/written by the hardware (typically 512 bytes or 4KB)
- Blocks (pages): OS/DBMS unit of I/O — multiple contiguous sectors (typically 4KB–64KB)
- Read/write head: One per platter surface; all heads move together on the same arm
Disk Access Time Components
Total access time = Seek time + Rotational delay + Transfer time
| Component |
Description |
Typical Value |
| Seek time |
Time to move arm to correct cylinder |
3–15 ms avg |
| Rotational delay |
Time to wait for sector to rotate under head |
0–T ms (avg = T/2 where T = rotation period) |
| Transfer time |
Time to actually read/write the data |
Proportional to data size |
| Controller overhead |
Processing time in disk controller |
Small, often ignored |
Rotational speed: 7,200 RPM → one rotation = 60/7200 = 8.33 ms → average rotational delay = 4.17 ms
Worked example:
- Disk: 7,200 RPM, 8 ms seek time, block size = 4 KB, transfer rate = 160 MB/s
- Rotational delay (avg) = 4.17 ms
- Transfer time for 4 KB block = 4096 / (160 × 10^6) ≈ 0.025 ms
- Total ≈ 8 + 4.17 + 0.025 ≈ 12.2 ms per random access
Disk Scheduling — Elevator Algorithm (SCAN)
When multiple block requests are pending, service them in a way that minimizes total seek time.
SCAN (Elevator) Algorithm:
- The arm moves in one direction, servicing all requests along the way
- When it reaches the end (or runs out of requests), it reverses direction
- Name comes from elevator analogy
SSTF (Shortest Seek Time First): Service the request nearest to the current head position. Minimizes seek time per request but can cause starvation of distant requests.
C-SCAN (Circular SCAN): Only services requests in one direction; resets to the beginning after reaching the end. More uniform wait times.
Flash Memory (SSD)
NAND Flash properties:
- Reads: Very fast, similar for random and sequential (~100 µs)
- Writes: Slower (~200 µs), and must erase before writing
- Erase unit: Flash block (128–256 pages) — you cannot overwrite a page directly; must erase the entire block first
- Wear leveling: Flash cells wear out after ~10K–100K erase cycles; wear-leveling firmware distributes writes evenly
Why SSDs change DBMS design:
- No seek time or rotational delay — random reads are nearly as fast as sequential
- Write amplification: a small write may require erasing and rewriting a large block
- Traditional disk-oriented B+ trees and file layouts still work well (no penalty for sequential access)
Buffer Management
The DBMS buffer pool is a region of main memory used to cache disk blocks.
Buffer replacement policies:
- LRU (Least Recently Used): Evict the block that was last accessed the longest time ago — good general-purpose policy
- MRU (Most Recently Used): Evict the most recently used block — better for sequential scans (a block just used won’t be needed again soon)
- Clock algorithm: Approximate LRU using a reference bit; efficient implementation
- Pinned blocks: Marked as unpinnable while a transaction is using them
Dirty pages: A buffer block that has been modified but not yet written to disk. Must be flushed to disk before eviction (or before checkpoint completion).
Records and Blocks
Record Types
| Type |
Description |
| Fixed-length |
All fields have fixed sizes; record size is constant |
| Variable-length |
One or more fields are variable-length (VARCHAR, TEXT, BLOB) |
Block Size and Blocking Factor
Record size (R) = sum of all field sizes (in bytes)
Blocking factor (bfr) = ⌊ Block size (B) / Record size (R) ⌋
Number of blocks = ⌈ Number of records (r) / bfr ⌉
Example: Block = 4096 bytes, record = 200 bytes
- bfr = ⌊4096/200⌋ = 20 records per block
- 1000 records → ⌈1000/20⌉ = 50 blocks
Variable-Length Record Representation
Approach 1 — Slotted page structure (used in PostgreSQL, Oracle):
- Page header contains: record count, free space pointer, array of (offset, length) pairs
- Records packed from end; slot array grows from front
- Supports efficient free space management and record deletion
Approach 2 — Fixed-width fields + separator:
- Each variable-length field prefixed with its length
- Or special separator characters (e.g., NUL byte) mark field boundaries
Spanning vs Non-spanning
- Non-spanning: Records never cross block boundaries; simpler but wastes space (last partial block of each file)
- Spanning: A record can begin in one block and continue in the next; uses space better but requires pointer to next block
File Organization
Heap (Unordered) Files
Records inserted in any order; new records go to the last block (or first block with free space).
- Insert: O(1) — very fast
- Search (equality): O(b) — must scan all blocks in worst case (b = number of blocks)
- Delete: Mark record as deleted (reorganize periodically)
- Best for: Bulk inserts, when searches use an index, when order doesn’t matter
Sorted (Sequential) Files
Records stored in sorted order on a key field (the ordering field).
- Insert: O(b) worst case — may require shifting records; usually done with overflow area + periodic reorganization
- Search (equality on ordering field): O(log₂ b) — binary search on blocks
- Range search: Excellent — ordered traversal from first matching block
- Best for: Reports that need sorting, range queries on the ordering key
Binary search on sorted file:
Number of block accesses = ⌈ log₂(b) ⌉
Hash Files
Records stored based on the hash value of a key field; hash function maps key → bucket number → block address.
- Search (equality on hash key): O(1) average — compute hash, go directly to bucket
- Insert: O(1) average
- Range search: O(b) — hash destroys order; must scan all buckets
- Delete: Find bucket, mark as deleted
- Best for: Exact-match queries on the hash key
File Organization Summary
| Operation |
Heap |
Sorted |
Hash |
| Insert |
Best |
Worst |
Good |
| Exact match |
Worst |
Good |
Best |
| Range search |
Worst |
Best |
Worst |
| Sequential scan |
OK |
Best |
Worst |
| Delete |
OK |
Bad |
Good |
Record Addressing
| Method |
Description |
| Physical address |
Disk address: (device, cylinder, track, sector, offset) |
| Logical (relative) address |
Record number relative to file start; OS translates to physical |
| Variable-length (pointer) |
Pointer (block address + offset within block); used in slotted pages |
Indirection via RID (Record ID = page_number + slot_number): Allows the physical location of a record to change (due to updates or reorganization) without invalidating pointers from indexes, as long as the slot array is updated.
RAID
RAID (Redundant Array of Inexpensive Disks) improves reliability and/or performance.
| Level |
Description |
Benefit |
| RAID 0 |
Striping only |
High throughput; NO redundancy |
| RAID 1 |
Mirroring |
Full redundancy; doubles storage cost |
| RAID 5 |
Striping + distributed parity |
Balance of cost and fault tolerance (tolerates 1 disk failure) |
| RAID 6 |
Striping + 2 parity blocks |
Tolerates 2 simultaneous disk failures |
Common Exam Patterns
Factual Questions
- What is the blocking factor formula? (bfr = ⌊B/R⌋)
- What storage level is volatile? (Primary storage: cache and main memory)
- What is the elevator algorithm? (SCAN: service requests in one direction, reverse at end)
Conceptual Questions
- Why is rotational delay a concern for HDDs but not SSDs?
- Explain the trade-off between spanning and non-spanning records.
- Why does hash file organization perform poorly for range queries?
- What is a slotted page structure and why is it useful?
Solving Questions
Block calculation:
Given: 50,000 records, record size = 300 bytes, block size = 4096 bytes
- bfr = ⌊4096/300⌋ = 13
- Blocks = ⌈50000/13⌉ = 3847 blocks
Access time calculation:
Given: 5400 RPM disk, average seek = 9 ms, transfer rate = 80 MB/s, block = 8 KB
- Rotation period = 60/5400 ≈ 11.11 ms; avg rotational delay = 5.56 ms
- Transfer time = 8192/(80×10⁶) ≈ 0.10 ms
- Total random access ≈ 9 + 5.56 + 0.10 = 14.66 ms
Binary search on sorted file:
Given: 3847 blocks
- Accesses = ⌈log₂(3847)⌉ = ⌈11.91⌉ = 12 block accesses