Table of Contents

Reliable Storage

Can disk drives be made reliable enough to identify and recover from errors that cause data corruption? Disk drives exhibit a variety of storage errors that may occur due to media imperfections or buggy firmware code. Standalone machines do no employ any sophisticated data protection measures to detect or recover from these errors. There is a need for a protection strategy that ensures data protection for partial disk failures.

Participants

Project Goals

Ensure data relibility even in the event of:

  1. Latent sector errors
  2. Torn writes
  3. Lost writes
  4. Misdirected writes
  5. Silent data corruptions

Design

read_block(X) {
  read block X and associated checksum;
  verify checksum;
  if (success) done;
  else {
    read stripe;
    rebuild data for X;
    write X;
  }
}
write_block(X) {
  write block X and checksum to log; /* (an optimization) */
  during_idle_time {
     /* do the following in a batches for optimization */
     for each entry in log {
         read prev version of block from static location;
         read parity;
         compute new parity;
         write in static location;
         write parity;
         delete log entry;
     }
  }
}

Some optimizations

Raju's early notes

File systems are already complex. EIO work shows that complex file systems neglect error conditions. This is a natural consequence of the complexity and also brings out the reliance of the file system developers on the underlying block device. \

Possible solution: Intra-disk redundancy for single disk reliability (Improving the reliability of commodity disk drives)