User Tools

Site Tools


This is an old revision of the document!

I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance



Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads revealed an overall improvement in disk I/O performance of 28-47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.


  • I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance pdf
    Ricardo Koller, Raju Rangaswami
    Proceedings of File and Storage Technologies (FAST), February, 2010.


3 weeks of traces for the following workloads:

Files Description CS department webmail proxy and online course management. CS department's mail server traces. It includes all the inboxes of mails in the CS department. Research group activities: developing, testing, experiments, technical writing, plotting.

The format is as follows (zero block is `3df1244f6143869f52abf2a1d73d0c0f`):

`[ts] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 4096 Bytes]`

In the case of the home traces, the format is different for the digests:

projects/iodedup/start.1267759885.txt.gz · Last modified: m/d/Y H:i by ric