Table of Contents

I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance

Authors

Abstract

Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads revealed an overall improvement in disk I/O performance of 28-47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.

Publications

Traces

3 weeks of traces collected during the period 11/01/08-11/21/08. All the systems were running Linux using the ext3 file system.

Files Description
mail.tar.xz (LZMA compression) CS department's mail server traces. It includes all the inboxes of mails in the CS department.
homes.tar.gz Research group activities: developing, testing, experiments, technical writing, plotting.
web-vm.tar.gz CS department webmail proxy and online course management.

The traces files (one per day) are in ASCII and each record is as follows:

[ts in ns] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 4096 Bytes]

In the case of the homes traces, the format is different for the digests:

[ts in ns] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 512 Bytes]