I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance

Authors

Abstract

Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads revealed an overall improvement in disk I/O performance of 28-47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.

Publications

I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance pdf
Ricardo Koller, Raju Rangaswami
Proceedings of USENIX File and Storage Technologies (FAST), February, 2010.

I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance pdf
Ricardo Koller, Raju Rangaswami
ACM Transactions on Storage, 6(3), September 2010.

Traces

3 weeks of traces collected during the period 11/01/08-11/21/08. All the systems were running Linux using the ext3 file system.

Files	Description
mail.tar.xz (LZMA compression)	CS department's mail server traces. It includes all the inboxes of mails in the CS department.
homes.tar.gz	Research group activities: developing, testing, experiments, technical writing, plotting.
web-vm.tar.gz	CS department webmail proxy and online course management.

The traces files (one per day) are in ASCII and each record is as follows:

[ts in ns] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 4096 Bytes]

In the case of the homes traces, the format is different for the digests:

[ts in ns] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 512 Bytes]

Systems Research Laboratory

Table of Contents

I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance

Authors

Abstract

Publications

Traces