====== I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance ====== ===== Authors ===== * [[http://www.cs.fiu.edu/~rkoll001/|Ricardo Koller]] * [[http://www.cs.fiu.edu/~raju/|Raju Rangaswami]] ===== Abstract ===== Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads revealed an overall improvement in disk I/O performance of 28-47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement. ===== Publications ===== * **I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance** {{iodedup-fast10.pdf|pdf}}\\ Ricardo Koller, Raju Rangaswami\\ Proceedings of USENIX File and Storage Technologies (FAST), February, 2010. * **I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance** {{iodedup-acmtos10.pdf|pdf}}\\ Ricardo Koller, Raju Rangaswami\\ ACM Transactions on Storage, 6(3), September 2010. ===== Traces ===== 3 weeks of traces collected during the period 11/01/08-11/21/08. All the systems were running Linux using the ext3 file system. ^ Files ^ Description ^ | {{:projects:iodedup:mail.tar.xz|}} (LZMA compression)| CS department's mail server traces. It includes all the inboxes of mails in the CS department. | | {{:projects:iodedup:homes.tar.gz|}} | Research group activities: developing, testing, experiments, technical writing, plotting. | | {{:projects:iodedup:web-vm.tar.gz|}} | CS department webmail proxy and online course management. | The traces files (one per day) are in ASCII and each record is as follows: > ''[ts in ns] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 4096 Bytes]'' In the case of the homes traces, the format is different for the digests: > ''[ts in ns] [pid] [process] [lba] [size in 512 Bytes blocks] [Write or Read] [major device number] [minor device number] [MD5 per 512 Bytes]''