internal:fast09-reviewing:review45

==+== FAST '09 Paper Review Form

==-== Set the paper number and fill out lettered sections A through G.

==-== DO NOT CHANGE LINES THAT START WITH ”==+==”!

==+== RAJU MUST REPLACE THIS LINE BEFORE UPLOADING

==+== Begin Review

==+== Paper #000000

==-== Replace '45' with the actual paper number.

==+== Review Readiness

==-== Enter “Ready” here if the review is ready for others to see:

Ready

==+== A. Overall merit

==-== Enter a number from 1 to 5.

==-== Choices: 1. Reject

==-== 2. Weak reject

==-== 3. Weak accept

==-== 4. Accept

==-== 5. Strong accept

==+== B. Novelty

==-== Enter a number from 1 to 5.

==-== Choices: 1. Published before

==-== 2. Done before (not necessarily published)

==-== 3. Incremental improvement

==-== 4. New contribution

==-== 5. Surprisingly new contribution

==+== C. Longevity

==-== How important will this work be over time?

==-== Enter a number from 1 to 5.

==-== Choices: 1. Not important now or later

==-== 2. Low importance

==-== 3. Average importance

==-== 4. Important

==-== 5. Exciting

==+== D. Reviewer expertise

==-== Enter a number from 1 to 4.

==-== Choices: 1. No familiarity

==-== 2. Some familiarity

==-== 3. Knowledgeable

==-== 4. Expert

==+== E. Paper summary

The paper presents the design of a commercial storage system which consists of a content addressed back-end and a set of front-end module based interfaces such as file systems. The back-end provides deduplication, failure tolerance and its read/write performance can scale linearly with the number of nodes. Failure tolerance is tunable at a “per-block” level with the use of erasure codes which are parametrized with the minimum number of blocks needed for reconstruction.

==+== F. Comments for author

Conceptual comments:

This paper attempts to bring techniques that are traditionally used in offline storage (such as content-addressed storage and deduplication) to online storage, and in essence to merge these two concepts within one storage system. That since online storage does not have the down-time availability that can be assumed with offline storage makes this an extremely challenging proposition. That is not to say that this paradigm is not worth-exploring, because it certainly has the potential to simplify management. However, after reading the paper, the authors did not convince me that they had addressed the key challenges.

I provide a couple of examples. First, using a content-addressed distributed store for enabling deduplication can easily result in loss of data locality (extremely important for online performance) turning sequential accesses into random based on the choice of data block size (more on this below). Second, I felt it was odd that the experiments report the storage system forced into read-only mode for a duration of 30 minutes (unacceptable for online storage) while doing a regular house-keeping operation (space reclamation).

Another high-level comment is that the authors overwhelm the reader with too many ideas that did not allow sufficient exploration of individual ideas. Further, a research paper must clearly distinguish original ideas from pre-existing ideas at the time of introduction of the idea so the reader is not misguided. I was also overwhelmed with the number of new terms that were introduced in this paper, but were not clearly defined. I believe most of these new terms could have been avoided if the authors had resorted to more well-established terminology for storage systems (more on this below). Unfortunately, the choice of the level of detail and the various new concepts had me continuously stretching my imagination as I was reading through the text.

Detailed Comments:

* I found the current paper organization very challenging to navigate. As I am reading Section 3, I am looking for a clear answer for how the logical block address space is mapped to individual storage nodes and what addressing mechanisms are in place. This appears only much later in Section 4 and even that description is unclear with synchruns etc. (more on this below). I think the paper will be much more readable if you first described clearly how the block address space gets mapped and then go into the details of how various concepts such as proxies, drivers, network overlay fit in and how read/write handling, load-balancing etc. fit in into this picture.

* A key decision point in the design is the size of data block used as a unit for content-addressing. And, variable block size deduplication is a very interesting and useful idea, but seems challenging to realize. The authors propose they support variable block sizes, but is not described in the paper. With a fixed block size, if the block size is too large, deduplication opportunities are significantly reduced. If it is too small, when blocks are written, fragmentation can be introduced affecting read performance as data sequentiality is lost. Given that file systems (clients) manage the logical address space and they do so with the locality assumption (that blocks which are colocated in the logical block address space are also more efficiently accessed in succession), such locality should be preserved in the backend. Finally, if the number of fragments is higher than the number of nodes, paralellism may not work that well compared to the complete stream read sequentially. More discussion on this aspect is needed.

* Related to the above, you claim “In general, reading is very efficient for streamed access, as all fragments are sequentially pre-fetched from disk to a local cache.” What is the local-cache (not introduced?) What do you mean by streamed accesses? Why would these be sequentially prefetched? More clarity using definitions and elaboration is needed.

* Specify what type of erasure coding you use. This deserves more detail than has been provided. Relatedly, after reads, blocks reconstructed are verified. How does this happen?

* Also (may be related to above), how is supernode cardinality chosen? why is it between 4 and 32? The question that a reader has at this point is what system metric does this choice affect at a high-level and why?

* Why are supernodes invincible that they may not be permanently lost? Also, how exactly is temporary loss of a supernode handled?

* How are data fragments distributed? What algorithm is used? What are the definitions of the following conccepts – “streamed access”, “block metadata”, “local node index”, “local cache” (Section 3.3) ?

* How are peer chain holes so easily detected?

* Load balancing (Section 3.4) seems based on some magical function. These details are the most important for readers who would like to understand what you are doing and future implementers who would like to implement your ideas.

* I don't understand how you can achieve per-block selection of data redundancy class. It seems more of a choice for the whole system.

* “The network survives node failures as long as each supernode remains alive, and for that at least half of each supernode peers plus one should remain alive to reach a consensus.” ⇒ an explanation of this is required. This seems to depend on the erasure code being used, which should be described.

* Understanding the idea of synchruns, chains of synchruns, synchrun components, synchrun component containers, chains of synchrun component containers is very hard given section 4.1 and 4.2 alone. Figure 3 should contain more annotations. For example a little description specifying that in this case one of the big rectangles is a set of fragments inside a synchrun component which in this case is a SCC. All of these concepts (Synchruns, chains of synchruns, merged SCC) probably could be explained in terms of most common concepts such as IO requests, IO scheduler queues, etc.

* The design of the deletion and space reclamation (section 5.3) has to be carefully justified. Why do you think the first (read-only) phase cannot be avoided? if not, what is its impact to foreground workload? Is a down-time (read-only) of several minutes or hours acceptable? why?

Minor comments:

You could make the readers life a bit easier by addressing some points: improve the prose and introduce all the terms you use before using them. Some examples were provided above, but you really need to:

make a exhaustive list of all terms you use
identify the ones that can be avoided (and not use them)
clearly define all remaining new terms

Grammar mistakes includes (not an extensive list):

+ “as transfer is much cheaper operation.”

+ “The number of storage servers run on a storage node depends on its resources.”

+ “Software components of the back-end include storage server, proxy server, and protocol drivers.”

* “For retention roots, we need to ensure that two blocks with the same search prefix point to the same blocks (otherwise retention roots will not be useful to identify snapshots).” ⇒ I think you mean: “two blocks with different search prefixes don't point to the same block.”

* I believe Figure 4. caption should read: “Percentage of duplicated blocks” instead of “duplicate eliminated.”

* “For example, data is available (i.e. all blocks are reconstructible), if sufficient number of peer chains (equal to the number of fragments needed to reconstruct each block) do not have any holes.”

⇒ do you mean “if sufficient number of peer chains do not have more holes than (cardinality - sufficient number of fragments needed to reconstruct each block)”. I guess you do mean this given the next phrase: “In the worst case, a given block may be lost completely if not enough fragments survive.”

* “synchrun, containing a limited number of consecutive block” ⇒ consecutive in which dimension ? space, time? something else?

* What does “only one synchrun is open” mean?

Missing related work:

See the “Foundation” work from MIT on content-addressed storage - Rhea et al. USENIX Techincal Conference 2008. The authors address the shortcomings of Venti that you point out as “do not take advantage of sequential nature of incoming data streams”.

==+== G. Comments for PC (hidden from authors)

==+== End Review