==+== FAST '09 Paper Review Form

==-== Set the paper number and fill out lettered sections A through G.

==-== DO NOT CHANGE LINES THAT START WITH ”==+==”!

==+== RAJU MUST REPLACE THIS LINE BEFORE UPLOADING

==+== Begin Review

==+== Paper #000000

==-== Replace '26' with the actual paper number.

==+== Review Readiness

==-== Enter “Ready” here if the review is ready for others to see:

Ready

==+== A. Overall merit

==-== Enter a number from 1 to 5.

==-== Choices: 1. Reject

==-== 2. Weak reject

==-== 3. Weak accept

==-== 4. Accept

==-== 5. Strong accept

2

==+== B. Novelty

==-== Enter a number from 1 to 5.

==-== Choices: 1. Published before

==-== 2. Done before (not necessarily published)

==-== 3. Incremental improvement

==-== 4. New contribution

==-== 5. Surprisingly new contribution

3

==+== C. Longevity

==-== How important will this work be over time?

==-== Enter a number from 1 to 5.

==-== Choices: 1. Not important now or later

==-== 2. Low importance

==-== 3. Average importance

==-== 4. Important

==-== 5. Exciting

3

==+== D. Reviewer expertise

==-== Enter a number from 1 to 4.

==-== Choices: 1. No familiarity

==-== 2. Some familiarity

==-== 3. Knowledgeable

==-== 4. Expert

4

==+== E. Paper summary

This paper proposes to extend the block layer abstraction provided to file systems with the “consistency interval” primitive over a distributed block store. This primitive allows file system designs to be simple whereby file systems delegate atomicity to the block storage layer, and system performance to be optimized because synchronization mechanisms are moved closer to the storage. The distributed block store also incorporates redundancy and versioning for providing reliability.

While the ideas of the consistency intervals, versioning storage, and distributed data redundancy are not new, the contribution of this paper is in the attempt at combining all of these ideas into a single system with some novel design decisions.

==+== F. Comments for author

High-level comments:

* The authors attempt to combine existing techniques of consistency intervals, versioning storage, and distributed data redundancy into a single mechanism for distributed block storage is certainly interesting. This approach if successful can substantially simplify the development of distributed file systems and is thus worth exploring. However, the paper in its current form is not yet ready so an accurate assessment of the value of this approach can be made.

I encourage the authors to continue their efforts towards improving the paper. I summarize the key areas for improvement below that can help better evaluate the value of the approach proposed and then expand on these subsequently.

* The paper is scant on some key details which engage the reader's imagination. These details need to be spelled-out for clarity of motivation and of approach so an accurate evaluation of the merits of this work can be made.

* The evaluation section seems disconnected from the rest of the paper, lacking in prioritization. The authors do not evaluate the primary dimensions of interest, but rather focus on evaluating performance and overheads, which is a secondary concern.

* The authors leave out some relevant related work (listed below).

* Paper writing and organization need improvement

Detailed comments:

* Key missing details:

- There needs to be a more elaborate discussion on the complexity of doing distributed atomic transactions at the block layer to contrast it with existing proposals for CIs on a local block store. It is easy to state that it would be more complex but an intuitive understanding of the key issues upfront would help the reader better appreciate the authors' contributions. Further discussion seems necessary on why the authors think a CI abstraction for distributed storage is going to be met with success when local block store CI proposals have arguably not been accepted after more than a decade.

- What is the impact of block size on parallelism and locking granularity? How should block size or locking granularity chosen? How are locks managed? Who is responsible for avoiding deadlocks, enforcing lock-acquire sequences, etc. Is it RIBD or the filesystem? These discussions are critical for the reader. The description in 4.4, for instance, needs more clarity and elabotration.

- How are block versions indexed and managed? How are past versions recreated ? Is versioning only used to recover until the most recent version ? or are past versions also deemed valuable?

- Garbage collection (pg 6) seems magical. What are the criteria for reclaiming blocks and how is this supported in design? Which blocks would you deem “too old”? How many passes over the data are required for garbage collection and are any of these read-only?

- Are there any fundamental reasons of why you believe simple redundancy is better than erasure coding, except that it is more complex?

- Client-side caching is an important question to address. This should not be beyond the scope of this work (as suggested on Pg 8). What are the implications if client-side caching is/is-not used? As the experiments suggest, performance degrades in the absence of a cache. What are the consistency implications of using one?

-

* Evaluation: The authors focus their evaluation on performance comparison (which is good), but do not address two other evaluation dimensions, which are even more important than performance, given the stated goal of RIBD and the fact that most of the architectural and design discussion is focused around these dimensions. These dimensions are (a) reduction of file system development complexity and (b) effectiveness of reliability mechanisms. These should be done in favor of one set of benchmarks that is used to evaluate performance.

- (a) I am interested in at least seeing concrete comparative examples and better still, a quantitative evaluation, of simple file system operations (e.g., file/directory creation, journaling commit/checkpointing) and how the complexity in development is reduced by the RIBD primitives when compared to existing file systems (one local and one distributed).

- (b) I am even more interested in seeing experiments that evaluate the reliability and recovery mechanisms of RIBD. Failures (node failure and network partitioning) should be simulated and protocol operations illustrated (e.g., reversing of locks, expiration of leases, etc.)

* What do you mean when you say the bottleneck in all the experiments was the networking later? How does this affect the interpretation of the results you made earlier.

* How are the block sizes chosen? What are the the trade-offs?

* Paper organization and writing. There are several aspects of the paper writing that need improvement as listed below:

- The authors tend to suggest typical approaches before presenting their approach (e.g., page 6 / liveness). This is confusing for the reader who must glean the exact approach used by RBID. It is much better for the reader (and also easier for you) if you would simply describe the specific approach proposed.

- Some parts of the prose are unclear in terms of the subject and object. The (ii) and (iii) bullets in Sec 5.1 is a prominent example. It is unclear who makes the acknowledgement, how it gets propagated through the RIBD hierarchy, who should reject future requests? What are the proper measures for state invalidation? These need to be crystallized for the reader to appreciate your contribution.

- How do the lease timeouts look like?

- How are VC elections conducted?

- The operation of various modules (CRM, SRM, and especially SXM) need to be explained much better and early on in the paper rather than in pg 8.

- The related work section should me moved to a later point in the paper, since you refer to concepts of RIBD not introduced until Sections 3/4/5.

- Table 1 is not explained. The acronyms are referred to much later (e.g., ) but not described concretely. For example how do the SXM, CLM, SLM operate?

- Provide a reference for your statement “many parallel applications are not fs-metadata-intensive”

- What are GNBD and CLVM?

- What were the actual parameters used for GFS and PVFS2 after tuning them for performance?

- Pg 9, It is misleading to state “all data and metadata are versioned” and then state “we do not capture versions during the experiments”.

- What are the take-away bullets from Figure 6. What is the importance of this experiment? What are unlock operations (and why not show lock operations)?

- What is “record size” in Fig 7 and 10?

- Why are there more acknowledgments with larger request size?

- why does OFS perform better than both GFS and PVFS2 (Fig 7c and 7d). The justification only implies that GFS and PVFS2 should be no better, but why are they worse? Similar situation is in the description of Figure 12-b.

- In fig 8, why does the client-cache not help PVFS2 (when you suggest it helps GFS)?

- You state “OFS crosses kernel-user boundary lesser as block size increases” Why is this the case? Shouldn't boundary crossing occur on each FS operation, regardless of the underlying block size?

- A general statement is that all graphs need much richer interpretation to answer “why” numbers look the way they do.

- Fig 10 should be redrawn with # threads on the X-axis rather than Record size (which can be averaged out). Figures should directly address the primary parameter you want to examine.

* RIBD is not the first to propose versioning for recovery at block-level. See [1], [2], [3], and [4] (refs below) .

* Also see earlier work on Logical disk [5] which introduces ARUs.

Minor comments:

* Pg 3. please explain why: “it is more appropriate to use single, transactional mechanism for primary storage applications”

* Pg 3, lock ( directory_addr ) ?

* Pg 3, you say “we use low-overhead versioning for atomicity” which is incorrect.

* Pg 6, Leader is not elected but seems to be “selected”.

* Pg 7, What are “distributed RAID techniques”?

* Pg 7, ABC → RIBD

* Pg 7, CTM and STM should be CCM and SCM respectively

Related work:

[1] M. Flouris and A. Bilas. Clotho: Transparent data versioning at the block I/O level. In IEEE Symposium on Mass Storage Systems, 2004.

[2] C. B. Morrey III and D. Grunwald. Peabody: The time travelling disk. In IEEE Symposium on Mass Storage Systems, pages 241-253, 2003.

[3] Q. Yang, W. Xiao, and J. Ren. TRAP-Array: A disk array architecture providing timely recovery to any point-in-time. In Proceedings of International Symposium on Computer Architecture, 2006.

[4] Guy Laden, Paula Ta-Shma, Eitan Yaffe, Michael Factor, Shachar Fienblit, Architectures for Controller Based CDP, 5th USENIX Conference on File and Storage Technologies - Paper

[5] Wiebren De Jonge, Wiebren De Jonge, M. Frans Kaashoek, M. Frans Kaashoek, Wilson C. Hsieh, Wilson C. Hsieh, Logical Disk: A Simple New Approach to Improving File System Performance, In 14th Symposium on Operating System Principles 1993.

==+== G. Comments for PC (hidden from authors)

==+== End Review