Systems Research Laboratory

==+== FAST '09 Paper Review Form

==-== Set the paper number and fill out lettered sections A through G.

==-== DO NOT CHANGE LINES THAT START WITH ”==+==”!

==+== RAJU MUST REPLACE THIS LINE BEFORE UPLOADING

==+== Begin Review

==+== Paper #000000

==-== Replace '000000' with the actual paper number.

==+== Review Readiness

==-== Enter “Ready” here if the review is ready for others to see:

[Enter your choice here]

==+== A. Overall merit

==-== Enter a number from 1 to 5.

==-== Choices: 1. Reject

==-== 2. Weak reject

==-== 3. Weak accept

==-== 4. Accept

==-== 5. Strong accept

2

==+== B. Novelty

==-== Enter a number from 1 to 5.

==-== Choices: 1. Published before

==-== 2. Done before (not necessarily published)

==-== 3. Incremental improvement

==-== 4. New contribution

==-== 5. Surprisingly new contribution

3

==+== C. Longevity

==-== How important will this work be over time?

==-== Enter a number from 1 to 5.

==-== Choices: 1. Not important now or later

==-== 2. Low importance

==-== 3. Average importance

==-== 4. Important

==-== 5. Exciting

3

==+== D. Reviewer expertise

==-== Enter a number from 1 to 4.

==-== Choices: 1. No familiarity

==-== 2. Some familiarity

==-== 3. Knowledgeable

==-== 4. Expert

3

==+== E. Paper summary

In this paper, the authors present a new approach (called Minuet) to improve the safety and liveness of SAN applications. They propose to minimally augment the SAN devices to enable “Session isolation” and propose associate mechanisms (perhaps previously developed) for enforcing “atomic transactions” over shared resources. These put together help provide availability inspite of failures, but more importantly, the proposed approach is tunable for trading off performance with aggressive vs. conservative consistency alternatives. They implement their protocols and their strategy and the evaluation shows improvement in performance and availability.

==+== F. Comments for author

While I am no expert in this area, the authors address what to me seems a fairly important problem in distributed SAN environments. A motivation using example scenarios illustrates the problem and its importance. The key problem being addressed is that of a lack of coordination between SAN clients (e.g., file systems) and the SAN devices (so called “dumb” devices) when multiple clients access shared content. A DLM-based solution is insufficient in this case because the DLM cannot control client I/Os that occur in case of (real/perceived) client failures / network partitions.

Overall, I liked the paper quite a bit. While there is substantial influence from traditional database consistency work (the authors do mention this), the authors basic approach seems novel and there are some very interesting new design choices that have been explored. Though, I have to point out that (i) the paper was not an easy read from an organization and presentation standpoint, (ii) some questionable assumptions are not adequately addressed, and (iii) evaluation seems preliminary. These are elaborated below after the high-level critique/appreciation of the work.

High-level critique:

The authors propose a solution that involves making the storage nodes “intelligent” by offloading the enforcing of exclusive client sessions from each other and implementing a “guard-logic” at the storage node. To access a shared resource, clients first establish a shared session and then subsequently and exclusive session (if necessary) to be able to access the resource. This would seem like 2-phase locking. This guard-logic enforces isolation of sessions (if necessary) that manipulate the same storage resource.

Two extreme configurations in the solution space for session ID allocation are considered: (a) all session IDs are managed by a single DLM service, and (b) optimistic self-assignment of session IDs by clients themselves (no coordination). In the former configuration, session I/Os initiated by a client may only get rejected in failure-like scenarios, e.g. delayed I/Os and network partitions. In the latter configuration, session I/Os can get rejected even during normal operation when contention does occur. The performance tradeoffs made by these configurations are fairly straightforward, whereby the latter configuration has no session ID establishment overhead; this is possible because clients can rely on the guard-logic to enforce isolation instead of being tied to a strongly-consistent centralized DLM solution.

The final link that seems to connect all of the above and in addition provides “atomic multi-resource updates” is the transaction protocol employed by clients. Each client first performs updates to an exclusive log and subsequently flushes these updates to the shared store. This is a 5-stage protocol whereby the clients start a transaction by registering it in the log, obtaining read locks to all the resources, updating them locally and then writing these updates to the client log. Next write locks are obtained for each resource to be updated and upon success, the transaction is committed.

Ogranization / Presentation issues:

- There needs to be section which ties in all the aspects of the paper in the beginning. For instance, I was left wondering about atomic multi-resource transactions all the way until Page 7. It is nice that you separated out the discussion on sessions; yes that was great, but you have to provide the reader overall context for how the parts form the whole (at a high-level) before you dive into a specific part of your design.

- The release semantics deserve further elaboration. How do clients release their shared or exclusive locks? And how does this affect sessions as managed by the guard-logic in case no centralized DLM is used.

- Why do the per-client logs need to be “exclusively locked” before setting up a transaction? There is no contention here, it would seem, except of course from other processes at the same client, which can be resolved within the Minuet transaction service at client C locally. Perhaps this is for handling delayed writes to the log? Some readers would certainly appreciate some help here …

- More discussion is needed on how the verify phase works; specifically, it seems that exclusive locks required for the verify phase may be initiated by more than one client and while I could work this out after a while, it would be more appropriate if the readers were offered some guidance.

- The authors need to be more forthcoming upfront about how the proposed approach is influenced by existing distributed systems consistency strategies such as 2-phase locking and 2-version (readlock→writelock→commitlock) locking.

- While the authors point to a report, I did not have the opportunity to read this. The paper coudl have probably been much more easily understood with figures that describe the locking and transactions protocols.

Unclear assumptions:

- What would be typical application as referred to throughout the paper? I would expect it to be a file system in most cases… In such a situation, what you are providing is a journaling layer for a distributed block store. Journaling file systems were not designed to rollback journal state during normal operation, which would happen if the verify phase fails to allow committing a transaction. Even if the clients were not file systems, application state must still be rolled back to redo the transaction operations. While you do mention the benefits that would be available with a change of programming model, it would be good to see some rough sketch as to how such a modified programming model could become generic enough to apply to an arbitrary application. Is this possible or not? I think these sort of questions are really important to answer.

- The execution history <R1.1, R2.1, R1.2, W1.1, W2.1> is claimed to not obey session isolation. While I could understand what the authors were trying to convey, there are no session violations obvious here. It would be not surprising if reads are not guaranteed to occur between obtaining of shared and exclusive locks to a resource.

- Applications that have committed persistent state are not typically designed to handle revocation of that state. Since locks can be revoked while the application is already inside a critical section,

Evaluation concerns:

Section 3.3.1: Before committing each transactions Minuet issues multiple extra disk requests to verify the validity of sessions. how does this affect the performance for a heavy workload. The evaluation section can possibly be improved with such a discussion since this seems like an Achilles heel of the design.

The partitioned network scenario is only mentioned in passing. Considering that availability is a key design goal, the current evaluation is incomplete.

I would also have much rather liked to see an evaluation with an actual distributed data base or file system that would also help address some of the concerns with the assumptions' scope / validity.

Other minor comments:

Page 3: In Scenario 2, you mention data corruptions can occur if the said assumption is violated. When can this condition arise?

What is the significance of the incarnation number (T.incNum) in the timestamp format? How is this used?

Section 3.3.1: What is xactID in the commit session identifier?

Section 5.1: Description of uniform and skewed workload can be improved.

Figure 4 is referenced before 3 in Section 5.2.

Typo: Section 5.2, partitioined → partitioned

==+== G. Comments for PC (hidden from authors)

[Enter your comments here]

==+== End Review