Systems Research Laboratory

==+== FAST '09 Paper Review Form

==-== Set the paper number and fill out lettered sections A through G.

==-== DO NOT CHANGE LINES THAT START WITH “==+==”!

==+== RAJU MUST REPLACE THIS LINE BEFORE UPLOADING

==+== Begin Review

==+== Paper #86

==-== Replace '000000' with the actual paper number.

==+== Review Readiness

==-== Enter “Ready” here if the review is ready for others to see:

Ready

==+== A. Overall merit

==-== Enter a number from 1 to 5.

==-== Choices: 1. Reject

==-== 2. Weak reject

==-== 3. Weak accept

==-== 4. Accept

==-== 5. Strong accept

1

==+== B. Novelty

==-== Enter a number from 1 to 5.

==-== Choices: 1. Published before

==-== 2. Done before (not necessarily published)

==-== 3. Incremental improvement

==-== 4. New contribution

==-== 5. Surprisingly new contribution

3

==+== C. Longevity

==-== How important will this work be over time?

==-== Enter a number from 1 to 5.

==-== Choices: 1. Not important now or later

==-== 2. Low importance

==-== 3. Average importance

==-== 4. Important

==-== 5. Exciting

1

==+== D. Reviewer expertise

==-== Enter a number from 1 to 4.

==-== Choices: 1. No familiarity

==-== 2. Some familiarity

==-== 3. Knowledgeable

==-== 4. Expert

4

==+== E. Paper summary

The paper introduces a modification to journaling which is termed “Active Journaling”. The goals are to improve application perceived I/O performance, and system reliability using techniques like journaling anywhere, using free replicas, and reconstructing disk-layout by monitoring temporal locality of user's requests.

==+== F. Comments for author

High-Level Comments:

* The paper presents several ideas which are thought provoking at a high-level. Techniques like reducing journaling latency by using group-journaling, attempt at eliminating multiple writes by using the journal copy itself for metadata, and matching temporal locality with spatial locality by reordering block arrangements, all appear rather interesting at first glance.

* However, an in-depth analysis of most of these techniques reveals why such techniques are not used in practice after several years in which journaling file systems have matured. Below, I list the key bullets:

- To put it simply, the authors propose building a log-structured filesystem for metadata only. This decision has three problems, the latter two are common with traditional log-structured file systems used for both data and metadata. (1) locality of data and metadata is lost and thus lookup operations (optimized in VFS using block groups) and tandem metadata-data reads/writes are degraded; (2) the file system hierarchical structure is lost (3) cleaning must be performed to always maintain contiguous free-space at the end of the multiple logging points, thus requiring additional I/O operations (thus multiple writes are not eliminated; in fact, additional reads are required since copies in cache will typically be lost over the longer durations)

- Since journaling competes for space with file allocation mechanisms, unlike before, they must now communicate adding complexity.

- Since journal is fragmented, file data blocks will also get fragmented as the journal fragments grow in size and number.

- Temporal locality is not clearly defined. Because blocks are accessed within a small window does not mean these are related. They may be accesses from two unrelated processes. You need more information to make determination of temporal locality.

- Cleaning is unavoidable for any logging based approach. During cleaning, metadata locations will need to change to make contiguous free space for committing new transactions sequentially. Therefore, multiple writes of metadata are unavoidable. If you do need to copy metadata elsewhere you might as well copy these to their original static filesystem location.

- The writing is sloppy in multiple places. For instance in the description of the indirection mechanism, block consistency must be clearly addressed. What is the exact algorithm used to access blocks that are indexed in the vmap for both reads and writes? Further, the vmap has to be kept up-to-date on persistent storage at all times. You mention changes to vmap are tracked along with transaction committing (pg 7) and also make backup of it. Where is this backup located? What sort of overheads are we looking at? how did you simulate the persistence of vmap for your experiments?

- There are several new techniques that have to be incorporated inside the kernel. You need to evaluate each of these design decisions separately and as a whole. Also, you have to analyze the sensitivity to some of the configuration parameters. For example, you should evaluate the choice of time gap in the sequence window, as well as the space threshold. The evaluation of “self-healing” is not mentioned in the experiments (perhaps inject I/O failures in simulation?).

To summarize, I am not convinced that any of these techniques from a high-level view would work well in a general purpose system. The only way to convince the readers of your work would require a working prototype of AJ so that you can compare applications outputs by running real workloads and tease out the better parts of your design.

Additional comments:

- While I do not agree with the principle of a fragmented, unbound metadata journal, your design might be much more simplified if you used a block-map (similar to the vmap) but for the entire journal so the journal can be managed as if it is all contiguous and let the mapping layer handle redirection with much greater flexibility.

- The figures are not well-explained; there are several concepts (e.g., modified block, unmodified block, durable temporal locality, etc. in Fig 5) which appear in the figure but not elaborated upon.

- Some of the terms are not introduced formally and some concepts are poorly explained:

– how does crash recovery work exactly with active journaling?

– what is the structure of the v-map?

– how does cleaning work?

– what is the “block preference location table”?

– why do you choose window of 0.1 seconds? on what basis?

– sequence groups are not clearly defined and thus the subsequent few comments:

– why is small size group a mismatch?

– why does colocating small and large sized groups speed up access?

– how are replica blocks chosen?

- You state 'Nearly all updated blocks have at least two copies' (PG 7) why is that? and what does this imply?

- The following suggestion that “ you could recover from a previous version of the block” is dangerous for block consistency. You need to carefully account for how the file system used that block in the past.

- Experiments need to reflect steady state when the file system

- It is not clear how the traces were obtained.

- The use of “Average Response time” as metric for the system performance does not generalize across different runs. It is better to normalize this value with the number of blocks accessed (eg. average time/block or disk throughput).

* Minor Comments:

- In page 2, “interchangeable” → “interchangeably”

- In page 2, check the sentence construction “…the chance such an inconsistency occurs.”

- In figure 2, change the caption from “logic” to “logical”. You need to repeat this change in several parts of the paper.

- Pg 6, typos “If merge journaling” , “how we merges”.

- In page 6, the “sequential access pattern complying with logic locality are common..” doesn't make sense. It should be “not complying”.

- In page 7, change “largely speed up” → “largely sped up” ?

- In conclusion, change “Read world” → “Real world”.

- In page 7, change “With the increasing of disk capacity” → “With increasing disk capacity”.

==+== G. Comments for PC (hidden from authors)

==+== End Review