comments28 DSRL wiki

Trace: » review28 » Welcome to the DSRL Wiki » internal » fast09-reviewing » comments28

==+== FAST '09 Paper Review Form

==-== Set the paper number and fill out lettered sections A through G.

==-== DO NOT CHANGE LINES THAT START WITH ”==+==”!

==+== RAJU MUST REPLACE THIS LINE BEFORE UPLOADING

==+== Begin Review

==+== Paper #000000

==-== Replace '000000' with the actual paper number.

==+== Review Readiness

==-== Enter “Ready” here if the review is ready for others to see:

Ready

==+== A. Overall merit

==-== Choices: 1. Reject

==-== 2. Weak reject

==-== 3. Weak accept

==-== 4. Accept

==-== 5. Strong accept

2

==+== B. Novelty

==-== Enter a number from 1 to 5.

==-== Choices: 1. Published before

==-== 2. Done before (not necessarily published)

==-== 3. Incremental improvement

==-== 4. New contribution

==-== 5. Surprisingly new contribution

2

==+== C. Longevity

==-== How important will this work be over time?

==-== Enter a number from 1 to 5.

==-== Choices: 1. Not important now or later

==-== 2. Low importance

==-== 3. Average importance

==-== 4. Important

==-== 5. Exciting

2

==+== D. Reviewer expertise

==-== Enter a number from 1 to 4.

==-== Choices: 1. No familiarity

==-== 2. Some familiarity

==-== 3. Knowledgeable

==-== 4. Expert

3

==+== E. Paper summary

This paper presents a write cache at the driver level (between the device and bus driver) designed for NAND flash memories which helps reducing the number of writes. This is done by merging temporally subsequent requests to the same sectors, by merging space consecutive requests, and “almost” space consecutive requests by padding some zeros in the middle.

==+== F. Comments for author

Writing in NAND flash memories may require live-page copying and block erasing which are both expensive operations. For the same amount of data written to the same number of files, the FAT file system writes far greater number of sectors: around 10 times more than NTFS. This paper presents a mechanism placed below the filesystem that makes it possible to reduce the number of writes.

The basic idea is to use a write cache that holds the recently used sectors (LRU fashion). Every write request is cached and put on hold (not sent to the device) until a later point due to cache pressure or if the “Allow medium removal” command is issued. This helps reducing the number of requests to the same sectors. Additionally, whenever a write requests arrives to the cache, sectors within the same cluster are searched in the cache for merging. In case there is some entry next to the request they are merged and in case there is empty space in between, the gap is padded with zeroes and the requests are merged.

The above idea is fairly simple, however I have some fundamental concerns about basic approach and key design decisions.

The first concern is that this work changes the persistence semantics of file I/O. By caching I/O operations in volatile memory and reporting them as complete, the file system is misled to believe that it has stored data persistently. This, I believe, is a fundamental flaw in the design. Yes, FAT does way more I/O and these are mostly metadata I/O (as you have also pointed out), but you cannot change the persistence semantics by inserting a volatile cache. This semantics is important for crash consistency of the file system (journaling file systems address this and so do soft updates). Further the file system may also be using an explicit ordering of metadata updates on persistent store, which is possibly disrupted because of caching and flushing at a later point. The authors need to convince the reader why such modified persistence semantics are acceptable.

The second odd design choice was the padding of zeroes, which was very confusing. How can you overwrite arbitrary zeroes to the disk when not requested by the file system? A more appropriate choice would be to do something like the BPLRU work (see FAST 2008 paper on this), which is to read the missing content from the disk, pad correct content and then flush to medium.

In general the writing of the paper needs to be improved. The writing is mostly implementation oriented which makes it difficult to understand how individual components work (e.g. it is not clear what are the roles of various components of the filter driver) and their interactions. A general comment is that the authors should first motivate the design decision before describing the design and implementation.

I don't understand the cluster and non-cluster descriptions. From successive sections of the paper it seems that the non-cluster section are the file system metadata and the cluster are the data itself. But I am doing guesswork here.

At the end of the second column of page 6 there is an explanation of the phases of an I/O request to a NAND flash memory. This is not really the reason for the low performance of write requests as you mention. These are the steps for a request to IDE as well. The real problem with writes is that writes may include erasing and live-copying.

Figure 8 shows 3 options: no cache, LRU cache and cohesive cache. Is the LRU cache the cache proposed but without padding and merging of consecutive requests (in space)? Does this apply for the experimental section as well?

Some terms used are not defined, e.g., what is an IoPacket?

Some important design details are not presented, how much time are I/O operations delayed due to caching?

The experiments section should also be revised to include more experiments which illustrate the performance with different workloads (i.e not file coping alone. For instance, well established benchmarks for I/O performance such as IOzone, Postmark would better serve for evaluation. These evaluations should also adress the CPU and memory overhead introduced by the filter driver.

In sec 5.2, the experiments are not very well defined, are the writes done sequentially? if they are done sequentially then this will be the best case performance for your improvement but that does not say much about other workloads.

In sec 5.3, how is the time measured, is it application, processing or device time, further what are the overheads of your driver?

In sec 5.4, exactly what do the benchmarks do? This will help clarify both the goal and exactly what they evaluate.

In sec 5.5, this experiment shows that 64KB will work for the device used, but does not necessary imply the same size of cache will work best for all UFD since they typically have differ substantially in their architecture depending on model and manufacturer.

Minor comments:

* At the end of page 2, right column: “On the other hand,” is misplaced and would be correct in case you would like to say that reads per file DON'T remain different.

* The use of “e.g.” at page 3, bottom of the first column, is incorrect. It is not an example; it is the only case.

* Last paragraph of the first column of page 6 is very hard to read. Third phrase reads “There are five steps” but not all the five steps are explained. “directory entry” in the list is duplicated.

* Figure 8 is of very poor quality. What do the X and Y axis refer to? Reference to “grey cells” in text but in the figure I could not identify these distinctly from the white ones.

* At the bottom of the first column of page 8: “I[NR]” should be ”[812,813]”. the use of ”… (/…)” is should be avoided.

==+== G. Comments for PC (hidden from authors)

[Enter your comments here]

==+== End Review