====== SSD Caching v.s. Tiering ======
===== Participants =====
* Wendy Belluomini
* Joseph Glider
* [[http://www.cs.fiu.edu/~jguerra/|Jorge Guerra]]
* Ajay Gulati
* Bindu Pucha
* [[http://www.cs.fiu.edu/~raju/|Raju Rangaswami]]
===== Project Goals =====
Find out best way to use SSDs as a front-end tier by designing a solution that considers device specific characteristics and using a combination of caching and tiering techniques. There are quite a few unknowns in designing such a solution which we are trying to explore.
===== Meetings =====
We are meeting every Thursday at 3pm PST (6pm PST), via 888-426-6840 (Participant Code: 45536712).
* 09/06/12: More tree updates and plans for workload simulation {{{{internal:projects:ct:ct090612.mp3|mp3}}
* 09/04/12: Tree update and simulating workloads at tree decision points {{{{internal:projects:ct:ct090412.mp3|mp3}}
* 08/28/12: More refining of decision tree {{{{internal:projects:ct:ct082812.mp3|mp3}}
* 08/24/12: Refining decision tree {{{{internal:projects:ct:ct082412.mp3|mp3}}
* 08/08/12: Paper message possibility: Cache-Tier Decision Tree {{{{internal:projects:ct:ct080812.mp3|mp3}}
* 08/02/12: List of observations and conclusions {{{{internal:projects:ct:ct080212.mp3|mp3}}
* 07/24/12: FAST paper message {{{{internal:projects:ct:ct072412.mp3|mp3}}
* 06/21/12: Thesis chapter feedback {{{{internal:projects:ct:ct062112.mp3|mp3}}
* 03/09/11: Reviewing writing, Plans for new experiments, and load-balance/sequentiality discussion {{{{internal:projects:ct:ct030911.mp3|mp3}}
* 02/25/11: Data analysis, talking points, and writing plans {{{{internal:projects:ct:ct022511.mp3|mp3}}
* 02/21/11: HotStorage submission plans {{{{internal:projects:ct:ct022111.mp3|mp3}}
===== Latest Updates =====
==== MSR Workloads 2 Hour Runs ====
In this set of experiments I look at the two claims:
* First, for a more fair comparison I feed both algorithms extent statistics. To get this statistics, I simulate that a run of the complete week long trace. For EDT, I use these gather statistics to derive an initial extent placement. For MQ, I the statistics collected represent the LRU queues and ghost buffer data, which will determine the cache lines that will be initially in the cache.
* New hardware is performing much better (in particular) the SSDs in comparison with the previously used HW (EDT paper). More concretely, I'll be running the same workload with two different HW settings:
== IBM ==
| Type | Model | Avail | Cost |
| SSD | Intel x25 MLC | 2 | $430 |
| SAS | ST3450857SS | 4 | $325 |
| SATA | ST31000340NS | 4 | $170 |
*Cost according to FAST paper per device
== FIU ==
| Type | Model | Avail | Cost |
| SSD | Intel 320 Series MLC | 2 | $220 |
| SAS | ST3300657SS | 4 | $220 |
| SATA | ST31000524NS | 4 | $110 |
*Cost according to Google per device
=== Server peak (hours 2-4 of the trace) ===
For this experiment I use a SSD+SATA configuration.
== Response time ==
| FIU | IBM |
| {{internal:projects:ct:rt-time-server-peak-fiu.png|}}| {{internal:projects:ct:rt-time-server-peak-ibm.png|}}|
| **Avg RT: MQ=0.805ms EDT=3.608ms** | **Avg RT: MQ=9.445ms EDT=9.927ms**|
| {{internal:projects:ct:rt-clock-cdf-server-peak-fiu.png|}}| {{internal:projects:ct:rt-clock-cdf-server-peak-ibm.png|}} |
== I/O distribution ==
| | FIU | IBM |
| MQ | {{internal:projects:ct:server-peak-mq-1mb-iodist-fiu.png|}}| {{internal:projects:ct:server-peak-mq-1mb-iodist-ibm.png|}}|
| EDT | {{internal:projects:ct:server-peak-edt-1mb-iodist-fiu.png|}}| {{internal:projects:ct:server-peak-edt-1mb-iodist-ibm.png|}}|
=== Source Control peak (hours 2-4 of the trace) ===
For this experiment I use a SSD+SATA configuration.
== Response time ==
| FIU | IBM |
| {{internal:projects:ct:rt-time-srccntl-peak-fiu.png|}}| {{internal:projects:ct:rt-time-srccntl-peak-ibm.png|}}|
| **Avg RT: MQ=1.353ms EDT=1.856ms** | **Avg RT: MQ=7.363ms EDT=3.190ms**|
| {{internal:projects:ct:rt-clock-cdf-srccntl-peak-fiu.png|}}| {{internal:projects:ct:rt-clock-cdf-srccntl-peak-ibm.png|}}|
== I/O distribution ==
| | FIU | IBM |
| MQ | {{internal:projects:ct:srccntl-peak-mq-1mb-iodist-fiu.png|}}| {{internal:projects:ct:srccntl-peak-mq-1mb-iodist-ibm.png|}}|
| EDT | {{internal:projects:ct:srccntl-peak-edt-1mb-iodist-fiu.png|}}| {{internal:projects:ct:srccntl-peak-edt-1mb-iodist-ibm.png|}}|
=== Complete MSR peak (hours 8-10 of the trace) ===
For this experiment I use a SSD+SAS+SATA configuration.
== Response time ==
| FIU | IBM |
| {{internal:projects:ct:rt-time-msr-peak-fiu.png|}}| {{internal:projects:ct:rt-time-msr-peak-ibm.png|}}|
| **Avg RT: MQ=2.270ms MQ-UB=2.684ms** | **Avg RT: MQ=23.481ms MQ-UB=18.226ms**|
| {{internal:projects:ct:rt-clock-cdf-msr-peak-fiu.png|}}| {{internal:projects:ct:rt-clock-cdf-msr-peak-ibm.png|}}|
== I/O distribution ==
| | FIU | IBM |
| MQ | {{internal:projects:ct:msr-peak-mq-1mb-iodist-fiu.png|}}| {{internal:projects:ct:mq-1mb-iodist-msr-peak-ibm.png|}}|
| MQ-UB | {{internal:projects:ct:msr-peak-mq-1mb-ub-iodist-fiu.png|}} | {{internal:projects:ct:mq-1mb-ub-iodist-msr-peak-ibm.png|}}|
==== Workload Analysis ====
All data and plots shown below are assuming 1mb extents (cache lines).
== Workload Summary ==
| Workload | Length (days) | Total I/Os | Active Exts | Total Exts | % Exts Accessed |
| server | 7 | 219828231 | 522919 | 1690624 | 30.930 |
| data | 7 | 125587968 | 2368184 | 3809280 | 62.168 |
| srccntl | 7 | 88442081 | 311331 | 925696 | 33.632 |
| fiu-nas | 12 | 1316154000 | 9849142 | 20507840 | 48.026 |
== IOPS ==
**__SERVER__**
{{internal:projects:ct:server-iops.png|IOPS}}
**__DATA__**
{{internal:projects:ct:data-iops.png|IOPS}}
**__SRCCNTL__**
{{internal:projects:ct:srccntl-iops.png|IOPS}}
**__FIU NAS__**
{{internal:projects:ct:fiu-nas01-iops.png|IOPS}}
== I/O Distribution across extents ==
**__SERVER__**
{{internal:projects:ct:server01-act-iod.png|Only active extents}} {{internal:projects:ct:server01-iod.png|Complete system}}
**__DATA__**
{{internal:projects:ct:data01-act-iod.png|Only active extents}} {{internal:projects:ct:data01-iod.png|Complete system}}
**__SRCCNTL__**
{{internal:projects:ct:srccntl01-act-iod.png|Only active extents}} {{internal:projects:ct:srccntl01-iod.png|Complete system}}
**__FIU NAS__**
{{internal:projects:ct:fiu-nas1-act-iod.png|Only active extents}} {{internal:projects:ct:fiu-nas01-iod.png|Complete system}}
== Extent reuse distance ==
**__SERVER__**
{{internal:projects:ct:server-reuse.png|}} {{internal:projects:ct:server1-reuse-cdf.png|}}
**__DATA__**
{{internal:projects:ct:data-reuse.png|}} {{internal:projects:ct:data1-reuse-cdf.png|}}
**__SRCCNTL__**
{{internal:projects:ct:srccntl-reuse.png|}} {{internal:projects:ct:srccntl1-reuse-cdf.png|}}
**__FIU NAS__**
{{internal:projects:ct:fiu-nas01-reuse.png|}} {{internal:projects:ct:fiu-nas1-reuse-cdf.png|}}
== Expected Hit Ratio ==
**__SERVER__**
{{internal:projects:ct:server-es6-c10000000-reuse-cdf.png|}}
**__DATA__**
{{internal:projects:ct:data-es6-c10000000-reuse-cdf.png|}}
**__SRCCNTL__**
{{internal:projects:ct:srccntl-es6-c10000000-reuse-cdf.png|}}
**__FIU NAS__**
{{internal:projects:ct:fiu-nas-reuse-lba-cdf.png|}}
== Hit Ratio ==
Hit rate assuming 5% and 1% of HDD space provisioning for the SSD cache.
**__SERVER__**
{{internal:projects:ct:server-hits.png|}} {{internal:projects:ct:server01-hits.png|}}
**__DATA__**
{{internal:projects:ct:data-hits.png|}} {{internal:projects:ct:data01-hits.png|}}
**__SRCCNTL__**
{{internal:projects:ct:srccntl-hits.png|}} {{internal:projects:ct:srccntl01-hits.png|}}
**__FIU NAS__**
{{internal:projects:ct:fiu-nas-1-hits.png|}} {{internal:projects:ct:fiu-nas01-hits.png|}}
== Cache Lifetime ==
**__SERVER__**
{{internal:projects:ct:server-life.png|}}
**__DATA__**
{{internal:projects:ct:data-life.png|}}
**__SRCCNTL__**
{{internal:projects:ct:srccntl-life.png|}}
**__FIU NAS__**
{{internal:projects:ct:fiu-nas01-life.png|}}
===== Project TODOs =====
==== Tiering and Caching Scenarios ====
For all experiments there are a couple of different configurations being tried out:
* edt: EDT using extents of 1MB reconfiguring every 30min, also notice that there is throttling detection/correction going on in the background.
* mq: MQ caching algorithm (configured as in the paper) with cache lines of varying size. The cache is populated on a miss in a synchronous way. If a read I/O is received, a read for the whole line is issued to the disk, once the read completes the I/O that triggered the miss is serviced from memory and we proceed to the populate the cache. If a write I/O received, there are two options of if the I/O is smaller than the cache line, the remaining piece of the line is read from disk merged with the I/O, and the write of the merged cache line completes asynchronously. If the I/O is bigger or equal to in size to the cache line the cache is simply written to the cache as is.
* hybrid: same as mq, but added the "load balancing" algorithm proposed by Ajay.
=== RAND READ ===
{{internal:projects:ct:rt-time-micro-rand.png|}}
{{internal:projects:ct:rt-clock-cdf-micro-rand.png|}}
{{internal:projects:ct:iops-micro-rand.png|}}
I/O and extent distribution through time (top plot is I/O and bottom is extent)
EDT
{{internal:projects:ct:micro-rand-edt-es10-iodist.png|}}
{{internal:projects:ct:micro-rand-edt-es10-migrs.png|}}
MQ
First column is 16kb cache lines and second column is 4kb.
{{internal:projects:ct:micro-rand-mq-es4-iodist.png|}} {{internal:projects:ct:micro-rand-mq-es2-iodist.png|}}
{{internal:projects:ct:micro-rand-mq-es4-migrs.png|}} {{internal:projects:ct:micro-rand-mq-es2-migrs.png|}}
=== RAND WRITE ===
{{internal:projects:ct:rt-time-micro-randw.png|}}
{{internal:projects:ct:rt-clock-cdf-micro-randw.png|}}
{{internal:projects:ct:iops-micro-randw.png|}}
I/O and extent distribution through time (top plot is I/O and bottom is extent)
EDT
{{internal:projects:ct:micro-randw-edt-es10-iodist.png|}}
{{internal:projects:ct:micro-randw-edt-es10-migrs.png|}}
MQ
First column is 16kb cache lines and second column is 4kb.
{{internal:projects:ct:micro-randw-mq-es4-iodist.png|}} {{internal:projects:ct:micro-randw-mq-es2-iodist.png|}}
{{internal:projects:ct:micro-randw-mq-es4-migrs.png|}} {{internal:projects:ct:micro-randw-mq-es2-migrs.png|}}
=== SEQ READ ===
Sequentially read 15GB from a 60GB file, in a loop for 1hr.
{{internal:projects:ct:rt-time-micro-seq.png|}}
{{internal:projects:ct:rt-clock-cdf-micro-seq.png|}}
{{internal:projects:ct:iops-micro-seq.png|}}
I/O and extent distribution through time (top plot is I/O and bottom is extent)
EDT
{{internal:projects:ct:micro-seq-edt-es10-iodist.png|}}
{{internal:projects:ct:micro-seq-edt-es10-migrs.png|}}
=== SEQ WRITE ===
Sequentially write 15GB from a 60GB file, in a loop for 1hr.
{{internal:projects:ct:rt-time-micro-seqw.png|}}
{{internal:projects:ct:rt-clock-cdf-micro-seqw.png|}}
{{internal:projects:ct:iops-micro-seqw.png|}}
I/O and extent distribution through time (top plot is I/O and bottom is extent)
EDT
{{internal:projects:ct:micro-seqw-edt-es10-iodist.png|}}
{{internal:projects:ct:micro-seqw-edt-es10-migrs.png|}}
==== Study related work on SSD caching ====
In particular the [[http://www.ee.tamu.edu/~reddy/papers/MASCOTS-2010.pdf|paper from Reddy's group]]. We should also look at Ismail Ari's work from HP labs. IIRC, he brought up this issue of modeling the cost of writes explicitly many years ago.
- [[http://faculty.chicagobooth.edu/robert.gramacy/papers/gra2003-02.pdf|Adaptive Caching By Refetching]]
- [[http://users.soe.ucsc.edu/~elm/Papers/ddas02.pdf|Adaptive Caching using Multiple Experts]]
- [[http://ssrc.cse.ucsc.edu/Papers/ari-wdas02.pdf|Who is more adaptive? ACME: Adaptive Caching using Multiple Experts (Abstract)]]
==== Explore different caching policies (algorithms) ====
Such as LRU, [[http://www.usenix.org/events/fast03/tech/full_papers/megiddo/megiddo.pdf|ARC]], {{paperclub:10.1.1.121.1470.pdf|MQ}}. And measure how they perform.
Here we should evaluate these algorithms as it is. Then we can modify them to not write every block to SSDs but only when we think it should go in. This will include modeling the cost of writes and not doing them for every miss.
=== Current Status ===
Currently I plan to evaluate two caching solutions. First, LRU which represents a basic solution in which every page accessed is always brought to the cache if not already present. Second, a more complex approach, MQ, which was designed for storage and takes into account for frequency and recency.
==== Try out different caching unit ====
Current experiments are done using a 1mb extent, try with different sizes and measure it's impact on performance.
Here we want to try out basic caching style with no prefetching and compare that with prefetching larger block size on a miss and write that to SSD. I agree with Jody that 1 MB may be an overkill. We should also consider SSD erase block size here.
=== Current Status ===
== 2 Tiers (SSD+SATA) ==
First 12 hours of the MSR server workload.
{{internal:projects:ct:power-sum-server-ssdsata.png|}}
{{internal:projects:ct:rt-clock-cdf-server-ssdsata.png|}}
cache-X represents a cache line size of 2^X
==== Look into provisioning cache systems. ====
One possibility is use a trace to provision, just like in the EDT work, but now use the requirements of the optimal placement as the result of the provisioning.
To my knowledge, include a few google searches, there is not much work on how to provision cache systems. But the general feeling is the bigger the cache the most benefit one should get, at an increase in cost.
=== First approach ===
Compute the working set size for some interval with a given "adequate" cache line size. Then take the maximum working set size across time as the size of the cache. The key questions here are:
* //How long does my interval need to be?// In general, the larger the interval the bigger the working set, as one expects more data to be accessed within a longer period of time. On the other side, one would want the system to perform well during peak I/O load periods, since cache misses here will likely be more costly to the increased load. Therefore, we may need to use the length of the peaks in I/O load to determine the "adequate" interval length.
* //What is the adequate cache line size?// Common sense seems to point to smaller cache lines as more effective. However, I think the experiments might provide some clues into which size is better.
==== Investigate load balancing with caching ====
The work from [[http://www.ee.tamu.edu/~reddy/papers/MASCOTS-2010.pdf|Reddy's group]] describes an method to load balance the I/O load in a two tier system by continuously migrating among devices in an effort to improve performance based on I/O parallelism. In particular they migrate data on three scenerios:
- Cold data is migrated, in the background, from faster device to larger devices to make room on the smaller, faster devices.
- Hot data is migrated to lighter-loaded devices, in the background, to even out performance of the devices.
- Data is migrated on writes, when writes are targeted to the heavier-loaded device and the memory cache has all the blocks of the extent to which the write blocks belong.
==== Get more insights into the workload characteristics ====
This is very useful and needed. It will be good if we can do some of the above mentioned analysis in a simulator using different workloads and then implement the near-final design in a real system. This is just my suggestion, feel free to go with the implementation route if possible.
=== MSR Workloads ===
Data plots from individual MSR volumes {{internal:projects:ct:msr-volumes.zip|}}.
==== Issue of write-through vs. write-back ====
Here I think write-through is pretty much necessary for cases when SSDs are local to the server and a host failure can lead to potential data loss. If the SSDs are on the array and/or are mirrored in some way, we can use write-back. I think for the first cut, we can stick to write-through.
=== Current Status ===
Current system is designed to be write-back. The write-through mechanism needs to be coded, it will require some work probably a couple of days of coding.
==== What is the adequate number of tiers ====
Are two tiers (SSD+SATA) enough, or do we need three tiers (SSD+SAS+SATA). If three tiers are need then how do we design a caching solution?
==== Minor Issues ====
* Correct units on plots.
* Get plots with extent placement, and random vs sequential IOPS.
* Get heatmap of I/O activity through time.