Table of Contents

SSD Caching v.s. Tiering

Participants

Project Goals

Find out best way to use SSDs as a front-end tier by designing a solution that considers device specific characteristics and using a combination of caching and tiering techniques. There are quite a few unknowns in designing such a solution which we are trying to explore.

Meetings

We are meeting every Thursday at 3pm PST (6pm PST), via 888-426-6840 (Participant Code: 45536712).

Latest Updates

MSR Workloads 2 Hour Runs

In this set of experiments I look at the two claims:

IBM
Type Model Avail Cost
SSD Intel x25 MLC 2 $430
SAS ST3450857SS 4 $325
SATA ST31000340NS 4 $170

*Cost according to FAST paper per device

FIU
Type Model Avail Cost
SSD Intel 320 Series MLC 2 $220
SAS ST3300657SS 4 $220
SATA ST31000524NS 4 $110

*Cost according to Google per device

Server peak (hours 2-4 of the trace)

For this experiment I use a SSD+SATA configuration.

Response time
FIU IBM
Avg RT: MQ=0.805ms EDT=3.608ms Avg RT: MQ=9.445ms EDT=9.927ms
I/O distribution
FIU IBM
MQ
EDT

Source Control peak (hours 2-4 of the trace)

For this experiment I use a SSD+SATA configuration.

Response time
FIU IBM
Avg RT: MQ=1.353ms EDT=1.856ms Avg RT: MQ=7.363ms EDT=3.190ms
I/O distribution
FIU IBM
MQ
EDT

Complete MSR peak (hours 8-10 of the trace)

For this experiment I use a SSD+SAS+SATA configuration.

Response time
FIU IBM
Avg RT: MQ=2.270ms MQ-UB=2.684ms Avg RT: MQ=23.481ms MQ-UB=18.226ms
I/O distribution
FIU IBM
MQ
MQ-UB

Workload Analysis

All data and plots shown below are assuming 1mb extents (cache lines).

Workload Summary
Workload Length (days) Total I/Os Active Exts Total Exts % Exts Accessed
server 7 219828231 522919 1690624 30.930
data 7 125587968 2368184 3809280 62.168
srccntl 7 88442081 311331 925696 33.632
fiu-nas 12 1316154000 9849142 20507840 48.026
IOPS

SERVER

IOPS

DATA

IOPS

SRCCNTL

IOPS

FIU NAS

IOPS

I/O Distribution across extents

SERVER

Only active extents Complete system

DATA

Only active extents Complete system

SRCCNTL

Only active extents Complete system

FIU NAS

Only active extents Complete system

Extent reuse distance

SERVER

DATA

SRCCNTL

FIU NAS

Expected Hit Ratio

SERVER

DATA

SRCCNTL

FIU NAS

Hit Ratio

Hit rate assuming 5% and 1% of HDD space provisioning for the SSD cache.

SERVER

DATA

SRCCNTL

FIU NAS

Cache Lifetime

SERVER

DATA

SRCCNTL

FIU NAS

Project TODOs

Tiering and Caching Scenarios

For all experiments there are a couple of different configurations being tried out:

RAND READ

I/O and extent distribution through time (top plot is I/O and bottom is extent)

EDT

MQ

First column is 16kb cache lines and second column is 4kb.

RAND WRITE

I/O and extent distribution through time (top plot is I/O and bottom is extent)

EDT

MQ

First column is 16kb cache lines and second column is 4kb.

SEQ READ

Sequentially read 15GB from a 60GB file, in a loop for 1hr.

I/O and extent distribution through time (top plot is I/O and bottom is extent)

EDT

SEQ WRITE

Sequentially write 15GB from a 60GB file, in a loop for 1hr.

I/O and extent distribution through time (top plot is I/O and bottom is extent)

EDT

In particular the paper from Reddy's group. We should also look at Ismail Ari's work from HP labs. IIRC, he brought up this issue of modeling the cost of writes explicitly many years ago.

Explore different caching policies (algorithms)

Such as LRU, ARC, MQ. And measure how they perform. Here we should evaluate these algorithms as it is. Then we can modify them to not write every block to SSDs but only when we think it should go in. This will include modeling the cost of writes and not doing them for every miss.

Current Status

Currently I plan to evaluate two caching solutions. First, LRU which represents a basic solution in which every page accessed is always brought to the cache if not already present. Second, a more complex approach, MQ, which was designed for storage and takes into account for frequency and recency.

Try out different caching unit

Current experiments are done using a 1mb extent, try with different sizes and measure it's impact on performance. Here we want to try out basic caching style with no prefetching and compare that with prefetching larger block size on a miss and write that to SSD. I agree with Jody that 1 MB may be an overkill. We should also consider SSD erase block size here.

Current Status

2 Tiers (SSD+SATA)

First 12 hours of the MSR server workload.

cache-X represents a cache line size of 2^X

Look into provisioning cache systems.

One possibility is use a trace to provision, just like in the EDT work, but now use the requirements of the optimal placement as the result of the provisioning.

To my knowledge, include a few google searches, there is not much work on how to provision cache systems. But the general feeling is the bigger the cache the most benefit one should get, at an increase in cost.

First approach

Compute the working set size for some interval with a given “adequate” cache line size. Then take the maximum working set size across time as the size of the cache. The key questions here are:

Investigate load balancing with caching

The work from Reddy's group describes an method to load balance the I/O load in a two tier system by continuously migrating among devices in an effort to improve performance based on I/O parallelism. In particular they migrate data on three scenerios:

  1. Cold data is migrated, in the background, from faster device to larger devices to make room on the smaller, faster devices.
  2. Hot data is migrated to lighter-loaded devices, in the background, to even out performance of the devices.
  3. Data is migrated on writes, when writes are targeted to the heavier-loaded device and the memory cache has all the blocks of the extent to which the write blocks belong.

Get more insights into the workload characteristics

This is very useful and needed. It will be good if we can do some of the above mentioned analysis in a simulator using different workloads and then implement the near-final design in a real system. This is just my suggestion, feel free to go with the implementation route if possible.

MSR Workloads

Data plots from individual MSR volumes msr-volumes.zip.

Issue of write-through vs. write-back

Here I think write-through is pretty much necessary for cases when SSDs are local to the server and a host failure can lead to potential data loss. If the SSDs are on the array and/or are mirrored in some way, we can use write-back. I think for the first cut, we can stick to write-through.

Current Status

Current system is designed to be write-back. The write-through mechanism needs to be coded, it will require some work probably a couple of days of coding.

What is the adequate number of tiers

Are two tiers (SSD+SATA) enough, or do we need three tiers (SSD+SAS+SATA). If three tiers are need then how do we design a caching solution?

Minor Issues