KernelProjects/DmCache - Linux Kernel Newbies

In 2008, disk seeks are still on the same order of magnitude as they were 10 years ago (nearly 10ms), but applications have become a lot more demanding than they used to be. Solid state memory is not (yet?) cheap enough to fully replace disks for most people. However, fast flash memory devices are becoming very affordable (around $50 for a 4GB size, 30MB/second throughput flash memory device). It should be possible to use flash memory to speed up "disk accesses" by applications, without requiring the system administrator to carefully place some data on flash and some on disk.

Project goals:

Speed up filesystem access by using a flash memory device to cache IO.
Speed up application data accesses, not just metadata accesses.
Allow system administrators to add such a cache to their system, without having to re-make their filesystems.
Have the caching mechanism implemented in a filesystem independent, self-contained module.

How can flash help improve speed?

Flash memory and disks are both fast and slow, depending on your point of view. Affordable flash memory can read and write 30MB/s, while current hard disks can read up to 50MB/s. However, hard disks can only keep up that pace if IO is contiguous, which it almost never is. Typical workloads include filesystem metadata commits, small file reads and writes, small file fsyncs (mail servers) and small synchronous updates to large files (databases). Because the seek time of a hard disk is on the order of 10ms (counting rotational latency), a disk can really only do around 100 IO operations per second. If we optimistically assume that the average size of an IO operation is 16kB, that means a hard disk will do only 1.6MB of small IOs per second.

Some observations:

Flash memory has no seek time and will be able to keep up its 30MB/s rate regardless of IO size.
Some applications (mail servers, databases) need to wait for data to hit stable storage before they can close a transaction.
The problem with hard disks is latency (seek time), throughput is ok with very large IO operations.

dm-cache

The logical way to implement such a caching scheme would be as a device mapper plugin. Not only does that make the caching self contained and filesystem independent, it also allows for integration of the caching configuration with the device-mapper tools that many system administrators are already familiar with.

Logically, the device mapper layer sits between the filesystem and the block device drivers. Filesystems turn (file, offset) requests from userland applications into (block device, block number) IO requests, which the device mapper layer translates into (physical disk, physical block number) IO requests. This is an easy place in the stack to divert some requests to flash memory, instead of to disk.

What to cache

Basic considerations:

Linux uses spare RAM to cache very frequently accessed disk blocks, in the page cache. This means that the data which is most frequently read by applications will stay in RAM and is not actually fetched from disk very often; there is no need to cache that data on flash.
Disk writes are often synchronous, meaning that the application needs to wait for several disk seeks to complete. Having disk writes go to flash (which is persistent across power failures) could drastically speed up transaction oriented applications.
Because the copy of the data in flash is always the latest and considered authoritive, data can be written back to disk asynchronously and in any order.
Some disk blocks are frequently written. Very frequently written blocks can end up being written to flash many times before the block is (asynchronously) flushed to disk. In this scenario, the flash memory cache has actually reduced the amount of disk IO.
The contents of flash memory are persistent across power loss. This means that the index of the contents of the flash memory cache needs to be persistent too.
The content index of the flash memory cache needs to be manipulated in a transaction safe way. If a program is told that data made it to stable storage, that data needs to still be there after a reboot. There cannot be a timing window where the data is lost if the system loses power.

Ideas for enhancements:

Large IO bypass.
Read caching.
Wear leveling.

Large IO bypass

Hard disk performance is adequate for large IOs. Maybe large IO operations (>128kB?) should bypass the flash device and go straight to the hard disk? This would reduce wear on the flash device and could even improve system performance if one 30MB/s flash device is used as a cache for several 50MB/s (sequential) hard disks.

Read caching

The flash device may contain a newer version of a data block than what is on the hard disk. This happens when the data has not been flushed from flash to disk yet. Because of this, any read request will first need to go to the flash memory device and only go to the hard disk when it cannot be satisfied from flash memory. This part of read caching is necessary in any implementation.

Further enhancements may be worthwhile though. Some workloads have data or metadata blocks that are somewhat frequently accessed by the application. Not frequently enough that Linux will cache the data in the page cache, yet frequently enough that the application would go faster if that data was cached in flash memory. Filesystem metadata (file manager stating every file in a large directory) and database indexes could benefit from being cached on flash memory.

The bulk of read operations should not be cached in flash memory, however:

Very frequently accessed data will be cached in RAM by Linux, in the page cache. This data will not be fetched from disk often.
Rarely accessed data should not be placed in the flash cache, since it will have been evicted before it is used again.
Flash devices can only be written to a finite number of times. Writing useless-to-cache data to flash will reduce its life span.
Putting useless data in the cache can slow down other IO operations, not just by keeping the flash device busy, but also by evicting useful data from the cache.

These considerations mean that a read caching scheme needs to identify what blocks are frequently accessed and only cache those blocks. In short, the caching scheme needs to keep track of pages that are not in the flash memory cache. At the same time, disk writes still need to go directly into flash memory, preferably without pushing the frequently accessed blocks out of the cache.

The 2Q cache replacement algorithm looks like a suitable candidate. Using the 2Q default of 70% of cache memory for the hot (frequently accessed) pages and 30% for the cold (rarely accessed) pages could work.

Writes to infrequently accessed pages will always go into the cold set. Reads of infrequently accessed pages will bypass the flash memory cache; the page will go from disk to RAM and not pollute either the hot set or the cold set. Reads from and writes to memory in the hot set will go to flash memory and not touch disk. By keeping track of pages not in the cache, the system can identify a read to a block that should be in the hot set and place it there, evicting an older block from the cache.

Wear leveling

It may be a good idea to tune the cache replacement scheme and the way the cache index is maintained to be friendly to flash devices by helping with wear leveling.

Another related issue is that some flash devices are fast at reading and writing data, but slow at erasing data. Avoiding read+erase+rewrite cycles by doing smarter space reuse on the flash memory cache device may be worth it for performance and longevity reasons.

CategoryKernelProjects