In 2008, disk seeks are still on the same order of magnitude as they were 10 years ago (nearly 10ms), but applications have become a lot more demanding than they used to be. Solid state memory is not (yet?) cheap enough to fully replace disks for most people. However, fast flash memory devices are becoming very affordable (around $50 for a 4GB size, 30MB/second throughput flash memory device). It should be possible to use flash memory to speed up "disk accesses" by applications, without requiring the system administrator to carefully place some data on flash and some on disk.

Project goals:

How can flash help improve speed?

Flash memory and disks are both fast and slow, depending on your point of view. Affordable flash memory can read and write 30MB/s, while current hard disks can read up to 50MB/s. However, hard disks can only keep up that pace if IO is contiguous, which it almost never is. Typical workloads include filesystem metadata commits, small file reads and writes, small file fsyncs (mail servers) and small synchronous updates to large files (databases). Because the seek time of a hard disk is on the order of 10ms (counting rotational latency), a disk can really only do around 100 IO operations per second. If we optimistically assume that the average size of an IO operation is 16kB, that means a hard disk will do only 1.6MB of small IOs per second.

Some observations:


The logical way to implement such a caching scheme would be as a device mapper plugin. Not only does that make the caching self contained and filesystem independent, it also allows for integration of the caching configuration with the device-mapper tools that many system administrators are already familiar with.

Logically, the device mapper layer sits between the filesystem and the block device drivers. Filesystems turn (file, offset) requests from userland applications into (block device, block number) IO requests, which the device mapper layer translates into (physical disk, physical block number) IO requests. This is an easy place in the stack to divert some requests to flash memory, instead of to disk.

What to cache

Basic considerations:

Ideas for enhancements:

Large IO bypass

Hard disk performance is adequate for large IOs. Maybe large IO operations (>128kB?) should bypass the flash device and go straight to the hard disk? This would reduce wear on the flash device and could even improve system performance if one 30MB/s flash device is used as a cache for several 50MB/s (sequential) hard disks.

Read caching

The flash device may contain a newer version of a data block than what is on the hard disk. This happens when the data has not been flushed from flash to disk yet. Because of this, any read request will first need to go to the flash memory device and only go to the hard disk when it cannot be satisfied from flash memory. This part of read caching is necessary in any implementation.

Further enhancements may be worthwhile though. Some workloads have data or metadata blocks that are somewhat frequently accessed by the application. Not frequently enough that Linux will cache the data in the page cache, yet frequently enough that the application would go faster if that data was cached in flash memory. Filesystem metadata (file manager stating every file in a large directory) and database indexes could benefit from being cached on flash memory.

The bulk of read operations should not be cached in flash memory, however:

These considerations mean that a read caching scheme needs to identify what blocks are frequently accessed and only cache those blocks. In short, the caching scheme needs to keep track of pages that are not in the flash memory cache. At the same time, disk writes still need to go directly into flash memory, preferably without pushing the frequently accessed blocks out of the cache.

The 2Q cache replacement algorithm looks like a suitable candidate. Using the 2Q default of 70% of cache memory for the hot (frequently accessed) pages and 30% for the cold (rarely accessed) pages could work.

Writes to infrequently accessed pages will always go into the cold set. Reads of infrequently accessed pages will bypass the flash memory cache; the page will go from disk to RAM and not pollute either the hot set or the cold set. Reads from and writes to memory in the hot set will go to flash memory and not touch disk. By keeping track of pages not in the cache, the system can identify a read to a block that should be in the hot set and place it there, evicting an older block from the cache.

Wear leveling

It may be a good idea to tune the cache replacement scheme and the way the cache index is maintained to be friendly to flash devices by helping with wear leveling.

Another related issue is that some flash devices are fast at reading and writing data, but slow at erasing data. Avoiding read+erase+rewrite cycles by doing smarter space reuse on the flash memory cache device may be worth it for performance and longevity reasons.


KernelNewbies: KernelProjects/DmCache (last edited 2017-12-30 01:30:28 by localhost)