In 2008, disk seeks are still on the same order of magnitude as they were 10 years ago (nearly 10ms), but applications have become a lot more demanding than they used to be. Solid state memory is not (yet?) cheap enough to fully replace disks for most people. However, fast flash memory devices are becoming very affordable (around $50 for a 4GB size, 30MB/second throughput flash memory device). It should be possible to use flash memory to speed up "disk accesses" by applications, without requiring the system administrator to carefully place some data on flash and some on disk.
How can flash help improve speed?
Flash memory and disks are both fast and slow, depending on your point of view. Affordable flash memory can read and write 30MB/s, while current hard disks can read up to 50MB/s. However, hard disks can only keep up that pace if IO is contiguous, which it almost never is. Typical workloads include filesystem metadata commits, small file reads and writes, small file fsyncs (mail servers) and small synchronous updates to large files (databases). Because the seek time of a hard disk is on the order of 10ms (counting rotational latency), a disk can really only do around 100 IO operations per second. If we optimistically assume that the average size of an IO operation is 16kB, that means a hard disk will do only 1.6MB of small IOs per second.
The logical way to implement such a caching scheme would be as a device mapper plugin. Not only does that make the caching self contained and filesystem independent, it also allows for integration of the caching configuration with the device-mapper tools that many system administrators are already familiar with.
Logically, the device mapper layer sits between the filesystem and the block device drivers. Filesystems turn (file, offset) requests from userland applications into (block device, block number) IO requests, which the device mapper layer translates into (physical disk, physical block number) IO requests. This is an easy place in the stack to divert some requests to flash memory, instead of to disk.
What to cache
Linux uses spare RAM to cache very frequently accessed disk blocks, in the page cache. This means that the data which is most frequently read by applications will stay in RAM and is not actually fetched from disk very often; there is no need to cache that data on flash.
Disk writes are often synchronous, meaning that the application needs to wait for several disk seeks to complete. Having disk writes go to flash (which is persistent across power failures) could drastically speed up transaction oriented applications.
Some disk blocks are frequently written. Very frequently written blocks can end up being written to flash many times before the block is (asynchronously) flushed to disk. In this scenario, the flash memory cache has actually reduced the amount of disk IO.
The content index of the flash memory cache needs to be manipulated in a transaction safe way. If a program is told that data made it to stable storage, that data needs to still be there after a reboot. There cannot be a timing window where the data is lost if the system loses power.
Large IO bypass
Hard disk performance is adequate for large IOs. Maybe large IO operations (>128kB?) should bypass the flash device and go straight to the hard disk? This would reduce wear on the flash device and could even improve system performance if one 30MB/s flash device is used as a cache for several 50MB/s (sequential) hard disks.
The flash device may contain a newer version of a data block than what is on the hard disk. This happens when the data has not been flushed from flash to disk yet. Because of this, any read request will first need to go to the flash memory device and only go to the hard disk when it cannot be satisfied from flash memory. This part of read caching is necessary in any implementation.
Further enhancements may be worthwhile though. Some workloads have data or metadata blocks that are somewhat frequently accessed by the application. Not frequently enough that Linux will cache the data in the page cache, yet frequently enough that the application would go faster if that data was cached in flash memory. Filesystem metadata (file manager stating every file in a large directory) and database indexes could benefit from being cached on flash memory.
These considerations mean that a read caching scheme needs to identify what blocks are frequently accessed and only cache those blocks. In short, the caching scheme needs to keep track of pages that are not in the flash memory cache. At the same time, disk writes still need to go directly into flash memory, preferably without pushing the frequently accessed blocks out of the cache.
The 2Q cache replacement algorithm looks like a suitable candidate. Using the 2Q default of 70% of cache memory for the hot (frequently accessed) pages and 30% for the cold (rarely accessed) pages could work.
Writes to infrequently accessed pages will always go into the cold set. Reads of infrequently accessed pages will bypass the flash memory cache; the page will go from disk to RAM and not pollute either the hot set or the cold set. Reads from and writes to memory in the hot set will go to flash memory and not touch disk. By keeping track of pages not in the cache, the system can identify a read to a block that should be in the hot set and place it there, evicting an older block from the cache.
Another related issue is that some flash devices are fast at reading and writing data, but slow at erasing data. Avoiding read+erase+rewrite cycles by doing smarter space reuse on the flash memory cache device may be worth it for performance and longevity reasons.