KernelNewbies:

iomap

iomap allows filesystems to sequentially iterate over ranges in an inode and apply operations to it.

iomap grew out of the need to provide a modern block mapping abstraction for filesystems with the different IO access methods they support and assisting the VFS with manipulating files into the page cache. iomap helpers are provided for each of these mechanisms. However, block mapping is just one of the features of iomap, given iomap supports DAX devices and also supports such as the lseek/llseek SEEK_DATA/SEEK_HOLE interfaces.

Block mapping provides a mapping between data cached in memory and the location on persistent storage where that data lives. LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient, since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. iomap is flexible providing block ranges in bytes. iomap, with the support of folios, provides a modern replacement for buffer-heads.

This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for iomap and guidance for developers on converting a filesystem over from buffer-heads to iomap.

A modern block abstraction

iomap allows filesystems to query storage media for data using byte ranges. Since block mapping are provided for a byte ranges for cache data in memory, in the page cache, naturally this implies operations on block ranges will also deal with multipage operations in the page cache. Folios are used to help provide multipage operations in memory for the byte ranges being worked on.

struct iomap_ops

A filesystem is must provide a struct iomap_ops for to deal with the beginning an IO operation, iomap_begin(), and ending an IO operation on a block range, iomap_end(). You would call iomap with a specialized iomap operation depending on its filesystem or the VFS needs.

For example iomap_dio_rw() would be used for for a filesystem when doing a block range read or write operation with direct IO. In this case your fileystems's respective struct file_operations.write_iter() would eventually call iomap_dio_rw() on the filesystem's struct file_operations.write_iter().

For buffered IO a fileseystem would use iomap_file_buffered_write() on the same struct file_operations.write_iter(). But that is not the only situation in which a filesystem would deal with buffered writes, you could also use buffered writes when a filesystem has to deal with struct file_operations.fallocate(). However fallocate() can be used for zeroing or for truncation purposes. A special respective iomap_zero_range() would be used for zeroing, and a iomap_truncate_page() would be used for truncation.

XFS was the first filesystem to adopt iomap and experience with it has shown that the filesystem implementation of these operations can be simplified considerably if one struct iomap_ops is provided per major filesystem IO operation:

For example, XFS has:

struct iomap_dio_ops

Used for direct-IO. These will call iomap_dio_write().

struct iomap_writeback_ops

The struct iomap_writeback_ops is used for when dealing with a filesystem struct address_space_operations.writepages(), for writeback.

Calling iomap

You call iomap depending on the type of filesystem operation you are working on. We detail some of these interactions below.

Calling iomap for bufferred IO writes

You call iomap for buffered IO with:

You may use buffered writes to also deal with fallocate():

Typically you'd also happen to use these on paths when updating an inode's size.

Calling iomap for direct IO

You call iomap for direct IO with:

You may use direct IO writes to also deal with fallocate():

Typically you'd also happen to use these on paths when updating an inode's size.

Calling iomap for reads

You can call into iomap for reading, ie, dealing with the filesystems's struct file_operations:

Calling iomap for userspace file extent mapping

The fiemap ioctl can be used to allow userspace to get a file extent mapping, instead of older bmap() allows the VM to map logical block offset to physical block number. The bmap() is a legacy block mapping operation supported only for the ioctl and two areas in the kernel which likely are broken (the default swapfile implementation and odd md bitmap code). The fiemap ioctl is supported through an inode struct inode_operations.fiemap() callback.

You would use iomap_fiemap() to provide the mapping. You could use two seperate struct iomap_ops one for when requested to also map extended attributes (FIEMAP_FLAG_XATTR) and your another struct iomap_ops for regular read struct iomap_ops when there is no need for extended attributes. In the future iomap may provide its own dedicated ops structure for fiemap.

Calling iomap for assisting the VFS

A filesystem also needs to call iomap when assisting the VFS manipulating a file into the page cache.

Calling iomap for VFS reading

A filesystem can call iomap to deal with the VFS reading a file into folios with:

Calling iomap for VFS writepages

A filesystem can call iomap to deal with the VFS write out of pages back to backing store, that is to help deal with a filesystems's struct address_space_operations.writepages(). The special iomap_writepages() is used for this case with its own respective filestems's struct iomap_ops for this.

Calling iomap for VFS llseek

A filesystem struct address_space_operations.llseek() is used by the VFS when it needs to move the current file offset, the file offset is in struct file.f_pos. iomap has special support for the llseek SEEK_HOLE or SEEK_DATA interfaces:

Your own 'struct iomap_ops` for this is encouraged.

Calling iomap for DAX

You can use dax_iomap_rw() when calling iomap from a DAX context, this is typically from the filesystems's struct file_operations.write_iter() callback.

Converting filesystems from buffer-head to iomap guide

These are generic guidelines on converting a filesystem over to iomap from buffer-heads.

One op at at time

You may try to convert a filesystem with different clustered set of operations at time, below are a generic order you may strive to target:

Defining a simple filesystem

A simple filesystem is perhaps the easiest to convert over to iomap, a simple filesystem is one which:

Converting a simple filesystem to iomap

Simple filesystems should covert to IOMAP directly and avoid buffer heads directly, ie, don't use IOMAP_F_BUFFER_HEAD.

Converting shared filesystem features

fscrupt, fsverity, compression needs to be converted first to iomap if a fs uses it as iomap supports no permutations (XXX: clarify on this)

Converting complex filesystems

If your filesystem does not fit the simple description above the general recommendation is to port to iomap with IOMAP_F_BUFFER_HEAD in one kernel release to verify you no bugs with, locking, writeback and general use of your new struct iomap_ops.

When to set iomap on srcmap or dstmap

The struct iomap is required to be set on iomap_begin(), if its a CoW path also set srcmap when used with iomap_begin().

This perhaps should be redesigned in the future depending on read / write requirements and it may take time to get this right.

Removal of IOMAP_F_BUFFER_HEAD

IOMAP_F_BUFFER_HEAD won't be removed until we have all filesystem fully converted away from buffer-heads, and this could be never.

Testing Direct IO

Other than fstests you can use LTP's dio, however this tests is limited as it does not test stale data.

./runltp -f dio -d /mnt1/scratch/tmp/

Known issues and future improvements

Other than lack of documetnation there are some known issues and limitatiosn with iomap at this time. We try to itemize them here:

Q&A

References


CategoryDocs

KernelNewbies: KernelProjects/iomap (last edited 2023-05-08 23:49:37 by mcgrof)