2959
Comment:
|
13134
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
'''iomap''' grew out of need to provide '''modern''' block mapping abstraction for filesystems to read/write for three different IO access methods: | '''iomap''' allows filesystems to sequentially iterate over ranges in an inode and apply operations to it. |
Line 5: | Line 5: |
* Direct IO * Buffered IO * DAX |
'''iomap''' grew out of the need to provide a '''modern''' block mapping abstraction for filesystems with the different IO access methods they support '''and''' assisting the VFS with manipulating files into the page cache. '''iomap''' helpers are provided for each of these mechanisms. However, '''block mapping''' is just one of the features of '''iomap''', given '''iomap''' supports DAX IO for filesystems and also supports such the `lseek`/`llseek` `SEEK_DATA`/`SEEK_HOLE` interfaces. |
Line 9: | Line 7: |
Block mapping provides a mapping between data cached in memory, in the page cache, and the location on persistent storage where that data lives. [[https://lwn.net/Articles/930173/|LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient]], since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for '''iomap''' and guidance for developers on converting a filesystem over from buffer-heads to '''iomap'''. | Block mapping provides a mapping between data cached in memory and the location on persistent storage where that data lives. [[https://lwn.net/Articles/930173/|LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient]], since the the inception of Linux. Since '''buffer-heads''' work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. '''iomap''' is flexible providing block ranges in '''bytes'''. '''iomap''', with the support of folios, provides a modern replacement for '''buffer-heads'''. This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for '''iomap''' and guidance for developers on converting a filesystem over from buffer-heads to '''iomap'''. |
Line 14: | Line 14: |
Instead of assuming a granularity of storage media 512-blocks at time, '''iomap''' allows filesystems to query storage media for data using block ranges. Since block mapping are provided for a block ranges for cache data in memory, in the page cache, naturally this implies operations on block ranges will also deal with multipage operations in the page cache. Folios are used to help provide multipage operations in memory. |
'''iomap''' allows filesystems to query storage media for data using '''byte ranges'''. Since block mapping are provided for a '''byte ranges''' for cache data in memory, in the page cache, naturally this implies operations on block ranges will also deal with '''multipage''' operations in the page cache. '''Folios''' are used to help provide '''multipage''' operations in memory for the '''byte ranges''' being worked on. |
Line 19: | Line 18: |
A filesystem is encouraged to provide struct iomap_ops for beginning an IO operation and ending an IO operation on a block range, and so the `struct iomap_ops` data structure has `iomap_begin()` and `iomap_end()` callbacks. Experience in adopting '''iomap''' on XFS has has shown that the filesystem implementation of these operations can be simplified considerably if one `struct iomap_ops` is provided per major filesystem IO operation. For example: | A filesystem is must provide a `struct iomap_ops` for to deal with the beginning an IO operation, `iomap_begin()`, and ending an IO operation on a block range, `iomap_end()`. You would call '''iomap''' with a specialized '''iomap''' operation depending on its filesystem or the VFS needs. |
Line 21: | Line 20: |
* `struct iomap_ops` xfs_'''xattr'''_iomap_ops - xfs: fix SEEK_DATA for speculative COW fork preallocation * `struct iomap_ops` xfs_'''seek'''_iomap_ops - iomap: move the iomap_dio_rw ->end_io callback into a structure * `struct iomap_ops` xfs_'''direct_write'''_iomap_ops * `struct iomap_ops` xfs_'''buffered_write'''_iomap_ops - xfs: split out a new set of read-only iomap ops * `struct iomap_ops` xfs_'''read'''_iomap_ops` iomap: lift the xfs writeback code to iomap * `struct iomap_ops` xfs_'''dax_write'''_iomap_ops |
For example iomap_dio_rw() would be used for for a filesystem when doing a block range read or write operation with direct IO. In this case your fileystems's respective `struct file_operations.write_iter()` would eventually call `iomap_dio_rw()` on the filesystem's `struct file_operations.write_iter()`. For buffered IO a fileseystem would use `iomap_file_buffered_write()` on the same `struct file_operations.write_iter()`. But that is not the only situation in which a filesystem would deal with buffered writes, you could also use buffered writes when a filesystem has to deal with `struct file_operations.fallocate()`. However `fallocate()` can be used for '''zeroing''' or for '''truncation''' purposes. A special respective `iomap_zero_range()` would be used for '''zeroing''', and a `iomap_truncate_page()` would be used for '''truncation'''. XFS was the first filesystem to adopt '''iomap''' and experience with it has shown that the filesystem implementation of these operations can be simplified considerably if one `struct iomap_ops` is provided per major filesystem IO operation: * buffered io * direct io * DAX io * fiemap for with extended attributes (`FIEMAP_FLAG_XATTR`) * lseek For example, XFS has: * `struct iomap_ops` xfs_'''read'''_iomap_ops` iomap * `struct iomap_ops` xfs_'''direct_write'''_iomap_ops * `struct iomap_ops` xfs_'''dax_write'''_iomap_ops * `struct iomap_ops` xfs_'''buffered_write'''_iomap_ops * `struct iomap_ops` xfs_'''xattr'''_iomap_ops * `struct iomap_ops` xfs_'''seek'''_iomap_ops |
Line 30: | Line 43: |
* `struct iomap_dio_ops` xfs_'''dio_write'''_ops->end_io() - iomap: add a filesystem hook for direct I/O bio submission * `struct iomap_dio_ops` xfs_'''dio_write'''_ops->submit_io() - xfs: split the iomap ops for buffered vs direct writes |
Used for direct-IO. These will call `iomap_dio_write()`. * `struct iomap_dio_ops.end_io()` * `struct iomap_dio_ops.submit_io()` |
Line 35: | Line 50: |
* `struct iomap_writeback_ops` xfs_'''writeback'''_ops - xfs: support CoW in fsdax mode | The `struct iomap_writeback_ops` is used for when dealing with a filesystem `struct address_space_operations.writepages()`, for writeback. * `struct iomap_writeback_ops` == Calling iomap == You call '''iomap''' depending on the type of filesystem operation you are working on. We detail some of these interactions below. === Calling iomap for bufferred IO writes === You call '''iomap''' for buffered IO with: * `iomap_file_buffered_write()` - for buffered writes * `iomap_page_mkwrite()` - when dealing callbacks for `struct vm_operations_struct`: * `struct vm_operations_struct.page_mkwrite()` * `struct vm_operations_struct.fault()` * `struct vm_operations_struct.huge_fault()` * `struct vm_operations_struct`.pfn_mkwrite()` You '''may''' use buffered writes to also deal with `fallocate()`: * `iomap_zero_range()` on fallocate for zeroing * `iomap_truncate_page()` on fallocate for truncation Typically you'd also happen to use these on paths when updating an inode's size. === Calling iomap for direct IO === You call '''iomap''' for direct IO with: * `iomap_dio_rw()` You '''may''' use direct IO writes to also deal with `fallocate()`: * `iomap_zero_range()` on fallocate for zeroing * `iomap_truncate_page()` on fallocate for truncation Typically you'd also happen to use these on paths when updating an inode's size. === Calling iomap for reads === You can call into '''iomap''' for reading, ie, dealing with the filesystems's `struct file_operations`: * `struct file_operations.read_iter()`: note that depending on the type of read your filesystem might use `iomap_dio_rw()` for direct IO, generic_file_read_iter() for buffered IO and `dax_iomap_rw()` for DAX. * `struct file_operations.remap_file_range()` - currently the special `dax_remap_file_range_prep()` helper is provided for DAX mode reads. === Calling iomap for userspace file extent mapping === The `fiemap` ioctl can be used to allow userspace to get a file extent mapping. The older `bmap()` (aka `FIBMAP`) allows the VM to map logical block offset to physical block number. `bmap()` is a legacy block mapping operation supported only for the ioctl and two areas in the kernel which likely are broken (the default swapfile implementation and odd md bitmap code). `bmap()` was only useful in the days of ext2 when there were no support for delalloc or unwritten extents. Consequently, the interface reports nothing for those types of mappings. Because of this we don't want filesystems to start exporting this interface if they don't already do so. The `fiemap` ioctl is supported through an inode `struct inode_operations.fiemap()` callback. You would use `iomap_fiemap()` to provide the mapping. You could use two seperate `struct iomap_ops` one for when requested to also map extended attributes (`FIEMAP_FLAG_XATTR`) and your another `struct iomap_ops` for regular read `struct iomap_ops` when there is no need for extended attributes. In the future '''iomap''' may provide its own dedicated ops structure for '''fiemap'''. `iomap_bmap()` exists and should '''only be used''' by filesystems that '''already''' supported `FIBMAP`. `FIBMAP` '''should not be used''' with the address_space -- we have iomap readpages and writepages for that. === Calling iomap for assisting the VFS === A filesystem also needs to call '''iomap''' when assisting the VFS manipulating a file into the page cache. ==== Calling iomap for VFS reading ==== A filesystem can call '''iomap''' to deal with the VFS reading a file into folios with: * `iomap_bmap()` - called to assist the VFS when manipulating page cache with `struct address_space_operations.bmap()`, to help the VFS map a logical block offset to physical block number. * `iomap_read_folio()` - called to assist the page cache with `struct address_space_operations.read_folio()` * `iomap_readahead()` - called to assist the page cache with `struct address_space_operations.readahead()` ==== Calling iomap for VFS writepages ==== A filesystem can call '''iomap''' to deal with the VFS write out of pages back to backing store, that is to help deal with a filesystems's `struct address_space_operations.writepages()`. The special `iomap_writepages()` is used for this case with its own respective filestems's `struct iomap_ops` for this. ==== Calling iomap for VFS llseek ==== A filesystem `struct address_space_operations.llseek()` is used by the VFS when it needs to move the current file offset, the file offset is in `struct file.f_pos`. '''iomap''' has special support for the `llseek` `SEEK_HOLE` or `SEEK_DATA` interfaces: * `iomap_seek_hole()`: for when the `struct address_space_operations.llseek()` ''whence'' argument is `SEEK_HOLE`, when looking for the file's next hole. * `iomap_seek_data()`: for when the `struct address_space_operations.llseek()` ''whence'' argument is `SEEK_DATA` when looking for the file's next data area. Your own 'struct iomap_ops` for this is encouraged. === Calling iomap for DAX === You can use `dax_iomap_rw()` when calling iomap from a DAX context, this is typically from the filesystems's `struct file_operations.write_iter()` callback. |
Line 38: | Line 131: |
== Defining a simple filesystem == | These are generic guidelines on converting a filesystem over to '''iomap''' from '''buffer-heads'''. |
Line 40: | Line 133: |
The easiest | === One op at at time === You may try to convert a filesystem with different clustered set of operations at time, below are a generic order you may strive to target: * direct io * miscellaneous helpers (seek/fiemap/bmap) * buffered io === Defining a simple filesystem === A simple filesystem is perhaps the easiest to convert over to '''iomap''', a simple filesystem is one which: * does not use fsverify, fscrypt, compression * has no Copy on Write support (reflinks) ==== Converting a simple filesystem to iomap ==== Simple filesystems should covert to IOMAP piecemeal wise first converting over '''direct IO''', then the miscellaneous helpers (seek/fiemap/bmap) and last should be buffered IO. === Dynamic mappings considerations === Filesystems that have dynamic mappings (e.g. anything other than zonefs) should fill out the validity cookie when doing page cache operations so that those ops can re-query the filesystem for mapping data if the mappings change out from under the operation. Writeback doesn't take the vfs locks, so this can happen. === Converting shared filesystem features === Shared filesystems features such as fscrypt, compression, erasure coding, and any other data transformations need to be ported to '''iomap''' first, as none of the current '''iomap''' users require any of this functionality. === Converting complex filesystems === If your filesystem relies on any shared filesystem features mentioned above those would need to be converted piecemeal wise. If reflinks are supported you need to first ensure proper locking sanity in order to be able to address byte ranges can be handled properly through '''iomap''' operations. An example filesystem where this work is taking place is btrfs. === When to set iomap on srcmap or dstmap === The struct iomap is required to be set on `iomap_begin()`, if its a '''CoW''' path also set `srcmap` when used with iomap_begin(). This perhaps should be redesigned in the future depending on read / write requirements and it may take time to get this right. === Removal of IOMAP_F_BUFFER_HEAD === `IOMAP_F_BUFFER_HEAD` won't be removed until we have all filesystem fully converted away from '''buffer-heads''', and this could be never. `IOMAP_F_BUFFER_HEAD` should be avoided as a stepping stone / to port filesystems over to '''iomap''' as it's support for '''buffer-heads''' only apply to the buffered write path and nothing else including the read_folio/readahead and writepages aops. === Testing Direct IO === Other than fstests you can use LTP's dio, however this tests is limited as it does not test stale data. {{{ ./runltp -f dio -d /mnt1/scratch/tmp/ }}} === Known issues and future improvements === Other than lack of documetnation there are some known issues and limitatiosn with '''iomap''' at this time. We try to itemize them here: * write amplification on IOMAP when bs < ps * '''iomap''' needs improvements for large folios for dirty bitmap tracking === Q&A === * Why does btrfs only have a few IOMAP calls: * the current '''iomap''' use case is only for direct I/O * converting the buffered I/O code is a lot more work * btrfs does a lot of really odd things in it's buffered I/O path that are can't work with iomap and should be fixed (Goldwyn has been working on this) === References === * [[https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_185|Presentation on iomap evolution]] * [[https://lwn.net/Articles/930173/|LWN review of deprecating buffer-heads]]] ---- CategoryDocs |
iomap
iomap allows filesystems to sequentially iterate over ranges in an inode and apply operations to it.
iomap grew out of the need to provide a modern block mapping abstraction for filesystems with the different IO access methods they support and assisting the VFS with manipulating files into the page cache. iomap helpers are provided for each of these mechanisms. However, block mapping is just one of the features of iomap, given iomap supports DAX IO for filesystems and also supports such the lseek/llseek SEEK_DATA/SEEK_HOLE interfaces.
Block mapping provides a mapping between data cached in memory and the location on persistent storage where that data lives. LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient, since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. iomap is flexible providing block ranges in bytes. iomap, with the support of folios, provides a modern replacement for buffer-heads.
This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for iomap and guidance for developers on converting a filesystem over from buffer-heads to iomap.
Contents
-
iomap
- A modern block abstraction
- struct iomap_ops
- struct iomap_dio_ops
- struct iomap_writeback_ops
- Calling iomap
- Converting filesystems from buffer-head to iomap guide
A modern block abstraction
iomap allows filesystems to query storage media for data using byte ranges. Since block mapping are provided for a byte ranges for cache data in memory, in the page cache, naturally this implies operations on block ranges will also deal with multipage operations in the page cache. Folios are used to help provide multipage operations in memory for the byte ranges being worked on.
struct iomap_ops
A filesystem is must provide a struct iomap_ops for to deal with the beginning an IO operation, iomap_begin(), and ending an IO operation on a block range, iomap_end(). You would call iomap with a specialized iomap operation depending on its filesystem or the VFS needs.
For example iomap_dio_rw() would be used for for a filesystem when doing a block range read or write operation with direct IO. In this case your fileystems's respective struct file_operations.write_iter() would eventually call iomap_dio_rw() on the filesystem's struct file_operations.write_iter().
For buffered IO a fileseystem would use iomap_file_buffered_write() on the same struct file_operations.write_iter(). But that is not the only situation in which a filesystem would deal with buffered writes, you could also use buffered writes when a filesystem has to deal with struct file_operations.fallocate(). However fallocate() can be used for zeroing or for truncation purposes. A special respective iomap_zero_range() would be used for zeroing, and a iomap_truncate_page() would be used for truncation.
XFS was the first filesystem to adopt iomap and experience with it has shown that the filesystem implementation of these operations can be simplified considerably if one struct iomap_ops is provided per major filesystem IO operation:
- buffered io
- direct io
- DAX io
fiemap for with extended attributes (FIEMAP_FLAG_XATTR)
- lseek
For example, XFS has:
struct iomap_ops xfs_read_iomap_ops` iomap
struct iomap_ops xfs_direct_write_iomap_ops
struct iomap_ops xfs_dax_write_iomap_ops
struct iomap_ops xfs_buffered_write_iomap_ops
struct iomap_ops xfs_xattr_iomap_ops
struct iomap_ops xfs_seek_iomap_ops
struct iomap_dio_ops
Used for direct-IO. These will call iomap_dio_write().
struct iomap_dio_ops.end_io()
struct iomap_dio_ops.submit_io()
struct iomap_writeback_ops
The struct iomap_writeback_ops is used for when dealing with a filesystem struct address_space_operations.writepages(), for writeback.
struct iomap_writeback_ops
Calling iomap
You call iomap depending on the type of filesystem operation you are working on. We detail some of these interactions below.
Calling iomap for bufferred IO writes
You call iomap for buffered IO with:
iomap_file_buffered_write() - for buffered writes
iomap_page_mkwrite() - when dealing callbacks for struct vm_operations_struct:
struct vm_operations_struct.page_mkwrite()
struct vm_operations_struct.fault()
struct vm_operations_struct.huge_fault()
struct vm_operations_struct.pfn_mkwrite()`
You may use buffered writes to also deal with fallocate():
iomap_zero_range() on fallocate for zeroing
iomap_truncate_page() on fallocate for truncation
Typically you'd also happen to use these on paths when updating an inode's size.
Calling iomap for direct IO
You call iomap for direct IO with:
iomap_dio_rw()
You may use direct IO writes to also deal with fallocate():
iomap_zero_range() on fallocate for zeroing
iomap_truncate_page() on fallocate for truncation
Typically you'd also happen to use these on paths when updating an inode's size.
Calling iomap for reads
You can call into iomap for reading, ie, dealing with the filesystems's struct file_operations:
struct file_operations.read_iter(): note that depending on the type of read your filesystem might use iomap_dio_rw() for direct IO, generic_file_read_iter() for buffered IO and dax_iomap_rw() for DAX.
struct file_operations.remap_file_range() - currently the special dax_remap_file_range_prep() helper is provided for DAX mode reads.
Calling iomap for userspace file extent mapping
The fiemap ioctl can be used to allow userspace to get a file extent mapping. The older bmap() (aka FIBMAP) allows the VM to map logical block offset to physical block number. bmap() is a legacy block mapping operation supported only for the ioctl and two areas in the kernel which likely are broken (the default swapfile implementation and odd md bitmap code). bmap() was only useful in the days of ext2 when there were no support for delalloc or unwritten extents. Consequently, the interface reports nothing for those types of mappings. Because of this we don't want filesystems to start exporting this interface if they don't already do so.
The fiemap ioctl is supported through an inode struct inode_operations.fiemap() callback.
You would use iomap_fiemap() to provide the mapping. You could use two seperate struct iomap_ops one for when requested to also map extended attributes (FIEMAP_FLAG_XATTR) and your another struct iomap_ops for regular read struct iomap_ops when there is no need for extended attributes. In the future iomap may provide its own dedicated ops structure for fiemap.
iomap_bmap() exists and should only be used by filesystems that already supported FIBMAP. FIBMAP should not be used with the address_space -- we have iomap readpages and writepages for that.
Calling iomap for assisting the VFS
A filesystem also needs to call iomap when assisting the VFS manipulating a file into the page cache.
Calling iomap for VFS reading
A filesystem can call iomap to deal with the VFS reading a file into folios with:
iomap_bmap() - called to assist the VFS when manipulating page cache with struct address_space_operations.bmap(), to help the VFS map a logical block offset to physical block number.
iomap_read_folio() - called to assist the page cache with struct address_space_operations.read_folio()
iomap_readahead() - called to assist the page cache with struct address_space_operations.readahead()
Calling iomap for VFS writepages
A filesystem can call iomap to deal with the VFS write out of pages back to backing store, that is to help deal with a filesystems's struct address_space_operations.writepages(). The special iomap_writepages() is used for this case with its own respective filestems's struct iomap_ops for this.
Calling iomap for VFS llseek
A filesystem struct address_space_operations.llseek() is used by the VFS when it needs to move the current file offset, the file offset is in struct file.f_pos. iomap has special support for the llseek SEEK_HOLE or SEEK_DATA interfaces:
iomap_seek_hole(): for when the struct address_space_operations.llseek() whence argument is SEEK_HOLE, when looking for the file's next hole.
iomap_seek_data(): for when the struct address_space_operations.llseek() whence argument is SEEK_DATA when looking for the file's next data area.
Your own 'struct iomap_ops` for this is encouraged.
Calling iomap for DAX
You can use dax_iomap_rw() when calling iomap from a DAX context, this is typically from the filesystems's struct file_operations.write_iter() callback.
Converting filesystems from buffer-head to iomap guide
These are generic guidelines on converting a filesystem over to iomap from buffer-heads.
One op at at time
You may try to convert a filesystem with different clustered set of operations at time, below are a generic order you may strive to target:
- direct io
- miscellaneous helpers (seek/fiemap/bmap)
- buffered io
Defining a simple filesystem
A simple filesystem is perhaps the easiest to convert over to iomap, a simple filesystem is one which:
- does not use fsverify, fscrypt, compression
- has no Copy on Write support (reflinks)
Converting a simple filesystem to iomap
Simple filesystems should covert to IOMAP piecemeal wise first converting over direct IO, then the miscellaneous helpers (seek/fiemap/bmap) and last should be buffered IO.
Dynamic mappings considerations
Filesystems that have dynamic mappings (e.g. anything other than zonefs) should fill out the validity cookie when doing page cache operations so that those ops can re-query the filesystem for mapping data if the mappings change out from under the operation. Writeback doesn't take the vfs locks, so this can happen.
Converting shared filesystem features
Shared filesystems features such as fscrypt, compression, erasure coding, and any other data transformations need to be ported to iomap first, as none of the current iomap users require any of this functionality.
Converting complex filesystems
If your filesystem relies on any shared filesystem features mentioned above those would need to be converted piecemeal wise. If reflinks are supported you need to first ensure proper locking sanity in order to be able to address byte ranges can be handled properly through iomap operations. An example filesystem where this work is taking place is btrfs.
When to set iomap on srcmap or dstmap
The struct iomap is required to be set on iomap_begin(), if its a CoW path also set srcmap when used with iomap_begin().
This perhaps should be redesigned in the future depending on read / write requirements and it may take time to get this right.
Removal of IOMAP_F_BUFFER_HEAD
IOMAP_F_BUFFER_HEAD won't be removed until we have all filesystem fully converted away from buffer-heads, and this could be never.
IOMAP_F_BUFFER_HEAD should be avoided as a stepping stone / to port filesystems over to iomap as it's support for buffer-heads only apply to the buffered write path and nothing else including the read_folio/readahead and writepages aops.
Testing Direct IO
Other than fstests you can use LTP's dio, however this tests is limited as it does not test stale data.
./runltp -f dio -d /mnt1/scratch/tmp/
Known issues and future improvements
Other than lack of documetnation there are some known issues and limitatiosn with iomap at this time. We try to itemize them here:
write amplification on IOMAP when bs < ps
iomap needs improvements for large folios for dirty bitmap tracking
Q&A
- Why does btrfs only have a few IOMAP calls:
the current iomap use case is only for direct I/O
- converting the buffered I/O code is a lot more work
- btrfs does a lot of really odd things in it's buffered I/O path that are can't work with iomap and should be fixed (Goldwyn has been working on this)