Size: 10979
Comment:
|
Size: 11468
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
'''iomap''' grew out of need to provide '''modern''' block mapping abstraction for filesystems with the different IO access methods they support '''and''' assisting the VFS with manipulating files into the page cache. '''iomap''' helpers are provided for each of these mechanisms. | |
Line 3: | Line 4: |
'''iomap''' grew out of need to provide '''modern''' block mapping abstraction for filesystems three different IO access methods '''and''' working with the page cache. So helpers are provided for Direct IO, buffered IO, DAX, and interactions with the page cache. | Block mapping provides a mapping between data cached in memory and the location on persistent storage where that data lives. [[https://lwn.net/Articles/930173/|LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient]], since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for '''iomap''' and guidance for developers on converting a filesystem over from buffer-heads to '''iomap'''. |
Line 5: | Line 6: |
Block mapping provides a mapping between data cached in memory, in the page cache, and the location on persistent storage where that data lives. [[https://lwn.net/Articles/930173/|LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient]], since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for '''iomap''' and guidance for developers on converting a filesystem over from buffer-heads to '''iomap'''. |
[[attachment:iomap.jpg]] |
Line 11: | Line 11: |
Line 15: | Line 14: |
Line 17: | Line 15: |
Line 21: | Line 18: |
* read * direct writes * DAX writes * buffered writes * xattr - FIEMAP_FLAG_XATTR * seek |
* read * direct writes * DAX writes * buffered writes * xattr - FIEMAP_FLAG_XATTR * seek |
Line 30: | Line 27: |
* `struct iomap_ops` xfs_'''read'''_iomap_ops` iomap: lift the xfs writeback code to iomap * `struct iomap_ops` xfs_'''direct_write'''_iomap_ops * `struct iomap_ops` xfs_'''dax_write'''_iomap_ops * `struct iomap_ops` xfs_'''buffered_write'''_iomap_ops - xfs: split out a new set of read-only iomap ops * `struct iomap_ops` xfs_'''xattr'''_iomap_ops - xfs: fix SEEK_DATA for speculative COW fork preallocation * `struct iomap_ops` xfs_'''seek'''_iomap_ops - iomap: move the iomap_dio_rw ->end_io callback into a structure |
* `struct iomap_ops` xfs_'''read'''_iomap_ops` iomap: lift the xfs writeback code to iomap * `struct iomap_ops` xfs_'''direct_write'''_iomap_ops * `struct iomap_ops` xfs_'''dax_write'''_iomap_ops * `struct iomap_ops` xfs_'''buffered_write'''_iomap_ops - xfs: split out a new set of read-only iomap ops * `struct iomap_ops` xfs_'''xattr'''_iomap_ops - xfs: fix SEEK_DATA for speculative COW fork preallocation * `struct iomap_ops` xfs_'''seek'''_iomap_ops - iomap: move the iomap_dio_rw ->end_io callback into a structure |
Line 38: | Line 35: |
Line 41: | Line 37: |
* `struct iomap_dio_ops` xfs_'''dio_write'''_ops->end_io() - iomap: add a filesystem hook for direct I/O bio submission * `struct iomap_dio_ops` xfs_'''dio_write'''_ops->submit_io() - xfs: split the iomap ops for buffered vs direct writes |
* `struct iomap_dio_ops` xfs_'''dio_write'''_ops->end_io() - iomap: add a filesystem hook for direct I/O bio submission * `struct iomap_dio_ops` xfs_'''dio_write'''_ops->submit_io() - xfs: split the iomap ops for buffered vs direct writes |
Line 45: | Line 41: |
Line 48: | Line 43: |
* `struct iomap_writeback_ops` xfs_'''writeback'''_ops - xfs: support CoW in fsdax mode | * `struct iomap_writeback_ops` xfs_'''writeback'''_ops - xfs: support CoW in fsdax mode |
Line 51: | Line 46: |
Line 55: | Line 49: |
Line 58: | Line 51: |
* `iomap_file_buffered_write()` - for buffered writes * `iomap_page_mkwrite()` - when dealing callbacks for `struct vm_operations_struct`: * `struct vm_operations_struct.page_mkwrite()` * `struct vm_operations_struct.fault()` * `struct vm_operations_struct.huge_fault()` * `struct vm_operations_struct`.pfn_mkwrite()` |
* `iomap_file_buffered_write()` - for buffered writes * `iomap_page_mkwrite()` - when dealing callbacks for `struct vm_operations_struct`: * `struct vm_operations_struct.page_mkwrite()` * `struct vm_operations_struct.fault()` * `struct vm_operations_struct.huge_fault()` * `struct vm_operations_struct`.pfn_mkwrite()` |
Line 67: | Line 60: |
* `iomap_zero_range()` on fallocate for zeroing * `iomap_truncate_page()` on fallocate for truncation |
* `iomap_zero_range()` on fallocate for zeroing * `iomap_truncate_page()` on fallocate for truncation |
Line 73: | Line 66: |
Line 76: | Line 68: |
* `iomap_dio_rw()` | * `iomap_dio_rw()` |
Line 80: | Line 72: |
* `iomap_zero_range()` on fallocate for zeroing * `iomap_truncate_page()` on fallocate for truncation |
* `iomap_zero_range()` on fallocate for zeroing * `iomap_truncate_page()` on fallocate for truncation |
Line 85: | Line 77: |
=== Calling iomap for readss === |
=== Calling iomap for reads === |
Line 89: | Line 80: |
* `struct file_operations.read_iter()`: note that depending on the type of read your filesystem might use `iomap_dio_rw()` for direct IO, generic_file_read_iter() for buffered IO and `dax_iomap_rw()` for DAX. * `struct file_operations.remap_file_range()` - currently the special `dax_remap_file_range_prep()` helper is provided for DAX mode reads. |
* `struct file_operations.read_iter()`: note that depending on the type of read your filesystem might use `iomap_dio_rw()` for direct IO, generic_file_read_iter() for buffered IO and `dax_iomap_rw()` for DAX. * `struct file_operations.remap_file_range()` - currently the special `dax_remap_file_range_prep()` helper is provided for DAX mode reads. === Calling iomap for userspace file extent mapping === The `fiemap` ioctl can be used to allow userspace to get a file extent mapping, instead of mapping gather by the VFS for logical block offset to physical block number. The `fiemap` ioctl is supported through an inode `struct inode_operations.fiemap()` callback. You would use `iomap_fiemap()`, you could use two seperate `struct iomap_ops` one for when requested to map extended attributes as well (`FIEMAP_FLAG_XATTR`) and your regular read `struct iomap_ops` for not requested to to map extended attributes. |
Line 93: | Line 89: |
A filesystem also needs to call '''iomap''' when assisting the VFS manipulating a file into the page cache. |
A filesystem also needs to call '''iomap''' when assisting the VFS manipulating a file into the page cache. |
Line 97: | Line 92: |
Line 100: | Line 94: |
* `iomap_bmap()` - called to assist the VFS when manipulating page cache with `struct address_space_operations.bmap()`, to help the VFS map a logical block offset to physical block number. * `iomap_read_folio()` - called to assist the page cache with `struct address_space_operations.read_folio()` * `iomap_readahead()` - called to assist the page cache with `struct address_space_operations.readahead()` |
* `iomap_bmap()` - called to assist the VFS when manipulating page cache with `struct address_space_operations.bmap()`, to help the VFS map a logical block offset to physical block number. * `iomap_read_folio()` - called to assist the page cache with `struct address_space_operations.read_folio()` * `iomap_readahead()` - called to assist the page cache with `struct address_space_operations.readahead()` |
Line 105: | Line 99: |
Line 109: | Line 102: |
A filesystem `struct address_space_operations.llseek()` is used by the VFS when it needs to move the current file offset, the file offset is in `struct file.f_pos`. Although `generic_file_llseek()` is typically used for most cases, two helpers exist to call '''iomap''' if a filesystem has to deal with them specially for `SEEK_HOLE` or `SEEK_DATA`: | |
Line 110: | Line 104: |
A filesystem `struct address_space_operations.llseek() is used by the VFS when it needs to move the current file offset used for lseek, that is the `struct file.f_pos`. Although `generic_file_llseek()` is typically used for most cases, two helpers exist to call iomap if a filesystem has to deal with them specially for `SEEK_HOLE` or `SEEK_DATA`: * `iomap_seek_hole()`: for when the `struct address_space_operations.llseek()` ''whence'' argument is `SEEK_HOLE`, when looking for the file's next hole. * `iomap_seek_data()`: for when the `struct address_space_operations.llseek() ''whence'' argument is `SEEK_DATA` when looking for the file's next data area. |
* `iomap_seek_hole()`: for when the `struct address_space_operations.llseek()` ''whence'' argument is `SEEK_HOLE`, when looking for the file's next hole. * `iomap_seek_data()`: for when the `struct address_space_operations.llseek() ''whence'' argument is `SEEK_DATA` when looking for the file's next data area. |
Line 118: | Line 110: |
Line 122: | Line 113: |
Line 126: | Line 116: |
Line 129: | Line 118: |
* xattr * seek * direct writes * buffered writes * read * DAX writes |
* xattr * seek * direct writes * buffered writes * read * DAX writes |
Line 137: | Line 126: |
Line 140: | Line 128: |
* does not use fsverify, fscrypt, compression * has no direct overwrites * has no Copy on Write support (reflinks) |
* does not use fsverify, fscrypt, compression * has no direct overwrites * has no Copy on Write support (reflinks) |
Line 145: | Line 133: |
Line 149: | Line 136: |
Line 153: | Line 139: |
Line 157: | Line 142: |
Line 163: | Line 147: |
Line 167: | Line 150: |
Line 173: | Line 155: |
Line 175: | Line 156: |
Line 178: | Line 158: |
* write amplification on IOMAP when bs < ps * '''iomap''' needs improvements for large folios for dirty bitmap tracking |
* write amplification on IOMAP when bs < ps * '''iomap''' needs improvements for large folios for dirty bitmap tracking |
Line 182: | Line 162: |
* Why does btrfs only have a few IOMAP calls: * btrfs manages page cache folios for buffered IO? |
* Why does btrfs only have a few IOMAP calls: * btrfs manages page cache folios for buffered IO? |
Line 187: | Line 166: |
* [[https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_185|Presentation on iomap evolution]] * [[https://lwn.net/Articles/930173/|LWN review of deprecating buffer-heads]]] |
* [[https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_185|Presentation on iomap evolution]] * [[https://lwn.net/Articles/930173/|LWN review of deprecating buffer-heads]]] |
iomap
iomap grew out of need to provide modern block mapping abstraction for filesystems with the different IO access methods they support and assisting the VFS with manipulating files into the page cache. iomap helpers are provided for each of these mechanisms.
Block mapping provides a mapping between data cached in memory and the location on persistent storage where that data lives. LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient, since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for iomap and guidance for developers on converting a filesystem over from buffer-heads to iomap.
Contents
A modern block abstraction
Instead of assuming a granularity of storage media 512-blocks at time, iomap allows filesystems to query storage media for data using block ranges. Since block mapping are provided for a block ranges for cache data in memory, in the page cache, naturally this implies operations on block ranges will also deal with multipage operations in the page cache. Folios are used to help provide multipage operations in memory.
struct iomap_ops
A filesystem is encouraged to provide struct iomap_ops for beginning an IO operation and ending an IO operation on a block range, and so the struct iomap_ops data structure has iomap_begin() and iomap_end() callbacks. You would call iomap with a specialized iomap operation depending on its filesystem or page cache interactions. For example iomap_dio_rw() would be used for Direct IO. So for example, on your fileystems's respective struct file_operations.write_iter() you'd eventually call iomap_dio_rw(..., &filesystem_direct_write_iomap_ops, &your_filesystem_dio_write_ops…) when dealing with Direct IO on the write_iter(). For buffered IO you'd use iomap_file_buffered_write(..., &your_filesystem_buffered_write_iomap_ops) on the same struct file_operations.write_iter(). But that is not the only situation in which a filesystem would deal with buffered writes, you could also use buffered writes when a filesystem does struct file_operations.fallocate() and for this case there is a special respective iomap_zero_range(..., &your_filesystem_buffered_write_iomap_ops). However struct file_operations.fallocate() also supports truncation, and for that you'd use iomap_truncate_page(..., &your_filesystem_buffered_write_write_iomap_ops). We'll elaborate on these more below.
Experience in adopting iomap on XFS has has shown that the filesystem implementation of these operations can be simplified considerably if one struct iomap_ops is provided per major filesystem IO operation:
- read
- direct writes
- DAX writes
- buffered writes
- xattr - FIEMAP_FLAG_XATTR
- seek
For example:
struct iomap_ops xfs_read_iomap_ops` iomap: lift the xfs writeback code to iomap
struct iomap_ops xfs_direct_write_iomap_ops
struct iomap_ops xfs_dax_write_iomap_ops
struct iomap_ops xfs_buffered_write_iomap_ops - xfs: split out a new set of read-only iomap ops
struct iomap_ops xfs_xattr_iomap_ops - xfs: fix SEEK_DATA for speculative COW fork preallocation
struct iomap_ops xfs_seek_iomap_ops - iomap: move the iomap_dio_rw ->end_io callback into a structure
struct iomap_dio_ops
Used for Direct-IO. These will call iomap_dio_write().
struct iomap_dio_ops xfs_dio_write_ops->end_io() - iomap: add a filesystem hook for direct I/O bio submission
struct iomap_dio_ops xfs_dio_write_ops->submit_io() - xfs: split the iomap ops for buffered vs direct writes
struct iomap_writeback_ops
The struct iomap_writeback_ops is used for when dealing with a filesystem struct address_space_operations.writepages(), for writeback.
struct iomap_writeback_ops xfs_writeback_ops - xfs: support CoW in fsdax mode
Calling iomap
You call iomap depending on the type of fileystem operation you are working on. We detail some of these interactions below.
Calling iomap for bufferred IO writes
You call iomap for buffered IO with:
iomap_file_buffered_write() - for buffered writes
iomap_page_mkwrite() - when dealing callbacks for struct vm_operations_struct:
struct vm_operations_struct.page_mkwrite()
struct vm_operations_struct.fault()
struct vm_operations_struct.huge_fault()
struct vm_operations_struct.pfn_mkwrite()`
You may use buffered writes to also deal with fallocate():
iomap_zero_range() on fallocate for zeroing
iomap_truncate_page() on fallocate for truncation
Typically you'd also happen to use these on paths when updating an inode's size.
Calling iomap for direct IO
You call iomap for direct IO with:
iomap_dio_rw()
You may use direct IO writes to also deal with fallocate():
iomap_zero_range() on fallocate for zeroing
iomap_truncate_page() on fallocate for truncation
Typically you'd also happen to use these on paths when updating an inode's size.
Calling iomap for reads
You can call into iomap for reading, ie, dealing with the filesystems's struct file_operations:
struct file_operations.read_iter(): note that depending on the type of read your filesystem might use iomap_dio_rw() for direct IO, generic_file_read_iter() for buffered IO and dax_iomap_rw() for DAX.
struct file_operations.remap_file_range() - currently the special dax_remap_file_range_prep() helper is provided for DAX mode reads.
Calling iomap for userspace file extent mapping
The fiemap ioctl can be used to allow userspace to get a file extent mapping, instead of mapping gather by the VFS for logical block offset to physical block number. The fiemap ioctl is supported through an inode struct inode_operations.fiemap() callback.
You would use iomap_fiemap(), you could use two seperate struct iomap_ops one for when requested to map extended attributes as well (FIEMAP_FLAG_XATTR) and your regular read struct iomap_ops for not requested to to map extended attributes.
Calling iomap for assisting the VFS
A filesystem also needs to call iomap when assisting the VFS manipulating a file into the page cache.
Calling iomap for VFS reading
A filesystem can call iomap to deal with the VFS reading a file into folios with:
iomap_bmap() - called to assist the VFS when manipulating page cache with struct address_space_operations.bmap(), to help the VFS map a logical block offset to physical block number.
iomap_read_folio() - called to assist the page cache with struct address_space_operations.read_folio()
iomap_readahead() - called to assist the page cache with struct address_space_operations.readahead()
Calling iomap for VFS writepages
A filesystem can call iomap to deal with the VFS write out of pages back to backing store, that is to help deal with a filesystems's struct address_space_operations.writepages(). The special iomap_writepages() is used for this case with its own respective filestems's struct iomap_ops for this.
Calling iomap for VFS llseek
A filesystem struct address_space_operations.llseek() is used by the VFS when it needs to move the current file offset, the file offset is in struct file.f_pos. Although generic_file_llseek() is typically used for most cases, two helpers exist to call iomap if a filesystem has to deal with them specially for SEEK_HOLE or SEEK_DATA:
iomap_seek_hole(): for when the struct address_space_operations.llseek() whence argument is SEEK_HOLE, when looking for the file's next hole.
iomap_seek_data(): for when the struct address_space_operations.llseek() ''whence'' argument is SEEK_DATA` when looking for the file's next data area.
Your own 'struct iomap_ops` for this is encouraged.
Calling iomap for DAX
You can use dax_iomap_rw() when calling iomap from a DAX context, this is typically from the filesystems's struct file_operations.write_iter() callback.
Converting filesystems from buffer-head to iomap guide
These are generic guidelines on converting a filesystem over to iomap from buffer-heads.
One op at at time
You may try to convert a filesystem IO operation at time, for instance this order reflects the order in which XFS started converting over to iomap:
- xattr
- seek
- direct writes
- buffered writes
- read
- DAX writes
Defining a simple filesystem
A simple filesystem is perhaps the easiest to convert over to iomap, a simple filesystem is one which:
- does not use fsverify, fscrypt, compression
- has no direct overwrites
- has no Copy on Write support (reflinks)
Converting a simple filesystem to iomap
Simple filesystems should covert to IOMAP directly and avoid buffer heads directly, ie, don't use IOMAP_F_BUFFER_HEAD.
Converting shared filesystem features
fscrupt, fsverity, compression needs to be converted first to iomap if a fs uses it as iomap supports no permutations (XXX: clarify on this)
Converting complex filesystems
If your filesystem does not fit the simple description above the general recommendation is to port to iomap with IOMAP_F_BUFFER_HEAD in one kernel release to verify you no bugs with, locking, writeback and general use of your new struct iomap_ops.
When to set iomap on srcmap or dstmap
The struct iomap is required to be set on iomap_begin(), if its a CoW path also set srcmap when used with iomap_begin().
This perhaps should be redesigned in the future depending on read / write requirements and it may take time to get this right.
Removal of IOMAP_F_BUFFER_HEAD
IOMAP_F_BUFFER_HEAD won't be removed until we have all filesystem fully converted away from buffer-heads, and this could be never.
Testing Direct IO
Other than fstests you can use LTP's dio, however this tests is limited as it does not test stale data.
./runltp -f dio -d /mnt1/scratch/tmp/
Known issues and future improvements
Other than lack of documetnation there are some known issues and limitatiosn with iomap at this time. We try to itemize them here:
write amplification on IOMAP when bs < ps
iomap needs improvements for large folios for dirty bitmap tracking
Q&A
- Why does btrfs only have a few IOMAP calls:
- btrfs manages page cache folios for buffered IO?