= iomap = '''iomap''' grew out of need to provide '''modern''' block mapping abstraction for filesystems three different IO access methods: * Direct IO * Buffered IO * DAX Block mapping provides a mapping between data cached in memory, in the page cache, and the location on persistent storage where that data lives. [[https://lwn.net/Articles/930173/|LWN has an incredible review of the old buffer-heads block-mapping and why they are inefficient]], since the the inception of Linux. Since buffer-heads work on a 512-byte block based paradigm, it creates an overhead for modern storage media which no longer necessarily works only on 512-blocks. This document strives to provide a template for LSFMM for what will hopefully eventually become upstream Linux kernel documentation for '''iomap''' and guidance for developers on converting a filesystem over from buffer-heads to '''iomap'''. <> == A modern block abstraction == Instead of assuming a granularity of storage media 512-blocks at time, '''iomap''' allows filesystems to query storage media for data using block ranges. Since block mapping are provided for a block ranges for cache data in memory, in the page cache, naturally this implies operations on block ranges will also deal with multipage operations in the page cache. Folios are used to help provide '''multipage''' operations in memory. == struct iomap_ops == A filesystem is encouraged to provide struct iomap_ops for beginning an IO operation and ending an IO operation on a block range, and so the `struct iomap_ops` data structure has `iomap_begin()` and `iomap_end()` callbacks. You would call '''iomap''' with a specialized '''iomap''' operation depending on its filesystem or page cache interactions. For example iomap_dio_rw() would be used for Direct IO. So for example, on your fileystems's respective `struct file_operations.write_iter()` you'd eventually call `iomap_dio_rw`(..., &filesystem_'''direct_write'''_iomap_ops, &`your_filesystem`_'''dio_write'''_ops…) when dealing with Direct IO on the `write_iter()`. For buffered IO you'd use `iomap_file_buffered_write`(..., &`your_filesystem`_'''buffered_write'''_iomap_ops) on the same `struct file_operations.write_iter()`. But that is not the only situation in which a filesystem would deal with buffered writes, you could also use buffered writes when a filesystem does `struct file_operations.fallocate()` and for this case there is a special respective `iomap_zero_range`(..., &`your_filesystem`_'''buffered_write'''_iomap_ops). However `struct file_operations.fallocate()` also supports truncation, and for that you'd use `iomap_truncate_page`(..., &`your_filesystem`_'''buffered_write'''_write_iomap_ops). We'll elaborate on these more below. Experience in adopting '''iomap''' on XFS has has shown that the filesystem implementation of these operations can be simplified considerably if one `struct iomap_ops` is provided per major filesystem IO operation: * read * direct writes * DAX writes * buffered writes * xattr - FIEMAP_FLAG_XATTR * seek For example: * `struct iomap_ops` xfs_'''read'''_iomap_ops` iomap: lift the xfs writeback code to iomap * `struct iomap_ops` xfs_'''direct_write'''_iomap_ops * `struct iomap_ops` xfs_'''dax_write'''_iomap_ops * `struct iomap_ops` xfs_'''buffered_write'''_iomap_ops - xfs: split out a new set of read-only iomap ops * `struct iomap_ops` xfs_'''xattr'''_iomap_ops - xfs: fix SEEK_DATA for speculative COW fork preallocation * `struct iomap_ops` xfs_'''seek'''_iomap_ops - iomap: move the iomap_dio_rw ->end_io callback into a structure == struct iomap_dio_ops == Used for Direct-IO. These will call `iomap_dio_write()`. * `struct iomap_dio_ops` xfs_'''dio_write'''_ops->end_io() - iomap: add a filesystem hook for direct I/O bio submission * `struct iomap_dio_ops` xfs_'''dio_write'''_ops->submit_io() - xfs: split the iomap ops for buffered vs direct writes == struct iomap_writeback_ops == The `struct iomap_writeback_ops` is used for when dealing with ` filesystem `struct address_space_operations.writepages()`, for writeback. * `struct iomap_writeback_ops` xfs_'''writeback'''_ops - xfs: support CoW in fsdax mode == Converting filesystems from buffer-head to iomap guide == These are generic guidelines on converting a filesystem over to '''iomap''' from '''buffer-heads'''. === One op at at time === You may try to convert a filesystem IO operation at time, for instance this order reflects the order in which XFS started converting over to iomap: * xattr * seek * direct writes * buffered writes * read * DAX writes === Defining a simple filesystem === A simple filesystem is perhaps the easiest to convert over to '''iomap''', a simple filesystem is one which: * does not use fsverify, fscrypt, compression * has no direct overwrites * has no Copy on Write support (reflinks) ==== Converting a simple filesystem to iomap ==== Simple filesystems should covert to IOMAP directly and avoid buffer heads directly, ie, don't use `IOMAP_F_BUFFER_HEAD`. === Converting shared filesystem features === fscrupt, fsverity, compression needs to be converted first to '''iomap''' if a fs uses it as '''iomap''' supports no permutations (XXX: clarify on this) === Converting complex filesystems === If your filesystem does not fit the simple description above the general recommendation is to port to '''iomap''' with `IOMAP_F_BUFFER_HEAD` in one kernel release to verify you no bugs with, locking, writeback and general use of your new `struct iomap_ops`. === When to set iomap on srcmap or dstmap === The struct iomap is required to be set on `iomap_begin()`, if its a '''CoW''' path also set `srcmap` when used with iomap_begin(). This perhaps should be redesigned in the future depending on read / write requirements and it may take time to get this right. === Removal of IOMAP_F_BUFFER_HEAD === `IOMAP_F_BUFFER_HEAD` won't be removed until we have all filesystem fully converted away from '''buffer-heads''', and this could be never. === Testing Direct IO === Other than fstests you can use LTP's dio, however this tests is limited as it does not test stale data. {{{ ./runltp -f dio -d /mnt1/scratch/tmp/ }}} === Known issues and future improvements === Other than lack of documetnation there are some known issues and limitatiosn with '''iomap''' at this time. We try to itemize them here: * write amplification on IOMAP when bs < ps * '''iomap''' needs improvements for large folios for dirty bitmap tracking === Q&A === * Why does btrfs only have a few IOMAP calls: * btrfs manages page cache folios for buffered IO? === References === * [[https://docs.google.com/presentation/d/e/2PACX-1vSN4TmhiTu1c6HNv6_gJZFqbFZpbF7GkABllSwJw5iLnSYKkkO-etQJ3AySYEbgJA/pub?start=true&loop=false&delayms=3000&slide=id.g189cfd05063_0_185|Presentation on iomap evolution]] * [[https://lwn.net/Articles/930173/|LWN review of deprecating buffer-heads]]]