= Large block sizes = Today the default is to typically use 4KiB block sizes for storage IO and filesystems. In order to be able to leverage storage hardware more efficiently though it would be ideal to increase the block size both for storage IO and filesystems. This documents tries to itemize the efforts required to address this properly in the Linux kernel. These goals are long term, and while some tasks / problems may be rather easy to address, addressing this completely is a large effort which often times will require coordination with different subsystems, in particular memory, IO, or the filesystem folks. <> == Filesystem support for 64KiB block sizes == Filesystems can support 64KiB by allowing the user to specify the block size when creating the filesystem. For instance this can be accomplished as follows with XFS: {{{ mkfs.xfs -f -b size=65536 }}} In this example 64 KiB block size is used. This would correspond to the kdevops respective XFS section to test xfs_bigblock. But in order to be able to use use and test this the architecture underneath must also support 64KiB page sizes. Example of these are: * ARM64: Through CONFIG_ARM64_64K_PAGES * PowerPC: Through CONFIG_PPC_64K_PAGE The x86_64 architecture still needs work to support this. What a filesystem needs to do to support this is to compute the correct offset in the file to the corresponding block on the disk. This is obfuscated by the get_block_t API. When CoW is enabled though this gets a bit more complicated due to memory pressure on write, as the kernel would need all the corresponding 64 KiB pages in memory. Low memory pressure create problems. == Storage IO supporting 64 KiB block sizes == A world with a 4 KiB PAGE_SIZE is rather simple due to writeback considerations with memory pressure. If your storage device has a larger block size than PAGE_SIZE, the kernel con only send a write once it has in memory all the data required. In this situation reading data means you also would have to wait for all the data to be read from the drive, you could use something like a bit bucket, however that would mean that data would somehow have to be invalidated should a write come through during say a second PAGE_SIZE read on data on a storage block of twice the PAGE_SIZE. === Using folios on the page cache to help === To address the complexity of writeback when the storage supports larger IO block sizes than the PAGE_SIZE folios should be used as a long term solution for the page cache, to opportunistically cache files in larger chunks, however there are some problems that need to be considered and measured to prove / disprove its value for storage. Folios can be used to address the writeback problem by ensuring that the block size for the storage is treated as a single unit in the page cache. A problem with this is the assumption that the kernel can provide the target block IO size for new allocations over time in light of possible memory fragmentation. An assumption here is that if a filesystem is regularly creating allocations in the block size required for the IO storage device, that the kernel will also then be able to reclaim memory in these block sizes. A consideration here is that perhaps some workloads may end up spiraling down with no allocations being available for the target IO block size. Testing this hypothesis is something which we all in the community could work on. The more memory gets cached using folios the easier it becomes to address problems and contentions with memory using a larger IO block device than PAGE_SIZE. === Using huge pages === Using huge pages should help alleviate pressure when doing large allocations for large block sizes and also help with TBL pressure. == OKRs == Here are a set of OKRs to help with folio work [[https://en.wikipedia.org/wiki/OKR|OKRs]], OKRs are used to help break down tasks into concrete tangible components which we can measure for success. ||||||||||||'''OKRs for large block size work''' || ||No ||Objective ||No ||Key result ||Details ||Volunteers to do this work|| ||<|3>O1 ||<|3>Conversion from struct page to struct folio ||K1 ||ext4 ||convert ext4 to use struct folio ||TBD || ||K2 ||btrfs ||convert btrfs to use struct folio ||TBD || ||K3 ||f2fs ||convert f2fs to use struct folio ||TBD || ||<|2>O2 ||<|2>Page order work ||K1||memory compaction||memory compaction should be extended to support arbitrary page order||TBD|| ||K2||tmpfs should be modified to support page orders 0-9||memory compaction should be extended to support arbitrary page order||TBD|| ||<|4>O3 ||<|4>make certain structs independent of struct page as we did with the new struct slab see [[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d122019bf061cccc4583eb9ad40bf58c2fe517be|commit d122019bf0]] . For more details see [[https://kernelnewbies.org/MemoryTypes|MemoryTypes page work]]||K1||Page table||make a new struct page_table||TBD|| ||K2||Net pool||Make a new struct net_pool||TBD|| ||K3||[[https://www.kernel.org/doc/html/latest/vm/zsmalloc.html|zsmalloc]]||zsmalloc is a special 0-order allocator for zswap. It binds a series of 0-order struct pages together. The linked 0-order pages are refrred to as zspage. This should be converted to folios.||TBD|| ||K4||Enhance the semantics with special ZONE_DEVICE||modify current special casing of acting on the reference count going to 1, instead use refcount 0. Review Fujitsu's patches which help with this||TBD|| ||<|2>O4 ||<|2>Virtualization work||K1||remove kvm struct page limitation on IO||KVM IO work relies on struct page, evaluate work to remove this constraint||TBD|| ||K2||qemu 64 KiB block on IO||Revise qemu 64 KiB block support||TBD|| ||<|6>O5 ||<|6>Enhance test plan for folio work ||K1||Use 0-day before linux-next||Get 0-day to test folio development git tree branches before they get merged onto linux-next||mcgrof, willy || ||K2||Augment test runners to support variable block sizes for the drives||Add support for 512 bytes, 1 KiB, and 64 KiB for block sizes for qemu. Using 1 KiB block size stresses a special aspect of folios. 512 bytes is the default today which willy uses, we should support 512 to match parity with current testing. 64 KiB would be good to have, does qemu support this yet? If not add support for it||TBD|| ||K3||9p fs support on test runners||Support using 9p fs to install and test local kernel development changes onto guest||TBD|| ||K4||Support git tracker on test runners||Add support for test runners so that a git push can trigger running a test plan and email results|| ||K5||Augment test suite coverage on test runners||Extend test runners to support LTP, stress-ng||TBD|| ||K6||Revise folio test plan||Document a test plan for folios work||TBD||