Large block sizes
Today the default is to typically use 4KiB block sizes for storage IO and filesystems. In order to be able to leverage storage hardware more efficiently though it would be ideal to increase the block size both for storage IO and filesystems. This documents tries to itemize the efforts required to address this properly in the Linux kernel. These goals are long term, and while some tasks / problems may be rather easy to address, addressing this completely is a large effort which often times will require coordination with different subsystems, in particular memory, IO, or the filesystem folks.
Filesystem support for 64KiB block sizes
Filesystems can support 64KiB by allowing the user to specify the block size when creating the filesystem. For instance this can be accomplished as follows with XFS:
mkfs.xfs -f -b size=65536
In this example 64 KiB block size is used. This would correspond to the kdevops respective XFS section to test xfs_bigblock. But in order to be able to use use and test this the architecture underneath must also support 64KiB page sizes. Example of these are:
- ARM64: Through CONFIG_ARM64_64K_PAGES
- PowerPC: Through CONFIG_PPC_64K_PAGE
The x86_64 architecture still needs work to support this. What a filesystem needs to do to support this is to compute the correct offset in the file to the corresponding block on the disk. This is obfuscated by the get_block_t API. When CoW is enabled though this gets a bit more complicated due to memory pressure on write, as the kernel would need all the corresponding 64 KiB pages in memory. Low memory pressure create problems.
Storage IO supporting 64 KiB block sizes
A world with a 4 KiB PAGE_SIZE is rather simple due to writeback considerations with memory pressure. If your storage device has a larger block size than PAGE_SIZE, the kernel con only send a write once it has in memory all the data required. In this situation reading data means you also would have to wait for all the data to be read from the drive, you could use something like a bit bucket, however that would mean that data would somehow have to be invalidated should a write come through during say a second PAGE_SIZE read on data on a storage block of twice the PAGE_SIZE.
Using folios on the page cache to help
To address the complexity of writeback when the storage supports larger IO block sizes than the PAGE_SIZE folios should be used as a long term solution for the page cache, to opportunistically cache files in larger chunks, however there are some problems that need to be considered and measured to prove / disprove its value for storage. Folios can be used to address the writeback problem by ensuring that the block size for the storage is treated as a single unit in the page cache. A problem with this is the assumption that the kernel can provide the target block IO size for new allocations over time in light of possible memory fragmentation.
An assumption here is that if a filesystem is regularly creating allocations in the block size required for the IO storage device, that the kernel will also then be able to reclaim memory in these block sizes. A consideration here is that perhaps some workloads may end up spiraling down with no allocations being available for the target IO block size. Testing this hypothesis is something which we all in the community could work on. The more memory gets cached using folios the easier it becomes to address problems and contentions with memory using a larger IO block device than PAGE_SIZE.
Using huge pages
Using huge pages should help alleviate pressure when doing large allocations for large block sizes and also help with TBL pressure.