Large block sizes

Today the default is to typically use 4KiB block sizes for storage IO and filesystems. In order to be able to leverage storage hardware more efficiently though it would be ideal to increase the block size both for storage IO and filesystems. This documents tries to itemize the efforts required to address this properly in the Linux kernel. These goals are long term, and while some tasks / problems may be rather easy to address, addressing this completely is a large effort which often times will require coordination with different subsystems, in particular memory, IO, or the filesystem folks.

Contents

Large block sizes

Filesystem support for 64KiB block sizes

Filesystems can support 64KiB by allowing the user to specify the block size when creating the filesystem. For instance this can be accomplished as follows with XFS:

mkfs.xfs -f -b size=65536

In this example 64 KiB block size is used. This would correspond to the kdevops respective XFS section to test xfs_bigblock. But in order to be able to use use and test this the architecture underneath must also support 64KiB page sizes. Example of these are:

ARM64: Through CONFIG_ARM64_64K_PAGES
PowerPC: Through CONFIG_PPC_64K_PAGE

The x86_64 architecture still needs work to support this. What a filesystem needs to do to support this is to compute the correct offset in the file to the corresponding block on the disk. This is obfuscated by the get_block_t API. When CoW is enabled though this gets a bit more complicated due to memory pressure on write, as the kernel would need all the corresponding 64 KiB pages in memory. Low memory pressure create problems.

Storage IO supporting 64 KiB block sizes

A world with a 4 KiB PAGE_SIZE is rather simple due to writeback considerations with memory pressure. If your storage device has a larger block size than PAGE_SIZE, the kernel con only send a write once it has in memory all the data required. In this situation reading data means you also would have to wait for all the data to be read from the drive, you could use something like a bit bucket, however that would mean that data would somehow have to be invalidated should a write come through during say a second PAGE_SIZE read on data on a storage block of twice the PAGE_SIZE.

Using folios on the page cache to help

To address the complexity of writeback when the storage supports larger IO block sizes than the PAGE_SIZE folios should be used as a long term solution for the page cache, to opportunistically cache files in larger chunks, however there are some problems that need to be considered and measured to prove / disprove its value for storage. Folios can be used to address the writeback problem by ensuring that the block size for the storage is treated as a single unit in the page cache. A problem with this is the assumption that the kernel can provide the target block IO size for new allocations over time in light of possible memory fragmentation.

An assumption here is that if a filesystem is regularly creating allocations in the block size required for the IO storage device, that the kernel will also then be able to reclaim memory in these block sizes. A consideration here is that perhaps some workloads may end up spiraling down with no allocations being available for the target IO block size. Testing this hypothesis is something which we all in the community could work on. The more memory gets cached using folios the easier it becomes to address problems and contentions with memory using a larger IO block device than PAGE_SIZE.

Using huge pages

Using huge pages should help alleviate pressure when doing large allocations for large block sizes and also help with TBL pressure.

OKRs

Here are a set of OKRs to help with folio work OKRs, OKRs are used to help break down tasks into concrete tangible components which we can measure for success.

OKRs for large block size work
No	Objective	No	Key result	Details	Volunteers to do this work
O1	Conversion from struct page to struct folio	K1	ext4	convert ext4 to use struct folio	TBD
		K2	btrfs	convert btrfs to use struct folio	TBD
		K3	f2fs	convert f2fs to use struct folio	TBD
O2	Page order work	K1	memory compaction	memory compaction should be extended to support arbitrary page order	TBD
O2	Page order work	K2	tmpfs should be modified to support page orders 0-9	memory compaction should be extended to support arbitrary page order	TBD
O3	make certain structs independent of struct page as we did with the new struct slab see commit d122019bf0 . For more details see MemoryTypes page work	K1	Page table	make a new struct page_table	TBD
		K2	Net pool	Make a new struct net_pool	TBD
		K3	zsmalloc	zsmalloc	TBD
		K4	Enhance the semantics with special ZONE_DEVICE	modify current special casing of acting on the reference count going to 1, instead use refcount 0. Review Fujitsu's patches which help with this	TBD
O4	Virtualization work	K1	remove kvm struct page limitation on IO	KVM IO work relies on struct page, evaluate work to remove this constraint	TBD
O4	Virtualization work	K2	qemu 64 KiB block on IO	Revise qemu 64 KiB block support	TBD
O5	Enhance test plan for folio work	K1	Use 0-day before linux-next	Get 0-day to test folio development git tree branches before they get merged onto linux-next	mcgrof, willy
		K2	Augment test runners to support variable block sizes for the drives	Add support for 512 bytes, 1 KiB, and 64 KiB for block sizes for qemu. Using 1 KiB block size stresses a special aspect of folios. 512 bytes is the default today which willy uses, we should support 512 to match parity with current testing. 64 KiB would be good to have, does qemu support this yet? If not add support for it	TBD
		K3	9p fs support on test runners	Support using 9p fs to install and test local kernel development changes onto guest	TBD
		K4	Support git tracker on test runners	Add support for test runners so that a git push can trigger running a test plan and email results
		K5	Augment test suite coverage on test runners	Extend test runners to support LTP, stress-ng	TBD
		K6	Revise folio test plan	Document a test plan for folios work	TBD

KernelNewbies: KernelProjects/large-block-size (last edited 2022-07-13 23:11:19 by mcgrof)