KernelNewbies:

Large block sizes

Today the default is to typically use 4KiB block sizes for storage IO and filesystems. In order to be able to leverage storage hardware more efficiently though it would be ideal to increase the block size both for storage IO and filesystems. This documents tries to itemize the efforts required to address this properly in the Linux kernel. These goals are long term, and while some tasks / problems may be rather easy to address, addressing this completely is a large effort which often times will require coordination with different subsystems, in particular memory, IO, or the filesystem folks.

Filesystem support for 64KiB block sizes

Filesystems can support 64KiB by allowing the user to specify the block size when creating the filesystem. For instance this can be accomplished as follows with XFS:

mkfs.xfs -f -b size=65536

In this example 64 KiB block size is used. This would correspond to the kdevops respective XFS section to test xfs_bigblock. But in order to be able to use use and test this the architecture underneath must also support 64KiB page sizes. Example of these are:

The x86_64 architecture still needs work to support this. What a filesystem needs to do to support this is to compute the correct offset in the file to the corresponding block on the disk. This is obfuscated by the get_block_t API. When CoW is enabled though this gets a bit more complicated due to memory pressure on write, as the kernel would need all the corresponding 64 KiB pages in memory. Low memory pressure create problems.

Storage IO supporting 64 KiB block sizes

A world with a 4 KiB PAGE_SIZE is rather simple due to writeback considerations with memory pressure. If your storage device has a larger block size than PAGE_SIZE, the kernel con only send a write once it has in memory all the data required. In this situation reading data means you also would have to wait for all the data to be read from the drive, you could use something like a bit bucket, however that would mean that data would somehow have to be invalidated should a write come through during say a second PAGE_SIZE read on data on a storage block of twice the PAGE_SIZE.

Using folios on the page cache to help

To address the complexity of writeback when the storage supports larger IO block sizes than the PAGE_SIZE folios should be used as a long term solution for the page cache, to opportunistically cache files in larger chunks, however there are some problems that need to be considered and measured to prove / disprove its value for storage. Folios can be used to address the writeback problem by ensuring that the block size for the storage is treated as a single unit in the page cache. A problem with this is the assumption that the kernel can provide the target block IO size for new allocations over time in light of possible memory fragmentation.

An assumption here is that if a filesystem is regularly creating allocations in the block size required for the IO storage device, that the kernel will also then be able to reclaim memory in these block sizes. A consideration here is that perhaps some workloads may end up spiraling down with no allocations being available for the target IO block size. Testing this hypothesis is something which we all in the community could work on. The more memory gets cached using folios the easier it becomes to address problems and contentions with memory using a larger IO block device than PAGE_SIZE.

Using huge pages

Using huge pages should help alleviate pressure when doing large allocations for large block sizes and also help with TBL pressure.

OKRs

Here are a set of OKRs to help with folio work OKRs, OKRs are used to help break down tasks into concrete tangible components which we can measure for success.

OKRs for large block size work

No

Objective

No

Key result

Details

Volunteers to do this work

O1

Conversion from struct page to struct folio

K1

ext4

convert ext4 to use struct folio

TBD

K2

btrfs

convert btrfs to use struct folio

TBD

K3

f2fs

convert f2fs to use struct folio

TBD

O2

Page order work

K1

memory compaction

memory compaction should be extended to support arbitrary page order

TBD

K2

tmpfs should be modified to support page orders 0-9

memory compaction should be extended to support arbitrary page order

TBD

O3

make certain structs independent of struct page as we did with the new struct slab see commit d122019bf0 . For more details see MemoryTypes page work

K1

Page table

make a new struct page_table

TBD

K2

Net pool

Make a new struct net_pool

TBD

K3

zsmalloc

zsmalloc

TBD

K4

Enhance the semantics with special ZONE_DEVICE

modify current special casing of acting on the reference count going to 1, instead use refcount 0. Review Fujitsu's patches which help with this

TBD

O4

Virtualization work

K1

remove kvm struct page limitation on IO

KVM IO work relies on struct page, evaluate work to remove this constraint

TBD

K2

qemu 64 KiB block on IO

Revise qemu 64 KiB block support

TBD

O5

Enhance test plan for folio work

K1

Use 0-day before linux-next

Get 0-day to test folio development git tree branches before they get merged onto linux-next

mcgrof, willy

K2

Augment test runners to support variable block sizes for the drives

Add support for 512 bytes, 1 KiB, and 64 KiB for block sizes for qemu. Using 1 KiB block size stresses a special aspect of folios. 512 bytes is the default today which willy uses, we should support 512 to match parity with current testing. 64 KiB would be good to have, does qemu support this yet? If not add support for it

TBD

K3

9p fs support on test runners

Support using 9p fs to install and test local kernel development changes onto guest

TBD

K4

Support git tracker on test runners

Add support for test runners so that a git push can trigger running a test plan and email results

K5

Augment test suite coverage on test runners

Extend test runners to support LTP, stress-ng

TBD

K6

Revise folio test plan

Document a test plan for folios work

TBD

KernelNewbies: KernelProjects/large-block-size (last edited 2022-07-13 23:11:19 by mcgrof)