25630
Comment:
|
25427
|
Deletions are marked like this. | Additions are marked like this. |
Line 144: | Line 144: |
=== Using huge pages for LBS === Using huge pages could help alleviate pressure when doing large allocations for `bs > ps` and also help with TBL pressure. This is an area of R&D being evaluated. |
Large block sizes (LBS)
Old storage devices used a physical block size of 512 bytes. To help with theoretical boundaries on areal density with 512 byte sectors storage devices have advanced to with "long data sectors". 1k became a reality and soon after that 4k physical drives followed.
Storage technology has advanced and we now have support for non-volatile memory storage devices with specialized interfaces such as NVMe. There is R&D which shows how increasing the physical block size can help as it did with the 512 to 4096 jump. For example, it has been shown how QLC using a 16KiB page size provides benefits from large sequential IOs. As with the 512 byte to 4096 byte shift, storage drive physical block sizes may likely increase over time due to different technological advancements.
Linux has supported larger block sizes on both filesystems and bock devices when the PAGE_SIZE is also large for years. Supporting a blocksize > PAGE_SIZE, or bs > ps for short, has required much more work and is the main focus of this page.
This page focuses on the bs > ps world when considering large block sizes for filesystems and storage devices. This page tries to itemize and keep track of progress, outstanding issues, and the ongoing efforts to support bs > ps (LBS) on Linux.
Addressing this is a large endeavor requiring coordination with different subsystems: filesystems, memory & IO.
Contents
Filesystem and storage LBS support
Linux filesystems have historically supported limiting their own supported max data blocksize size though the CPU's and operating system PAGE_SIZE. The page size for x86 has been 4k since the i386 days. But Linux also has supported architectures whith page sizes greater than 4k for years. Examples are Power PC with CONFIG_PPC_64K_PAGE, ARM64 with CONFIG_ARM64_64K_PAGES. In fact, some architectures such as ARM64 can support different page sizes.
The XFS filesystem, for example, has supported a up to 64k blocksize for data for years on ppc64 systems with CONFIG_PPC_64K_PAGE. Some RAID systems are known to have existed and sold. To can create a 64k blocksize filesystem you can use:
mkfs.xfs -f -b size=65536
The above will work today even on x86 with 4k PAGE_SIZE, however mount will only work on a system with at least a 64k PAGE_SIZE.
Linux XFS was ported from IRIX's version of XFS, and IRIX had support for larger block sizes than what the CPU supported for its PAGE_SIZE. We refer to this for short as bs > ps. Linux has therefore lacked bs > ps support and adding support for bs > ps should bring XFS up to parity with the features that IRIX XFS used to support.
Supporting bs > ps requires proper operating system support and today we strive to make this easier on Linux through the adoption of folios and IOMAP. The rest of this document focuses on the bs > ps world when considering LBS for filesystems.
Using a large block size on a filesystem does not necessarily mean you need to use a storage device with a large or matching physical block size. Most storage devices today work with a max physical block size of 4k. With only two out of tree XFS patches you can help *test* today bs > ps support. Testing to ensure this works properly is ongoing.
Support for LBS on filesystems and block devices require different types of efforts. Supporting LBS for filesystems is slightly easier than dealing with storage devices with LBS if you have proper Operating System support (folios and IOMAP). For a simple filesystems, generally all the filesystem needs to do is provide the correct offset-in-file to block-on-disk mapping. Filesystems provide this today through either the old buffer-heads or newer IOMAP.
Storage devices which support larger block sizes implicate that when you ask for IO you do that in a physical blocksize aligned manner and you take into consideration its minimum supported IO size. It also means userspace tools must take these alignment requirements into consideration for optimal IO and correctness. The page cache assists filesystems by enabling writeback at lower granularity than what a physical block devices requires by dealing with changes in memory. This also implicates that if bypassing the page cache, using direct IO, userspace will have to take care to ensure proper alignment considerations when dealing with LBS devices.
Filesystem block device page cache LBS support
Block based storage devices also have an implied block device page cache bdev cache. The Linux bdev cache is used to query the storage device's capacity, and provide the block layer access to information about a drive's partitions. A device which has a capacity, or size, greater than 0 by default gets scanned for partitions on device initialization through the respective device_add_disk(). The bdev cache today uses the old buffer-heads to talk to the block device for block data given the block device page cache is implemented as a simple filesystem where each block device gets its own super_block with just one inode representing the entire disk.
In the future we should be able to deprecate using buffer-heads in favor for IOMAP for the block device page cache as a build option when no filesystems which require buffer-heads is enabled or when a LBS device is used.
The bdev cache also optionally enables filesystems mounted on block devices to query for metadata by using the backing block device super_block. This takes advantage of the fact that the page cache could have basic information about its backing device already in memory through the initial block partition scanning. Filesystems can query a backing block device super_block through buffer-head calls such as:
sb_bread()
sb_bread_unmovable()
sb_breadahead()
sb_getblk()
sb_getblk_gfp()
sb_find_get_block()
Not all filesystems use the bdev cache for metadata though. XFS for example has it's own metadata address space for caching. The xfs_buf buffer cache does not use the page cache either. It does it's own thing, has it's own indexing, locking, shrinkers, etc. It also does not use IOMAP at all either - IOMAP is used by XFS exclusively for data IO. This means XFS relies on an uncached buffer for its backing device super_block. XFS does this mostly for historic reasons.
Changes have been needed to enhance the bdev cache in order to enable storage block devices with bs > ps. Given the bdev cache today relies on buffer-heads, buffer-heads has been extended to support high order folios. High order folios would be used today when doing partition scanning on LBS storage devices. This is only done if a block driver allows LBS. NVMe today for example disables LBS devices.
In the future we expect the bdev cache will use buffer-heads for non LBS devices, unless the kernel is built without buffer heads. The bdev cache is expected to only use iomap for LBS even if buffer-heads dependent filesystems are enabled.
Requirements for LBS
A world with a 4 KiB PAGE_SIZE and bs, ie bs = 4k = bs is rather simple due to writeback considerations with memory pressure. If your storage device has a bs > PAGE_SIZE, the kernel can only send a write once it has in memory all the data required. In this situation reading data means you also would have to wait for all the data to be read from the drive, you could use something like a bit bucket, however that would mean that data would somehow have to be invalidated should a write come through during say a second PAGE_SIZE read on data on a storage block of twice the PAGE_SIZE.
How folios help LBS
To address the complexity of writeback when the storage supports bs > PAGE_SIZE folios should be used as a long term solution for the page cache, to opportunistically cache files in larger chunks, however there are some problems that need to be considered and measured to prove / disprove its value for storage. Folios can be used to address the writeback problem by ensuring that the block size for the storage is treated as a single unit in the page cache. A problem with this is the assumption that the kernel can provide the target block IO size for new allocations over time in light of possible memory fragmentation.
A hypothesis to consider here is that if a filesystem is regularly creating allocations in the block size required for the IO storage device the kernel will also then be able to reclaim memory in these block sizes. In the worst case some workloads may end up spiraling down with no allocations being available for the target IO block size. Testing this hypothesis is something which we all in the community could work on. The more memory gets cached using folios the easier it becomes to address problems and contentions with memory using bs > ps.
Filesystem LBS support
In order to provide support for a larger block size a filesystem needs to provide the correct offset-in-file to block-on-disk mapping. Supporting LBS with reflinks / CoW can be more complicated. For example, on memory pressure on write, the kernel would need all the corresponding 64 KiB pages in memory. Low memory pressure could easily create problems synchronization issues.
Filesystems provide offset-in-file to block-on-disk mapping through the old buffer-head or newer IOMAP. Synchronization and dealing with LBS atomically is enabled by using folios.
LBS with buffer-heads
Linux buffer-heads, implemented in fs/buffer.c, provides filesystems a 512 byte buffer array based block mapping abstraction for dealing with block devices and the page cache. Filesystems supply a filesystem callback get_block_t when issuing calls to buffer-heads when they want to request an IO to a block device, the get_block_t callback is in charge of providing the filesystem block which buffer-heads will submit_bio() for. Filesystems would use the get_block_t callback in buffer-heads calls such as:
* block_write_full_page(..., get_block_t *get_block, ...) * __block_write_full_page(..., get_block_t *get_block, ...) * block_read_full_folio(struct folio *, get_block_t *) * block_write_begin(..., get_block_t *get_block, ...) * __block_write_begin(..., get_block_t *get_block, ...) * cont_write_begin(..., get_block_t *, ...) * block_page_mkwrite(..., get_block_t *, ...) * generic_block_bmap(..., get_block_t *);
Filesystems or the bdev cache can request higher order folios through buffer-heads when the bs > ps.
LBS with IOMAP
The newer IOMAP allows filesystems to provide callbacks in a similar way but it was designed instead allow filesystems to simplify the filesystem block-mapping callbacks for each type of need within the filesystem. This allows the filesystem callbacks to be easier to read and dedicated towards each specific need.
LBS with storage devices
Storage devices announce their physical block sizes to Linux by setting the physical block size with call blk_queue_physical_block_size(). To help with stacking IO devices and to ensure backward compatibility with older userspace it can announce a smaller logical block size with blk_queue_logical_block_size().
LBS with NVMe
NVMe sets the logical and physical block size on nvme_update_disk_info(). The logical block size is set to the NVMe blocksize. The physical block size is set as the minimum between NVMe blocksize and the NVMe's announced atomic block size (nawupf or awupf).
Today NVMe drives with an LBA format with a blocksize greater than the PAGE_SIZE are effectively disabled by assigning a disk capacity of 0 using set_capacity_and_notify(disk, 0).
Enabling support for NVMe drives where the storage bs > ps requires some effort which is currently being worked on.
NVMe LBA format
NVMe drives support different data block sizes through the different LBA formats it supports. To query what LBA format your drive supports you can use something like the following with nvme-cli.
nvme id-ns -H /dev/nvme4n1 | grep ^LBA LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
This is an example NVMe drive which supports two different LBA formats, one where the physical data block size is 512 bytes and another where the data physical block size is 4096 bytes. To format a drive to a format one can use the nvme-cli format. For example to use the above LBA format for 4k we'd use:
nvme format --lbaf=1 --force /dev/nvme4n1
LBS Storage alignment considerations
Reads and writes to storage drives need to be aligned to the storage medium's physical block size to enhance performance and correctness. To help with this Linux started exporting to userspace block device and partition IO topologies after Linux commit c72758f33784 ("block: Export I/O topology for block devices and partitions") in May 2009. This enables Linux tools (parted, lvm, mkfs.*, etc) to optimize placement of and access to data. Proper alignment is especially important for Direct IO since the page cache is not aligned to assist with lower granularity reads or writes or writeback, and so all direct IO should be aligned to the logical block size otherwise IO will fail.
The logical block size then is a Linux software construct to help ensure backward compatibility with older userspace and to also support stacking IO devices together (RAID).
Checking physical and logical block sizes
Below is an example of how to check for the physical and logical block sizes of an NVMe drive.
cat /sys/block/nvme0n1/queue/logical_block_size 512 cat /sys/block/nvme0n1/queue/physical_block_size 4096
4kn alignment example
Drives which support only 4k block sizes are referred to as using an Advanced Format 4kn, these are devices where the physical block size and logical block size are 4k. For these devices it important that applications perform direct IO that is multiple of 4k, this ensures it is aligned to 4k. Applications that perform 512 byte aligned IO on 4kn drives will break
Alignment considerations beyond 4kn
In light of prior experience in dealing with increasing physical block sizes on storage devices, since modern userspace *should* be reading the physical and logical block sizes prior to doing IO, *in theory* increasing storage device physical block sizes should not be an issue and not require much, if any changes.
Relevant block layer historic Linux commits
e1defc4ff0cf ("block:
Do away with the notion of hardsect_size"). So the sysfs hw_sector_size is just the old name.
ae03bf639a50 (block: Use accessor functions for queue limits)
cd43e26f0715 (block: Expose stacked device queues in sysfs)
025146e13b63 (block: Move queue limits to an embedded struct)
c72758f33784 (block: Export I/O topology for block devices and partitions)
It is worth quoting in full the commit which added the logical block size to userspace:
block: Export I/O topology for block devices and partitions To support devices with physical block sizes bigger than 512 bytes we need to ensure proper alignment. This patch adds support for exposing I/O topology characteristics as devices are stacked. logical_block_size is the smallest unit the device can address. physical_block_size indicates the smallest I/O the device can write without incurring a read-modify-write penalty. The io_min parameter is the smallest preferred I/O size reported by the device. In many cases this is the same as the physical block size. However, the io_min parameter can be scaled up when stacking (RAID5 chunk size > physical block size). The io_opt characteristic indicates the optimal I/O size reported by the device. This is usually the stripe width for arrays. The alignment_offset parameter indicates the number of bytes the start of the device/partition is offset from the device's natural alignment. Partition tools and MD/DM utilities can use this to pad their offsets so filesystems start on proper boundaries. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Experimenting with LBS
As of April 28, 2023, experimenting with LBS support requires out-of-tree patches to Linux. A few block drivers and filesystems have been modified to help experiment and enable the community to work with LBS.
large-block-next
Since a lot of this is still work in progress a lot of this code is not yet merged upstream into Linux. Some of this is not yet even upstream on linux-next. And so a git tree has been put together to help developers wishing to test the latest and greatest changes to help support and test LBS.
This tree will rebase sporadically:
These branches will not be changed and are intended to remain static:
LBS on qemu
Qemu has been used with the NVMe driver to experiment with LBS. You can experiment with LBS by just using a large physical and logical block size than your PAGE_SIZE. Qemu does not allow for different physical and logical block sizes.
LBS on brd
To create a 64k LBS RAM block device use:
modprobe brd rd_nr=1 rd_size=1024000 rd_blksize=65536 rd_logical_blksize=65536
LBS on shmem
To create a 64k LBS tmpfs filesystem:
mkdir /data-tmpfs/ mount -t tmpfs -o size=10M,bsize=$((4096*16)) -o noswap tmpfs /data-tmpfs/
LBS on kdevops
See the kdevops LBS R&D page for how to use kdevops to quickly get started on experimenting with any of the mentioned components above.
Testing with LBS
kdevops has been extended to support LBS filesystem profiles on XFS. We need to run tests for XFS for LBS and also run tests on blktests.
LBS resources
This is a summary mostly by Chinner's on why Nick and Christoph Lameter's strategy had failed
- it used high order compound pages in the page cache so that nothing needed to
- change on the filesystem level in order to support this
- Chinner supported revisiting this strategy as we use and support compound pages all over
- now and also support compaction, and compaction helps with ensuring we can get higher order pages
- There is still concern for fragmentation if you start doing a lot of only high order page
- allocations
- Chinner still wondered, if Lameter's approach to modifying only the page cache is used
- and a page fault means tracking a page to its own pte, why do we need contigous pages?
- Can't se use discontiguous pages?
- it used high order compound pages in the page cache so that nothing needed to
- Nick Piggin's fsblock strategy rewrites buffer head logic to support filesystem blocks larger
- than a page size, while leaving the page cache untouched.
- Downsides to Nick's strategy is all filesystems would need to be rewritten to use fsblock
- instead of buffer head
Dave Chinner 2018's effort to support block size > PAGE_SIZE
2021 iomap description as a page cache abstraction], fs/iomap
- already provides filesystems with a complete, efficient page cache abstraction that only requires filesytems to provide block mapping services. Filesystems using iomap do not interact with the page cache at all. And David Howells is working with Willy and all the network fs devs to build an equivalent generic netfs page cache abstraction based on folios that is supported by the major netfs client implementations in the kernel.
2021 description of fs/xfs/xfs_buf.c while clarifying while supporting folio design vs an opaque object straegy, fs/xfs/xfs_buf.c is
- an example of a high performance handle based, variable object size cache that abstracts away the details of the data store being allocated from slab, discontiguous pages, contiguous pages or vmapped memory. It is basically two decade old re-implementation of the Irix low layer global disk-addressed buffer cache, modernised and tailored directly to the needs of XFS metadata caching.
2021 Folio-enabling the page cache - enables filesystems to be
- converted to folios
2022 LSFMM coverage on A memory folio update
buffer-head --> iomap, why? Because it was designed for larger block size support in mind
- only readahead uses folios today
- all current filesystem's write path uses base pages, and it has not been clear when large
- folios should be used
- some filesystems want features like range-locking but this gets complex as you need to lock
- a page's lock before a filesystem lock
- page reclaim can be problematic for filesystems
- Wilcox suggested most filesystems already do a good job with writeback and so the kernel's
- reclaim mechanism may not be needed anymore
- afs already removed the writepage() callback
- XFS hasn't had a writepage() callback since last summer
- hnaz suggests that writepage() call is still there on paper, but it's been neutered by
- conditionals that rarely trigger in practice. It's also only there for the global case, never called for cgroup reclaim. Cgroup-aware flushers are conceivable, but in practice
- Wilcox suggested most filesystems already do a good job with writeback and so the kernel's
- folios could increase write amplification as when a page is dirty the entire folio needs to be
- written the global flushers and per-cgroup dirty throttling have been working well.
April 2023 Christoph's patches on enabling Linux without buffer-heads
Divide & Conquer work - OKRs for LBS
Here are a set of OKRs to help with folio work & LBS OKRs, OKRs are used to help divide & conquer tasks into concrete tangible components we need to complete proper LBS support upstream. Since a lot of this is editing tables, this is maintained on an Google sheet.