Size: 7147
Comment:
|
← Revision 7 as of 2025-05-02 22:51:14 ⇥
Size: 7147
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 73: | Line 73: |
* [[https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250428-pwrite-debug&id=42bbcb07bfd0871e316974b3be276b7198be4b2e|nvme: add awun / nawun sanity check]] * [[https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250428-pwrite-debug&id=6060ec8de0b0cbba99813d1ae97d8d5f201850e7|nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs]] |
* [[https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250428-pwrite-debug&id=6060ec8de0b0cbba99813d1ae97d8d5f201850e7|nvme: add awun / nawun sanity check]] * [[https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250428-pwrite-debug&id=42bbcb07bfd0871e316974b3be276b7198be4b2e|nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs]] |
NFS atomics
Hardware atomic writes are supported on SCSI and NVMe. SCSI requires a special write atomic command depending on the atomic write size. NVMe does not have a special atomic write, its implicit, so long as the atomic write size and alignments are met the write will be atomic. The Linux kernel RWF_ATOMIC API is useful to help the kernel to vet the alignment requirements and to help with the block layer to not merge IOs when deemed atomic. If userspace used RWF_ATOMIC the kernel will error out a write if a write does not meet the required hardware atomic requirements.
While the block layer, SCSI and NVMe can profit from large atomics we are also interested in evaluating support for large atomics through NFS. This page is dedicated towards evaluating what would be need to support large atomics on the NFS protocol side of things so to help extend standards, while also enabling PoCs.
Contents
Standards involved
NVMEe 1.2b Section 6.4 covers Atomic Operations:
- AWUN / NAWUN - control atomicity in relation to other commands
- AWUPF / NAWUPF - atomic power fail
- NABSN / NABSPF - atomic boundary
- SCSI sbcr22 Section 6.6.4
Empirical findings
Evaluation has been done by graphing of TPS variability of both MySQL for Direct IO and PostgreSQL for buffered IO by leveraging atomics and either setting innodb_doublewrite=0 for MySQL or setting full_page_writes=off for PostgreSQL for different filesystem configurations using AWS i4i.4xlarge instances.
Findings:
- For Direct IO we see 3x-4x TPS variability reduction with MySQL
- For Buffered IO we observe 14x-18x TPS variability reduction with PostgreSQL
The Linux kernel supports RWF_ATOMIC only supported for direct IO today. We have a discussions ongoing from LSFMM for a roadmap for buffered IO support it which require coordinating with the PostgreSQL community.
Enablement of NFS large atomic PoCs
Userspace software does require changes to leverage the RWF_ATOMIC Linux API. However, RWF_ATOMIC is not needed to leverage atomics support on NVMe, as NVMe deals with atomics implicitly, so long as atomic alignment and write size requirements are met.
Picking an NFS backend filesystem and verification
IO introspection can be done with blkagln in userspace. PoCs can be done without RWF_ATOMIC if you can vet that all writes are 16 KiB aligned and only 16 KiB writes are issued. However there are already filesystem which do support RWF_ATOMIC or work is underway for that. Below we provide a lay of the land of support for large atomics on Linux filesystems.
XFS atomics support
XFS supports RWF_ATOMIC as of v6.13 through LBS even on x86_64.
Furthermore, as of v6.15 XFS provides IO alignment determinism for all writes by supporting a large sector size. With this feature, the XFS sector size can also be set to 16 KiB, so to also force metadata writes to be aligned to 16 KiB. Aligning metadata writes enables filesystem to also leverage future filesystem enhancements to writing 16 KiB atomically. It can also give a piece of mind to userspace that at least all filesystem writes will be 16 KiB aligned.
It is known that with XFS LBS support on 16 KiB block sizes if you are using smaller than 10 threads you will see a performance regression Vs 4 KiB block sizes XFS, this has been root caused to the redo log writing 512 writes followed by sync and was also covered at LSFMM 2025. You can fix this by placing the redo log in a different directory on another partition, however that is not a suitable solution if you want to take snapshots from the same filesystem. Note that MySQL would work with Direct IO and the redo log writes happen with buffered IO. Could NFS help coalesce these small buffered IO writes?
ext4 with bigalloc 16 KiB cluster sizes
While ext4 did get support for RWF_ATOMIC on v6.13 you need a 16 KiB PAGE_SIZE system. This is likely useful for ARM64 and PowerPC systems.
ext4 with bigalloc with 16 KiB clusters can be used on x86_64, however ext4 does not yet support 16 KiB metadata writes, to enable that ext4 will need support for LBS. If not aligning metadata writes to atomic boundary is a requirement ext4 with 16 KiB cluster sizes can be leveraged. However RWF_ATOMIC is not yet supported upstream on ext4 on x86_64 for 16 KiB atomics. As discussed at LSFMM 2025, however is work is underway for this by Ojaswin Mujoo, and a patchset was recently posted by Ritesh Harjani as RFCs.
Extending automation tools for an NFS PoC
The sysbench on kdevops can easily be extended to support NFS once we find what we need to change on the kernel side.
Faking HW atomics
If you have NVMe hardware with a large AWUN, and you do not care about power failure, and want to PoC large atomics you can do with with an nvme_core module parameter maintained outside of the Linux kernel tree. You would require two patches, these are based on v6.15:
NFS atomics PoC
How do we do a quick NFS PoC for large atomics? Do we evaluate both direct IO and buffered IO? What standards should be extended through IETF?
Few notes below:
- NFSv3 cannot be extended anymore, any standards update would have to be through NFSv4.
- NFSv2 ensures all WRITE operations are effectively FILE_SYNC.
- If we wanted to evaluate being strict a sync mount flag would require all
- WRITEs to be either FILE_SYNC or UNSTABLE followed immediately by a COMMIT. This later can help ensure the write would happen within the same boot cycle of the server.
- If we needed to constrain max write size on the client we could leverage
- the mount flag wsize=16834.
- Today's Linux max wsize is 1 MiB
NFSv3 FSINFO (RFC 1813) and FSFEAT (RFC 1813) carry filesystem block size information, which we could extend in the future to allow to report back statx atomic fields. For details refer to this example statx query program for atomic related information.
- NFS allows Direct IO writes aligned to any boundary, this may require changing