Summary: This release adds: the CAKE network queue management to fight bufferbloat, its algorithm is designed to fight intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers; support for guaranteeing minimum I/O latency targets for cgroups; experimental support for the future Wi-Fi 6 (802.11ax-drafts); memory usage for overlayfs users has been improved; a experimental EROFS file system optimized for read-only use; a new asynchronous I/O polling interface; support for avoiding unintentional writes to an attacker-controlled FIFO or regular files in world writable sticky directories; support for a Intel feature that locks part of the CPU cache for an application; and many new drivers and other improvements.
- Better networking experience with the CAKE queue management algorithm
- Block I/O latency controller
- Preliminary Wi-Fi 6 (802.11ax) support
- New asynchronous I/O polling interface
- Overlayfs memory usage improvements
- New experimental EROFS file system
- Better protection in sticky directories (eg. /tmp)
- Intel Cache Pseudo-locking
- Even more fixes for CPU security bugs
- Core (various)
- File systems
1. Coolest features
1.1. Better networking experience with the CAKE queue management algorithm
This release includes a new queuing discipline for the network packet scheduler: Common Applications Kept Enhanced (CAKE). It is designed to replace and improve upon the complex hierarchy of simple queuing disciplines presently required to effectively tackle the bufferbloat problem at the network edge.
CAKE targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it.
Recommended LWN article: Let them run CAKE
Project page: https://www.bufferbloat.net/projects/codel/wiki/Cake/
Technical information: https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/
1.2. Block I/O latency controller
This release adds a new controller that attempts to guarantee minimum I/O latency targets for cgroups. As long as everybody is meeting their latency target the controller doesn't do anything, but once a group starts missing its target it will attempt to maintain average IO latencies below the configured latency target, throttling anybody with a higher latency target than the victimized group. Latency targets need to be enabled in the new io.latency cgroup file, but experimentation is needed to determine the latency targets for a given hardware configuration. For more details see the documentation.
Recommended LWN article: The block I/O latency controller
1.3. Preliminary Wi-Fi 6 (802.11ax) support
1.4. New asynchronous I/O polling interface
After being merged and reverted in 4.18, this feature adds a simple one-shot poll through the io_submit(2) interface to poll for the readiness of file descriptors using the aio subsystem. It allows aio poll to work without any additional context switches, unlike epoll. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It adds a io_pgetevents(2) system call, which is the io_getevents(2) equivalent of ppoll(2)/pselect(2) and allows to properly mix signals and aio completions (especially with IOCB_CMD_POLL). The API is based on patches that existed in RHAS2.1 and RHEL3, which means it already is supported by libaio.
Recommended LWN article: A new kernel polling interface
1.5. Overlayfs memory usage improvements
When users of overlayfs (eg. containers) change metadata on a file, overlayfs makes a copy of the entire file's cache for the upper layer. This means that some actions, eg. doing chown() on whole image directory tree, will increase memory usage considerably. This release allows to delay copy up of data: when file is on lower layer and only metadata is modified (except size), the kernel will only only copy up the metadata and continue to use the data from the lower file until file is opened for writing. Following the previous example, doing chown() on whole image directory tree won't trigger a copy of the file's data, containers will continue sharing the page cache. For instructions on how to turn on this feature, see the documentation.
This release also properly implements regular file operations for overlayfs, removing several hacks and allowing proper interaction of read-only open files with copy-up, possibility to implement fs modifying ioctls properly, and others. Overlayfs can now act as a POSIX compliant filesystem with some features turned on, for more details see the documentation
1.6. New experimental EROFS file system
The new EROFS file system has been added in this release. It is a experimental project, under the staging directory, and still expects to make changes to the on-disk layout. EROFS stands for Enhanced Read-Only File System, and it is a lightweight read-only file system with a modern design (eg. page-sized blocks, inline xattrs/data, etc.) for scenarios which need high-performance read-only requirements, eg. firmwares in mobile phone or Livecds. It also provides VLE compression support, focusing on random read improvements, keeping relatively lower compression ratios, which is useful for high-performance devices with limited memory and ROM space.
1.7. Better protection in sticky directories (eg. /tmp)
This release tries to avoid unintentional writes to an attacker-controlled FIFO or regular files by disallowing open of FIFOs or regular files not owned by the user in world writable sticky directories, unless the owner is the same as that of the directory or the file is opened without the O_CREAT flag. The purpose is to make data spoofing attacks harder. This protection can be turned on and off separately for FIFOs (protected_fifos) and regular files (protected_regular) via sysctl, just like the already existing symlinks/hardlinks protection
1.8. Intel Cache Pseudo-locking
This release adds support for an Intel-specific CPU feature. It allows a user to specify the amount of CPU cache space that an application can fill, it isolates that region of the CPU cache and 'locks' it. From that point on will only serve cache hits. The cache pseudo-locked memory is made accessible to user space where an application can map it into its virtual address space and thus have a region of memory with reduced average read latency. The locking is not perfect and gets totally screwed by WBINDV and similar mechanisms, but it provides a reasonable enhancement for certain types of latency sensitive applications.
1.9. Even more fixes for CPU security bugs
This release includes the usual round of patches to deal with the new and exciting CPU security bugs:
x86: Add protection against L1TF, aka L1 Terminal Fault, yet another speculative hardware engineering trainwreck. It's a hardware vulnerability which allows unprivileged speculative access to data which is available in the Level 1 Data Cache. For more details, read this LWN article: Meltdown strikes back: the L1 terminal fault vulnerability; and/or see the documentation
- x86: Add protection against userspace-userspace spectreRSB
- x86: Support Enhanced IBRS on future Intel CPUs as the default Spectre V2 mitigation technique instead of retpoline for improved performance
- Support page-table isolation protection (KPTI) for x86-32
- KVM Shadow Paging performance improvements to improve the performance of shadow paging when the guest kernel is using KPTI
- Support for flushing the count cache on context switch on some POWERPC IBM CPUs (controlled by firmware), as a Spectre v2 mitigation
- Many small patches across the entire kernel to fix the possible Spectre exploitations warned by GCC
2. Core (various)
Raise the minimum required gcc version to 4.6 commit
Provide a command line (nosmt) and a sysfs knob (/sys/devices/system/cpu/smt/*) to control Simultaneous Multi Threading commit
Add an option for uncompressed kernel commit
locking: Implement an algorithm choice for Wound-Wait mutexes commit
task scheduler: Remove unused sched_time_avg_ms sysctl commit
usercopy: Enabling HARDENED_USERCOPY may cause measurable regressions in networking performance: up to 8% under UDP flood. Allow to disable it at runtime with hardened_usercopy=off option commit
Enable early printing of hashed pointers. Aid debugging early in the boot sequence for machines that do not have a hw RNG by adding the command line option debug_boot_weak_hash commit, commit, commit, commit
virtual terminal: The vt code translates UTF-8 strings into glyph index values and stores those glyph values in the screen buffer. Because there can only be at most 512 glyphs, it is impossible to represent most unicode characters. This release introduces unicode screen support to the core console code with /dev/vcs* as a first user commit, commit, commit
proc: show fd locks taken by processes from another pidns commit
Power management: Add a new framework for idle injection, to be used by all of the idle injection code in the kernel in the future commit
driver core: Add a debugfs entry to show deferred devices commit
3. File systems
(FEATURED) Add EROFS (Enhanced Read-Only File System), a lightweight read-only file system with modern designs for scenarios which need high-performance read-only requirements, eg. firmwares in mobile phone or LIVECDs source
Remove deprecated barrier/nobarrier mount commit
Add fsync_mode=nobarrier mount option which gives an option to user where it skips issuing cache flush commands to underlying flash storage commit
Enable -o discard by default commit
Introduces a new mount option fault_type to assist old fault_injection commit
Support discard submission error injection commit
Add FITRIM ioctl for FAT file system commit
Add support for asynchronous server-side COPY operations commit, commit, commit, commit, [[https://git.kernel.org/linus/bc0c9079b48ddcf1f8a6e1aaa277288b263c7