KernelNewbies:

Linux 4.14 has been released on 12 Nov 2017.

Summary: This release includes support for bigger memory limits in x86 hardware (128PiB of virtual address space, 4PiB of physical address space); support for AMD Secure Memory Encryption; a new unwinder that provides better kernel traces and a smaller kernel size; a cgroups "thread mode" that allows resource distribution across the threads of a group of processes; support for the zstd compression algorithm has been added to Btrfs and Squashfs; support for zero-copy of data from user memory to sockets; better asynchronous buffered I/O support; support for Heterogeneous Memory Management that will be needed in future GPUs; better cpufreq behaviour in some corner cases; Longer-lived TLB entries by using the PCID CPU feature; asynchronous non-blocking buffered reads; and many new drivers and other improvements.

1. Prominent features

1.1. Bigger memory limits

Original x86-64 was limited by 4-level paging to 256 TiB of virtual address space and 64 TiB of physical address space. People are already bumping into this limit: some vendors offers servers with 64 TiB of memory today. To overcome the limitation upcoming hardware will introduce support for 5-level paging. It is a straight-forward extension of the current page table structures adding one more layer of translation. It bumps the limits to 128 PiB of virtual address space and 4 PiB of physical address space. This "ought to be enough for anybody" ©.

On x86, 5-level paging enables 56-bit userspace virtual address space. Not all user space is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers. It collides with valid pointers with 5-level paging and leads to crashes. To mitigate this, the Linux kernel will not allocate virtual address space above 47-bit by default. Userspace can ask for allocation from full address space by specifying hint address above 47-bits.

Recommended LWN article: Five-level page tables

Code: commit, commit, commit, commit, merge

1.2. Add support for AMD Secure Memory Encryption

Secure Memory Encryption can be used to mark individual pages of memory as encrypted through the page tables. A page of memory that is marked encrypted will be automatically decrypted when read from DRAM and will be automatically encrypted when written to DRAM. Secure Memory Encryption can therefore be used to protect the contents of DRAM from physical attacks on the system.

Recommended LWN article: Two approaches to x86 memory encryption

AMD Memory encryption whitepaper: link

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.3. Better kernel traces with the ORC unwinder

This release includes a new "unwinder". An "unwinder" is what prints the list of functions (aka. stack trace, callgraph, call stack...) that have been executed before reaching a determinate point of the code, and it's used, for example, the list of functions that lead to a crash when a kernel oopses. The new unwinder is called ORC, an alias for "Oops Rewind Capability", and has been developed as an simpler alternative to the DWARF debuginfo format.

Linux already has an unwinder, and while it usually works well, it isn't reliable in all situations, which causes troubles for modern functionality like live patching that requires completely reliable stack traces. It also requires a functionality called "frame pointers" (CONFIG_FRAME_POINTERS) to print complete call stacks. Frame pointers make GCC add instrumentation code to every function in the kernel, which increases the size of the kernel executable code by about 3.2%, resulting in a broad kernel-wide slowdown, and more for some workloads. This option is enabled by default in some Linux distros.

In contrast, the ORC unwinder does not need to insert code anywhere so it has no effect on text size or runtime performance, because the debuginfo (about 2-4MiB) is placed out of band. So the ORC unwinder provides a nice performance improvement across the board compared with frame pointers, while at the same time having reliable stack traces.

Recommended LWN article: The ORCs are coming

Recommended article: The Linux x86 ORC Stack Unwinder

Code: commit, commit

1.4. zstd compression in Btrfs and Squashfs

zstd offers a wide variety of compression speed and quality trade-offs. It can compress at speeds approaching lz4, and quality approaching lzma. zstd decompressions at speeds more than twice as fast as zlib, and decompression speed remains roughly the same across all compression levels. Because it is a big win in speed over zlib and in compression ratio over lzo, FB has been using it in production with great results. Support has also been added for squashfs. For benchmark numbers see the links.

Project page: https://github.com/facebook/zstd

Code: commit, commit, commit, commit

1.5. Zero-copy from user memory to sockets

Copying large buffers between user process and kernel can be expensive. Linux supports various interfaces that eschew copying, such as sendpage(2) and splice(2). The MSG_ZEROCOPY socket flag extends the underlying copy avoidance mechanism to common socket send calls. Copy avoidance is not a free lunch. As implemented, with page pinning, it replaces per byte copy cost with page accounting and completion notification overhead. As a result, MSG_ZEROCOPY is generally only effective at writes over around 10 KB.

Recommended LWN article: Zero-copy networking

Documentation: MSG_ZEROCOPY

Netdev talk: sendmsg copy avoidance with MSG_ZEROCOPY

1.6. Heterogeneous Memory Management for future GPUs

Today device driver expose dedicated memory allocation API through their device file, often relying on a combination of IOCTL and mmap calls. The device can only access and use memory allocated through this API. This effectively split the program address space into object allocated for the device and useable by the device and other regular memory (malloc, mmap of a file, share memory, ...) only accessible by CPU (or in a very limited way by a device by pinning memory). Allowing different isolated component of a program to use a device thus require duplication of the input data structure using device memory allocator. This is reasonable for simple data structure (array, grid, image, ...) but this get extremely complex with advance data structures. This is becoming a serious limitation on the kind of work load that can be offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this barrier. This require a shared address space between GPU device and CPU so that GPU can access any memory of a process (while still obeying memory protection like read only). This kind of feature is also appearing in various other operating systems. Heterogeneous Memory Management is a set of helpers to facilitate several aspects of address space sharing and device memory management.

Recommended LWN article: Heterogeneous memory management

Documentation: Documentation/vm/hmm.txt

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.7. Asynchronous buffered I/O support

The buffered I/O path in Linux can block in some situations. Using a threadpool to emulate non-blocking operations on regular buffered files is a common pattern today (samba, libuv, etc...) Applications split the work between network bound threads (epoll) and IO threadpool. Not every application can use sendfile syscall (TLS / post-processing). This common pattern leads to increased request latency. Latency can be due to additional synchronization between the threads or fast (cached data) request stuck behind slow request (large / uncached data).

In this release, the preadv2(2) syscall with RWF_NONBLOCK will let userspace applications bypass enqueuing operation in the threadpool if it's already available in the pagecache.

Code: commit, commit, commit, commit

1.8. Better cpufreq coordination with SMP

In Linux, notifications of task scheduler events are sent to the cpufreq subsystem, so that it can increase the frequency if needed, and achieve good interactivity. However, the cpufreq drivers are not called when the events are happening in different CPUs, for example, a new process being created in another CPU. This release makes task scheduler to update the cpufreq policies for remote CPUs as well. The schedutil, ondemand and conservative governors are updated to process cpufreq updates for remote CPUs (the intel_pstate driver is updated to always reject them).

Recommended LWN article: CPU frequency governors and remote callbacks

Code: commit, commit

1.9. Control Groups thread mode

In this release, cgroup v2 supports thread granularity, to support use cases requiring hierarchical resource distribution across the threads of a group of processes. By default, all threads of a process belong to the same cgroup, which also serves as the resource domain to host resource consumptions which are not specific to a process or thread. The thread mode allows threads to be spread across a subtree while still maintaining the common resource domain for them.

Recommended LWN article: A milestone for control groups

Code: commit, commit, commit, commit, commit, commit

1.10. Longer-lived TLB Entries with PCID

PCID is a hardware feature that has been available on Intel CPUs and that it attaches an address space tag to TLB entries and thus allows the hardware to skip TLB flushes when it context-switches. x86's PCID is far too short to uniquely identify a process, and it can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups.

This release uses PCID differently. It uses a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, it is given a fresh PCID each time it's loaded except in cases where the kernel wants to preserve the TLB, in which case it reuses a recent value.

Code: commit, commit, commit, commit, commit, commit, commit, commit

2. Core (various)

3. File systems

4. Memory management

5. Block layer

6. Tracing, perf and BPF

7. Virtualization

8. Security

9. Networking

10. Architectures

11. Drivers

11.1. Graphics

11.2. Storage

11.3. Drivers in the Staging area

11.4. Networking

11.5. Audio

11.6. Tablets, touch screens, keyboards, mouses

11.7. TV tuners, webcams, video capturers

11.8. Universal Serial Bus

11.9. Serial Peripheral Interface (SPI)

11.10. Watchdog

11.11. ACPI, EFI, cpufreq, thermal, Power Management

11.12. Real Time Clock (RTC)

11.13. Voltage, current regulators, power capping, power supply

11.14. Pin Controllers (pinctrl)

11.15. Multi Media Card (MMC)

11.16. Memory Technology Devices (MTD)

11.17. Industrial I/O (iio)

11.18. Multi Function Devices (MFD)

11.19. Pulse-Width Modulation (PWM)

11.20. Inter-Integrated Circuit (I2C)

11.21. Hardware monitoring (hwmon)

11.22. General Purpose I/O (gpio)

11.23. Leds

11.24. DMA engines

11.25. Cryptography hardware acceleration

11.26. PCI

11.27. Clock

11.28. Various

12. List of merges

13. Other news sites

KernelNewbies: Linux_4.14 (last edited 2017-12-30 01:30:01 by localhost)