KernelNewbies:

Linux 4.5 has been released on Sunday, 13 March.

Summary: This release adds a new copy_file_range(2) system call that allows to make copies of files without transferring data through userspace; experimental Powerplay power management for modern Radeon GPUs; scalability improvements in the Btrfs free space handling; support GCC's Undefined Behavior Sanitizer (-fsanitize=undefined); Forwarded Error Correction support in the device-mapper's verity target; support for the MADV_FREE flag in madvise(); the new cgroup unified hierarchy is considered stable; scalability improvements for SO_REUSEPORT UDP sockets; scalability improvements for epoll, and better memory accounting of sockets in the memory controller. There are also new drivers and many other small improvements.

1. Prominent features

1.1. Copy offloading with new copy_file_range(2) system call

Copying a file consists in reading the data from a file to user space memory, then copy that memory to the destination file. There is nothing wrong with this way of doing things, but it requires doing extra copies of the data to/from the process memory. In this release Linux adds a system call, copy_file_range(2), which allows to copy a range of data from one file to another, avoiding the mentioned cost of transferring data from the kernel to user space and then back into the kernel.

This system call is only very slightly faster than cp, because the costs of these memory copies are barely noticeable compared with the time it takes to do the actual I/O, but there are some cases where it can help a lot more. In networking filesystems such as NFS, copying data involves sending the copied data from the server to the client through the network, then sending it again from the client to the new file in the server. But with copy_file_range(2), the NFS client can tell the NFS server to make a file copy from the origin to the destination file, without transferring the data over the network (for NFS, this also requires the server-side copy feature present in the upcoming NFS v4.2, and also supported experimentally in this Linux release). In next releases, local filesystems such as Btrfs, and especialized storage devices that provide copy offloading facilities, could also use this system call to optimize the copy of data, or remove some of the present limitations (currently, copy offloading is limited to files on the same mount and superblock, and not in the same file).

Recommended LWN articles: 1:copy_file_range(); 2:Copy offload

Raw man page: copy_file_range.2

Code: commit, commit, commit, commit, commit, commit, commit; NFS code: commit

1.2. Experimental PowerPlay supports brings high performance to the amdgpu driver

Modern GPUs start running in low power, low performance modes. To get the best performance, they need to dynamically change its frequency. But doing that requires good power management. This release adds support for PowerPlay in the amdgpu driver for discrete GPUs Tonga and Fiji, and integrated APUs Carrizo and Stoney. Powerplay is the brand name for a set of technologies for power management implemented in several of AMD CPUs and APUs; it has been available in the propietary Catalyst driver, and it aims to eventually replace the existing dynamic power management in the amdgpu driver. In the supported GPUs, performance will be much higher due to the ability to handle frequency changes.

Powerplay support is not enabled by default for all kind of hardware supported in this release due to stability concerns; in these cases the use of Powerplay can be forced with the "amdgpu.powerplay=1" kernel option.

Code: see link

1.3. Btrfs free space handling scalability improvements

Filesystems need to keep track of which blocks are being used and which ones are free. They also need to store information about the free space somewhere, because it's too costly to generate it from scratch. Btrfs has been able to store a cache of the available free space since 2.6.37, but the implementation is a scalability bottleneck on large (+30T), busy filesystems.

This release includes a new, experimental way of representing the free space cache that takes less work overall to update on each commit and fixes the scalability issues. This new code is experimental, and it's not the default yet. It can be enabled with the -o space_cache=v2 mount option. On the first mount with the this option set, the new free space tree will be created and a read-only compatibility flag will be enabled (older kernels will be able to read, but not to write, to the filesystem). It is possible to revert to the old free space cache (and remove the compatibility flag) by mounting the filesystem with the options -o clear_cache,space_cache=v1.

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit

1.4. Support for GCC's Undefined Behavior Sanitizer (-fsanitize=undefined)

UBSAN (Undefined Behaviour SANitizer) is a debugging tool available since GCC 4.9 (see -fsanitize=undefined documentation). It inserts instrumentation code during compilation that will perform checks at runtime before operations that could cause undefined behaviours. Undefined behavior means that the semantics of certain operations is undefined, and the compiler presumes that such operations never happen because the programmer will take care of avoiding them, but if they happen the application can produce wrong results, crash or even allow security breaches; examples of undefined behaviour are using a non-static variable before it has been initialized, integer division by zero, signed integer overflows, dereferencing NULL pointers, etc.

In this release, Linux supports compiling the kernel with the Undefined Behavior Sanitizer enabled with the -fsanitize options shift, integer-divide-by-zero, unreachable, vla-bound, null, signed-integer-overflow, bounds, object-size, returns-nonnull-attribute, bool, enum and, optionally, alignment. Most of the work is done by compiler, all the kernel does is to handle the printing of errors.

Links:

Code: commit

1.5. Forwarded Error Correction support in the device-mapper's verity target

The device-mapper's "verity" target, used by popular platforms such as Android or Netflix, was merged in Linux 3.4, and it allows that a file system hasn't been modified by checking every filesystem read attempt with a list of cryptographic hashes.

This release adds Forward Error Correction support to the verity target. This feature makes possible to recover from several consecutive corrupted data blocks, by using pregenerated error correction blocks that have relatively small space overhead and can be used to reconstruct the damaged blocks. This technique, found in DVDs, hard drives or satellite transmissions, will make possible to recover from errors in a verity-backed filesystem placed in slightly damaged media.

Code: commit

1.6. Add MADV_FREE flag to madvise(2)

madvise(2) is a system call used by processes to tell the kernel how they are going to use their memory, allowing the kernel to optimize the memory management according to these hints to achieve better overall performance.

When an application wants to signal the kernel that it isn't going to use a range of memory in the near future, it can use the MADV_DONTNEED flag, so the kernel can free resources associated with it. Subsequent accesses in the range will succeed, but will result either in reloading of the memory contents from the underlying mapped file or zero-fill-on-demand pages for mappings without an underlying file. But there are some kind of apps (notably, memory allocators) that can reuse that memory range after a short time, and MADV_DONTNEED forces them to incur in page fault, page allocation, page zeroing, etc. For avoiding that overhead, other OS like BSDs have supported MADV_FREE, which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again. This release adds Linux support for this flag.

Recommended LWN article: Volatile ranges and MADV_FREE

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.7. Better epoll multithread scalability

When multiple epoll file descriptors or epfds (the file descriptor returned from epoll_create(2) are added to a shared wakeup source, they are always added in a non-exclusive manner. This means that an event will wakeup all epfds, creating a scalability problem when many epfds are being used.

This release introduces a new EPOLLEXCLUSIVE flag that can be passed as part of the event argument during an epoll_ctl(2) EPOLL_CTL_ADD operation. This new flag allows for exclusive wakeups when there are multiple epfds attached to a shared fd event source. In a modified version of Enduro/X, the use of the 'EPOLLEXCLUSIVE' flag reduced the length of this particular workload from 860s down to 24s.

Recommended LWN article: Epoll evolving: Better multi-threaded behavior

Code: commit, commit

1.8. cgroup unified hierarchy is considered stable

cgroups, or control groups, are a feature introduced in Linux 2.6.24 which allow to allocate resources (such as CPU time, system memory, network bandwidth) among user-defined groups of processes running on a system. In the first implementation, cgroups allowed an arbitrary number of process hierarchies and each hierarchy could host any number of controllers. While this seemed to provide a high level of flexibility, in practice it had a number of problems, so in Linux 3.16 a new, unified hierarchy was merged. But it was experimental, only available with the -o __DEVEL__sane_behavior mount option.

In this release, the unified hierarchy is considered stable, and it's no longer hidden behind that developer flag. It can be mounted using the cgroup2 filesystem type (unfortunately, the cpu controller for cgroup2 hasn't made it into this release, only memory and io controllers are available at the moment). For more details, including a detailed reasoning behind the migration to the unified hierarchy, see the cgroup2 documentation: Documentation/cgroup-v2.txt

Code: (merge)

1.9. Performance improvements for SO_REUSEPORT UDP sockets

SO_REUSEPORT is a socket option available since Linux 3.9 that allows multiple listener sockets to bind to the same port. An use case for SO_REUSEPORT would be something like a web server binding to port 80 running with multiple threads, where each thread might have it's own listener socket.

In this release, Linux includes two optimizations for SO_REUSEPORT sockets (in this release, only for UDP sockets):

Code: commit, commit, commit

1.10. Proper control of socket memory usage in the memory controller

In past releases, socket buffers were accounted in the cgroup's memory controller, separately, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications.

In this release, the new unified memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. Upon pressure, the VM reclaims and shrinks and puts pressure on whatever memory consumer in that pool is within its reach. When the VM has trouble freeing memory, the network code is instructed to stop growing the cgroup's transmit windows. Overhead is only incurred when a non-root control group is created and the memory controller is instructed to track and account the memory footprint of that group. cgroup.memory=nosocket can be specified on the boot commandline to override any runtime configuration and forcibly exclude socket memory from active memory resource control.

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

2. Drivers and architectures

3. Core (various)

4. File systems

5. Memory management

6. Block layer

7. Cryptography

8. Security

9. Tracing and perf tool

10. Virtualization

11. Networking

12. List of merges

13. Other news sites

KernelNewbies: Linux_4.5 (last edited 2017-12-30 01:30:23 by localhost)