KernelNewbies:

Linux 4.5 has not been released

Summary: This release adds a new copy_file_range(2) system call that allows to make copies of files without transferring data through userspace; experimental Powerplay power management for modern Radeon GPUs; scalability improvements in the Btrfs free space handling; support GCC's Undefined Behavior Sanitizer (-fsanitize=undefined); Forwarded Error Correction support in the device-mapper's verity target; support for the MADV_FREE flag in madvise(); the new cgroup unified hierarchy is considered stable; scalability improvements for SO_REUSEPORT UDP sockets; scalability improvements for epoll, and better memory accounting of sockets in the memory controller. There are also new drivers and many other small improvements.

1. Prominent features

1.1. Copy offloading with new copy_file_range(2) system call

Copying a file consists in reading the data from a file to user space memory, then copy that memory to the destination file. There is nothing wrong with this way of doing things, but it requires doing extra copies of the data to/from the process memory. In this release Linux adds a system call, copy_file_range(2), which allows to copy a range of data from one file to another, avoiding the mentioned cost of transferring data from the kernel to user space and then back into the kernel.

This system call is only very [https://lwn.net/Articles/658718/ slightly faster] than cp, because the costs of these memory copies are barely noticeable compared with the time it takes to do the actual I/O, but there are some cases where it can help a lot more. In networking filesystems such as NFS, copying data involves sending the copied data from the server to the client through the network, then sending it again from the client to the new file in the server. But with copy_file_range(2), the NFS client can tell the NFS server to make a file copy from the origin to the destination file, without transferring the data over the network (for NFS, this also requires the server-side copy feature [https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-41#section-4 present in the upcoming NFS v4.2], and also supported experimentally in this Linux release). In next releases, local filesystems such as Btrfs, and especialized storage devices that provide copy offloading facilities, could also use this system call to optimize the copy of data, or remove some of the present limitations (currently, copy offloading is limited to files on the same mount and superblock, and not in the same file).

Recommended LWN articles: 1:[https://lwn.net/Articles/659523/ copy_file_range()]; 2:[https://lwn.net/Articles/637436/ Copy offload]

Raw man page: [http://git.kernel.org/cgit/docs/man-pages/man-pages.git/tree/man2/copy_file_range.2 copy_file_range.2]

Code: [https://git.kernel.org/torvalds/c/29732938a6289a15e907da234d6692a2ead71855 commit], [https://git.kernel.org/torvalds/c/cb4c4e8091e86e08cb2d48e7ae6bf454245c36cb commit], [https://git.kernel.org/torvalds/c/3db11b2eecc02dc0eee943e71822c6d929281aa7 commit], [https://git.kernel.org/torvalds/c/eac70053a141998c40907747d6cea1d53a9414be commit], [https://git.kernel.org/torvalds/c/54dbc15172375641ef03399e8f911d7165eb90fb commit], [https://git.kernel.org/torvalds/c/04b38d601239b4d9be641b412cf4b7456a041c67 commit], [https://git.kernel.org/torvalds/c/d79bdd52d8be70d0e7024ac6715eee860a19834a commit]; NFS code: [https://git.kernel.org/torvalds/c/ffa0160a103917defd5d9c097ae0455a59166e03 commit]

1.2. Experimental Powerplay supports brings high performance to the amdgpu driver

Modern GPUs start running in low power, low performance modes. To get the best performance, they need to dynamically change its frequency. But doing that requires good power management. This release adds support for [https://en.wikipedia.org/wiki/AMD_PowerPlay Powerplay] in the amdgpu driver for discrete GPUs Tonga and Fiji, and integrated APUs Carrizo and Stoney. Powerplay is the brand name for a set of technologies for power management implemented in several of AMD CPUs and APUs; it has been available in the propietary Catalyst driver, and it aims to eventually replace the existing dynamic power management in the amdgpu driver. In the supported GPUs, performance will be much higher due to the ability to handle frequency changes.

Powerplay support is not enabled by default for all kind of hardware supported in this release due to stability concerns; in these cases the use of Powerplay can be forced with the "amdgpu.powerplay=1" kernel option.

Code: see [https://lists.freedesktop.org/archives/dri-devel/2015-November/094230.html link]

1.3. Btrfs free space handling scalability improvements

Filesystems need to keep track of which blocks are being used and which ones are free. They also need to store information about the free space somewhere, because it's too costly to generate it from scratch. Btrfs has been able to store a cache of the available free space [http://kernelnewbies.org/Linux_2_6_37#head-73fc3db571309a002aad2f56e930923422cff5d2 since 2.6.37], but the implementation is a scalability bottleneck on large (+30T), busy filesystems.

This release includes a new, experimental way of representing the free space cache that takes less work overall to update on each commit and fixes the scalability issues. This new code is experimental, and it's not the default yet. It can be enabled with the -o space_cache=v2 mount option. On the first mount with the this option set, the new free space tree will be created and a read-only compatibility flag will be enabled (older kernels will be able to read, but not to write, to the filesystem). It is possible to revert to the old free space cache (and remove the compatibility flag) by mounting the filesystem with the options -o clear_cache,space_cache=v1.

Code: [https://git.kernel.org/torvalds/c/3e1e8bb770dba29645b302c5499ffcb8e3906712 commit], [https://git.kernel.org/torvalds/c/0f3312295d3ce1d82392244236a52b3b663480ef commit], [https://git.kernel.org/torvalds/c/1abfbcdf56d9485f050149bc4968c1609f9a0773 commit], [https://git.kernel.org/torvalds/c/73fa48b674e819098c3bafc47618d0e2868191e5 commit], [https://git.kernel.org/torvalds/c/208acb8c72d7ace6b672b105502dca0bcb050162 commit], [https://git.kernel.org/torvalds/c/a5ed91828518ab076209266c2bc510adabd078df commit], [https://git.kernel.org/torvalds/c/7c55ee0c4afba4434d973117234577ae6ff77a1c commit], [https://git.kernel.org/torvalds/c/1e144fb8f4a4d6d6d88c58f87e4366e3cd02ab72 commit], [https://git.kernel.org/torvalds/c/70f6d82ec73c3ae2d3adc6853c5bebcd73610097 commit]

1.4. Support for GCC's Undefined Behavior Sanitizer (-fsanitize=undefined)

UBSAN (Undefined Behaviour SANitizer) is a debugging tool available since GCC 4.9 (see [https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html -fsanitize=undefined documentation]). It inserts instrumentation code during compilation that will perform checks at runtime before operations that could cause undefined behaviours. [https://en.wikipedia.org/wiki/Undefined_behavior Undefined behavior] means that the semantics of certain operations is undefined, and the compiler presumes that such operations never happen because the programmer will take care of avoiding them, but if they happen the application can produce wrong results, crash or even allow security breaches; examples of undefined behaviour are using a non-static variable before it has been initialized, integer division by zero, signed integer overflows, dereferencing NULL pointers, etc.

In this release, Linux supports compiling the kernel with the Undefined Behavior Sanitizer enabled with the -fsanitize options shift, integer-divide-by-zero, unreachable, vla-bound, null, signed-integer-overflow, bounds, object-size, returns-nonnull-attribute, bool, enum and, optionally, alignment. Most of the work is done by compiler, all the kernel does is to handle the printing of errors.

Links:

Code: [https://git.kernel.org/torvalds/c/c6d308534aef6c99904bf5862066360ae067abc4 commit]

1.5. Forwarded Error Correction support in the device-mapper's verity target

The device-mapper's "verity" target, used by popular platforms such as [https://source.android.com/security/verifiedboot/ Android] or Netflix, was merged [http://kernelnewbies.org/Linux_3.4#head-011d0bd1a20451b0e374283b36a71d8e8f5b7ae1 in Linux 3.4], and it allows that a file system hasn't been modified by checking every filesystem read attempt with a list of cryptographic hashes.

This release adds [https://en.wikipedia.org/wiki/Forward_error_correction Forward Error Correction] support to the verity target. This feature makes possible to recover from several consecutive corrupted data blocks, by using pregenerated error correction blocks that have relatively small space overhead and can be used to reconstruct the damaged blocks. This makes possible to keep using a filesystem placed in slightly damaged media.

Code: [https://git.kernel.org/torvalds/c/a739ff3f543afbb4a041c16cd0182c8e8d366e70 commit]

1.6. Add MADV_FREE flag to madvise(2)

[http://man7.org/linux/man-pages/man2/madvise.2.html madvise(2)] is a system call used by processes to tell the kernel how they are going to use their memory, allowing the kernel to optimize the memory management according to these hints to achieve better overall performance.

When an application wants to signal the kernel that it isn't going to use a range of memory in the near future, it can use the MADV_DONTNEED flag, so the kernel can free resources associated with it. Subsequent accesses in the range will succeed, but will result either in reloading of the memory contents from the underlying mapped file or zero-fill-on-demand pages for mappings without an underlying file. But there are some kind of apps (notably, memory allocators) that can reuse that memory range, and MADV_DONTNEED forces them to incur in page fault, page allocation, page zeroing, etc. For avoiding that overhead, other OS like BSDs [https://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2 have supported MADV_FREE], which just mark pages as available to free if needed, but it doesn't free them inmediately. This release adds support for this flag.

Recommended LWN article: [https://lwn.net/Articles/590991/ Volatile ranges and MADV_FREE]

Code: [https://git.kernel.org/torvalds/c/854e9ed09dedf0c19ac8640e91bcc74bc3f9e5c9 commit], [https://git.kernel.org/torvalds/c/ef58978f1eaab140081ec1808d96ee06e933e760 commit], [https://git.kernel.org/torvalds/c/21f55b018ba57897f4d3590ecbe11516bdc540af commit], [https://git.kernel.org/torvalds/c/64b42bc1cfdf6e2c3ab7315f2ff56c31cd257370 commit], [https://git.kernel.org/torvalds/c/10853a039208c4afaa322a7d802456c8dca222f4 commit], [https://git.kernel.org/torvalds/c/337ed7eb5fada305c7d5bf168cf5032f825faddf commit], [https://git.kernel.org/torvalds/c/590a471ce92355bc6c93a48769e8616b80071991 commit], [https://git.kernel.org/torvalds/c/79cedb8f62f116e72079c4d424edbc3d90302333 commit], [https://git.kernel.org/torvalds/c/d5d6a443b24304711fe83b312d29ff26cfa03f0c commit], [https://git.kernel.org/torvalds/c/44842045e4baaf406db2954dd2e07152fa61528d commit], [https://git.kernel.org/torvalds/c/05ee26d9e7e29ab026995eab79be3c6e8351908c commit], [https://git.kernel.org/torvalds/c/b8d3c4c3009d42869dc03a1da0efc2aa687d0ab4 commit]

1.7. Better epoll multitheaded scalability

When multiple [http://man7.org/linux/man-pages/man7/epoll.7.html epoll] file descriptors or epfds (the fd returned from [http://man7.org/linux/man-pages/man2/epoll_create.2.html epoll_create() are added to a shared wakeup source, they are always added in a non-exclusive manner. This means that an event will wakeup all epfds, creating a scalability problem when many epfds are being used.

This release introduces a new EPOLLEXCLUSIVE flag that can be passed as part of the event argument during an [http://man7.org/linux/man-pages/man2/epoll_ctl.2.html epoll_ctl(2)] EPOLL_CTL_ADD operation. This new flag allows for exclusive wakeups when there are multiple epfds attached to a shared fd event source. In a modified version of [https://en.wikipedia.org/wiki/Enduro/X Enduro/X], the use of the 'EPOLLEXCLUSIVE' flag reduced the length of this particular workload from 860s down to 24s.

Recommended LWN article: [https://lwn.net/Articles/633422/#excl Epoll evolving: Better multi-threaded behavior]

Code: [https://git.kernel.org/torvalds/c/df0108c5da561c66c333bb46bfe3c1fc65905898 commit], [https://git.kernel.org/torvalds/c/b6a515c8a0f6c2010a52793b43a79520bc95f994 commit]

1.8. Cgroup unified hierarchy is considered stable

cgroups, or control groups, are a feature [http://kernelnewbies.org/Linux_2_6_24#head-5b7511c1e918963d347abc8ed4b75215877d3aa3 introduced in Linux 2.6.24] which allow to allocate resources (such as CPU time, system memory, network bandwidth) among user-defined groups of processes running on a system. In the first implementation, cgroups allowed an arbitrary number of process hierarchies and each hierarchy could host any number of controllers. While this seemed to provide a high level of flexibility, in practice it had a number of problems, so in [http://kernelnewbies.org/Linux_3.16#head-c8889cafd94fac58408ccc55a02589eda7608eb9 Linux 3.16] a new, [http://lwn.net/Articles/601840/ experimental unified hierarchy], available with the -o __DEVEL__sane_behavior mount option.

In this release, the unified hierarchy is considered stable, and it's no longer hidden behind a developer flag. It can be mounted using the cgroup2 filesystem type. Unfortunately, the cpu controller for cgroup2 hasn't made it into this release, only memory and io controllers are available at the moment.

Documentation about cgroups2 can be found at [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt Documentation/cgroup-v2.txt]

Code: [https://git.kernel.org/torvalds/c/34a9304a96d6351c2d35dcdc9293258378fc0bd8 (merge)]

1.9. Performance improvements for SO_REUSEPORT UDP sockets

[https://lwn.net/Articles/542629/ SO_REUSEPORT] is a socket option available [http://kernelnewbies.org/Linux_3.9#head-7f858c19da75698842d3571a2424c9b62d3f5b0a since Linux 3.9] that allows multiple listener sockets to bind to the same port. Motivation for soresuseport would be something like a web server binding to port 80 running with multiple threads, where each thread might have it's own listener socket.

In this release, Linux includes two optimizations for SO_REUSEPORT (only in UDP):

Code: [https://git.kernel.org/torvalds/c/ef456144da8ef507c8cf504284b6042e9201a05c commit], [https://git.kernel.org/torvalds/c/e32ea7e747271a0abcd37e265005e97cc81d9df5 commit], [https://git.kernel.org/torvalds/c/538950a1b7527a0a52ccd9337e3fcd304f027f13 commit]

1.10. Proper control of socket memory usage in the memory controller

In past releases, socket buffers were accounted in the cgroup's memory controller, separately, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail hard and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications.

In this release, the new unified memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. Upon pressure, the VM reclaims and shrinks and puts pressure on whatever memory consumer in that pool is within its reach. When the VM has trouble freeing memory, the network code is instructed to stop growing the cgroup's transmit windows.

Overhead is only incurred when a non-root control group is created and the memory controller is instructed to track and account the memory footprint of that group. cgroup.memory=nosocket can be specified on the boot commandline to override any runtime configuration and forcibly exclude socket memory from active memory resource control.

Code: [https://git.kernel.org/torvalds/c/7d828602e5ef3297a69392a2d31264e4ab9c8bb7 commit], [https://git.kernel.org/torvalds/c/8c2c2358b236530bc2c79b4c2a447cbdbc3d96d7 commit], [https://git.kernel.org/torvalds/c/931f3f4beb031cd483c1c8ab159ef1f8bdbe8888 commit], [https://git.kernel.org/torvalds/c/3d596f7b907b0281b997cf30c92994a71ad0a1a9 commit], [https://git.kernel.org/torvalds/c/af95d7df4059cfeab7e7c244f3564214aada7dad commit], [https://git.kernel.org/torvalds/c/80f23124f57c77915a7b4201d8dcba38a38b23f0 commit], [https://git.kernel.org/torvalds/c/e805605c721021879a1469bdae45c6f80bc985f4 commit], [https://git.kernel.org/torvalds/c/baac50bbc3cdfd184ebf586b1704edbfcee866df commit], [https://git.kernel.org/torvalds/c/80e95fe0fdcde2812c341ad4209d62dc1a7af53b commit], [https://git.kernel.org/torvalds/c/7941d2145abc4def5583f9d8d0b2e02647b6d1de commit], [https://git.kernel.org/torvalds/c/1109208766d9fa7059a9b66ad488e66d99ce49af commit], [https://git.kernel.org/torvalds/c/f7e1cb6ec51b041335b5ad4dd7aefb37a56d79a6 commit], [https://git.kernel.org/torvalds/c/8e8ae645249b85c8ed6c178557f8db8613a6bcc7 commit]

2. Drivers and architectures

3. Core (various)

4. File systems

5. Memory management

6. Block layer

7. Cryptography

8. Security

9. Tracing and perf tool

10. Virtualization

11. Networking

12. List of merges

13. Other news sites

KernelNewbies: Linux_4.5 (last edited 2016-03-13 13:57:34 by diegocalleja)