Linux 4.5 has been released on Sunday, 13 March.
Summary: This release adds a new copy_file_range(2) system call that allows to make copies of files without transferring data through userspace; experimental Powerplay power management for modern Radeon GPUs; scalability improvements in the Btrfs free space handling; support GCC's Undefined Behavior Sanitizer (-fsanitize=undefined); Forwarded Error Correction support in the device-mapper's verity target; support for the MADV_FREE flag in madvise(); the new cgroup unified hierarchy is considered stable; scalability improvements for SO_REUSEPORT UDP sockets; scalability improvements for epoll, and better memory accounting of sockets in the memory controller. There are also new drivers and many other small improvements.
- Prominent features
- Copy offloading with new copy_file_range(2) system call
- Experimental PowerPlay supports brings high performance to the amdgpu driver
- Btrfs free space handling scalability improvements
- Support for GCC's Undefined Behavior Sanitizer (-fsanitize=undefined)
- Forwarded Error Correction support in the device-mapper's verity target
- Add MADV_FREE flag to madvise(2)
- Better epoll multithread scalability
- cgroup unified hierarchy is considered stable
- Performance improvements for SO_REUSEPORT UDP sockets
- Proper control of socket memory usage in the memory controller
- Drivers and architectures
- Core (various)
- File systems
- Memory management
- Block layer
- Tracing and perf tool
- List of merges
- Other news sites
1. Prominent features
1.1. Copy offloading with new copy_file_range(2) system call
Copying a file consists in reading the data from a file to user space memory, then copy that memory to the destination file. There is nothing wrong with this way of doing things, but it requires doing extra copies of the data to/from the process memory. In this release Linux adds a system call, copy_file_range(2), which allows to copy a range of data from one file to another, avoiding the mentioned cost of transferring data from the kernel to user space and then back into the kernel.
This system call is only very slightly faster than cp, because the costs of these memory copies are barely noticeable compared with the time it takes to do the actual I/O, but there are some cases where it can help a lot more. In networking filesystems such as NFS, copying data involves sending the copied data from the server to the client through the network, then sending it again from the client to the new file in the server. But with copy_file_range(2), the NFS client can tell the NFS server to make a file copy from the origin to the destination file, without transferring the data over the network (for NFS, this also requires the server-side copy feature present in the upcoming NFS v4.2, and also supported experimentally in this Linux release). In next releases, local filesystems such as Btrfs, and especialized storage devices that provide copy offloading facilities, could also use this system call to optimize the copy of data, or remove some of the present limitations (currently, copy offloading is limited to files on the same mount and superblock, and not in the same file).
Raw man page: copy_file_range.2
1.2. Experimental PowerPlay supports brings high performance to the amdgpu driver
Modern GPUs start running in low power, low performance modes. To get the best performance, they need to dynamically change its frequency. But doing that requires good power management. This release adds support for PowerPlay in the amdgpu driver for discrete GPUs Tonga and Fiji, and integrated APUs Carrizo and Stoney. Powerplay is the brand name for a set of technologies for power management implemented in several of AMD CPUs and APUs; it has been available in the propietary Catalyst driver, and it aims to eventually replace the existing dynamic power management in the amdgpu driver. In the supported GPUs, performance will be much higher due to the ability to handle frequency changes.
Powerplay support is not enabled by default for all kind of hardware supported in this release due to stability concerns; in these cases the use of Powerplay can be forced with the "amdgpu.powerplay=1" kernel option.
Code: see link
1.3. Btrfs free space handling scalability improvements
Filesystems need to keep track of which blocks are being used and which ones are free. They also need to store information about the free space somewhere, because it's too costly to generate it from scratch. Btrfs has been able to store a cache of the available free space since 2.6.37, but the implementation is a scalability bottleneck on large (+30T), busy filesystems.
This release includes a new, experimental way of representing the free space cache that takes less work overall to update on each commit and fixes the scalability issues. This new code is experimental, and it's not the default yet. It can be enabled with the -o space_cache=v2 mount option. On the first mount with the this option set, the new free space tree will be created and a read-only compatibility flag will be enabled (older kernels will be able to read, but not to write, to the filesystem). It is possible to revert to the old free space cache (and remove the compatibility flag) by mounting the filesystem with the options -o clear_cache,space_cache=v1.
1.4. Support for GCC's Undefined Behavior Sanitizer (-fsanitize=undefined)
UBSAN (Undefined Behaviour SANitizer) is a debugging tool available since GCC 4.9 (see -fsanitize=undefined documentation). It inserts instrumentation code during compilation that will perform checks at runtime before operations that could cause undefined behaviours. Undefined behavior means that the semantics of certain operations is undefined, and the compiler presumes that such operations never happen because the programmer will take care of avoiding them, but if they happen the application can produce wrong results, crash or even allow security breaches; examples of undefined behaviour are using a non-static variable before it has been initialized, integer division by zero, signed integer overflows, dereferencing NULL pointers, etc.
In this release, Linux supports compiling the kernel with the Undefined Behavior Sanitizer enabled with the -fsanitize options shift, integer-divide-by-zero, unreachable, vla-bound, null, signed-integer-overflow, bounds, object-size, returns-nonnull-attribute, bool, enum and, optionally, alignment. Most of the work is done by compiler, all the kernel does is to handle the printing of errors.
1.5. Forwarded Error Correction support in the device-mapper's verity target
The device-mapper's "verity" target, used by popular platforms such as Android or Netflix, was merged in Linux 3.4, and it allows that a file system hasn't been modified by checking every filesystem read attempt with a list of cryptographic hashes.
This release adds Forward Error Correction support to the verity target. This feature makes possible to recover from several consecutive corrupted data blocks, by using pregenerated error correction blocks that have relatively small space overhead and can be used to reconstruct the damaged blocks. This technique, found in DVDs, hard drives or satellite transmissions, will make possible to recover from errors in a verity-backed filesystem placed in slightly damaged media.
1.6. Add MADV_FREE flag to madvise(2)
madvise(2) is a system call used by processes to tell the kernel how they are going to use their memory, allowing the kernel to optimize the memory management according to these hints to achieve better overall performance.
When an application wants to signal the kernel that it isn't going to use a range of memory in the near future, it can use the MADV_DONTNEED flag, so the kernel can free resources associated with it. Subsequent accesses in the range will succeed, but will result either in reloading of the memory contents from the underlying mapped file or zero-fill-on-demand pages for mappings without an underlying file. But there are some kind of apps (notably, memory allocators) that can reuse that memory range after a short time, and MADV_DONTNEED forces them to incur in page fault, page allocation, page zeroing, etc. For avoiding that overhead, other OS like BSDs have supported MADV_FREE, which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again. This release adds Linux support for this flag.
Recommended LWN article: Volatile ranges and MADV_FREE
1.7. Better epoll multithread scalability
When multiple epoll file descriptors or epfds (the file descriptor returned from epoll_create(2) are added to a shared wakeup source, they are always added in a non-exclusive manner. This means that an event will wakeup all epfds, creating a scalability problem when many epfds are being used.
This release introduces a new EPOLLEXCLUSIVE flag that can be passed as part of the event argument during an epoll_ctl(2) EPOLL_CTL_ADD operation. This new flag allows for exclusive wakeups when there are multiple epfds attached to a shared fd event source. In a modified version of Enduro/X, the use of the 'EPOLLEXCLUSIVE' flag reduced the length of this particular workload from 860s down to 24s.
Recommended LWN article: Epoll evolving: Better multi-threaded behavior
1.8. cgroup unified hierarchy is considered stable
cgroups, or control groups, are a feature introduced in Linux 2.6.24 which allow to allocate resources (such as CPU time, system memory, network bandwidth) among user-defined groups of processes running on a system. In the first implementation, cgroups allowed an arbitrary number of process hierarchies and each hierarchy could host any number of controllers. While this seemed to provide a high level of flexibility, in practice it had a number of problems, so in Linux 3.16 a new, unified hierarchy was merged. But it was experimental, only available with the -o __DEVEL__sane_behavior mount option.
In this release, the unified hierarchy is considered stable, and it's no longer hidden behind that developer flag. It can be mounted using the cgroup2 filesystem type (unfortunately, the cpu controller for cgroup2 hasn't made it into this release, only memory and io controllers are available at the moment). For more details, including a detailed reasoning behind the migration to the unified hierarchy, see the cgroup2 documentation: Documentation/cgroup-v2.txt
1.9. Performance improvements for SO_REUSEPORT UDP sockets
SO_REUSEPORT is a socket option available since Linux 3.9 that allows multiple listener sockets to bind to the same port. An use case for SO_REUSEPORT would be something like a web server binding to port 80 running with multiple threads, where each thread might have it's own listener socket.
Two new sockets options allow to define a classic or extended BPF program (SO_ATTACH_REUSEPORT_CBPF and SO_ATTACH_REUSEPORT_EBPF). These BPF programs can define how packets are assigned to the sockets placed in the SO_REUSEPORT group of sockets that are bound to the same port.
Faster lookup when selecting a SO_REUSEPORT socket for an incoming packet. Previously, the lookup process needed to consider all sockets, in this release an appropriate socket can be found much faster (see the commit link for benchmarks).
1.10. Proper control of socket memory usage in the memory controller
In past releases, socket buffers were accounted in the cgroup's memory controller, separately, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications.
In this release, the new unified memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. Upon pressure, the VM reclaims and shrinks and puts pressure on whatever memory consumer in that pool is within its reach. When the VM has trouble freeing memory, the network code is instructed to stop growing the cgroup's transmit windows. Overhead is only incurred when a non-root control group is created and the memory controller is instructed to track and account the memory footprint of that group. cgroup.memory=nosocket can be specified on the boot commandline to override any runtime configuration and forcibly exclude socket memory from active memory resource control.
2. Drivers and architectures
All the driver and architecture-specific changes can be found in the Linux_4.5-DriversArch page
3. Core (various)
Allow to preconfigure tune the ASLR randomness. Two sysctls /proc/sys/vm/mmap_rnd_bits and /proc/sys/vm/mmap_rnd_compat_bits (for 32bit processes and 32bit-on-64bit-kernel) can be used to tune it. Recommended LWN article: Increasing the range of address-space layout randomization. Code: commit, commit, commit, commit
fcntl: allow to set O_DIRECT flag on pipe (will be used by CRIU to migrate packetized pipes) commit
futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op commit
RCU: Add rcupdate.rcu_normal kernel parameter to suppress expedited grace periods, that is, to treat requests for expedited grace periods as if they were requests for normal grace periods. Useful for extreme real-time workloads commit
RCU: Add kernel parameter rcupdate.rcu_normal_after_boot that disables expedited grace periods just before init is spawned commit
Allow disabling mandatory file locking at compile time (appears to be almost unused and buggy and there appears no real interest in doing anything with it). Recommended LWN article: Optional mandatory locking. commit
vfio: No-IOMMU mode. Only with an IOMMU can userspace access to DMA capable devices be considered secure, but some people still want to do it commit
workqueues: implement a workqueue lockup detector commit
sysctl: enable strict writes by default. File position will be respected when doing multiple writes to a sysctl, instead of rewriting the content with each write commit
scripts: add prune-kernel script to clean up old kernel images commit
configfs: Add support for binary attributes commit
workqueues: Add debug facility, enabled with the workqueue.debug_force_rr_cpu kernel parameter, which forces workqueue items to be run in foreign CPUs commit
4. File systems
Introduce per-inode DAX enablement, because rather than just being able to turn DAX on and off via a mount option, some applications may only want to enable DAX for certain performance critical files in a filesystem. When this flag is set on a directory, it acts as an "inherit flag". That is, inodes created in the directory will automatically inherit the on-disk inode DAX flag, enabling administrators to set up directory hierarchies that automatically use DAX. Setting this flag on an empty root directory will make the entire filesystem use DAX by default commit
Add a mechanism to inject CRC errors into log records to facilitate testing torn write detection during log recovery commit
5. Memory management
pipes: limit the per-user amount of pages allocated in pipes. It is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. This release makes possible to enforce a per-user soft limit above which new pipes will be limited to a single page (4KB), as well as a hard limit above which no new pipes may be created for this user. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB). The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splice(2)) commit
proc: Currently, /proc/pid/smaps will always show Swap: 0 kB for shmem-backed mappings, even if the mapped portion does contain shmem-backed pages that were swapped out. This release accounts for shmem swap commit
proc: There are several shortcomings with the accounting of shared memory (SysV shm, shared anonymous mapping, mapping of a tmpfs file). The values in /proc/<pid>/status and statm don't allow to distinguish between shmem memory and a shared mapping to a regular file, even though theirs implication on memory usage are quite different. This release adds a breakdown of VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and RssShmem. These fields tell the user the memory occupied by private anonymous pages, mapped regular files and shmem, respectively commit, commit
vmstats: replace THP_SPLIT with tree events: THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD commit
Revert /proc/<pid>/maps [stack:TID] annotation commit
6. Block layer
Enable DAX (page cache bypass) for raw block devices. This capability is targeted primarily to hypervisors wanting to provision persistent memory for guests. It can be disabled / enabled dynamically via the new BLKDAXSET ioctl. This feature is experimental and cause data loss, it needs to be enabled explicitely commit, commit
raid5-cache: add journal hot add/remove support commit
lightnvm: support factory reset commit
dm verity: add ignore_zero_blocks feature, which makes dm-verity not verify blocks that are expected to contain zeroes and always return zeroes instead. This may be useful if the partition contains unused blocks that are not guaranteed to contain zeroes commit
drbd: make drbd known to lsblk commit
talitos: add algorithms: ecb(aes), ctr(aes), ecb(des), cbc(des), ecb(des3_ede) commit
rsa: adds PKCS#1 v1.5 standard RSA padding commit
qat: Support for Intel C62x with Intel Quick Assist Technology for accelerating crypto and compression workloads commit
Allow to update multiple times the IMA policy. The new rules get appended to the original policy. Users must have in mind that the rules are scanned in FIFO order so it's necessary to be careful when designing and adding new ones commit
Allow the root user to read the current IMA policy rules. It is often useful to be able to read back the IMA policy, and even more important after introducing the ability to update the IMA policy commit
keys: forbid to remove certain keys. A new key flag named KEY_FLAG_KEEP is added to prevent userspace from being able to unlink, revoke, invalidate or timed out a key on a keyring. When this flag is set on the keyring, all keys subsequently added are flagged. In addition, when this flag is set, the keyring itself can not be cleared commit
keys: enable to use TPM2 authorization policies to seal trusted keys commit
keys: Allow to select hash algorithm for TPM2 chips. For TPM 1.x the only allowed value is sha1. For TPM 2.x the allowed values are sha1, sha256, sha384, sha512 and sm3-256 commit
selinux: Make validatetrans decisions available through selinuxfs. "/validatetrans" is added to selinuxfs for this purpose. This functionality is needed by file system servers implemented in userspace or kernelspace without the VFS layer commit
9. Tracing and perf tool
Allow using trace events fields as sort order keys, making perf evlist --trace_fields show those, and then the user can select a subset and use like: perf top -e sched:sched_switch -s prev_comm,next_comm. That works as well in perf report when handling files containing tracepoints. Support for things like perf report -s 'switch.*' --stdio is also possible commit, commit, commit, commit, commit, commit, commit, commit, commit
BPF programs can now to specify perf probe tunables via its section name, separating key=val values using semicolons. A exec key is used to specify an user executable which allows to attach BPF programs at uprobe events commit; a module key is used to allow users to attach BPF programs to symbols in modules commit, an inline key that allows to specify whether to probe at inline symbols or not and a force key to forcibly add events with existing name" commit
Allow BPF scriptlets to specify arguments to be fetched using DWARF info, using a prologue generated at compile/build time. perf probe various options can be used to list functions, or see what variables can be collected at any given point commit, commit, commit
Introduce a new callchain mode: folded ( perf report -g folded) to print callchains in a line, facilitating perf report output processing by other tools, such as Brendan Gregg's flamegraph tools commit, commit, commit
perf script: If no script is specified for stat data, display stat events in raw form commit
perf record: Add --buildid-all option to record build-id of all DSOs regardless whether it's actually hit or not commit
perf record: Add record.build-id config option, which can be set to three different options, see commit for more details commit
perf record: Support custom vmlinux path, when vmlinux is needed as the source of DWARF info to generate prologue for BPF programs commit
perf report/top: Add --raw-trace option commit
perf report: --call-graph option add support for how to display callchain values. Possible values are percent, period and count. percent is same as before and it's the default behavior. period displays the raw period value. count displays the number of occurrences commit
perf report: Change default to use event group view. If users want to keep the original behavior, they can set the report.group config variable to false and/or use --no-group option commit
Add file_only config option to strlist commit
bpf: add show_fdinfo handler for maps commit
paravirtualized queued spinlock:
Convert the backend driver into an multiqueue driver and exposing more than one queue to the frontend (merge)
user-mode-linux: Add seccomp support commit
Add the ability to destroy a TCP socket using the netlink socket diag interface. It causes all blocking calls on the socket to fail fast with ECONNABORTED and causes a protocol close of the socket. It informs the other end of the connection by sending a RST, i.e., initiating a TCP ABORT. Recommended LWN article: SOCK_DESTROY: an old Android patch aims upstream. commit, commit, commit, commit
Add a new address generator mode, using the stable address generator with an automatically generated secret. This is intended as a default address generator mode for device types with no EUI64 implementation. The new generator is used for ARPHRD_NONE interfaces initially, adding default IPv6 autoconf support to e.g. tun interfaces commit
Add the support for adding expire value to routes commit
ILA: Add generic ILA translation facility to avoid a big performance hit in the receive path. This table can be configured with identifier to locator mappings, and can be queried to resolve a mapping. Queries can be parameterized based on interface, direction (incoming or outoing), and matching locator. The table is implemented using rhashtable and is configured via netlink (through ip ila .. in iproute) commit
Multi Protocol Label Switching: support for dead routes (RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls routes). Also adds code to ignore dead routes during route selection commit
Enable child sockets to inherit the L3 master device index. Enabling this option allows a "global" listen socket to work across L3 master domains (e.g., VRFs) with connected sockets derived from the listen socket to be bound to the L3 domain in which the packets originated. A sysctl setting (tcp_l3mdev_accept) is added to control the behavior which is similar to sk_mark and sysctl_tcp_fwmark_accept commit
SCTP: dynamically enable or disable "potentially failed" state via a sysctl commit
nftables: add netdev packet forwarding support. You can use this to forward packets from ingress to the egress path of the specified interface. This provides a fast path to bounce packets from one interface to another specific destination interface commit
nftables: add netdev packet duplication support. You can use this to duplicate packets and inject them at the egress path of the specified interface. This duplication allows you to inspect traffic from the dummy or any other interface dedicated to this purpose commit
nftables: add byte/packet counter matching support commit
nftables: Allow to invert limit expression in nf_tables, so we can throttle overlimit traffic commit
nftables: Add support for mangling packet payload. Checksum for the specified base header is updated automatically if requested, however no updates for any kind of pseudo headers are supported, meaning no stateless NAT is supported commit
Add cgroup2 support to iptables commit
meta: Allow to redirect bridged packets to local machine commit
Add new NFTA_SET_USERDATA attribute to store user data in sets commit
Add configfs for RDMA communication manager (CM) commit
Add cross-channel support, allowing to execute WQEs that involve synchronization of I/O operations’ on different QPs. This capability enables to program complex flows with a single function call, hereby significantly reducing overhead associated with I/O processing commit
Add sysfs files to show attributes of net device and gid type to each GID in the GID table commit
Add RoCE V2 support commit
Add clsact qdisc, a generalization of the ingress qdisc as a qdisc holding only classifiers commit
Add UDP port offload for Ethernet devices commit
Add support for sending to monitor channel system notes as text strings for debugging purposes commit
Add support for controller specific logging to allow userspace to log per controller commit
Add support for Get Advertising Size Information command, which allows to retrieve size information for advertising data and scan response data fields depending on the selected flags commit
Add support for bluetooth v4.1 Start Limited Discovery command commit
batman-adv: export single hop neighbor list via debugfs commit
nfc: netlink: Add support for missing HCI event EVT_CONNECTIVITY and forward it to userspace commit
SCTP: allow setting SCTP_SACK_IMMEDIATELY by the application commit
Near Field Communication: support ISO14443 Type4A tags commit
ethtool: Add support for phy statistics commit
Add sysctl max_skb_frags to configure the maximum numbers of fragments per skb commit
12. List of merges
13. Other news sites
linuxfr.org Sortie du noyau Linux 4.5