Linux 4.4 has been released on Sun, 10 Jan 2016.

Summary: This release adds support for 3D support in virtual GPU driver, which allows 3D hardware-accelerated graphics in virtualization guests; loop device support for Direct I/O and Asynchronous I/O, which saves memory and increases performance; support for Open-channel SSDs, which are devices that share the responsibility of the Flash Translation Layer with the operating system; the TCP listener handling is completely lockless and allows for faster and more scalable TCP servers; journalled RAID5 in the MD layer which fixes the RAID write hole; eBPF programs can now be run by unprivileged users, they can be made persistent, and perf has added support for eBPF programs aswell; a new mlock2() syscall that allows users to request memory to be locked on page fault; and block polling support for improved performance in high-end storage devices. There are also new drivers and many other small improvements.

Contents

Prominent features
Drivers and architectures
Core (various)
File systems
Memory management
Block layer
Cryptography
Security
Tracing and perf tool
Virtualization
Networking
List of merges
Other news sites

1. Prominent features

1.1. Faster and leaner loop device with Direct I/O and Asynchronous I/O support

This release introduces support of Direct I/O and asynchronous I/O for the loop block device. There are several advantages to use direct I/O and AIO on read/write loop's backing file: double cache is avoided due to Direct I/O which reduces memory usage a lot; unlike user space direct I/O there isn't cost of pinning pages; avoids context switches in some cases because concurrent submissions can be avoided. See commits for benchmarks.

Code: commit, commit, commit, commit, commit

1.2. 3D support in virtual GPU driver

virtio-gpu is a driver for virtualization guests that allows to use the host graphics card efficiently. In this release, it allows the virtualization guest to use the capabilities of the host GPU to accelerate 3D rendering. In practice, this means that a virtualized linux guest can run a opengl game while using the GPU acceleration capabilities of the host, as show in this or this video. This also requires running QEMU 2.5.

project page

44m linux.conf talk about the project

Code: commit

1.3. LightNVM adds support for Open-Channel SSDs

Open-channel SSDs are devices that share responsibilities with the operating system in order to implement and maintain features that typical SSDs keep strictly in firmware. These include the Flash Translation Layer (FTL), bad block management, and hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs. LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. This Linux release adds support for lightnvm, (and adds support to NVMe as well).

Recommended LWN article: Taking control of SSDs with LightNVM

Code: commit, commit, commit, commit, commit

1.4. TCP listener handling completely lockless, making TCP servers faster and more scalable

In this release, and as a result from an effort that started two years ago, the TCP implementation has been refactored to make the TCP listener fast path completely lockless. During tests, a server was able to process 3,500,000 SYN packets per second on one listener and still have available CPU cycles - about 2 to 3 order of magnitude what it was possible before. SO_REUSEPORT has also been extended (see Networking section) to add proper CPU/NUMA affinities, so that heavy duty TCP servers can get proper siloing thanks to multi-queues NICs.

Code: commit, commit, commit

1.5. Preliminary journalled RAID5 MD support

This release adds journalled RAID 5 support to the MD (RAID/LVM) layer. With a journal device configured (typically NVRAM or SSD), Data/parity writing to RAID array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up RAID resync and fixes RAID5 write hole issue - a crash during degraded operations cannot result in data corruption. In future releases the journal will also be used to improve performance and latency

Code: merge

1.6. Unprivileged eBPF + persistent eBPF programs

Unprivileged eBPF

eBPF programs got its own syscall in Linux 3.18, but until now its use had been restricted to root, because these programs were dangerous for security. eBPF programs are, however, validated by the kernel, and in this release the eBPF verifier has been improved and unprivileged users can use it (although unprivileged eBPF is only meaningful for 'socket filter'-like programs, eBPF programs for tracing and TC classifiers/actions will stay root only). This feature can be switched off with the sysctl kernel.unprivileged_bpf_disabled (once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false)

Recommended LWN article: Unprivileged bpf()

Code: commit, commit

Persistent eBPF maps/progs

This release also adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users, for example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits, so a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path.

To fix issues as these, a new minimal file system has been created that can hold map/prog objects at /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) along with a pathname with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET), that in turn creates the file system nodes. The user can use that to access maps and progs later on, through bpf(2).

Code: commit, commit

1.7. perf + eBPF integration

In this release, eBPF programs have been integrated with perf. When perf is given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel, which can then be used and seen using perf trace and other tools.

Users are allowed to use BPF filter like: # perf record --event ./hello_world.o ls, and the eBPF program is attached to a newly created perf event which works with all tools.

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.8. Block polling support

This release adds basic support for polling for specific IO to complete, which can improve latency and throughput in very fast devices. Currently O_DIRECT sync read/write are supported. This support is only intended for testing, in future releases stats tracking will be used to auto-tune this. For now, for benchmark and testing purposes, we add a sysfs file (io_poll) that controls whether polling is enabled or not.

Recommended LWN article: Block-layer I/O polling

Code: commit, commit, commit

1.9. mlock2() syscall allow users to request memory to be locked on page fault

mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings this is not ideal: For example, security applications that need mlock() are forced to lock an entire buffer, no matter how big it is. Or maybe a large graphical models where the path through the graph is not known until run time, they are forced to lock the entire graph or lock page by page as they are faulted in.

This new mlock2() syscall set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. The new system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. New calls are added for munlock() and munlockall() which give the called a way to specify which flags are supposed to be cleared. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). Finally, a flag for mmap() is added that allows a user to specify that the covered are should not be paged out, but only after the memory has been used the first time.

Recommended LWN article: Deferred memory locking

Code: commit, commit, commit, commit

2. Drivers and architectures

All the driver and architecture-specific changes can be found in the Linux_4.4-DriversArch page

3. Core (various)

process scheduler: Apply a frequency scaling correction factor to per-entity load tracking to make it invariant with respect to CPU frequency. Currently, load appears bigger when the CPU is running at slower frequencies, which affects load-balancing decisions commit, commit
seccomp: add support for dumping a process' (classic BFP) seccomp filters via ptrace + PTRACE_SECCOMP_GET_FILTER commit
watchdog: Mimic the softlockup_panic kernel knob and create a /proc/sys/kernel/hardlockup_panic. It enables a hardlockup to panic the machine commit
watchdog: optionally perform all-CPU backtrace in case of hard lockup. Can be enabled with sysctl /proc/sys/kernel/hardlockup_all_cpu_backtrace commit
coredump: Add two new flags to the existing coredump mechanism for ELF and FDPIC ELF files to allow us to explicitly filter DAX mappings. This is desirable because DAX mappings, like hugetlb mappings, have the potential to be very large commit, commit
test_printf: test printf family at runtime commit
Make sync_file_range(2) use WB_SYNC_NONE writeback. It helps PostgreSQL avoid large latency spikes when flushing data in the background commit

4. File systems

XFS
- Add per-filesystem stats in /sys/fs/xfs/<block>/stats/stats, and a stats_clear file to clear them. Also, the global stats that are currently present in /proc are duplicated in /sys/fs/xfs/stats/stats (along with a stats_clear file) commit, commit, commit
Btrfs
- Add fragment debug mount option. It can be used to cause extreme fragmentation in data, metadata or both commit
- Add balance filter for stripes. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. commit
CIFS
- Allow duplicate extents (cp --reflink) in SMB3.0 not just SMB3.1.1 commit
- Add resilienthandles mount parameter. Since many servers (Windows clients, and non-clustered servers) do not support persistent handles but do support resilient handles, allow the user to specify a mount option "resilienthandles" in order to get more reliable connections and less chance of data loss (at least when SMB2.1 or later). Default resilient handle timeout (120 seconds to recent Windows server) is used commit
- Add support for persistent handles, which are like durable file handles with strong guarantees commit, commit, commit
- Allow copy offload (copychunk) across shares commit
NFS
- Support for NFSv4.2 file CLONE using the btrfs ioctl commit commit, commit, commit, commit
ext4
- Store checksum seed in superblock commit
OCFS2
- Improve performance for localalloc commit
UBIFS
- atime support commit

5. Memory management

Get rid of vmalloc_info from /proc/meminfo. It is too expensive to calculate and shows up in real workloads, people who actually want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead commit
Add HugetlbPages field to /proc/PID/status. Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb can need it commit
Add hugetlb-related fields to /proc/PID/smaps to know per-task or per-vma base hugetlb usage: AnonHugePages shows the amount of memory backed by transparent hugepage; Shared_Hugetlb and Private_Hugetlb show the amounts of memory backed by hugetlbfs page which is not counted in RSS or PSS field for historical reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field commit
memcontrol: eliminate memory.current on the root level, because it doesn't add anything that wouldn't be more accurate and detailed using system statistics commit

6. Block layer

Block polling support commit, commit, commit
loop: direct and asynchronous I/O commit, commit, commit, commit, commit
Add Persistent Reservations support. It includes a user space interface for simplified Persistent Reservations which map to block devices that support these (only SCSI for now). Persistent Reservations allow restricting access to block devices to specific initiators in a shared storage setup commit, commit, commit
Export integrity data interval size in /sys/block/<disk>/integrity/protection_interval_bytes, so that apps can tell whether the interval is different from the device's logical block size commit
cdrom: Random writing support for BD-RE media commit

7. Cryptography

crypto: caam - add support for acipher xts(aes) commit crypto: keywrap - add key wrapping block chaining mode commit crypto: qat - add support for ctr(aes) and xts(aes) commit

8. Security

TPM: Support TPM 2.0 chips commit, commit

9. Tracing and perf tool

Integration of perf with eBPF that, given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel via the sys_bpf syscall, which can then be used and seen using 'perf trace' and other tools. Users can run commands like perf record --event bpf-file.c ls to try it commit, commit, commit, commit, commit, commit, commit, commit, commit, commit
Add a new branch type sampling filter to perf record, named 'call' (perf record -j call -e cycles .....), that samples only call branches (function calls), unlike 'any_call' that included direct, indirect calls and far jumps. Only x86 and PowerPC are supported in this release commit, commit
Add Intel cstate (aka idle states) Performance Monitoring Unit support. This allows perf to support cstate related free running (read-only and system-wide) counters. For example, to caculate the fraction of time when the core is running in C6 state: perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 sleep 5 commit
CPU socket filtering: perf tools introduce a new sort type "socket" for the processor socket, eg. perf report --stdio --sort socket,comm,dso,symbol commit. Also, perf report introduces a --socket-filter option for 'perf report' to only show entries for a processor socket that match this filter commit. perf hists browser can zoom in/out for processor socket commit
perf tools: Introduce 'P' modifier, it will cause the event to get maximum possible detected precise level. For example, perf record -e cycles:P ... will detect maximum precise level for 'cycles' event and use it commit
perf tools: Add support for sorting on the iaddr. New sort option is: symbol_iaddr, header label is 'Code Symbol', eg perf mem report --stdio -F +symbol_iaddr commit
perf tools: enables config terms for tracepoint perf events. Valid terms for tracepoint events are 'call-graph' and 'stack-size', so different callgraph settings can be used for each event and eliminate unnecessary overhead. An example for using different call-graph config for each tracepoint: perf record -e syscalls:sys_enter_write/call-graph=fp -e syscalls:sys_exit_write/call-graph=no dd if=/dev/zero of=test bs=4k count=10 commit
perf script: Enable printing of branch stack viaa the 'brstack' and 'brstacksym' arguments to the field selection option -F. The option is off by default and operates only if the perf.data file has branch stack content commit
perf auxtrace: Add AUX area tracing option 'l' to synthesize branch stacks on samples just like sample type PERF_SAMPLE_BRANCH_STACK commit
perf hists browser: Add 'm' key for context menu display commit
perf inject: Add --strip option which is used with --itrace to strip out non-synthesized events commit
perf script: Allow time to be displayed in nanoseconds commit
Intel PT hardware tracer: Accept a zero --itrace period, meaning "as often as possible". In the case of Intel PT that is the same as a period of 1 and a unit of 'instructions' (i.e. --itrace=i1i)commit
Intel PT: Add support for generating branch stack context for PT samples. This is useful for: reporting accurate basic block edge frequencies through the perf report branch view or using with --branch-history to get the wider context of samples. Examples, record with Intel PT: perf record -e intel_pt//u ls
ftrace: add module globbing commit

10. Virtualization

Support for VT-d posted interrupts (i.e. PCI devices can inject interrupts directly into vCPUs). Used by KVM and VFIO commit
KVM: Nested virtualization now supports VPID (same as PCID but for vCPUs) which makes it quite a bit faster commit, commit, commit
KVM: Support for "split irqchip", i.e. LAPIC in kernel and IOAPIC/PIC/PIT in userspace, which reduces the attack surface of the hypervisor commit, commit, commit
KVM: add capability for any-length ioeventfds. With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and the kernel will ignore the length of guest write and may get a faster vmexit commit
VMware balloon: Get notified immediately via VMCI when a balloon target is set, instead of waiting for up to one second commit
VMware balloon: Support ballooning with 2 MB sized pages. It significantly reduces the hypervisor side (and guest side) overhead of ballooning and unballooning commit
Vmware vmxnet3: Extend register dump support commit

11. Networking

Lockless TCP listener commit, commit, commit
Add setsockopt() support for SO_INCOMING_CPU and extend SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot commit, commit
TCP: Recent ACK (RACK) loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh) (see commit for details). In the current patch set RACK is only a supplemental loss detection and does not trigger fast recovery. However RACK is being developed to replace or consolidate FACK/dupthresh, early retransmit, and thin-dupack. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It can be disabled with sysctl net.ipv4.tcp_recovery commit
IP Virtual Server: Support scheduling of ICMP packets to IPVS instances. A new sysctl net.ipv4.vs.schedule_icmp has been introduced, that will enable this feature if set to 1 (by default, it is set by default to 0 to retain the old behaviour) merge commit
IP Virtual Server: Allow to ignore tunnelled packets with new Sysctl net.ipv4.vs.ignore_tunneled. If set, ipvs will set the ipvs_property on all packets which are of unrecognised protocols. This prevents the kernel from routing tunnelled protocols like ipip, which is useful to prevent rescheduling packets that have been tunneled to the ipvs host (i.e. to prevent ipvs routing loops when ipvs is also acting as a real server) commit
Provide FIB table ID in ipv4 route dumps just as ipv6 does commit
IPv4: Hash-based multipath routing. When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed from more or less being destination-based into being quasi-random per-packet scheduling. This increased the risk of out-of-order packets and made it impossible to use multipath together with anycast services. In this release, the multipath routing implementation is replaced with a flow-based load balancing based on a hash over the source and destination addresses merge commit
IPv6 support to the Virtual Routing and Forwarding (VRF) devices commit, commit, commit
IPv4: Currently adding a new ipv4 address always cause the creation of the related network route, with default metric. Add support for IFA_F_NOPREFIXROUTE for ipv4 address. When an address is added with such flag set, no associated network route is created, no network route is deleted when said IP is gone and it's up to the user space manage such route commit
IPv6: gro: support sit protocol commit
Allow the user to ask for the statistics to be filtered out of ipv4/ipv6 address netlink dumps, because many commonly used functions like getifaddrs() invoke RTM_GETLINK to dump the interface information, and do not need the AF_INET6 statistics, which are expensive to calculate commit
bridge: Allow setting the bridge attribute ageing_time in rocker and switchdev commit, commit, commit
vxlan: support both IPv4 and IPv6 sockets in a single vxlan device commit
bridge: complete the bridge device's netlink support and makes it possible to view and configure everything that can be configured via sysfs commit
bridge: Enable adding fdb entries pointing to the bridge device. This can be used to propagate mac address of vlan interfaces configured on top of the vlan filtering bridge commit
Multi Protocol Label Switching (MPLS): Add support for multipath routes commit, commit
bonding: support encapsulated ipv6 TSO commit
Add support for filtering neighbor dumps by master device by adding the NDA_MASTER attribute to the dump request. A new netlink flag, NLM_F_DUMP_FILTERED, is added to indicate the kernel supports the request and output is filtered as requested commit
Add support for filtering neighbor dumps by device by adding the NDA_IFINDEX attribute to the dump request commit
Support for disabling certain features on devices which, when disabled on an upper device, such as a bonding master or a bridge, must be disabled and cannot be re-enabled on underlying devices commit
Introduce L3 Master device abstraction support. It provides glue between core networking code and device drivers to support L3 master devices like VRF commit
dummy: add more features commit
tso: add support for IPv6 commit
netfilter: nfnetlink_log: enables to include the conntrack information together with the packet that is sent to user-space via NFLOG, then a user-space program can acquire NATed information by this NFULA_CT attribute commit
Wireless
- Allow changing station capabilities for unassociated stations commit
- Implement Very High Throughput support for mesh networks commit
- Make CRDA support optional commit
- Advertise support for full station state in AP mode commit
- Put current TX power in interface info replies commit
- Enable wiphy device to suspend/resume asynchronously commit
ieee802154: experimental netlink support commit
ieee802154: 6lowpan: add tx/rx stats commit
ipconfig: Allow to send Client-identifier in DHCP requests with something like ip=dhcp,client_id_type, client_id_value, as a kernel parameter to enable the kernel to identify itself to the server commit
Add netlink directives and ndo entry to trust VF user. This controls the special permission of VF user. The administrator will dedicatedly trust VF user to use some features which impacts security and/or performance commit
IB: Add support of checksum capability reporting for RC and RAW commit
IB: Add support for network namespaces commit, commit, commit
openvswitch: Add netlink attributes for IPv6 tunnel addresses. This enables IPv6 support for tunnels commit
switchdev: Add support for flood control commit, commit
TIPC: introduce jumbo frame support for broadcast commit
xprtrdma: Enable swap-on-NFS/RDMA commit

12. List of merges

13. Other news sites

LWN merge window part 1 and part 2
Phoronix A Look At The New Features Of The Linux 4.4 Kernel