KernelNewbies:

Linux 4.4 [https://lkml.org/lkml/2016/1/10/305 has been released] on Sun, 10 Jan 2016.

Summary: This release adds support for 3D support in virtual GPU driver, which allows 3D hardware-accelerated graphics in virtualization guests; loop device support for Direct I/O and Asynchronous I/O, which saves memory and increases performance; support for Open-channel SSDs, which are devices that share the responsibility of the Flash Translation Layer with the operating system; the TCP listener handling is completely lockless and allows for faster and more scalable TCP servers; journalled RAID5 in the MD layer which fixes the RAID write hole; eBPF programs can now be run by unprivileged users, they can be made persistent, and perf has added support for eBPF programs aswell; a new mlock2() syscall that allows users to request memory to be locked on page fault; and block polling support for improved performance in high-end storage devices. There are also new drivers and many other small improvements.

TableOfContents()

1. Prominent features

1.1. Faster and leaner loop device with Direct I/O and Asynchronous I/O support

This release introduces support of Direct I/O and asynchronous I/O for the loop block device. There are several advantages to use direct I/O and AIO on read/write loop's backing file: double cache is avoided due to Direct I/O which reduces memory usage a lot; unlike user space direct I/O there isn't cost of pinning pages; avoids context switches in some cases because concurrent submissions can be avoided. See commits for benchmarks.

Code: [https://git.kernel.org/torvalds/c/ab1cb278bc7027663adbfb0b81404f8398437e11 commit], [https://git.kernel.org/torvalds/c/2e5ab5f379f96a6207c45be40c357ebb1beb8ef3 commit], [https://git.kernel.org/torvalds/c/5b5e20f421c0b6d437b3dec13e53674161998d56 commit], [https://git.kernel.org/torvalds/c/bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6 commit], [https://git.kernel.org/torvalds/c/e03a3d7a94e2485b6e2fa3fb630b9b3a30b65718 commit]

1.2. 3D support in virtual GPU driver

virtio-gpu is a driver for virtualization guests that allows to use the host graphics card efficiently. In this release, it allows the virtualization guest to use the capabilities of the host GPU to accelerate 3D rendering. In practice, this means that a virtualized linux guest can run a opengl game while using the GPU acceleration capabilities of the host, as show in [https://www.youtube.com/watch?v=ONFGnUaln-4 this] or [https://www.youtube.com/watch?v=ZuuF092RDDc this] video. This also requires running [http://wiki.qemu.org/ChangeLog/2.5#virtio QEMU 2.5].

[https://virgil3d.github.io/ project page]

[https://www.youtube.com/watch?v=rPeMrmeLTig 44m linux.conf talk about the project]

Code: [https://git.kernel.org/torvalds/c/3187567222178d4b3742e88242f7abb3c3b7a215 commit]

1.3. LightNVM adds support for Open-Channel SSDs

Open-channel SSDs are devices that share responsibilities with the operating system in order to implement and maintain features that typical SSDs keep strictly in firmware. These include the Flash Translation Layer (FTL), bad block management, and hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs. LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. This Linux release adds support for lightnvm, (and adds support to NVMe as well).

Recommended LWN article: [https://lwn.net/Articles/641247/ Taking control of SSDs with LightNVM]

Code: [https://git.kernel.org/torvalds/c/48add0f5a6f46919dd307575aad6ea3de7c9cb2a commit], [https://git.kernel.org/torvalds/c/cd9e9808d18fe7107c306f6e71c8be7230ee42b4 commit], [https://git.kernel.org/torvalds/c/ca0640850e43f5f80c6029e2895b119b705f23bd commit], [https://git.kernel.org/torvalds/c/b2b7e00148a203e9934bbd17aebffae3f447ade7 commit], [https://git.kernel.org/torvalds/c/ae1519ec448bc31a7fe7369b66e7c78872f91e84 commit]

1.4. TCP listener handling completely lockless, making TCP servers faster and more scalable

In this release, and as a result from an effort that started two years ago, the TCP implementation has been refactored to make the TCP listener fast path completely lockless. During tests, a server was able to process 3,500,000 SYN packets per second on one listener and still have available CPU cycles - about 2 to 3 order of magnitude what it was possible before. SO_REUSEPORT has also been extended (see Networking section) to add proper CPU/NUMA affinities, so that heavy duty TCP servers can get proper siloing thanks to multi-queues NICs.

Code: [https://git.kernel.org/torvalds/c/4d54d86546f62c7c4a0fe3b36a64c5e3b98ce1a9 commit], [https://git.kernel.org/torvalds/c/e6934f3ec00b04234acb24a1a2c28af59763d3b5 commit], [https://git.kernel.org/torvalds/c/c3fc7ac9a0b978ee8538058743d21feef25f7b33 commit]

1.5. Journalled RAID5 MD support

This release adds journalled RAID 5 support to the MD (RAID/LVM) layer. With a journal device configured (typically NVRAM or SSD), Data/parity writing to RAID array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up RAID resync and fixes RAID5 write hole issue - a crash during degraded operations cannot result in data corruption. In future releases the journal will also be used to improve performance and latency

Code: [https://git.kernel.org/torvalds/c/ac322de6bf5416cb145b58599297b8be73cd86ac merge]

1.6. Unprivileged eBPF + persistent eBPF programs

Unprivileged eBPF

eBPF programs got its own syscall in [http://kernelnewbies.org/Linux_3.18#head-ead251efb6bbdbe2922e7c6bd0c7b46342e03dad Linux 3.18], but until now its use had been restricted to root, because these programs were dangerous for security. eBPF programs are, however, validated by the kernel, and in this release the eBPF verifier has been improved and unprivileged users can use it (although unprivileged eBPF is only meaningful for 'socket filter'-like programs, eBPF programs for tracing and TC classifiers/actions will stay root only). This feature can be switched off with the sysctl kernel.unprivileged_bpf_disabled (once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false)

Recommended LWN article: [http://lwn.net/Articles/660331/ Unprivileged bpf()]

Code: [https://git.kernel.org/torvalds/c/1be7f75d1668d6296b80bf35dcf6762393530afc commit], [https://git.kernel.org/torvalds/c/aaac3ba95e4c8b496d22f68bd1bc01cfbf525eca commit]

Persistent eBPF maps/progs

This release also adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users, for example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits, so a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path.

To fix issues as these, a new minimal file system has been created that can hold map/prog objects at /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) along with a pathname with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET), that in turn creates the file system nodes. The user can use that to access maps and progs later on, through bpf(2).

Code: [https://git.kernel.org/torvalds/c/b2197755b2633e164a439682fb05a9b5ea48f706 commit], [https://git.kernel.org/torvalds/c/https://git.kernel.org/torvalds/c/42984d7c1e563bf92e6ca7a0fd89f8e933f2162e commit]

1.7. perf + eBPF integration

In this release, eBPF programs have been integrated with perf. When perf is given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel, which can then be used and seen using perf trace and other tools.

Users are allowed to use BPF filter like: # perf record --event ./hello_world.o ls, and the eBPF program is attached to a newly created perf event which works with all tools.

Code: [https://git.kernel.org/torvalds/c/69d262a93a25cf475012ea2e00aeb29f4932c028 commit], [https://git.kernel.org/torvalds/c/84c86ca12b2189df751eed7b2d67cb63bc8feda5 commit], [https://git.kernel.org/torvalds/c/ed63f34c026e9a60d17fa750ecdfe3f600d49393 commit], [https://git.kernel.org/torvalds/c/1f45b1d49073541947193bd7dac9e904142576aa commit], [https://git.kernel.org/torvalds/c/4edf30e39e6cff32390eaff6a1508969b3cd967b commit], [https://git.kernel.org/torvalds/c/71dc2326252ff1bcdddc05db03c0f831d16c9447 commit], [https://git.kernel.org/torvalds/c/d509db0473e40134286271b1d1adadccf42ac467 commit], [https://git.kernel.org/torvalds/c/aa3abf30bb28addcf593578d37447d42e3f65fc3 commit], [https://git.kernel.org/torvalds/c/1e5e3ee8ff3877db6943032b54a6ac21c095affd commit], [https://git.kernel.org/torvalds/c/ba1fae431e74bb427a699187434142fd3fe98390 commit]

1.8. Block polling support

This release adds basic support for polling for specific IO to complete, which can improve latency and throughput in very fast devices. Currently O_DIRECT sync read/write are supported. This support is only intended for testing, in future releases stats tracking will be used to auto-tune this. For now, for benchmark and testing purposes, we add a sysfs file (io_poll) that controls whether polling is enabled or not.

Recommended LWN article: [http://lwn.net/Articles/663879/ Block-layer I/O polling]

Code: [https://git.kernel.org/torvalds/c/15c4f638f3d41bae52105ca4c0c8760afbcbeaab commit], [https://git.kernel.org/torvalds/c/05229beeddf7e75e2e616ddaad4b70e7fca9528d commit], [https://git.kernel.org/torvalds/c/a0fa9647a54e81883abd57c5c865d1747f68a577 commit]

1.9. mlock2() syscall allow users to request memory to be locked on page fault

mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings this is not ideal: For example, security applications that need mlock() are forced to lock an entire buffer, no matter how big it is. Or maybe a large graphical models where the path through the graph is not known until run time, they are forced to lock the entire graph or lock page by page as they are faulted in.

This new mlock2() syscall set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. The new system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. New calls are added for munlock() and munlockall() which give the called a way to specify which flags are supposed to be cleared. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). Finally, a flag for mmap() is added that allows a user to specify that the covered are should not be paged out, but only after the memory has been used the first time.

Recommended LWN article: [http://lwn.net/Articles/650538/ Deferred memory locking]

Code: [https://git.kernel.org/torvalds/c/de60f5f10c58d4f34b68622442c0e04180367f3f commit], [https://git.kernel.org/torvalds/c/b0f205c2a3082dd9081f9a94e50658c5fa906ff1 commit], [https://git.kernel.org/torvalds/c/a8ca5d0ecbdde5cc3d7accacbd69968b0c98764e commit], [https://git.kernel.org/torvalds/c/1aab92ec3de552362397b718744872ea2d17add2 commit]

2. Drivers and architectures

3. Core (various)

4. File systems

5. Memory management

6. Block layer

7. Cryptography

crypto: caam - add support for acipher xts(aes) [https://git.kernel.org/torvalds/c/c6415a6016bff0b547c13cadb1d5e50e9ace2be3 commit] crypto: keywrap - add key wrapping block chaining mode [https://git.kernel.org/torvalds/c/e28facde3c39005071cc5323d56539bb44efa446 commit] crypto: qat - add support for ctr(aes) and xts(aes) [https://git.kernel.org/torvalds/c/def14bfaf30d5d5a4a8fe5bf600ce09232e688c0 commit]

8. Security

9. Tracing and perf tool

10. Virtualization

11. Networking

12. List of merges

13. Other news sites

KernelNewbies: Linux_4.4 (last edited 2016-02-03 19:49:38 by diegocalleja)