1. Prominent features
1.1. Bigger memory limits
Original x86-64 was limited by 4-level paging to 256 TiB of virtual address space and 64 TiB of physical address space. People are already bumping into this limit: some vendors offers servers with 64 TiB of memory today. To overcome the limitation upcoming hardware will introduce support for 5-level paging. It is a straight-forward extension of the current page table structures adding one more layer of translation. It bumps the limits to 128 PiB of virtual address space and 4 PiB of physical address space. This "ought to be enough for anybody" ©.
On x86, 5-level paging enables 56-bit userspace virtual address space. Not all user space is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers. It collides with valid pointers with 5-level paging and leads to crashes. To mitigate this, the Linux kernel will not allocate virtual address space above 47-bit by default. Userspace can ask for allocation from full address space by specifying hint address above 47-bits.
Recommended LWN article: [https://lwn.net/Articles/717293/ Five-level page tables]
Code: [https://git.kernel.org/linus/ee00f4a32a76ef631394f31d5b6028d50462b357 commit], [https://git.kernel.org/linus/b569bab78d8df157a6f91070af827753e4d1787c commit], [https://git.kernel.org/linus/44b04912fa72489d403738f39e1c782614b7ae7c commit], [https://git.kernel.org/linus/77ef56e4f0fbb350d93289aa025c7d605af012d4 commit], [https://git.kernel.org/torvalds/c/b1b6f83ac938d176742c85757960dec2cf10e468 merge]
1.2. Add support for AMD Secure Memory Encryption
Secure Memory Encryption can be used to mark individual pages of memory as encrypted through the page tables. A page of memory that is marked encrypted will be automatically decrypted when read from DRAM and will be automatically encrypted when written to DRAM. Secure Memory Encryption can therefore be used to protect the contents of DRAM from physical attacks on the system.
Recommended LWN article: [https://lwn.net/Articles/686808/#sme Two approaches to x86 memory encryption]
AMD Memory encryption whitepaper: [http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf link]
Code: [https://git.kernel.org/linus/c262f3b9a3246da87c66ce398cd7e30d8f1529ea commit], [https://git.kernel.org/linus/aac7b79eea6118dee3da9b99dcd564471672806d commit], [https://git.kernel.org/linus/f7750a79568788473c5e8092ee58a52248f34329 commit], [https://git.kernel.org/linus/872cbefd2d9c52bd0b1e2c7942c4369e98a5a5ae commit], [https://git.kernel.org/linus/9af9b94068fb1ea3206a700fc222075966fbef14 commit], [https://git.kernel.org/linus/7744ccdbc16f0ac4adae21b3678af93775b3a386 commit], [https://git.kernel.org/linus/33c2b803edd13487518a2c7d5002d84d7e9c878f commit], [https://git.kernel.org/linus/5868f3651fa0dff96a57f94d49247d3ef320ebe2 commit], [https://git.kernel.org/linus/fd7e315988b784509ba3f1b42f539bd0b1fca9bb commit], [https://git.kernel.org/linus/21729f81ce8ae76a6995681d40e16f7ce8075db4 commit], [https://git.kernel.org/linus/eef9c4abe77f55b1600f59d8ac5f1d953e2f5384 commit], [https://git.kernel.org/linus/f88a68facd9a15b94f8c195d9d2c0b30c76c595a commit], [https://git.kernel.org/linus/7f8b7e7f4ccbbd1fb8badddfabd28c955aea87b4 commit], [https://git.kernel.org/linus/b9d05200bc12444c7778a49c9694d8382ed06aa8 commit], [https://git.kernel.org/linus/d68baa3fa6e4d703fd0c7954ee5c739789e7242f commit], [https://git.kernel.org/linus/a19d66c56af1c52b8b463bf94d21116ae8c1aa5a commit], [https://git.kernel.org/linus/f99afd08a45fbbd9ce35a7624ffd1d850a1906c0 commit], [https://git.kernel.org/linus/38eecccdf488e38ee93690cfe9ec1914b73f512f commit], [https://git.kernel.org/linus/8f716c9b5febf6ed0f5fedb7c9407cd0c25b2796 commit], [https://git.kernel.org/linus/5997efb967565e858259401af394e8449629c1f0 commit], [https://git.kernel.org/linus/1de328628cd06b5efff9195b57bdc1a64680814d commit], [https://git.kernel.org/linus/77bd2342d4304bda7896c953d424d15deb314ca3 commit], [https://git.kernel.org/linus/163ea3c83aeeb3908a51162c79cb3a7c374d92b4 commit], [https://git.kernel.org/linus/c7753208a94c73d5beb1e4bd843081d6dc7d4678 commit], [https://git.kernel.org/linus/648babb7078c6310d2af5b8aa01f086030916968 commit], [https://git.kernel.org/linus/f655e6e6b992a2fb0d0334db2620607b98df39e7 commit], [https://git.kernel.org/linus/2543a786aa25258451f3418b87a038c7ddaa2e85 commit], [https://git.kernel.org/linus/46d010e04a637ca5bbdd0ff72554d9c06f2961c9 commit], [https://git.kernel.org/linus/95cf9264d5f36c291c1c50c00349f83348e6f9c7 commit], [https://git.kernel.org/linus/d0ec49d4de90806755e17289bd48464a1a515823 commit], [https://git.kernel.org/linus/bba4ed011a52d494aa7ef5e08cf226709bbf3f60 commit], [https://git.kernel.org/linus/f2f931c6819467af5260a21c59fb787ce2863f92 commit], [https://git.kernel.org/linus/8458bf94b0399cd1bca6c437366bcafb29c230c5 commit], [https://git.kernel.org/linus/db516997a985b461f021d594e78155bbc7fc3e7e commit], [https://git.kernel.org/linus/6ebcb060713f614c92216482eed501b31cee74ec commit], [https://git.kernel.org/linus/e505371dd83963caae1a37ead9524e8d997341be commit], [https://git.kernel.org/linus/7375ae3a0b79ea072f4c672039f08f5db633b9e1 commit], [https://git.kernel.org/linus/aca20d5462149333ba8b24a4a352be5b7a00dfd2 commit]
1.3. Better kernel traces with the ORC unwinder
This release includes a new "unwinder". An "unwinder" is what prints the list of functions (aka. stack trace, callgraph, call stack...) that have been executed before reaching a determinate point of the code, and it's used, for example, the list of functions that lead to a crash when a kernel oopses. The new unwinder is called ORC, an alias for "Oops Rewind Capability", and has been developed as an simpler alternative to the DWARF debuginfo format.
Linux already has an unwinder, and while it usually works well, it isn't reliable in all situations, which causes troubles for modern functionality like live patching that requires completely reliable stack traces. It also requires a functionality called "frame pointers" (CONFIG_FRAME_POINTERS) to print complete call stacks. Frame pointers make GCC add instrumentation code to every function in the kernel, which increases the size of the kernel executable code by about 3.2%, resulting in a broad kernel-wide slowdown, and more for some workloads. This option is enabled by default in some Linux distros.
In contrast, the ORC unwinder does not need to insert code anywhere so it has no effect on text size or runtime performance, because the debuginfo (about 2-4MiB) is placed out of band. So the ORC unwinder provides a nice performance improvement across the board compared with frame pointers, while at the same time having reliable stack traces.
Recommended LWN article: [https://lwn.net/Articles/728339/ The ORCs are coming]
Recommended article: [http://www.codeblueprint.co.uk/2017/07/31/the-orc-unwinder.html The Linux x86 ORC Stack Unwinder]
1.4. zstd compression in Btrfs and Squashfs
zstd offers a wide variety of compression speed and quality trade-offs. It can compress at speeds approaching lz4, and quality approaching lzma. zstd decompressions at speeds more than twice as fast as zlib, and decompression speed remains roughly the same across all compression levels. Because it is a big win in speed over zlib and in compression ratio over lzo, FB has been using it in production with great results. Support has also been added for squashfs. For benchmark numbers see the links.
Code: [https://git.kernel.org/linus/73f3d1b48f5069d46ba48aa28c2898dc93185560 commit], [https://git.kernel.org/linus/5d2405227a9eaea48e8cc95756a06d407b11f141 commit], [https://git.kernel.org/linus/5c1aab1dd5445ed8bdcdbb575abc1b0d7ee5b2e7 commit], [https://git.kernel.org/linus/87bf54bb43ddd385d2538b777324bf737f243042 commit]
1.5. Zero-copy from user memory to sockets
Copying large buffers between user process and kernel can be expensive. Linux supports various interfaces that eschew copying, such as sendpage(2) and splice(2). The MSG_ZEROCOPY socket flag extends the underlying copy avoidance mechanism to common socket send calls. Copy avoidance is not a free lunch. As implemented, with page pinning, it replaces per byte copy cost with page accounting and completion notification overhead. As a result, MSG_ZEROCOPY is generally only effective at writes over around 10 KB.
Recommended LWN article: [https://lwn.net/Articles/726917/ Zero-copy networking]
Documentation: [https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html MSG_ZEROCOPY]
Netdev talk: [https://netdevconf.org/2.1/session.html?debruijn sendmsg copy avoidance with MSG_ZEROCOPY]
1.6. Heterogeneous Memory Management for future GPUs
Today device driver expose dedicated memory allocation API through their device file, often relying on a combination of IOCTL and mmap calls. The device can only access and use memory allocated through this API. This effectively split the program address space into object allocated for the device and useable by the device and other regular memory (malloc, mmap of a file, share memory, ...) only accessible by CPU (or in a very limited way by a device by pinning memory). Allowing different isolated component of a program to use a device thus require duplication of the input data structure using device memory allocator. This is reasonable for simple data structure (array, grid, image, ...) but this get extremely complex with advance data structures. This is becoming a serious limitation on the kind of work load that can be offloaded to device like GPU.
New industry standard like C++, OpenCL or CUDA are pushing to remove this barrier. This require a shared address space between GPU device and CPU so that GPU can access any memory of a process (while still obeying memory protection like read only). This kind of feature is also appearing in various other operating systems. Heterogeneous Memory Management is a set of helpers to facilitate several aspects of address space sharing and device memory management.
Recommended LWN article: [https://lwn.net/Articles/684916/ Heterogeneous memory management]
Documentation: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hmm.txt Documentation/vm/hmm.txt]
Code: [https://git.kernel.org/linus/bffc33ec539699f045a9254144de3d4eace05f07 commit], [https://git.kernel.org/linus/133ff0eac95b7dc6edf89dc51bd139a0630bbae7 commit], [https://git.kernel.org/linus/c0b124054f9e42eb6da545a10fe9122a7d7c3f72 commit], [https://git.kernel.org/linus/da4c3c735ea4dcc2a0b0ff0bd4803c336361b6f5 commit], [https://git.kernel.org/linus/74eee180b935fcb9b83a56dd7648fb75caf38f0e commit], [https://git.kernel.org/linus/3072e413e305e353cd4654f8a57d953b66e85bf3 commit], [https://git.kernel.org/linus/5042db43cc26f51eed51c56192e2c2317e44315f commit], [https://git.kernel.org/linus/7b2d55d2c8961ae9d456d3133f4ae2f0fbd3e14f commit], [https://git.kernel.org/linus/c733a82874a79261866a4178edbb608847df4879 commit], [https://git.kernel.org/linus/a9d5adeeb4b2c73c8972180b28d0e05e7d718d06 commit], [https://git.kernel.org/linus/4ef589dc9b10cdcae75a2b2b0e9b2c5e8a92c378 commit], [https://git.kernel.org/linus/858b54dabf4363daa3a97b9a722130a8e7cea8c9 commit], [https://git.kernel.org/linus/2916ecc0f9d435d849c98f4da50e453124c87531 commit], [https://git.kernel.org/linus/8763cb45ab967a92a5ee49e9c544c0f0ea90e2d6 commit], [https://git.kernel.org/linus/8c3328f1f36a5efe817ad4e06497af601936a460 commit], [https://git.kernel.org/linus/a5430dda8a3a1cdd532e37270e6f36436241b6e7 commit], [https://git.kernel.org/linus/8315ada7f095bfa2cae0cd1e915b95bf6226897d commit], [https://git.kernel.org/linus/df6ad69838fc9dcdbee0dcf2fc2c6f1113f8d609 commit], [https://git.kernel.org/linus/d3df0a423397c9a1ae05c3857e8c04240dd85e68 commit]
1.7. Better cpufreq coordination with SMP
In Linux, notifications of task scheduler events are sent to the cpufreq subsystem, so that it can increase the frequency if needed, and achieve good interactivity. However, the cpufreq drivers are not called when the events are happening in different CPUs, for example, a new process being created in another CPU. This release makes task scheduler to update the cpufreq policies for remote CPUs as well. The schedutil, ondemand and conservative governors are updated to process cpufreq updates for remote CPUs (the intel_pstate driver is updated to always reject them).
Recommended LWN article: [https://lwn.net/Articles/732740/ CPU frequency governors and remote callbacks]
1.8. Faster TBL flushing with PCID
PCID is a hardware feature that has been available on Intel CPUs and that it attaches an address space tag to TLB entries and thus allows to skip TLB flushing in many cases. x86's PCID is far too short to uniquely identify a process, and it can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups.
This release uses PCID differently. It uses a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, it is given a fresh PCID each time it's loaded except in cases where the kernel wants to preserve the TLB, in which case it reuses a recent value.
Code: [https://git.kernel.org/linus/f39681ed0f48498b80455095376f11535feea332 commit], [https://git.kernel.org/linus/b0579ade7cd82391360e959cc844e50a160e8a96 commit], [https://git.kernel.org/linus/94b1b03b519b81c494900cb112aa00ed205cc2d9 commit], [https://git.kernel.org/linus/43858b4f25cf0adc5c2ca9cf5ce5fdf2532941e5 commit], [https://git.kernel.org/linus/cba4671af7550e008f7a7835f06df0763825bf3e commit], [https://git.kernel.org/linus/0790c9aad84901ca1bdc14746175549c8b5da215 commit], [https://git.kernel.org/linus/660da7c9228f685b2ebe664f9fd69aaddcc420b5 commit], [https://git.kernel.org/linus/10af6235e0d327d42e1bad974385197817923dc1 commit]
2. Core (various)
Asynchronous I/O: non-blocking buffered reads. Using a threadpool to emulate non-blocking operations on regular buffered files is a common pattern today (samba, libuv, etc...), but it leads to increased request latency due to additional synchronization between the threads or fast (cached data) request stuck behind slow requests. In this release, the preadv2(2) syscall with RWF_NONBLOCK lets userspace applications bypass enqueuing operation in the threadpool if it's already available in the pagecache [https://git.kernel.org/linus/47c27bc46946dea543196a92061da14c6da9889e commit], [https://git.kernel.org/linus/3239d834847627b6634a4139cf1dc58f6f137a46 commit], [https://git.kernel.org/linus/91f9943e1c7b6638f27312d03fe71fcc67b23571 commit], [https://git.kernel.org/li