#pragma section-numbers on #pragma keywords Linux, kernel, operating system, changes, changelog, file system, Linus Torvalds, open source, device drivers #pragma description Summary of the changes and new features merged in the Linux kernel during the 4.10 development cycle Linux 4.10 [[https://lkml.org/lkml/2017/2/19/224|was released]] on 19 Feb 2017. Summary: This release adds support for virtualized GPUs, a new 'perf c2c' tool for cacheline contention analysis in NUMA systems, a new 'perf sched timehist' command for a detailed history of task scheduling, improved writeback management that should make the system more responsive under heavy writing load, a new hybrid block polling method that uses less CPU than pure polling, support for ARM devices such as the Nexus 5 & 6 or Allwinner A64, a feature that allows to attach eBPF programs to cgroups, an experimental MD RAID5 writeback cache, support for Intel Cache Allocation Technology, and many other improvements and new drivers. <> = Prominent features = == Virtual GPU support == This release adds support for Intel GVT-g for KVM (a.k.a. KVMGT), a full GPU virtualization solution with mediated pass-through, starting from 4th generation Intel Core (Haswell) processors with Intel Graphics. This feature is based on a new VFIO Mediated Device framework. Unlike direct pass-through alternatives, the mediated device framework allows KVMGT to offer a complete virtualized GPU with full GPU features to each one of the virtualized guests, with part of performance critical resources directly assigned, while still having performance close to native. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability. For more details, see these papers: [[https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian|A Full GPU Virtualization Solution with Mediated Pass-Through]] [[http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf|KVMGT: a Full GPU Virtualization Solution]] [[http://www.linux-kvm.org/images/5/59/02x03-Neo_Jia_and_Kirti_Wankhede-vGPU_on_KVM-A_VFIO_based_Framework.pdf|vGPU on KVM]] ([[https://www.youtube.com/watch?v=Xs0TJU_sIPc|video]]) [[https://01.org/igvt-g/|Intel GVT main site]] Code: VFIO Mediated device [[https://git.kernel.org/torvalds/c/7b96953bc640b6b25665fe17ffca4b668b371f14|commit]], [[https://git.kernel.org/torvalds/c/fa3da00cb8c0d403030f4805ae615b444f0d2f3c|commit]], [[https://git.kernel.org/torvalds/c/7ed3ea8a71187a4569eb65f647ea4af0cdb9a856|commit]], [[https://git.kernel.org/torvalds/c/32f55d835b23830bf9295d038a1693ce9fd41b56|commit]], [[https://git.kernel.org/torvalds/c/2169037dc322d8baa84d9bd4468995f818f25d82|commit]], [[https://git.kernel.org/torvalds/c/3624a2486c8ca10a2c730c704441fdd034a0d9b7|commit]], [[https://git.kernel.org/torvalds/c/ea85cf353e4fed4adcf8c960f4add2a286bc2c91|commit]], [[https://git.kernel.org/torvalds/c/7896c998f0e7160df97bd7aaae9807120535bf14|commit]], [[https://git.kernel.org/torvalds/c/8f0d5bb95f763cacad7654304050ec1b636bb04a|commit]], [[https://git.kernel.org/torvalds/c/a54eb55045ae9b3032c71f1134e30d02de527038|commit]], [[https://git.kernel.org/torvalds/c/c086de818dd81c3c2f7cecff23de6585b74340c0|commit]], [[https://git.kernel.org/torvalds/c/b3c0a866f1692da2d1059dadd9c429ff5b364fc9|commit]], [[https://git.kernel.org/torvalds/c/c535d34569bbc61ebf25a5505ab9eafba057345f|commit]], [[https://git.kernel.org/torvalds/c/c747f08aea847c8c0704acf9375ca83c4800f6c1|commit]], [[https://git.kernel.org/torvalds/c/ef198aaa169c61ab357a5cea5a4ce1ee6aafa824|commit]], [[https://git.kernel.org/torvalds/c/a1e03e9bccd1402971213c4953ea59aab8142644|commit]], [[https://git.kernel.org/torvalds/c/2818c6e91980d966d015a9f763ab24b41e6a7c3d|commit]], [[https://git.kernel.org/torvalds/c/8e1c5a4048b89d04d8d1ee655ce1f685e6fddde4|commit]], [[https://git.kernel.org/torvalds/c/3771bd96976dbd01ce4995760ed1d0932f30a366|commit]], [[https://git.kernel.org/torvalds/c/9d1a546c53b4c1c378b0f34de84ddee2c7d4c90c|commit]], [[https://git.kernel.org/torvalds/c/5188287a860b6ec5950d5156d63056156f59ee3b|commit]]; KVMGT [[https://git.kernel.org/torvalds/c/8be8f4a9a9ce48d545512ef7299da607401f3879|(merge)]]; Intel GVT-g [[https://git.kernel.org/torvalds/c/06a75ace46e2fdd1d93b06228df0e2dfe526cc27|(merge)]] == New 'perf c2c' tool, for cacheline contention analysis == In modern systems with multiple processors, different memory modules are physically connected to different CPUs. In these [[https://en.wikipedia.org/wiki/Non-uniform_memory_access|NUMA]] systems, memory accesses to the local memory are faster than accesses to the memory connected to other processors. When a task is multi-threaded, different threads can run in different CPUs at the same time; if these threads try to access and modify the same memory, they can have performance issues due to the costs of synchronizing the CPU caches. perf c2c (for "cache to cache") is a new tool designed to analyse and track down performance problems caused by false sharing on NUMA systems. The tool is based on x86's load latency and precise store facility events provided by Intel CPUs. At a high level, perf c2c will show you: * The cachelines where false sharing was detected. * The readers and writers to those cachelines, and the offsets where those accesses occurred. * The pid, tid, instruction addr, function name, binary object name for those readers and writers. * The source file and line number for each reader and writer. * The average load latency for the loads to those cachelines. * Which numa nodes the samples a cacheline came from and which CPUs were involved. and more. For more details on perf c2c and how to use it, see https://joemario.github.io/blog/2016/09/01/c2c-blog/ Code: [[https://git.kernel.org/torvalds/c/e9c848928abf4cb60601e9ae7d336f0333c98bca|(merge)]] == Detailed history of scheduling events with perf sched timehist == 'perf sched timehist' provides an analysis of scheduling events. Example usage: {{{$ perf sched record -- sleep 1; perf sched timehist}}}. By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task: {{{ time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) -------- ------ ---------------- --------- --------- -------- 1.874569 [0011] gcc[31949] 0.014 0.000 1.148 1.874591 [0010] gcc[31951] 0.000 0.000 0.024 1.874603 [0010] migration/10[59] 3.350 0.004 0.011 1.874604 [0011] 1.148 0.000 0.035 1.874723 [0005] 0.016 0.000 1.383 1.874746 [0005] gcc[31949] 0.153 0.078 0.022 }}} For more details, see this article from Brendan Gregg: [[http://www.brendangregg.com/blog/2017-03-16/perf-sched.html|perf sched for Linux CPU scheduler analysis]] Code: [[https://git.kernel.org/torvalds/c/47414424c53a70eceb0fc6e0a35a31a2b763d5b2|(merge)]] == Improved writeback management == Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That's the definition of background activity...But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like {{{$ dd if=/dev/zero of=foo bs=1M count=10k}}}, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won't start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the [[https://lwn.net/Articles/682582/|LWN article]]). This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it's auto-tunable, and generally, a user would not have to touch the settings. This feature needs to be enabled explicitly in the configuration (and, as it should be expected, there can be regressions) Recommended LWN article: [[https://lwn.net/Articles/682582/|Toward less-annoying background writeback]] Code: [[https://git.kernel.org/torvalds/c/1d796d6a9641fbfcd90fcfaf6fb4894a13d0304f|commit]], [[https://git.kernel.org/torvalds/c/7637241e651ec36e409412869f986dd5f097735f|commit]], [[https://git.kernel.org/torvalds/c/13edd5e7315a26b448c5f7f33fc7721b1e0c17ef|commit]], [[https://git.kernel.org/torvalds/c/b57d74aff9ab92fbfb7c197c384d1adfa2827b2e|commit]], [[https://git.kernel.org/torvalds/c/d278d4a8892f13b6a9eb6102b356402f0e062324|commit]], [[https://git.kernel.org/torvalds/c/cf43e6be865a582ba66ee4747ae27a0513f6bba1|commit]], [[https://git.kernel.org/torvalds/c/e34cbd307477ae07c5d8a8d0bd15e65a9ddaba5c|commit]], [[https://git.kernel.org/torvalds/c/87760e5eef359788047d6fd54fc12eec74ce0d27|commit]], [[https://git.kernel.org/torvalds/c/80e091d10e8bf7b801d634ea8870b9e907314424|commit]], [[https://git.kernel.org/torvalds/c/d62118b6dd99b8f64350206a6ea6996083b28c9a|commit]] == Hybrid block polling == Linux 4.4 [[https://kernelnewbies.org/Linux_4.4#head-cd57c6abf8822152b3a175dd68c9610562b220d5|added]] support for polling requests in the block layer, a similar approach to what NAPI does for networking, which can improve performance for high-throughput devices (e.g. NVM). Continuously polling a device, however, can cause excessive CPU consumption and some times even worse throughput. This release includes a new hybrid, adaptative type of polling. Instead of polling after IO submission, the kernel induces an artificial delay, and then polls after that. For example, if the IO is presumed to complete in 8 μsecs from now, the kernel sleep for 4 μsecs, wake up, and then does the polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, Linux can achieve big latency reductions while still using the same (or less) amount of CPU. Thanks to improved statistics gathering included in this release, the kernel can measure the completion time of requests and calculate how much it should sleep. The hybrid block polling is disabled by default. A new sysfs file, {{{/sys/block//queue/io_poll_delay}}} has been added, which makes the polling behave as follows: {{{-1}}}: never enter hybrid sleep, always poll (default); {{{0}}}: Use half of the completion mean for this request type for the sleep delay (aka: hybrid poll); {{{>0}}}: disregard the mean value calculated by the kernel, and always use this specific value as the sleep delay. Code: [[https://git.kernel.org/torvalds/c/189ce2b9dcc3494410a576fbecbedbb6b21e51e0|commit]], [[https://git.kernel.org/torvalds/c/06426adf072bca62ac31ea396ff2159a34f276c2|commit]], [[https://git.kernel.org/torvalds/c/64f1c21e86f7fe63337b5c23c129de3ec506431d|commit]] == Better support for ARM devices such as Nexus 5 & 6 or Allwinner A64 == As an evidence of the work being done to bring Android and mainline kernels together, this release includes support for ARM socs such as: * Huawei Nexus 6P (Angler) * LG Nexus 5X (Bullhead) * Nexbox A1 and A95X Android TV boxes * Pine64 development board based on Allwinner A64 * Globalscale Marvell ESPRESSOBin community board based on Armada 3700 * Renesas "R-Car Starter Kit Pro" (M3ULCB) low-cost automotive board Code: [[https://git.kernel.org/torvalds/c/482c3e8835e9e9b325aad295c21bd9e965a11006|(merge)]] == Allow attaching eBPF programs to cgroups == This release adds eBPF hooks for cgroups, to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. A new BPF program type is added, {{{BPF_PROG_TYPE_CGROUP_SKB}}}. The [[http://man7.org/linux/man-pages/man2/bpf.2.html|bpf(2)]] syscall is extended with by two new commands, {{{BPF_PROG_ATTACH}}} and {{{BPF_PROG_DETACH}}}, which allow attaching and detaching eBPF programs to a target. This feature is configurable ({{{CONFIG_CGROUP_BPF}}}). Recommended LWN article: [[https://lwn.net/Articles/698073/|Network filtering for control groups]] Code: [[https://git.kernel.org/torvalds/c/0e33661de493db325435d565a4a722120ae4cbf3|commit]], [[https://git.kernel.org/torvalds/c/3007098494bec614fb55dee7bc0410bb7db5ad18|commit]], [[https://git.kernel.org/torvalds/c/f4324551489e8781d838f941b7aee4208e52e8bf|commit]], [[https://git.kernel.org/torvalds/c/c11cd3a6ec3a817c6b71b00c559e25d855f7e5b4|commit]], [[https://git.kernel.org/torvalds/c/33b486793cb31311f3a91ae4fe4be5926e7677b0|commit]], [[https://git.kernel.org/torvalds/c/d8c5b17f2bc0de09fbbfa14d90e8168163a579e7|commit]] This release also adds a new cgroup-based program type, {{{BPF_PROG_TYPE_CGROUP_SOCK}}}. Similar to {{{BPF_PROG_TYPE_CGROUP_SKB}}} programs can be attached to a cgroup and run any time a process in the cgroup opens an {{{AF_INET}}} or {{{AF_INET6}}} socket. Currently only {{{sk_bound_dev_if}}} is exported to userspace for modification by a bpf program. Code: [[https://git.kernel.org/torvalds/c/b2cd12574aa3e1625f471ff57cde7f628a18a46b|commit]], [[https://git.kernel.org/torvalds/c/61023658760032e97869b07d54be9681d2529e77|commit]], [[https://git.kernel.org/torvalds/c/ad2805dc79e647ec2aee931a51924fda9d03b2fc|commit]], [[https://git.kernel.org/torvalds/c/aa4c1037a30f4e88f444e83d42c2befbe0d5caf5|commit]], [[https://git.kernel.org/torvalds/c/4f2e7ae56e04cfe670cf39152a8d015984c90351|commit]], [[https://git.kernel.org/torvalds/c/554ae6e792ef38020b80b4d5127c51d510c0918f|commit]], [[https://git.kernel.org/torvalds/c/7f677633379b4abb3281cdbe7e7006f049305c03|commit]] == Experimental MD raid5 writeback cache and FAILFAST support == This release implements a raid5 writeback cache in the MD subsystem (Multiple Devices). Its goal is to aggregate writes to make full stripe write and reduce read-modify-write. It's helpful for workload which does sequential write and follows fsync for example. This feature is experimental and off by default. Code: [[https://git.kernel.org/torvalds/c/c757ec95c22036b1cb85c56ede368bf8f6c08658|commit]], [[https://git.kernel.org/torvalds/c/937621c36e0ea1af2aceeaea412ba3bd80247199|commit]], [[https://git.kernel.org/torvalds/c/2ded370373a400c20cf0c6e941e724e61582a867|commit]], [[https://git.kernel.org/torvalds/c/1e6d690b9334b7e1b31d25fd8d93e980e449a5f9|commit]], [[https://git.kernel.org/torvalds/c/a39f7afde358ca89e9fc09a5525d3f8631a98a3a|commit]], [[https://git.kernel.org/torvalds/c/2c7da14b90a01e48b17a028de6050a796cfd6d8d|commit]], [[https://git.kernel.org/torvalds/c/9ed988f5dc673f009d78f7ac55c5da88e1cf58a0|commit]], [[https://git.kernel.org/torvalds/c/b4c625c67362b3940f619c1a836b4e8329106658|commit]], [[https://git.kernel.org/torvalds/c/5aabf7c49d9ebe54a318976276b187637177a03e|commit]], [[https://git.kernel.org/torvalds/c/3bddb7f8f264ec58dc86e11ca97341c24f9d38f6|commit]] This release also adds "failfast" support. RAID disk with failed IOs are marked as broken quickly, and avoided in the future, which can improve latency. Code: [[https://git.kernel.org/torvalds/c/688834e6ae6b21e3d98b5cf2586aa4a9b515c3a0|commit]], [[https://git.kernel.org/torvalds/c/46533ff7fefb7e9e3539494f5873b00091caa8eb|commit]], [[https://git.kernel.org/torvalds/c/8d3ca83dcf9ca3d58822eddd279918d46f41e9ff|commit]], [[https://git.kernel.org/torvalds/c/1919cbb23bf1b3e0fdb7b6edfb7369f920744087|commit]], [[https://git.kernel.org/torvalds/c/2e52d449bcec31cb66d80aa8c798b15f76f1f5e0|commit]], [[https://git.kernel.org/torvalds/c/212e7eb7a3403464a796c05c2fc46cae3b62d803|commit]] == Support for Intel Cache Allocation Technology == A Intel feature that allows to set policies on the L2/L3 CPU caches; e.g. real-time tasks could be assigned dedicated cache space. For more details, read the recommended LWN article: [[https://lwn.net/Articles/694800/|Controlling access to the memory cache]]. Code: [[https://git.kernel.org/torvalds/c/78e99b4a2b9afb1c304259fcd4a1c71ca97e3acd|commit]], [[https://git.kernel.org/torvalds/c/4e978d06dedb8207b298a5a8a49fce4b2ab80d12|commit]], [[https://git.kernel.org/torvalds/c/113c60970cf41723891e3a1b303517eaf8510bb5|commit]], [[https://git.kernel.org/torvalds/c/12e0110c11a460b890ed7e1071198ced732152c9|commit]], [[https://git.kernel.org/torvalds/c/458b0d6e751b04216873a5ee9c899be2cd2f80f3|commit]], [[https://git.kernel.org/torvalds/c/60cf5e101fd4441ab112a81e88726efb6fd7542c|commit]], [[https://git.kernel.org/torvalds/c/4f341a5e48443fcc2e2d935ca990e462c02bb1a6|commit]], [[https://git.kernel.org/torvalds/c/60ec2440c63dea88a5ef13e2b2549730a0d75a37|commit]], [[https://git.kernel.org/torvalds/c/e02737d5b82640497637d18428e2793bb7f02881|commit]] = Core (various) = * Kernel configuration system: Introduce the "imply" keyword. The "imply" keyword is a weak version of "select" where the target config symbol can still be turned off, avoiding those pitfalls that come with the "select" keyword. This is useful e.g. with multiple drivers that want to indicate their ability to hook into a secondary subsystem while allowing the user to configure that subsystem out without also having to unset these drivers [[https://git.kernel.org/torvalds/c/237e3ad0f195d8fd34f1299e45f04793832a16fc|commit]] * To cover the needs of some systems where suspend-to-idle is the preferred suspend method, rework the system sleep state selection interface (but preserve backwards compatibiliby). A new sysfs file, {{{/sys/power/mem_sleep}}} is added, that will control the system suspend mode triggered when writing {{{mem}}} to {{{/sys/power/state}}} (in analogy with what {{{/sys/power/disk}}} does for hibernation). It selects suspend-to-RAM ({{{deep}}} sleep) by default (if supported) and fall back to suspend-to-idle ({{{s2idle}}}) otherwise and add a new command line argument, {{{mem_sleep_default}}}, allowing that default to be overridden if need be [[https://git.kernel.org/torvalds/c/406e79385f3223d82272cf2be86bc95cd000a258|commit]] * Task scheduler: Add support for tasks that inject idle, used by some idle injection drivers such as Intel powerclamp and ACPI PAD drivers [[https://git.kernel.org/torvalds/c/c1de45ca831acee9b72c9320dde447edafadb43f|commit]] * initramfs: allow again choice of the embedded initram compression algorithm [[https://git.kernel.org/torvalds/c/db2aa7fd15e857891cefbada8348c8d938c7a2bc|commit]] * posix-timers: Make them configurable, removing about 25 KB from the kernel binary size when configured out. Corresponding syscalls are routed to a stub logging the attempt to use them [[https://git.kernel.org/torvalds/c/baa73d9e478ff32d62f3f9422822b59dd9a95a21|commit]] * printk: add Kconfig option to set default console loglevel [[https://git.kernel.org/torvalds/c/a8cfdc68f6cfc0c7ffc6d664406fe7f06f17eef4|commit]] * Documentation: create an user's manual book [[https://git.kernel.org/torvalds/c/9d85025b0418163fae079c9ba8f8445212de8568|commit]] * driver core: Functional dependencies tracking support [[https://static.lwn.net/kerneldoc/driver-api/device_link.html|documentation]], [[https://git.kernel.org/torvalds/c/9ed9895370aedd6032af2a9181c62c394d08223b|commit]] * driver-core: add test module for asynchronous probing [[https://git.kernel.org/torvalds/c/79543cf2b18ea4a35f8864849d7ad8882ea8a23d|commit]] * iomap: implement direct I/O path [[https://git.kernel.org/torvalds/c/ff6a9292e6f633d596826be5ba70d3ef90cc3300|commit]] * Extend {{{rodata=off}}} boot cmdline parameter to module mappings [[https://git.kernel.org/torvalds/c/39290b389ea2654f9190e3b48c57d27b24def83e|commit]] * swiotlb: Add {{{ swiotlb=noforce}}} debug option to aid debugging and catch devices not supporting DMA to memory outside the 32-bit address space [[https://git.kernel.org/torvalds/c/fff5d99225107f5f13fe4a9805adc2a1c4b5fb00|commit]] = File systems = * OverlayFS * When copying up within the same fs, try to use clone copies [[https://git.kernel.org/torvalds/c/2ea98466491b7609ace297647b07c28d99ef3722|commit]] * Allow renaming a directory with backwards incompatible feature "redirect_dir" [[https://git.kernel.org/torvalds/c/a6c6065511411c57167a6cdae0c33263fb662b51|commit]], [[https://git.kernel.org/torvalds/c/688ea0e5a0e2278e2fcd0014324ab1ba68e70ad7|commit]], [[https://git.kernel.org/torvalds/c/3ea22a71b65b6743a53e286ff4991a06b9d2597c|commit]], [[https://git.kernel.org/torvalds/c/c5bef3a72b9d8a2040d5e9f4bde03db7c86bbfce|commit]] * ext4 * Forbid data journaling when data is encrypted [[https://git.kernel.org/torvalds/c/73b92a2a5e97d17cc4d5c4fe9d724d3273fb6fd2|commit]] * DAX iomap support [[https://git.kernel.org/torvalds/c/776722e85d3b0936253ecc3d14db4fba37f191ba|commit]] * F2FS * Support multiple devices [[https://git.kernel.org/torvalds/c/3c62be17d4f562f43fe1d03b48194399caa35aa5|commit]] * NFS * Add support for a new NFSv4.2 mode_umask attribute that makes ACL inheritance a little more useful in environments that default to restrictive umasks [[https://git.kernel.org/torvalds/c/dff25ddb48086afcb434770caa3d6849a4489b85|commit]], [[https://git.kernel.org/torvalds/c/47057abde515155a4fee53038e7772d6b387e0aa|commit]] * UBIFS * Add support for file encryption using the fscrypt framework [[https://git.kernel.org/torvalds/c/39d2c3b96e072c8756f3b980588fa516b7988cb1|(merge)]] * XFS * Faster buffer cache lookups [[https://git.kernel.org/torvalds/c/6031e73a5b3f85ec45cac08ef90995b2d3f941c7|commit]] * Use iomap for Direct I/O (much simpler, faster, and has lower IO latency than the existing direct IO infrastructure) [[https://git.kernel.org/torvalds/c/acdda3aae146d9b69d30e9d8a32a8d8937055523|commit]] * Deprecate barrier/nobarrier mount option [[https://git.kernel.org/torvalds/c/4cf4573d899cd80d8578c050061dc342f99f3a32|commit]] * CIFS * New mount option {{{snapshot=