#pragma section-numbers on #pragma keywords Linux, kernel, operating system, changes, changelog, file system, Linus Torvalds, open source, device drivers #pragma description Summary of the changes and new features merged in the Linux kernel during the 4.10 development cycle Linux 4.10 changelog. Summary: This release adds a new 'perf c2c' for cacheline contention analysis in NUMA systems, a new 'perf sched timehist' for a detailed history of task scheduling, improved writeback management, a new hybrid block polling method that uses less CPU than pure polling, a feature that allows to attach eBPF programs to cgroups, an experimental MD RAID5 writeback cache, support for Intel Cache Allocation Technology, and many other improvements and new drivers. = Prominent features = == New 'perf c2c' tool, for cacheline contention analysis == In modern systems with multiple processors, different memory modules are physically connected to different CPUs. In these [https://en.wikipedia.org/wiki/Non-uniform_memory_access NUMA] systems, memory accesses to the local memory are faster than accesses to the memory connected to other processors. When a task is multi-threaded, different threads can run in different CPUs at the same time; if these threads try to access and modify the same memory, they can have performance issues. perf c2c (for "cache to cache") is a new tool designed to analyse and track down performance problems caused by false sharing on NUMA systems. The tool is based on x86's load latency and precise store facility events provided by Intel CPUs. At a high level, perf c2c will show you: * The cachelines where false sharing was detected. * The readers and writers to those cachelines, and the offsets where those accesses occurred. * The pid, tid, instruction addr, function name, binary object name for those readers and writers. * The source file and line number for each reader and writer. * The average load latency for the loads to those cachelines. * Which numa nodes the samples a cacheline came from and which CPUs were involved. and more. For more details on perf c2c and how to use it, see https://joemario.github.io/blog/2016/09/01/c2c-blog/ Code: [https://git.kernel.org/torvalds/c/e9c848928abf4cb60601e9ae7d336f0333c98bca (merge)] == Detailed history of scheduling events with perf sched timehist == 'perf sched timehist' provides an analysis of scheduling events. Example usage: {{{# perf sched record -- sleep 1; perf sched timehist}}}. By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task: {{{ time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) -------- ------ ---------------- --------- --------- -------- 1.874569 [0011] gcc[31949] 0.014 0.000 1.148 1.874591 [0010] gcc[31951] 0.000 0.000 0.024 1.874603 [0010] migration/10[59] 3.350 0.004 0.011 1.874604 [0011] 1.148 0.000 0.035 1.874723 [0005] 0.016 0.000 1.383 1.874746 [0005] gcc[31949] 0.153 0.078 0.022 }}} Code: [https://git.kernel.org/torvalds/c/47414424c53a70eceb0fc6e0a35a31a2b763d5b2 (merge)] == Improved writeback management == Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That's the definition of background activity...But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like {{{$ dd if=/dev/zero of=foo bs=1M count=10k}}}, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won't start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the [https://lwn.net/Articles/682582/ LWN article]). This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it's auto-tunable, and generally, a user would not have to touch the settings. Recommended LWN article: [https://lwn.net/Articles/682582/ Toward less-annoying background writeback] Code: [https://git.kernel.org/torvalds/c/1d796d6a9641fbfcd90fcfaf6fb4894a13d0304f commit], [https://git.kernel.org/torvalds/c/7637241e651ec36e409412869f986dd5f097735f commit], [https://git.kernel.org/torvalds/c/13edd5e7315a26b448c5f7f33fc7721b1e0c17ef commit], [https://git.kernel.org/torvalds/c/b57d74aff9ab92fbfb7c197c384d1adfa2827b2e commit], [https://git.kernel.org/torvalds/c/d278d4a8892f13b6a9eb6102b356402f0e062324 commit], [https://git.kernel.org/torvalds/c/cf43e6be865a582ba66ee4747ae27a0513f6bba1 commit], [https://git.kernel.org/torvalds/c/e34cbd307477ae07c5d8a8d0bd15e65a9ddaba5c commit], [https://git.kernel.org/torvalds/c/87760e5eef359788047d6fd54fc12eec74ce0d27 commit], [https://git.kernel.org/torvalds/c/80e091d10e8bf7b801d634ea8870b9e907314424 commit], [https://git.kernel.org/torvalds/c/d62118b6dd99b8f64350206a6ea6996083b28c9a commit] == Hybrid block polling == Linux 4.4 [https://kernelnewbies.org/Linux_4.4#head-cd57c6abf8822152b3a175dd68c9610562b220d5 added] support for polling requests in the block layer, a similar approach to what NAPI does for networking, which can improve performance for high-throughput devices (eg: NVM). Continuously polling a device, however, can cause excessive CPU consumption and some times even worse throughput. This release includes a new hybrid, adaptative type of polling. Instead of polling after IO submission, the kernel induces an artificial delay, and then polls after that. For example, if the IO is presumed to complete in 8 usecs from now, the kernel sleep for 4 usecs, wake up, and then does the polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, Linux can achieve big latency reductions while still using the same (or less) amount of CPU. Thanks to improved statistics gathering included in this release, the kernel can measure the completion time of requests and calculate how much it should sleep. The hybrid block polling is disabled by default. A new sysfs file, {{{/sys/block//queue/io_poll_delay}}} has been added, which makes the polling behave as follows: {{{-1}}}: never enter hybrid sleep, always poll (default); {{{0}}}: Use half of the completion mean for this request type for the sleep delay (aka: hybrid poll); {{{>0}}}: disregard the mean value calculated by the kernel, and always use this specific value as the sleep delay. Code: [https://git.kernel.org/torvalds/c/189ce2b9dcc3494410a576fbecbedbb6b21e51e0 commit], [https://git.kernel.org/torvalds/c/06426adf072bca62ac31ea396ff2159a34f276c2 commit], [https://git.kernel.org/torvalds/c/64f1c21e86f7fe63337b5c23c129de3ec506431d commit] == Allow attaching eBPF programs to cgroups == This release adds eBPF hooks for cgroups, to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. A new BPF program type is added, {{{BPF_PROG_TYPE_CGROUP_SKB}}}. The [http://man7.org/linux/man-pages/man2/bpf.2.html bpf(2)] syscall is extended with by two new commands, {{{BPF_PROG_ATTACH}}} and {{{BPF_PROG_DETACH}}}, which allow attaching and detaching eBPF programs to a target. This feature is configurable ({{{CONFIG_CGROUP_BPF}}}). Recommended LWN article: [https://lwn.net/Articles/698073/ Network filtering for control groups] Code: [https://git.kernel.org/torvalds/c/0e33661de493db325435d565a4a722120ae4cbf3 commit], [https://git.kernel.org/torvalds/c/3007098494bec614fb55dee7bc0410bb7db5ad18 commit], [https://git.kernel.org/torvalds/c/f4324551489e8781d838f941b7aee4208e52e8bf commit], [https://git.kernel.org/torvalds/c/c11cd3a6ec3a817c6b71b00c559e25d855f7e5b4 commit], [https://git.kernel.org/torvalds/c/33b486793cb31311f3a91ae4fe4be5926e7677b0 commit], [https://git.kernel.org/torvalds/c/d8c5b17f2bc0de09fbbfa14d90e8168163a579e7 commit] This release also adds a new cgroup based program type, {{{BPF_PROG_TYPE_CGROUP_SOCK}}}. Similar to {{{BPF_PROG_TYPE_CGROUP_SKB}}} programs can be attached to a cgroup and run any time a process in the cgroup opens an {{{AF_INET}}} or {{{AF_INET6}}} socket. Currently only {{{sk_bound_dev_if}}} is exported to userspace for modification by a bpf program. Code: [https://git.kernel.org/torvalds/c/b2cd12574aa3e1625f471ff57cde7f628a18a46b commit], [https://git.kernel.org/torvalds/c/61023658760032e97869b07d54be9681d2529e77 commit], [https://git.kernel.org/torvalds/c/ad2805dc79e647ec2aee931a51924fda9d03b2fc commit], [https://git.kernel.org/torvalds/c/aa4c1037a30f4e88f444e83d42c2befbe0d5caf5 commit], [https://git.kernel.org/torvalds/c/4f2e7ae56e04cfe670cf39152a8d015984c90351 commit], [https://git.kernel.org/torvalds/c/554ae6e792ef38020b80b4d5127c51d510c0918f commit] == Experimental MD raid5 writeback cache and FAILFAST support == This release implements a raid5 writeback cache in the MD subsystem (Multiple Devices). Its goal is to aggregate writes to make full stripe write and reduce read-modify-write. It's helpful for workload which does sequential write and follows fsync for example. This feature is experimental and off by default. Code: [https://git.kernel.org/torvalds/c/c757ec95c22036b1cb85c56ede368bf8f6c08658 commit], [https://git.kernel.org/torvalds/c/937621c36e0ea1af2aceeaea412ba3bd80247199 commit], [https://git.kernel.org/torvalds/c/2ded370373a400c20cf0c6e941e724e61582a867 commit], [https://git.kernel.org/torvalds/c/1e6d690b9334b7e1b31d25fd8d93e980e449a5f9 commit], [https://git.kernel.org/torvalds/c/a39f7afde358ca89e9fc09a5525d3f8631a98a3a commit], [https://git.kernel.org/torvalds/c/2c7da14b90a01e48b17a028de6050a796cfd6d8d commit], [https://git.kernel.org/torvalds/c/9ed988f5dc673f009d78f7ac55c5da88e1cf58a0 commit], [https://git.kernel.org/torvalds/c/b4c625c67362b3940f619c1a836b4e8329106658 commit], [https://git.kernel.org/torvalds/c/5aabf7c49d9ebe54a318976276b187637177a03e commit], [https://git.kernel.org/torvalds/c/3bddb7f8f264ec58dc86e11ca97341c24f9d38f6 commit] This release also adds failfast support. This feature marks raid disk with failed IOs as broken quickly and avoided in the future, so can improve latency. Code: [https://git.kernel.org/torvalds/c/688834e6ae6b21e3d98b5cf2586aa4a9b515c3a0 commit], [https://git.kernel.org/torvalds/c/46533ff7fefb7e9e3539494f5873b00091caa8eb commit], [https://git.kernel.org/torvalds/c/8d3ca83dcf9ca3d58822eddd279918d46f41e9ff commit], [https://git.kernel.org/torvalds/c/1919cbb23bf1b3e0fdb7b6edfb7369f920744087 commit], [https://git.kernel.org/torvalds/c/2e52d449bcec31cb66d80aa8c798b15f76f1f5e0 commit], [https://git.kernel.org/torvalds/c/212e7eb7a3403464a796c05c2fc46cae3b62d803 commit] == Support for Intel Cache Allocation Technology == A Intel feature that allows to set policies on the L2/L3 CPU caches; eg. real time tasks could be assigned dedicated cache space. For more details, read the recommended LWN article: [https://lwn.net/Articles/694800/ Controlling access to the memory cache]. Code: [https://git.kernel.org/torvalds/c/78e99b4a2b9afb1c304259fcd4a1c71ca97e3acd commit], [https://git.kernel.org/torvalds/c/4e978d06dedb8207b298a5a8a49fce4b2ab80d12 commit], [https://git.kernel.org/torvalds/c/113c60970cf41723891e3a1b303517eaf8510bb5 commit], [https://git.kernel.org/torvalds/c/12e0110c11a460b890ed7e1071198ced732152c9 commit], [https://git.kernel.org/torvalds/c/458b0d6e751b04216873a5ee9c899be2cd2f80f3 commit], [https://git.kernel.org/torvalds/c/60cf5e101fd4441ab112a81e88726efb6fd7542c commit], [https://git.kernel.org/torvalds/c/4f341a5e48443fcc2e2d935ca990e462c02bb1a6 commit], [https://git.kernel.org/torvalds/c/60ec2440c63dea88a5ef13e2b2549730a0d75a37 commit], [https://git.kernel.org/torvalds/c/e02737d5b82640497637d18428e2793bb7f02881 commit] = Core (various) = * Kernel configuration system: Introduce the "imply" keyword. The "imply" keyword is a weak version of "select" where the target config symbol can still be turned off, avoiding those pitfalls that come with the "select" keyword. This is useful e.g. with multiple drivers that want to indicate their ability to hook into a secondary subsystem while allowing the user to configure that subsystem out without also having to unset these drivers [https://git.kernel.org/torvalds/c/237e3ad0f195d8fd34f1299e45f04793832a16fc commit] * To cover the needs of some systems where suspend-to-idle is the preferred suspend method, rework the system sleep state selection interface (but preserve backwards compatibiliby). A new sysfs file, {{{/sys/power/mem_sleep}}} is added, that will control the system suspend mode triggered when writing {{{mem}}} to {{{/sys/power/state}}} (in analogy with what {{{/sys/power/disk}}} does for hibernation). It selects suspend-to-RAM ({{{deep}}} sleep) by default (if supported) and fall back to suspend-to-idle ({{{s2idle}}}) otherwise and add a new command line argument, {{{mem_sleep_default}}}, allowing that default to be overridden if need be [https://git.kernel.org/torvalds/c/406e79385f3223d82272cf2be86bc95cd000a258 commit] * Task scheduler: Add support for tasks that inject idle, used by some idle injection drivers such as Intel powerclamp and ACPI PAD drivers [https://git.kernel.org/torvalds/c/c1de45ca831acee9b72c9320dde447edafadb43f commit] * initramfs: allow again choice of the embedded initram compression algorithm [https://git.kernel.org/torvalds/c/db2aa7fd15e857891cefbada8348c8d938c7a2bc commit] * posix-timers: Make them configurable, removing about 25KB from the kernel binary size when configured out. Corresponding syscalls are routed to a stub logging the attempt to use them [https://git.kernel.org/torvalds/c/baa73d9e478ff32d62f3f9422822b59dd9a95a21 commit] * printk: add Kconfig option to set default console loglevel [https://git.kernel.org/torvalds/c/a8cfdc68f6cfc0c7ffc6d664406fe7f06f17eef4 commit] * Documentation: create an user's manual book [https://git.kernel.org/torvalds/c/9d85025b0418163fae079c9ba8f8445212de8568 commit] * driver core: Functional dependencies tracking support [https://static.lwn.net/kerneldoc/driver-api/device_link.html documentation], [https://git.kernel.org/torvalds/c/9ed9895370aedd6032af2a9181c62c394d08223b commit] * driver-core: add test module for asynchronous probing [https://git.kernel.org/torvalds/c/79543cf2b18ea4a35f8864849d7ad8882ea8a23d commit] * iomap: implement direct I/O path [https://git.kernel.org/torvalds/c/ff6a9292e6f633d596826be5ba70d3ef90cc3300 commit] * Extend {{{rodata=off}}} boot cmdline parameter to module mappings [https://git.kernel.org/torvalds/c/39290b389ea2654f9190e3b48c57d27b24def83e commit] * swiotlb: Add {{{ swiotlb=noforce}}} debug option to aid debugging and catch devices not supporting DMA to memory outside the 32-bit address space [https://git.kernel.org/torvalds/c/fff5d99225107f5f13fe4a9805adc2a1c4b5fb00 commit] = File systems = * OVERLAYFS * When copying up within the same fs, try to use clone copies [https://git.kernel.org/torvalds/c/2ea98466491b7609ace297647b07c28d99ef3722 commit] * Allow renaming a directory with backwards incompatible feature "redirect_dir" [https://git.kernel.org/torvalds/c/a6c6065511411c57167a6cdae0c33263fb662b51 commit], [https://git.kernel.org/torvalds/c/688ea0e5a0e2278e2fcd0014324ab1ba68e70ad7 commit], [https://git.kernel.org/torvalds/c/3ea22a71b65b6743a53e286ff4991a06b9d2597c commit], [https://git.kernel.org/torvalds/c/c5bef3a72b9d8a2040d5e9f4bde03db7c86bbfce commit] * EXT4 * Forbid data journaling when data is encrypted [https://git.kernel.org/torvalds/c/73b92a2a5e97d17cc4d5c4fe9d724d3273fb6fd2 commit] * F2FS * Support multiple devices [https://git.kernel.org/torvalds/c/3c62be17d4f562f43fe1d03b48194399caa35aa5 commit] * NFS * Add support for a new NFSv4.2 mode_umask attribute that makes ACL inheritance a little more useful in environments that default to restrictive umasks [https://git.kernel.org/torvalds/c/dff25ddb48086afcb434770caa3d6849a4489b85 commit], [https://git.kernel.org/torvalds/c/47057abde515155a4fee53038e7772d6b387e0aa commit] * UBIFS * Add support for file encryption using the fscrypt framework [https://git.kernel.org/torvalds/c/39d2c3b96e072c8756f3b980588fa516b7988cb1 (merge)] * XFS * Faster buffer cache lookups [https://git.kernel.org/torvalds/c/6031e73a5b3f85ec45cac08ef90995b2d3f941c7 commit] * Use iomap for Direct I/O (much simpler, faster, and has lower IO latency than the existing direct IO infrastructure) [https://git.kernel.org/torvalds/c/acdda3aae146d9b69d30e9d8a32a8d8937055523 commit] * Deprecate barrier/nobarrier mount option [https://git.kernel.org/torvalds/c/4cf4573d899cd80d8578c050061dc342f99f3a32 commit] * CIFS * New mount option {{{snapshot=