KernelNewbies:

Linux 4.10 changelog.

Summary: This release adds support for virtualized GPUs, a new 'perf c2c' tool for cacheline contention analysis in NUMA systems, a new 'perf sched timehist' for a detailed history of task scheduling, improved writeback management, a new hybrid block polling method that uses less CPU than pure polling, a feature that allows to attach eBPF programs to cgroups, an experimental MD RAID5 writeback cache, support for Intel Cache Allocation Technology, and many other improvements and new drivers.

TableOfContents()

1. Prominent features

1.1. Virtual GPU support

This release adds support for Intel GVT-g for KVM (a.k.a. KVMGT), a full GPU virtualization solution with mediated pass-through, starting from 4th generation Intel Core processors with Intel Graphics. This feature is based on a new VFIO Mediated Device framework. Unlike direct pass-through alternatives, the mediated device framework allows KVMGT to offer a complete virtualized GPU with full GPU features to each one of the virtualized guests, with part of performance critical resources directly assigned, while still having performance close to native. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability.

For more details, see these papers:

[https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian A Full GPU Virtualization Solution with Mediated Pass-Through]

[http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf KVMGT: a Full GPU Virtualization Solution]

[http://www.linux-kvm.org/images/5/59/02x03-Neo_Jia_and_Kirti_Wankhede-vGPU_on_KVM-A_VFIO_based_Framework.pdf vGPU on KVM] ([https://www.youtube.com/watch?v=Xs0TJU_sIPc video])

[https://01.org/igvt-g/ Intel GVT main site]

Code: VFIO Mediated device [https://git.kernel.org/torvalds/c/7b96953bc640b6b25665fe17ffca4b668b371f14 commit], [https://git.kernel.org/torvalds/c/fa3da00cb8c0d403030f4805ae615b444f0d2f3c commit], [https://git.kernel.org/torvalds/c/7ed3ea8a71187a4569eb65f647ea4af0cdb9a856 commit], [https://git.kernel.org/torvalds/c/32f55d835b23830bf9295d038a1693ce9fd41b56 commit], [https://git.kernel.org/torvalds/c/2169037dc322d8baa84d9bd4468995f818f25d82 commit], [https://git.kernel.org/torvalds/c/3624a2486c8ca10a2c730c704441fdd034a0d9b7 commit], [https://git.kernel.org/torvalds/c/ea85cf353e4fed4adcf8c960f4add2a286bc2c91 commit], [https://git.kernel.org/torvalds/c/7896c998f0e7160df97bd7aaae9807120535bf14 commit], [https://git.kernel.org/torvalds/c/8f0d5bb95f763cacad7654304050ec1b636bb04a commit], [https://git.kernel.org/torvalds/c/a54eb55045ae9b3032c71f1134e30d02de527038 commit], [https://git.kernel.org/torvalds/c/c086de818dd81c3c2f7cecff23de6585b74340c0 commit], [https://git.kernel.org/torvalds/c/b3c0a866f1692da2d1059dadd9c429ff5b364fc9 commit], [https://git.kernel.org/torvalds/c/c535d34569bbc61ebf25a5505ab9eafba057345f commit], [https://git.kernel.org/torvalds/c/c747f08aea847c8c0704acf9375ca83c4800f6c1 commit], [https://git.kernel.org/torvalds/c/ef198aaa169c61ab357a5cea5a4ce1ee6aafa824 commit], [https://git.kernel.org/torvalds/c/a1e03e9bccd1402971213c4953ea59aab8142644 commit], [https://git.kernel.org/torvalds/c/2818c6e91980d966d015a9f763ab24b41e6a7c3d commit], [https://git.kernel.org/torvalds/c/8e1c5a4048b89d04d8d1ee655ce1f685e6fddde4 commit], [https://git.kernel.org/torvalds/c/3771bd96976dbd01ce4995760ed1d0932f30a366 commit], [https://git.kernel.org/torvalds/c/9d1a546c53b4c1c378b0f34de84ddee2c7d4c90c commit], [https://git.kernel.org/torvalds/c/5188287a860b6ec5950d5156d63056156f59ee3b commit]; KVMGT [https://git.kernel.org/torvalds/c/8be8f4a9a9ce48d545512ef7299da607401f3879 (merge)]; Intel GVT-g [https://git.kernel.org/torvalds/c/06a75ace46e2fdd1d93b06228df0e2dfe526cc27 (merge)]

1.2. New 'perf c2c' tool, for cacheline contention analysis

In modern systems with multiple processors, different memory modules are physically connected to different CPUs. In these [https://en.wikipedia.org/wiki/Non-uniform_memory_access NUMA] systems, memory accesses to the local memory are faster than accesses to the memory connected to other processors. When a task is multi-threaded, different threads can run in different CPUs at the same time; if these threads try to access and modify the same memory, they can have performance issues.

perf c2c (for "cache to cache") is a new tool designed to analyse and track down performance problems caused by false sharing on NUMA systems. The tool is based on x86's load latency and precise store facility events provided by Intel CPUs. At a high level, perf c2c will show you:

and more. For more details on perf c2c and how to use it, see https://joemario.github.io/blog/2016/09/01/c2c-blog/

Code: [https://git.kernel.org/torvalds/c/e9c848928abf4cb60601e9ae7d336f0333c98bca (merge)]

1.3. Detailed history of scheduling events with perf sched timehist

'perf sched timehist' provides an analysis of scheduling events. Example usage: # perf sched record -- sleep 1;  perf sched timehist. By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task:

{{{ time cpu task name wait time sch delay run time







}}}

Code: [https://git.kernel.org/torvalds/c/47414424c53a70eceb0fc6e0a35a31a2b763d5b2 (merge)]

1.4. Improved writeback management

Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That's the definition of background activity...But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like $ dd if=/dev/zero of=foo bs=1M count=10k, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won't start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the [https://lwn.net/Articles/682582/ LWN article]).

This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it's auto-tunable, and generally, a user would not have to touch the settings.

Recommended LWN article: [https://lwn.net/Articles/682582/ Toward less-annoying background writeback]

Code: [https://git.kernel.org/torvalds/c/1d796d6a9641fbfcd90fcfaf6fb4894a13d0304f commit], [https://git.kernel.org/torvalds/c/7637241e651ec36e409412869f986dd5f097735f commit], [https://git.kernel.org/torvalds/c/13edd5e7315a26b448c5f7f33fc7721b1e0c17ef commit], [https://git.kernel.org/torvalds/c/b57d74aff9ab92fbfb7c197c384d1adfa2827b2e commit], [https://git.kernel.org/torvalds/c/d278d4a8892f13b6a9eb6102b356402f0e062324 commit], [https://git.kernel.org/torvalds/c/cf43e6be865a582ba66ee4747ae27a0513f6bba1 commit], [https://git.kernel.org/torvalds/c/e34cbd307477ae07c5d8a8d0bd15e65a9ddaba5c commit], [https://git.kernel.org/torvalds/c/87760e5eef359788047d6fd54fc12eec74ce0d27 commit], [https://git.kernel.org/torvalds/c/80e091d10e8bf7b801d634ea8870b9e907314424 commit], [https://git.kernel.org/torvalds/c/d62118b6dd99b8f64350206a6ea6996083b28c9a commit]

1.5. Hybrid block polling

Linux 4.4 [https://kernelnewbies.org/Linux_4.4#head-cd57c6abf8822152b3a175dd68c9610562b220d5 added] support for polling requests in the block layer, a similar approach to what NAPI does for networking, which can improve performance for high-throughput devices (eg: NVM). Continuously polling a device, however, can cause excessive CPU consumption and some times even worse throughput. This release includes a new hybrid, adaptative type of polling. Instead of polling after IO submission, the kernel induces an artificial delay, and then polls after that. For example, if the IO is presumed to complete in 8 usecs from now, the kernel sleep for 4 usecs, wake up, and then does the polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, Linux can achieve big latency reductions while still using the same (or less) amount of CPU. Thanks to improved statistics gathering included in this release, the kernel can measure the completion time of requests and calculate how much it should sleep.

The hybrid block polling is disabled by default. A new sysfs file, /sys/block/<dev>/queue/io_poll_delay has been added, which makes the polling behave as follows: -1: never enter hybrid sleep, always poll (default); 0: Use half of the completion mean for this request type for the sleep delay (aka: hybrid poll); >0: disregard the mean value calculated by the kernel, and always use this specific value as the sleep delay.

Code: [https://git.kernel.org/torvalds/c/189ce2b9dcc3494410a576fbecbedbb6b21e51e0 commit], [https://git.kernel.org/torvalds/c/06426adf072bca62ac31ea396ff2159a34f276c2 commit], [https://git.kernel.org/torvalds/c/64f1c21e86f7fe63337b5c23c129de3ec506431d commit]

1.6. Allow attaching eBPF programs to cgroups

This release adds eBPF hooks for cgroups, to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. A new BPF program type is added, BPF_PROG_TYPE_CGROUP_SKB. The [http://man7.org/linux/man-pages/man2/bpf.2.html bpf(2)] syscall is extended with by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH, which allow attaching and detaching eBPF programs to a target. This feature is configurable (CONFIG_CGROUP_BPF).

Recommended LWN article: [https://lwn.net/Articles/698073/ Network filtering for control groups]

Code: [https://git.kernel.org/torvalds/c/0e33661de493db325435d565a4a722120ae4cbf3 commit], [https://git.kernel.org/torvalds/c/3007098494bec614fb55dee7bc0410bb7db5ad18 commit], [https://git.kernel.org/torvalds/c/f4324551489e8781d838f941b7aee4208e52e8bf commit], [https://git.kernel.org/torvalds/c/c11cd3a6ec3a817c6b71b00c559e25d855f7e5b4 commit], [https://git.kernel.org/torvalds/c/33b486793cb31311f3a91ae4fe4be5926e7677b0 commit], [https://git.kernel.org/torvalds/c/d8c5b17f2bc0de09fbbfa14d90e8168163a579e7 commit]

This release also adds a new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program.

Code: [https://git.kernel.org/torvalds/c/b2cd12574aa3e1625f471ff57cde7f628a18a46b commit], [https://git.kernel.org/torvalds/c/61023658760032e97869b07d54be9681d2529e77 commit], [https://git.kernel.org/torvalds/c/ad2805dc79e647ec2aee931a51924fda9d03b2fc commit], [https://git.kernel.org/torvalds/c/aa4c1037a30f4e88f444e83d42c2befbe0d5caf5 commit], [https://git.kernel.org/torvalds/c/4f2e7ae56e04cfe670cf39152a8d015984c90351 commit], [https://git.kernel.org/torvalds/c/554ae6e792ef38020b80b4d5127c51d510c0918f commit], [https://git.kernel.org/torvalds/c/7f677633379b4abb3281cdbe7e7006f049305c03 commit]

1.7. Experimental MD raid5 writeback cache and FAILFAST support

This release implements a raid5 writeback cache in the MD subsystem (Multiple Devices). Its goal is to aggregate writes to make full stripe write and reduce read-modify-write. It's helpful for workload which does sequential write and follows fsync for example.

This feature is experimental and off by default.

Code: [https://git.kernel.org/torvalds/c/c757ec95c22036b1cb85c56ede368bf8f6c08658 commit], [https://git.kernel.org/torvalds/c/937621c36e0ea1af2aceeaea412ba3bd80247199 commit], [https://git.kernel.org/torvalds/c/2ded370373a400c20cf0c6e941e724e61582a867 commit], [https://git.kernel.org/torvalds/c/1e6d690b9334b7e1b31d25fd8d93e980e449a5f9 commit], [https://git.kernel.org/torvalds/c/a39f7afde358ca89e9fc09a5525d3f8631a98a3a commit], [https://git.kernel.org/torvalds/c/2c7da14b90a01e48b17a028de6050a796cfd6d8d commit], [https://git.kernel.org/torvalds/c/9ed988f5dc673f009d78f7ac55c5da88e1cf58a0 commit], [https://git.kernel.org/torvalds/c/b4c625c67362b3940f619c1a836b4e8329106658 commit], [https://git.kernel.org/torvalds/c/5aabf7c49d9ebe54a318976276b187637177a03e commit], [https://git.kernel.org/torvalds/c/3bddb7f8f264ec58dc86e11ca97341c24f9d38f6 commit]

This release also adds "failfast" support. RAID disk with failed IOs are marked as broken quickly, and avoided in the future, which can improve latency.

Code: [https://git.kernel.org/torvalds/c/688834e6ae6b21e3d98b5cf2586aa4a9b515c3a0 commit], [https://git.kernel.org/torvalds/c/46533ff7fefb7e9e3539494f5873b00091caa8eb commit], [https://git.kernel.org/torvalds/c/8d3ca83dcf9ca3d58822eddd279918d46f41e9ff commit], [https://git.kernel.org/torvalds/c/1919cbb23bf1b3e0fdb7b6edfb7369f920744087 commit], [https://git.kernel.org/torvalds/c/2e52d449bcec31cb66d80aa8c798b15f76f1f5e0 commit], [https://git.kernel.org/torvalds/c/212e7eb7a3403464a796c05c2fc46cae3b62d803 commit]

1.8. Support for Intel Cache Allocation Technology

A Intel feature that allows to set policies on the L2/L3 CPU caches; eg. real time tasks could be assigned dedicated cache space. For more details, read the recommended LWN article: [https://lwn.net/Articles/694800/ Controlling access to the memory cache].

Code: [https://git.kernel.org/torvalds/c/78e99b4a2b9afb1c304259fcd4a1c71ca97e3acd commit], [https://git.kernel.org/torvalds/c/4e978d06dedb8207b298a5a8a49fce4b2ab80d12 commit], [https://git.kernel.org/torvalds/c/113c60970cf41723891e3a1b303517eaf8510bb5 commit], [https://git.kernel.org/torvalds/c/12e0110c11a460b890ed7e1071198ced732152c9 commit], [https://git.kernel.org/torvalds/c/458b0d6e751b04216873a5ee9c899be2cd2f80f3 commit], [https://git.kernel.org/torvalds/c/60cf5e101fd4441ab112a81e88726efb6fd7542c commit], [https://git.kernel.org/torvalds/c/4f341a5e48443fcc2e2d935ca990e462c02bb1a6 commit], [https://git.kernel.org/torvalds/c/60ec2440c63dea88a5ef13e2b2549730a0d75a37 commit], [https://git.kernel.org/torvalds/c/e02737d5b82640497637d18428e2793bb7f02881 commit]

2. Core (various)

3. File systems

4. Memory management

5. Block layer

6. Tracing and perf tool

7. Virtualization

8. Security

9. Graphics

10. Networking

11. Architectures

12. Drivers

12.1. Graphics

12.2. Storage

12.3. Drivers in the Staging area

12.4. Networking

12.5. Audio

12.6. Tablets, touch screens, keyboards, mouses

12.7. TV tuners, webcams, video capturers

12.8. Universal Serial Bus

12.9. Serial Peripheral Interface (SPI)

12.10. Watchdog

12.11. Serial

12.12. ACPI, EFI, cpufreq, thermal, Power Management

12.13. Real Time Clock (RTC)

12.14. Voltage, current regulators, power capping, power supply

12.15. Pin Controllers (pinctrl)

12.16. Multi Media Card (MMC)

12.17. Industrial I/O (iio)

12.18. Multi Function Devices (MFD)

12.19. Pulse-Width Modulation (PWM)

12.20. Inter-Integrated Circuit (I2C)

12.21. Hardware monitoring (hwmon)

12.22. General Purpose I/O (gpio)

12.23. Leds

12.24. DMA engines

12.25. Clocks

== Hardware Random Number Generator (hwrng)==

== Cryptography hardware acceleration ==

12.26. PCI

12.27. Various

13. List of merges

14. Other news sites

KernelNewbies: Linux_4.10 (last edited 2017-02-19 19:39:08 by diegocalleja)