KernelNewbies:

Linux 4.10 was released on 19 Feb 2017.

Summary: This release adds support for virtualized GPUs, a new 'perf c2c' tool for cacheline contention analysis in NUMA systems, a new 'perf sched timehist' command for a detailed history of task scheduling, improved writeback management that should make the system more responsive under heavy writing load, a new hybrid block polling method that uses less CPU than pure polling, support for ARM devices such as the Nexus 5 & 6 or Allwinner A64, a feature that allows to attach eBPF programs to cgroups, an experimental MD RAID5 writeback cache, support for Intel Cache Allocation Technology, and many other improvements and new drivers.

1. Prominent features

1.1. Virtual GPU support

This release adds support for Intel GVT-g for KVM (a.k.a. KVMGT), a full GPU virtualization solution with mediated pass-through, starting from 4th generation Intel Core (Haswell) processors with Intel Graphics. This feature is based on a new VFIO Mediated Device framework. Unlike direct pass-through alternatives, the mediated device framework allows KVMGT to offer a complete virtualized GPU with full GPU features to each one of the virtualized guests, with part of performance critical resources directly assigned, while still having performance close to native. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability.

For more details, see these papers:

A Full GPU Virtualization Solution with Mediated Pass-Through

KVMGT: a Full GPU Virtualization Solution

vGPU on KVM (video)

Intel GVT main site

Code: VFIO Mediated device commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit; KVMGT (merge); Intel GVT-g (merge)

1.2. New 'perf c2c' tool, for cacheline contention analysis

In modern systems with multiple processors, different memory modules are physically connected to different CPUs. In these NUMA systems, memory accesses to the local memory are faster than accesses to the memory connected to other processors. When a task is multi-threaded, different threads can run in different CPUs at the same time; if these threads try to access and modify the same memory, they can have performance issues due to the costs of synchronizing the CPU caches.

perf c2c (for "cache to cache") is a new tool designed to analyse and track down performance problems caused by false sharing on NUMA systems. The tool is based on x86's load latency and precise store facility events provided by Intel CPUs. At a high level, perf c2c will show you:

and more. For more details on perf c2c and how to use it, see https://joemario.github.io/blog/2016/09/01/c2c-blog/

Code: (merge)

1.3. Detailed history of scheduling events with perf sched timehist

'perf sched timehist' provides an analysis of scheduling events. Example usage: $ perf sched record -- sleep 1;  perf sched timehist. By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task:

{{{ time cpu task name wait time sch delay run time







}}}

For more details, see this article from Brendan Gregg: perf sched for Linux CPU scheduler analysis

Code: (merge)

1.4. Improved writeback management

Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That's the definition of background activity...But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like $ dd if=/dev/zero of=foo bs=1M count=10k, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won't start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the LWN article).

This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it's auto-tunable, and generally, a user would not have to touch the settings. This feature needs to be enabled explicitly in the configuration (and, as it should be expected, there can be regressions)

Recommended LWN article: Toward less-annoying background writeback

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.5. Hybrid block polling

Linux 4.4 added support for polling requests in the block layer, a similar approach to what NAPI does for networking, which can improve performance for high-throughput devices (e.g. NVM). Continuously polling a device, however, can cause excessive CPU consumption and some times even worse throughput. This release includes a new hybrid, adaptative type of polling. Instead of polling after IO submission, the kernel induces an artificial delay, and then polls after that. For example, if the IO is presumed to complete in 8 μsecs from now, the kernel sleep for 4 μsecs, wake up, and then does the polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, Linux can achieve big latency reductions while still using the same (or less) amount of CPU. Thanks to improved statistics gathering included in this release, the kernel can measure the completion time of requests and calculate how much it should sleep.

The hybrid block polling is disabled by default. A new sysfs file, /sys/block/<dev>/queue/io_poll_delay has been added, which makes the polling behave as follows: -1: never enter hybrid sleep, always poll (default); 0: Use half of the completion mean for this request type for the sleep delay (aka: hybrid poll); >0: disregard the mean value calculated by the kernel, and always use this specific value as the sleep delay.

Code: commit, commit, commit

1.6. Better support for ARM devices such as Nexus 5 & 6 or Allwinner A64

As an evidence of the work being done to bring Android and mainline kernels together, this release includes support for ARM socs such as:

Code: (merge)

1.7. Allow attaching eBPF programs to cgroups

This release adds eBPF hooks for cgroups, to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. A new BPF program type is added, BPF_PROG_TYPE_CGROUP_SKB. The bpf(2) syscall is extended with by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH, which allow attaching and detaching eBPF programs to a target. This feature is configurable (CONFIG_CGROUP_BPF).

Recommended LWN article: Network filtering for control groups

Code: commit, commit, commit, commit, commit, commit

This release also adds a new cgroup-based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program.

Code: commit, commit, commit, commit, commit, commit, commit

1.8. Experimental MD raid5 writeback cache and FAILFAST support

This release implements a raid5 writeback cache in the MD subsystem (Multiple Devices). Its goal is to aggregate writes to make full stripe write and reduce read-modify-write. It's helpful for workload which does sequential write and follows fsync for example.

This feature is experimental and off by default.

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

This release also adds "failfast" support. RAID disk with failed IOs are marked as broken quickly, and avoided in the future, which can improve latency.

Code: commit, commit, commit, commit, commit, commit

1.9. Support for Intel Cache Allocation Technology

A Intel feature that allows to set policies on the L2/L3 CPU caches; e.g. real-time tasks could be assigned dedicated cache space. For more details, read the recommended LWN article: Controlling access to the memory cache.

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit

2. Core (various)

3. File systems

4. Memory management

5. Block layer

6. Tracing and perf tool

7. Virtualization

8. Security

9. Graphics

10. Networking

11. Architectures

12. Drivers

12.1. Graphics

12.2. Storage

12.3. Drivers in the Staging area

12.4. Networking

12.5. Audio

12.6. Tablets, touch screens, keyboards, mouses

12.7. TV tuners, webcams, video capturers

12.8. Universal Serial Bus

12.9. Serial Peripheral Interface (SPI)

12.10. Watchdog

12.11. Serial

12.12. ACPI, EFI, cpufreq, thermal, Power Management

12.13. Real Time Clock (RTC)

12.14. Voltage, current regulators, power capping, power supply

12.15. Pin Controllers (pinctrl)

12.16. Multi Media Card (MMC)

12.17. Industrial I/O (iio)

12.18. Multi Function Devices (MFD)

12.19. Pulse-Width Modulation (PWM)

12.20. Inter-Integrated Circuit (I2C)

12.21. Hardware monitoring (hwmon)

12.22. General Purpose I/O (gpio)

12.23. Leds

12.24. DMA engines

12.25. Clocks

== Hardware Random Number Generator (hwrng)==

== Cryptography hardware acceleration ==

12.26. PCI

12.27. Various

13. List of merges

14. Other news sites

KernelNewbies: Linux_4.10 (last edited 2017-12-30 01:29:52 by localhost)