• Immutable Page
  • Info
  • Attachments

Diff for "LinuxChanges"

Differences between revisions 355 and 356

Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
[[Include(Linux_4.1)]] [[Include(Linux_4.4)]]

Changes done in each Linux kernel release. Other places to get news about the Linux kernel are LWN kernel status, H-Online, or the Linux Kernel mailing list (there is a web interface in www.lkml.org). List of changes of older releases can be found at LinuxVersions. If you're going to add something here look first at LinuxChangesRules!

You can discuss the latest Linux kernel changes on the New Linux Kernel Features Forum.

Linux 4.4 has been released on Sun, 10 Jan 2016.

Summary: This release adds support for 3D support in virtual GPU driver, which allows 3D hardware-accelerated graphics in virtualization guests; loop device support for Direct I/O and Asynchronous I/O, which saves memory and increases performance; support for Open-channel SSDs, which are devices that share the responsibility of the Flash Translation Layer with the operating system; the TCP listener handling is completely lockless and allows for faster and more scalable TCP servers; journalled RAID5 in the MD layer which fixes the RAID write hole; eBPF programs can now be run by unprivileged users, they can be made persistent, and perf has added support for eBPF programs aswell; a new mlock2() syscall that allows users to request memory to be locked on page fault; and block polling support for improved performance in high-end storage devices. There are also new drivers and many other small improvements.

  1. Prominent features
    1. Faster and leaner loop device with Direct I/O and Asynchronous I/O support
    2. 3D support in virtual GPU driver
    3. LightNVM adds support for Open-Channel SSDs
    4. TCP listener handling completely lockless, making TCP servers faster and more scalable
    5. Journalled RAID5 MD support
    6. Unprivileged eBPF + persistent eBPF programs
    7. perf + eBPF integration
    8. Block polling support
    9. mlock2() syscall allow users to request memory to be locked on page fault
  2. Drivers and architectures
  3. Core (various)
  4. File systems
  5. Memory management
  6. Block layer
  7. Cryptography
  8. Security
  9. Tracing and perf tool
  10. Virtualization

1. Prominent features

1.1. Faster and leaner loop device with Direct I/O and Asynchronous I/O support

This release introduces support of Direct I/O and asynchronous I/O for the loop block device. There are several advantages to use direct I/O and AIO on read/write loop's backing file: double cache is avoided due to Direct I/O which reduces memory usage a lot; unlike user space direct I/O there isn't cost of pinning pages; avoids context switches in some cases because concurrent submissions can be avoided. See commits for benchmarks.

Code: commit, commit, commit, commit, commit

1.2. 3D support in virtual GPU driver

virtio-gpu is a driver for virtualization guests that allows to use the host graphics card efficiently. In this release, it allows the virtualization guest to use the capabilities of the host GPU to accelerate 3D rendering. In practice, this means that a virtualized linux guest can run a opengl game while using the GPU acceleration capabilities of the host, as show in this or this video. This also requires running QEMU 2.5.

Virgil project page

44m linux.conf talk about the project

Code: commit

1.3. LightNVM adds support for Open-Channel SSDs

Open-channel SSDs are devices that share responsibilities with the operating system in order to implement and maintain features that typical SSDs keep strictly in firmware. These include the Flash Translation Layer (FTL), bad block management, and hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs. LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. This Linux release adds support for lightnvm, (and adds support to NVMe as well).

Recommended LWN article: Taking control of SSDs with LightNVM

Code: commit, commit, commit, commit, commit

1.4. TCP listener handling completely lockless, making TCP servers faster and more scalable

In this release, and as a result from an effort that started two years ago, the TCP implementation has been refactored to make the TCP listener fast path completely lockless. During tests, a server was able to process 3,500,000 SYN packets per second on one listener and still have available CPU cycles - about 2 to 3 order of magnitude what it was possible before. SO_REUSEPORT has also been extended (see Networking section) to add proper CPU/NUMA affinities, so that heavy duty TCP servers can get proper siloing thanks to multi-queues NICs.

Code: commit, commit, commit

1.5. Journalled RAID5 MD support

This release adds journalled RAID 5 support to the MD (RAID/LVM) layer. With a journal device configured (typically NVRAM or SSD), Data/parity writing to RAID array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up RAID resync and fixes RAID5 write hole issue - a crash during degraded operations cannot result in data corruption. In future releases the journal will also be used to improve performance and latency

Code: merge

1.6. Unprivileged eBPF + persistent eBPF programs

Unprivileged eBPF

eBPF programs got its own syscall in Linux 3.18, but until now its use had been restricted to root, because these programs were dangerous for security. eBPF programs are, however, validated by the kernel, and in this release the eBPF verifier has been improved and unprivileged users can use it (although unprivileged eBPF is only meaningful for 'socket filter'-like programs, eBPF programs for tracing and TC classifiers/actions will stay root only). This feature can be switched off with the sysctl kernel.unprivileged_bpf_disabled (once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false)

Recommended LWN article: Unprivileged bpf()

Code: commit, commit

Persistent eBPF maps/progs

This release also adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users, for example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits, so a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path.

To fix issues as these, a new minimal file system has been created that can hold map/prog objects at /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) along with a pathname with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET), that in turn creates the file system nodes. The user can use that to access maps and progs later on, through bpf(2).

Code: commit, commit

1.7. perf + eBPF integration

In this release, eBPF programs have been integrated with perf. When perf is given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel, which can then be used and seen using perf trace and other tools.

Users are allowed to use BPF filter like: # perf record --event ./hello_world.o ls, and the eBPF program is attached to a newly created perf event which works with all tools.

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.8. Block polling support

This release adds basic support for polling for specific IO to complete, which can improve latency and throughput in very fast devices. Currently O_DIRECT sync read/write are supported. This support is only intended for testing, in future releases stats tracking will be used to auto-tune this. For now, for benchmark and testing purposes, we add a sysfs file (io_poll) that controls whether polling is enabled or not.

Recommended LWN article: Block-layer I/O polling

Code: commit, commit, commit

1.9. mlock2() syscall allow users to request memory to be locked on page fault

mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings this is not ideal: For example, security applications that need mlock() are forced to lock an entire buffer, no matter how big it is. Or maybe a large graphical models where the path through the graph is not known until run time, they are forced to lock the entire graph or lock page by page as they are faulted in.

This new mlock2() syscall set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. The new system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. New calls are added for munlock() and munlockall() which give the called a way to specify which flags are supposed to be cleared. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). Finally, a flag for mmap() is added that allows a user to specify that the covered are should not be paged out, but only after the memory has been used the first time.

Recommended LWN article: Deferred memory locking

Code: commit, commit, commit, commit

2. Drivers and architectures

3. Core (various)

  • process scheduler: Apply a frequency scaling correction factor to per-entity load tracking to make it invariant with respect to CPU frequency. Currently, load appears bigger when the CPU is running at slower frequencies, which affects load-balancing decisions commit, commit

  • seccomp: add support for dumping a process' (classic BFP) seccomp filters via ptrace + PTRACE_SECCOMP_GET_FILTER commit

  • watchdog: Mimic the softlockup_panic kernel knob and create a /proc/sys/kernel/hardlockup_panic. It enables a hardlockup to panic the machine commit

  • watchdog: optionally perform all-CPU backtrace in case of hard lockup. Can be enabled with sysctl /proc/sys/kernel/hardlockup_all_cpu_backtrace commit

  • coredump: Add two new flags to the existing coredump mechanism for ELF and FDPIC ELF files to allow us to explicitly filter DAX mappings. This is desirable because DAX mappings, like hugetlb mappings, have the potential to be very large commit, commit

  • test_printf: test printf family at runtime commit

  • Make sync_file_range(2) use WB_SYNC_NONE writeback. It helps PostgreSQL avoid large latency spikes when flushing data in the background commit

4. File systems

  • XFS

    • Add per-filesystem stats in /sys/fs/xfs/<block>/stats/stats, and a stats_clear file to clear them. Also, the global stats that are currently present in /proc are duplicated in /sys/fs/xfs/stats/stats (along with a stats_clear file) commit, commit, commit

  • BTRFS

    • Add fragment debug mount option. It can be used to cause extreme fragmentation in data, metadata or both commit

    • Add balance filter for stripes. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. commit

  • CIFS

    • Allow duplicate extents (cp --reflink) in SMB3.0 not just SMB3.1.1 commit

    • Add resilienthandles mount parameter. Since many servers (Windows clients, and non-clustered servers) do not support persistent handles but do support resilient handles, allow the user to specify a mount option "resilienthandles" in order to get more reliable connections and less chance of data loss (at least when SMB2.1 or later). Default resilient handle timeout (120 seconds to recent Windows server) is used commit

    • Add support for persistent handles, which are like durable file handles with strong guarantees commit, commit, commit

    • Allow copy offload (copychunk) across shares commit

  • NFS

  • EXT4

    • Store checksum seed in superblock commit

  • OCFS2

    • Improve performance for localalloc commit

  • UBIFS

5. Memory management

  • Get rid of vmalloc_info from /proc/meminfo. It is too expensive to calculate and shows up in real workloads, people who actually want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead commit

  • Add HugetlbPages field to /proc/PID/status. Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb can need it commit

  • Add hugetlb-related fields to /proc/PID/smaps to know per-task or per-vma base hugetlb usage: AnonHugePages shows the amount of memory backed by transparent hugepage; Shared_Hugetlb and Private_Hugetlb show the amounts of memory backed by hugetlbfs page which is not counted in RSS or PSS field for historical reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field commit

  • memcontrol: eliminate memory.current on the root level, because it doesn't add anything that wouldn't be more accurate and detailed using system statistics commit

6. Block layer

  • Block polling support commit, commit, commit

  • loop: direct and asynchronous I/O commit, commit, commit, commit, commit

  • Add Persistent Reservations support. It includes a user space interface for simplified Persistent Reservations which map to block devices that support these (only SCSI for now). Persistent Reservations allow restricting access to block devices to specific initiators in a shared storage setup commit, commit, commit

  • Export integrity data interval size in /sys/block/<disk>/integrity/protection_interval_bytes, so that apps can tell whether the interval is different from the device's logical block size commit

  • cdrom: Random writing support for BD-RE media commit

7. Cryptography

crypto: caam - add support for acipher xts(aes) commit crypto: keywrap - add key wrapping block chaining mode commit crypto: qat - add support for ctr(aes) and xts(aes) commit

8. Security

9. Tracing and perf tool

  • Integration of perf with eBPF that, given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel via the sys_bpf syscall, which can then be used and seen using 'perf trace' and other tools. Users can run commands like perf record --event bpf-file.c ls to try it commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

  • Add a new branch type sampling filter to perf record, named 'call' (perf record -j call -e cycles .....), that samples only call branches (function calls), unlike 'any_call' that included direct, indirect calls and far jumps. Only x86 and PowerPC are supported in this release commit, commit

  • Add Intel cstate (aka idle states) Performance Monitoring Unit support. This allows perf to support cstate related free running (read-only and system-wide) counters. For example, to caculate the fraction of time when the core is running in C6 state: perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 sleep 5 commit

  • CPU socket filtering: perf tools introduce a new sort type "socket" for the processor socket, eg. perf report --stdio --sort socket,comm,dso,symbol commit. Also, perf report introduces a --socket-filter option for 'perf report' to only show entries for a processor socket that match this filter commit. perf hists browser can zoom in/out for processor socket commit

  • perf tools: Introduce 'P' modifier, it will cause the event to get maximum possible detected precise level. For example, perf record -e cycles:P ... will detect maximum precise level for 'cycles' event and use it commit

  • perf tools: Add support for sorting on the iaddr. New sort option is: symbol_iaddr, header label is 'Code Symbol', eg perf mem report --stdio -F +symbol_iaddr commit

  • perf tools: enables config terms for tracepoint perf events. Valid terms for tracepoint events are 'call-graph' and 'stack-size', so different callgraph settings can be used for each event and eliminate unnecessary overhead. An example for using different call-graph config for each tracepoint: perf record -e syscalls:sys_enter_write/call-graph=fp -e syscalls:sys_exit_write/call-graph=no dd if=/dev/zero of=test bs=4k count=10 commit

  • perf script: Enable printing of branch stack viaa the 'brstack' and 'brstacksym' arguments to the field selection option -F. The option is off by default and operates only if the perf.data file has branch stack content commit

  • perf auxtrace: Add AUX area tracing option 'l' to synthesize branch stacks on samples just like sample type PERF_SAMPLE_BRANCH_STACK commit

  • perf hists browser: Add 'm' key for context menu display commit

  • perf inject: Add --strip option which is used with --itrace to strip out non-synthesized events commit

  • perf script: Allow time to be displayed in nanoseconds commit

  • Intel PT hardware tracer: Accept a zero --itrace period, meaning "as often as possible". In the case of Intel PT that is the same as a period of 1 and a unit of 'instructions' (i.e. --itrace=i1i)commit

  • Intel PT: Add support for generating branch stack context for PT samples. This is useful for: reporting accurate basic block edge frequencies through the perf report branch view or using with --branch-history to get the wider context of samples. Examples, record with Intel PT: perf record -e intel_pt//u ls

  • ftrace: add module globbing commit

10. Virtualization

Tell others about this page:

last edited 2016-01-10 21:24:32 by diegocalleja