KernelNewbies:

Linux 4.3 has been released on 1 Nov 2015

Summary: This release removes the ext3 filesystem and leaves Ext4, which can also mount Ext3 filesystems, as the main Ext filesystem; it also adds userfaultfd(), a system call for handling page-faults in user space; membarrier(), a system call for issuing memory barriers on a set of threads; a PID controller for limiting the number of PIDs in cgroups, "ambient" capabilities for making easier to use capabilities; idle page tracking, a more precise way to track the memory being used by applications; support for IPv6 Identifier Locator Addressing; network light weight tunnels, virtual Routing and Forwarding Lite support, and many other improvements and new drivers.

1. Prominent features

1.1. The Ext3 filesystem has been removed

The Ext3 filesystem has been removed from the Linux core repository. The reason behind this removal is that Ext3 filesystems are fully support by the Ext4 filesystem, and major distros have been already using Ext4 to mount Ext3 filesystems for a long time. With the stabilization of Ext4, maintainers think that the Ext3 codebase is useless duplicated code and should disappear.

Recommended LWN article: rm -r fs/ext3

Code: commit

1.2. userfaultfd(), a system call for handling page-faults in user space

A page fault happens when a process has something mapped in its virtual address space (eg, a file) but the memory has not been loaded in RAM, and the process tries to access that memory. The kernel usually handles that page fault (eg. it loads the corresponding part of the file in memory).

This release adds support for handling page faults in userspace through a new system call, userfaultfd(). Aside from registering and unregistering virtual memory ranges, userfaultfd() provides two primary functionalities: 1) read/POLLIN protocol to notify a userland thread of the faults happening 2) various ioctls that can manage the virtual memory regions registered in the userfaultfd that allows userland to efficiently resolve the userfaults it receives via or to manage the virtual memory in the background. The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures.

The main user of this syscall is QEMU, which can use the userfaultfd syscall to implement postcopy live migration: VMs are migrated to another host without transferring the memory, which makes the migration much faster, and QEMU uses userfaultfd() to transfer the pages as page faults happen.

Recommended LWN article: User-space page fault handling

Documentation: Documentation/vm/userfaultfd.txt

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.3. membarrier(), a system call for issuing memory barriers on a set of threads

This release adds a new system call, membarrier(2), which helps distributing the cost of user-space memory barriers required to order memory accesses on multi-core systems, by transforming pairs of memory barriers into pairs consisting of membarrier(2) and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be accelerated significantly with this syscall by moving the bulk of the memory barrier overhead to the write-side. The idea is to make CPUs execute memory barriers only when synchronization is required by the updater thread, as opposed to executing them each time before and after accessing it from a reader thread.

Recommended LWN article: sys_membarrier()

Code: commit

1.4. New PID controller for limiting the number of PIDs in cgroups

This release adds a new PIDs controller to limit the number of tasks that can be forked inside a cgroup. PIDs are fundamentally a global resource because it fairly trivial to reach PID exhaustion before you reach other resource limits. As a result, it is possible to grind a system to halt without being limited by other cgroup policies. The PIDs cgroup controller is designed to stop this from happening.

Essentially, this is an implementation of RLIMIT_NPROC that applies to a cgroup rather than a process tree. However, it should be noted that organisational operations (adding and removing tasks from a PIDs hierarchy) will not be prevented. Rather, the number of tasks in the hierarchy cannot exceed the limit through forking.

In order to use the pids controller, set the maximum number of tasks in pids.max (this is not available in the root cgroup for obvious reasons). The number of processes currently in the cgroup is given by pids.current. To set a cgroup to have no limit, set pids.max to "max". This is the default for all new cgroups.

Documentation: Documentation/cgroup-v1/pids.txt

Code: commit, commit, commit, commit

1.5. Ambient capabilities

On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: "effective", "permitted", "inheritable", and a "bounding set". When the kernel checks for a capability, it checks the effective mask. The other capability masks serve to modify what capabilities can be in "effective".

Due to the shortcomings of the Linux capabilities implementation, capability inheritance is not very useful. To solve these problems, this release adds a fifth capability mask called the "ambient" mask. The ambient mask does what most people expect "inheritable" to do. No capabilities bit can ever be set in "ambient" if it is not set in both "inheritable" and "permitted". Dropping a bit from "permitted" or "inheritable" drops that bit from "ambient". This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting prctl(PR_SET_KEEPCAPS,...), using setresuid() to switch to nonroot uids, and then calling execve() effectively drops capabilities. Therefore, setresuid() from root to nonroot conditionally clears "ambient" unless SECBIT_NO_SETUID_FIXUP is set.

If you are nonroot but you have a capability, you can add it to "ambient". If you do so, your children get that capability in "ambient", "permitted", and "effective". For example, you can set CAP_NET_BIND_SERVICE in "ambient", and your children can automatically bind low-numbered ports. Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees.

Recommended LWN article: Inheriting capabilities

Code: commit, commit

1.6. Introduce idle page tracking, a more precise way to track the memory being used by applications

Knowing which memory pages are being accessed by a workload and which are idle can be useful for estimating the workload's working set size, which, in turn, can be taken into account when configuring the workload parameters, setting memory cgroup limits, or deciding where to place the workload within a compute cluster. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/clear_refs and /proc/PID/smaps: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: 1) it does not count unmapped file pages, 2) it affects the reclaimer logic.

To overcome these drawbacks, this release introduces the idle page tracking feature

In order to estimate the amount of pages that are not used by a workload one should:

Recommended lWN article: Tracking actual memory utilization

Documentation: Documentation/vm/idle_page_tracking.txt

Code: commit, commit, commit, commit, commit, commit, commit, commit

1.7. Support for IPv6 Identifier Locator Addressing

This release adds support for Identifier Locator Addressing, a mechanism meant to implement tunnels or network virtualization without encapsulation. The basic concept of ILA is that a IPv6 address is split into a 64 bit locator and 64 bit identifier. The identifier is the identity of an entity in communication ("who") and the locator expresses the location of the entity ("where"). Applications use externally visible address that contains the identifier. When a packet is actually sent, a translation is done that overwrites the first 64 bits of the address with a locator. The packet can then be forwarded over the network to the host where the addressed entity is located. At the receiver, the reverse translation is done so the that the application sees the original, untranslated address.

This feature is configured by the "ip -6 route" command using the "encap ila <locator>" option, where <locator> is the value to set in destination locator of the packet. e.g. ip -6 route add 3333:0:0:1:5555:0:1:0/128 encap ila 2001:0:0:1 via 2401:db00:20:911a:face:0:25:0 will set a route where 3333:0:0:1 will be overwritten by 2001:0:0:1 on output.

Recommended LWN article: Identifier locator addressing

RFC draft: draft-herbert-nvo3-ila-00

Slides from netconf: netconf2015Herbert-ILA.pdf

Slides from presentation at IETF: slides-92-nvo3-1.pdf

Code: commit

1.8. Network light weight tunnels

This release provides an infrastructure to support light weight tunnels like mpls ip tunnels. It allows for scalable flow based encapsulation without bearing the overhead of a full blown netdevice.

iproute2 is extended with a new use cases: {{{VXLAN: ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0

MPLS: ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1}}}

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.9. Virtual Routing and Forwarding (Lite) support

This release adds a Virtual Routing and Forwarding (VRF) device that, combined with ip rules, provides the ability to create virtual routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the Linux network stack. One use case is the multi-tenancy problem where each tenant has their own unique routing tables and in the very least need different default gateways. It is a cross between functionality that the IPVLAN driver and the Team drivers provide, where a device is created and packets into/out of the routing domain are shuttled through this VRF device. The device is then used as a handle to identify the applicable rules. The VRF device is thus the layer3 equivalent of a vlan device.

Processes can be "VRF aware" by binding a socket to the VRF device. Packets through the socket then use the routing table associated with the VRF device. An important feature of the VRF device implementation is that it impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected (ie., they do not need to be run in each VRF). The design also allows the use of higher priority ip rules (Policy Based Routing) to take precedence over the VRF device rules directing specific traffic as desired. In addition, VRF devices allow VRFs to be nested within namespaces. For example network namespaces provide separation of network interfaces at L1 (Layer 1 separation), VLANs on the interfaces within a namespace provide L2 separation and then VRF devices provide L3 separation.

Documentation: Documentation/networking/vrf.txt

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

2. Core (various)

3. File systems

4. Memory management

5. Block layer

6. Cryptography

7. Security

8. Tracing and perf tool

9. Virtualization

10. Networking

11. Architectures

12. Drivers

12.1. Graphics

commit]

12.2. Storage

12.3. Staging

12.4. Networking

12.5. Audio

12.6. Tablets, touch screens, keyboards, mouses

12.7. TV tuners, webcams, video capturers

12.8. USB

12.9. Serial Peripheral Interface (SPI)

12.10. Watchdog

12.11. Serial

12.12. SOC (System On Chip) specific Drivers

12.13. ACPI, EFI, cpufreq, thermal, Power Management

12.14. Real Time Clock (RTC)

12.15. Voltage, current regulators, power capping, power supply

12.16. Pin Controllers (pinctrl)

12.17. Memory Technology Devices (MTD)

12.18. Multi Media Card

12.19. Industrial I/O (iio)

12.20. Multi Function Devices (MFD)

12.21. Inter-Integrated Circuit (I2C)

12.22. Hardware monitoring (hwmon)

12.23. General Purpose I/O (gpio)

12.24. Clocks

12.25. PCI

12.26. DMA Engines

12.27. Various

13. List of merges

14. Other news sites

KernelNewbies: Linux_4.3 (last edited 2017-12-30 01:30:22 by localhost)