KernelNewbies:

Spam: Ulrich Drepper, the libc maintainer, has published a must-read paper about "What every programmer should know about memory"

Linux kernel version 2.6.24 Released 24 January 2008 (full SCM git log)

1. Short overview (for news sites, etc)

2.6.24 includes CPU "group scheduling", memory fragmentation avoidance, tickless support for x86-64/ppc and other architectures, many new wireless drivers and a new wireless configuration interface, SPI/SDIO MMC support, USB authorization, per-device dirty memory thresholds, support for PID and network namespaces, support for static probe markers, SELinux performance improvements, SATA link power management and port multiplier support, Large Receive Offload in network devices, memory hot-remove support, a new framework for controlling the idle processor power management, CIFS ACLs support, many new drivers and many other features and fixes.

2. Important things (AKA: ''the cool stuff'')

2.1. CFS improvements

Performance/size improvements

The CFS task scheduler merged in Linux 2.6.23 is getting some microoptimization work in 2.6.24. 2.6.23's CFS context switching is more than 10% slower than the old task scheduler. With the optimization done in 2.6.24, CFS is now even a bit faster than the old task scheduler (which is quite fast already). The compiled size of the scheduler has also improved and now it's a bit smaller on UP and a lot smaller in SMP.

Fair Group Scheduling

You can read this recommended article about the Fair Group Scheduling feature.

Another feature in the scheduler is Fair Group Scheduling. Normally the scheduler operates on individual tasks and strives to provide fair CPU time to each task. Sometimes, it may be desirable to group tasks and provide fair CPU time to each such task group. For example, it may be desirable to first provide fair CPU time to each user on the system and then to each task belonging to a user. In other words, given two users, one running one cpu-bound process and the other two cpu-bound processes, you may want to give 50% of CPU time to the first users and his task, and 50% to the other user, which will be shared between his two processes - 25% of CPU time for each. Group Scheduling provides the ability to choose partitions to support the previous scenario.

At present, there are two (mutually exclusive) mechanisms to group tasks for CPU bandwidth control purpose: 1) Group scheduling based on user id, which is the case previously mentioned as example. This mechanism is configurable, which means you can have more CPU time than just a 50%/50% rule - for example, you can assign user root double the priority of other users. 2) Group scheduling. This mechanism (based in the "task control groups", see section 2.10) lets the administrator create arbitrary groups of tasks (ie: "multimedia", "compiling"), set how much CPU time 'priority' you want to give that group by catting the value to its cpu_share file, and then attach a PID to whatever task group you want. Documentation on how to use those two features can be found in Documentation/sched-design-CFS.txt.

guest time reporting

Additionally, the task scheduler in 2.6.24 is adding a new "guest" field after "system" and "user" in /proc/<PID>/stat, where it tracks how much CPU time a task is spending in running a 'virtual' CPU.

2.2. Tickless support for x86-64, PPC, UML, ARM, MIPS

The Tickless feature was added in Linux 2.6.21. This feature allows the kernel to disable timer interrupts for longer, variable periods, saving some power and improving performance, especially in virtual guests. 2.6.24 adds tickless support to the widespread 64-bit x86 architecture, but also to PPC, the virtualized architecture UML, and some variants of ARM and MIPS. They join the already tickless supported x86-32, SPARC-64 and SH.

2.3. New wireless drivers and configuration interface

New wireless configuration interface

In Linux 2.6.22, Linux got a new wireless stack. This new stack is backwards compatible with the old ioctl-based configuration of the old stack. However, the new stack was designed to have a much better configuration interface, based on netlink. While the backwards compatibility isn't going away, all wireless configuration tools are recommended to have long-term plans to switch to the new interface.

Drivers

In Linux 2.6.22, the mac80211 (formerly d80211) wireless stack was merged, but not many drivers that use this new stack have been merged (only one). Linux 2.6.24 will have a lot of new wireless drivers using the new stack; 2.3 MB of source files in total:

There are also a lot of network (non-wireless) drivers being merged, look at the section 2.14, "new drivers"

2.4. Anti-fragmentation patches

You can read this recommended article about the "Anti-fragmentation" feature.

A known weakness of the Linux kernel is the memory fragmentation that the system faces after days without rebooting or after intense operations. This makes difficult to do "high-order" memory allocations (allocations larger than the native page size - 4 KB on x86). It's relatively easy to trigger those cases. For example a network driver may try to allocate 4 pages to store data received from the network. This allocation may not succeed despite there being plenty of free memory available, as there is no single uninterrupted block of memory big enough (fragmentation). For almost three years patient developers have been continually developing and improving the anti-fragmentation patches to improve the memory allocator and reduce the tendency to fragment. These efforts have been finally merged in 2.6.24.

The purpose of this feature is to reduce external fragmentation by grouping pages of related types together. When pages are migrated (or reclaimed under memory pressure), large contiguous pages will be freed. Allocations are categorized by their ability to migrate. Tests show that about 60-70% of physical memory can be allocated on a desktop after a few days uptime. In benchmarks and stress tests, it has been found that 80% of memory is available as contiguous blocks at the end of the test. To compare, a standard kernel was getting < 1% of memory as large pages on a desktop and about 8-12% of memory as large pages at the end of stress tests.

2.5. SPI/SDIO support in the MMC layer

The MMC layer, which is the code which implements support for MMC/SD memory cards, is suffering one of the biggest transformations in its life, because it has been heavily modified to get support for SDIO and SPI.

SDIO is an alias for "Secure Digital I/O", and it allows to use the SD card slot (in the devices that support SDIO, ie. PDAs, cell phones or laptops) to use "small devices designed for the SD form factor, like GPS receivers, Wi-Fi or Bluetooth adapters, modems, Ethernet adapters, barcode readers, IrDA adapters, FM radio tuners, TV tuners, RFID readers, digital cameras, or other mass storage media such as hard drives" (quote from the Wikipedia entry). There are currently three working drivers for this new stack: sdio_uart, a driver for the standardised GPS interfaces; libertas_sdio, a driver for Marvell's 8686 Libertas wifi chip; and hci_sdio, a driver for the standardised bluetooth interface.

SPI is required by SDIO, and it's a "bus" (like IDE, SATA, USB...) which is used to access a wide range of devices, but more importantly, some systems require to access MMC/SD cards using a SPI controller instead of using a "native" MMC/SD controller. This has a disadvantage of being relatively high overhead, but a compensating advantage of working on many systems without dedicated MMC/SD controllers. 2.6.24 includes support for SPI and a experimental "MMC/SD over SPI" driver. (commit)]

2.6. USB authorization

As part of the efforts to make the USB layer ready for wireless USB, Linux 2.6.24 is getting support for USB device authorization, which allows you to control if a USB device (wireless or not) can be used or not in a system. As of now, when a USB device is connected it is configured and its interfaces immediately made available to the users. With this modification, only if root authorizes the device to be configured will then it be possible to use it.

Beside of providing an infrastructure to allow secure usage of wireless USB devices, this feature also allows to implement kiosk-style lockdown of USB devices, fully controlled by user space. Every USB device has a corresponding /sys/bus/usb/devices/<DEVICE>/authorized file. Writing 1 to that file authorizes a device to connect, 0 deauthorizes it. USB hosts can also set new devices connected to be deauthorized by writing 0 (or 1 to authorize) to /sys/bus/usb/devices/usb<X>/authorized_default. By default, wired USB devices are authorized by default to connect, and wireless USB hosts deauthorize by default all new connected devices (this is so because they need to do an authentication phase before authorizing).

2.7. Per-device dirty memory thresholds

You can read this recommended article about the "per-device dirty thresholds" feature.

When a process writes data to the disk, the data is stored temporally in 'dirty' memory until the kernel decides to write the data to the disk ('cleaning' the memory used to store the data). A process can 'dirty' the memory faster than the data is written to the disk, so the kernel throttles processes when there's too much dirty memory around. The problem with this mechanism is that the dirty memory thresholds are global, the mechanism doesn't care if there are several storage devices in the system, much less if some of them are faster than others. There are a lot of scenarios where this design harms performance. For example, if there's a very slow storage device in the system (ex: a USB 1.0 disk, or a NFS mount over dialup), the thresholds are hit very quickly - not allowing other processes that may be working in much faster local disk to progress. Stacked block devices (ex: LVM/DM) are much worse and even deadlock-prone (check the LWN article).

In 2.6.24, the dirty thresholds are per-device, not global. The limits are variable, depending on the writeout speed of each device. This improves the performance greatly in many situations.

2.8. PID and network namespaces

You can read this recommended article, and this one, about the "PID and network namespaces" feature.

Usually, there's a global PID namespace for a whole Linux system: The list of processes contains all the processes running in the system. There's also a global view of the networking stack (routing tables and firewall rules, etc). However, operating-system virtualization like OpenVZ or Vserver need to have different views of the PID namespace and the networking stack. Linux 2.6.24 adds PID namespaces and basic support for network namespaces. They're used through the CLONE_NEWPID and CLONE_NEWNET clone() flags.

2.9. Large Receive Offload (LRO) support for TCP traffic

You can read this recommended article about the "Large Receive Offload" feature.

LRO combines received tcp packets to a single larger tcp packet and passes them then to the network stack in order to increase performance (throughput). After many out-of-the-tree iterations, mainline Linux is getting support for this feature (commit), (commit), (commit)

2.10. Task Control Groups

There have been various proposals in the Linux arena for resource management/accounting and other task grouping subsystems in the kernel (Resgroups, User Beancounters, NSProxy cgroups, and others). Task Control Groups is the framework that is getting merged in 2.6.24 to fulfill the functionality that lead to the creation of such proposals. TCG can track and group processes into arbitrary "cgroups" and assign arbitrary state to those groups, in order to control its behaviour. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access.

For example, cpusets (see Documentation/cpusets.txt) allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. The CFS group scheduling feature uses cgroups to control the CPU time that every cgroup can get. Other various resource management and virtualization/cgroup efforts can become task cgroup clients. The configuration interface is described in Documentation/cgroups.txt

2.11. Linux Kernel Markers

You can read this recommended article about the "Linux Kernel Markers" feature.

The Linux Kernel Markers implement static probing points for the Linux kernel. Dynamic probing system like kprobes/dtrace can put probes pretty much anywhere. However, the scripts that dynamic probing points use can become quickly outdated, because a small change in the kernel may trigger a rewrite of the script, which needs to be maintained and updated separately, and will not work for all kernel versions. Thats why static probing points are useful, since they can be put directly into the kernel source code and hence they are always in sync with the kernel development. Static probing points apparently can also have some performance advantages. They've no performance costs when they're not being used.

The kernel markers are a sort of "derivative" of the long-time external patchset "Linux Trace Toolkit" (LTT), which is a feature that has been around since 1999. The Kernel Markers are a feature needed for the SystemTap project. In this release, there are no probing points being included, but many will be certainly included in the future, and some tracking tools like blktrace will probably be ported to this kind of infrastructure in the future.

2.12. x86-32/64 arch reunification

You can read this recommended article.

When support for the x86-64 AMD architecture was developed, it was decided to develop it as a "fork" of the traditional x86 architecture for comodity reasons. Many patches needed to patch a file in the i386 architecure directory, and another similar patch for the duplicated file in the x86_64 directory. It has been decided to unify both architectures in the same directory again.

This reunification has not been done in a radical way. In this release, both architectures have been unified in arch/x86, but only in appearance. All the source files in i386 and x86-64 directories have been moved to arch/x86, but renaming them with "_32" and "_64" suffixes. Ex: arch/i386/kernel/reboot.c has been moved to arch/x86/kernel/reboot_32.c, and arch/x86_64/kernel/reboot.c has been moved to arch/x86/kernel/reboot_64.c. Makefiles have been modified accordingly. So for now the reunification has been pretty much just a relocation of all the files and adaptation of the build machinery to make it compile just as it'd have been compiled in the old separated directories, done mostly with scripts.

In the future lots of those files will be unified and shared by both architectures, ex. reboot_32.c and reboot_64.c into reboot.c, and many files have already been unified in this release. Others will keep separated forever, due to the differences between both architectures.

2.13. New drivers

Graphics

SATA/IDE

Network(wireless)

Network

Sound

MTD

USB

V4L/DVB

Hwmon

I2C

Bluetooth

3. Subsystems

3.1. Memory management

3.2. Various

3.3. Networking

3.4. Filesystems

3.5. CRYPTO

3.6. SELinux

3.7. KVM

3.8. DM

3.9. Audit

3.10. Architecture-specific changes

4. Drivers

4.1. Buses

4.2. Graphics

4.3. SATA/IDE

4.4. Networking

4.5. Sound

4.6. ACPI

4.7. MTD

4.8. Input

4.9. SCSI

4.10. USB

4.11. HID

4.12. V4L/DVB

4.13. HWMON

4.14. Cpufreq

4.15. I2C

4.16. Bluetooth

4.17. Watchdog

4.18. FireWire

4.19. Various

5. Crashing soon a kernel near you

This is a list of some of the patches being developed right now at the kernel community that will be part of future Linux releases. Those features may take many months to get into the Linus' git tree, or may be completely dropped. You can test the features in the -mm tree, but be warned, it can crash your machine, eat your data (unlikely but not impossible) or kidnap your family (just because it has never happened it doesn't mean it can't happen):

Reading the Linux Weather Forecast page is recommended.

KernelNewbies: Linux_2_6_24 (last edited 2017-12-30 01:29:55 by localhost)