KernelNewbies:

Spam: Ulrich Drepper, the libc maintainer, has published a [http://people.redhat.com/drepper/cpumemory.pdf must-read paper] about "What every programmer should know about memory"

Linux kernel version 2.6.24 Released ([http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.24 full SCM git log])

TableOfContents()

1. Short overview (for news sites, etc)

2.6.23 includes a "group scheduling" feature to assign different cpu resources at groups of users built on top of a new resource management framework, memory fragmentacion avoidance, tickless support for x86-64, ppc and other architectures, many new wireless drivers and a new wireless configuration interface, SPI/SDIO MMC support, USB authorization, per-device dirty memory thresholds, support for PID and network namespaces, support for static probe markers, read-only bind mounts, SELinux performance improvements, SATA link power management and port multiplier support, Largue Receive Offload in network devices, memory hot-remove support, a new framework for controlling the idle processor power management, CIFS ACLs support, many new drivers and many other features and fixes.

2. Important things (AKA: ''the cool stuff'')

2.1. CFS improvements

Performance/size improvements

The CFS task scheduler [http://kernelnewbies.org/Linux_2_6_23#head-f3a847a5aace97932f838027c93121321a6499e7 merged in Linux 2.6.23] is getting [http://lkml.org/lkml/2007/9/11/395 some microoptimization work] in 2.6.24. 2.6.23's CFS context switching is more than 10% slower than the old task scheduler. With the optimization done in 2.6.24, CFS is now even a bit faster than the old task scheduler (which is quite fast already). The compiled size of the scheduler has also improved and now it's a bit more smaller on UP and a lot smaller in SMP.

Fair Group Scheduling

You can read [http://lwn.net/Articles/240474/ this recommended article] about the Fair Group Scheduling feature.

Another feature in the scheduler is the Fair Group Scheduling. Normally the scheduler operates on individual tasks and strives to provide fair CPU time to each task. Sometimes, it may be desirable to group tasks and provide fair CPU time to each such task group. For example, it may be desirable to first provide fair CPU time to each user on the system and then to each task belonging to a user. In other words, given two users, one running one cpu-bound process and the other two cpu-bound processes, you may want to give 50% of CPU time to the first users and his task, and 50% to the other user, which will be shared between his two processes - 25% of CPU time for each.

Thats the kind of thing that the Group Scheduling feature does. At present, there are two (mutually exclusive) mechanisms to group tasks for CPU bandwidth control purpose: 1) Group scheduling based on user id, which is the case previously mentioned as example. This mechanism is configurable, which means you can have more CPU time than just a 50%/50% rule - you can assign user root the double of priority than other users. 2) Group scheduling. This mechanism (based in the "task control groups", see section 2.10) lets the administrator create arbitrary groups of tasks (ie: "multimedia", "compiling"), set how much CPU time 'priority' you want to give that group by catting the value to its cpu_share file, and then attach a PID to whatever task group you want. Documentation on how to use those two features can be found at Documentation/sched-design-CFS.txt.

guest time reporting

Aditionally, the task scheduler in 2.6.24 is adding a new "guest" field after "system" and "user" in /proc/<PID>/stat, where it tracks how much CPU time a task is spending in running a 'virtual' CPU.

2.2. Tickless support for x86-64, PPC, UML, ARM, MIPS

The Tickless feature was [http://kernelnewbies.org/Linux_2_6_21#head-8547911895fda9cdff32a94771c8f5706d66bba0 added in Linux 2.6.21]. This feature allows the kernel to disable timer interrupts for longer, variable periods, saving some power and improving performance, specially in virtual guests. 2.6.24 adds tickless support to the widespread 64-bit x86 architecture, but also to PPC, the virtualized architecture UML, and some variants of ARM and MIPS. They join to the already supported x86-32, SPARC-64 and SH.

2.3. New wireless drivers and configuration interface

New wireless configuration interface

In [http://kernelnewbies.org/Linux_2_6_22 Linux 2.6.22], Linux got a new and shiny wireless stack. This new stack has backwards compatibility with the old ioctl-based configuration of the old stack. However, the new stack was designed to have a much better configuration interface, based on netlink. While the backwards compatibility isn't going away, all wireless configuration tools are recommended to do long-term plans to switch to this interface

Drivers

In Linux 2.6.22, the mac80211 (formerly d80211) wireless stack was [http://kernelnewbies.org/Linux_2_6_22#head-1498b990e997cc0e95dbfa9047e7ebe8d84847cc merged], but not many drivers that use this new stack have been merged (only one). Linux 2.6.24 will have a lot of new wireless drivers using the new stack; 2.3 MB of source files in total:

There're also a lot of network (non-wireless) drivers being merged, look at the section 2.14, "new drivers"

2.4. Anti-fragmentation patches

You can read [http://lwn.net/Articles/224829/ this recommended article] about the "Anti-fragmentation" feature.

A know weakness in the linux kernel is the memory fragmentation that the system faces after days without rebooting or after intense operations, which makes difficult to make "high-order" memory allocations (allocations larger than the native page size - 4 KB on x86). It's relatively easy to trigger those cases; fe. a network driver may try to allocate 4 pages to put there the data received from the network, and the allocation may not suceed despite of having lot of free memory available, due to fragmentation. For almost three years, patient developers have been contunally developing and improving the anti-fragmentation patches to improve the memory allocator and reduce the fragmentation, and have been finally merged them 2.6.24.

The purpose of this feature is to reduce external fragmentation by grouping pages of related types together. When pages are migrated (or reclaimed under memory pressure), large contiguous pages will be freed. Allocations are categorised by their ability to migrate. Tests show that about 60-70% of physical memory can be allocated on a desktop after a few days uptime. In benchmarks and stress tests, it has been found that 80% of memory is available as contiguous blocks at the end of the test. To compare, a standard kernel was getting < 1% of memory as large pages on a desktop and about 8-12% of memory as large pages at the end of stress tests.

2.5. SPI/SDIO support in the MMC layer

The MMC layer, which is the code which implements support for MMC/SD memory cards, is suffering one of the biggest transformations in its life, because it has been [http://lkml.org/lkml/2007/9/24/37 heavily modified] to get support for [http://en.wikipedia.org/wiki/Secure_Digital_card#SDIO SDIO] and [http://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bus SPI].

SDIO is an alias for "Secure Digital I/O", and it allows to use the SD card slot (in the devices that support SDIO, ie. PDAs, cell phones or laptops) to use "small devices designed for the SD form factor, like GPS receivers, Wi-Fi or Bluetooth adapters, modems, Ethernet adapters, barcode readers, IrDA adapters, FM radio tuners, TV tuners, RFID readers, digital cameras, or other mass storage media such as hard drives" (quote from the [http://en.wikipedia.org/wiki/Secure_Digital_card#SDIO Wikipedia entry]). There are currently three working drivers for this new stack: sdio_uart, a driver for the standardised GPS interfaces; libertas_sdio, a driver for Marvell's 8686 Libertas wifi chip; and hci_sdio, a driver for the standardised bluetooth interface.

SPI is required by SDIO, an it's a "bus" (like IDE, SATA, USB...) which is used to access a wide range of devices, but more importantly, some systems require to access MMC/SD cards using a SPI controller instead of using a "native" MMC/SD controller. This has a disadvantage of being relatively high overhead, but a compensating advantage of working on many systems without dedicated MMC/SD controllers. 2.6.24 includes support for SPI and a experimental "MMC/SD over SPI" driver.9 (commit)]

2.6. USB authorization

As part of the efforts to make the USB layer ready for [http://en.wikipedia.org/wiki/Wireless_USB wireless USB], Linux 2.6.24 is getting support for USB device authorization, which allows you to control if a USB device (wireless or not) can be used or not in a system. As of now, when a USB device is connected it is configured and it's interfaces inmediately made available to the users. With this modification, only if root authorizes the device to be configured will then it be possible to use it.

Beside of providing a infrastructure to allow secure usage of wireless USB devices, this feature also allows to implement kiosk-style lockdown of USB devices, fully controlled by user space. Every USB device has a corresponding /sys/bus/usb/devices/<DEVICE>/authorized file. Writing 1 to that file authorizes a device to connect, 0 deauthorizes it. USB hosts can also set new devices connected to be deauthorized by writing 0 (or 1 to authorize) to /sys/bus/usb/devices/usb<X>/authorized_default. By default, wired USB devices are authorized by default to connect, and wireless USB hosts deauthorize by default all new connected devices (this is so because they need to do an authentication phase before authorizing).

2.7. Per-device dirty memory thresholds

You can read [http://lwn.net/Articles/245600/ this recommended article] about the "per-device dirty thresholds" feature.

When a process writes data to the disk, the data is stored temporally in 'dirty' memory until the kernel decides to write the data to the disk ('cleaning' the memory used to store the data). A process can 'dirty' the memory faster than the data is written to the disk, so the kernel throttles processes when there's too much dirty memory around. The problem with this mechanism is that the dirty memory thresholds are global, the mechanism doesn't care if there're several storage devices in the system, much less if some of them are faster than others. There're lot of scenaries where this design harms performance. For example, if there's a very slow storage device in the system (ex: a USB 1.0 disk, or a NFS mount over dialup), the thresholds are hit very quickly - not allowing other processes that may be working in much faster local disk to progress. Stacked block devices (ex: LVM/DM) are much worse and even deadlock-prone (check the LWN article).

In 2.6.24, the dirty thresholds are per-device, not global. The limits are variable, depending on the writeout speed of each device. This improves the performance greatly in many situations.

2.8. PID and network namespaces

You can read [http://lwn.net/Articles/256389/ this recommended article], and [http://lwn.net/Articles/259217/ this one], about the "PID and network namespaces" feature.

Usually, there's a global PID namespace for a whole Linux system: The list of processes contains all the processes running in the system. There's also a global view of the networking stack (routing tables and firewall rules, etc). However, [http://en.wikipedia.org/wiki/Operating_system-level_virtualization operating-system virtualization] like [http://openvz.org OpenVZ] or [http://en.wikipedia.org/wiki/Linux-VServer Vserver] need to have different views of the PID namespace and the networking stack. Linux 2.6.24 adds PID namespaces and basic support for network namespaces. They're used through the CLONE_NEWPID and CLONE_NEWNET clone() flags.

2.9. Large Receive Offload (LRO) support for TCP traffic

You can read [http://lwn.net/Articles/243949/ this recommended article] about the "Large Receive Offload" feature.

LRO combines received tcp packets to a single larger tcp packet and passes them then to the network stack in order to increase performance (throughput). After many out-of-the-tree iterations, mainline Linux is getting support for this feature [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=71c87e0cedca843162206c698cfa02e5fea9e2e3 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1e6e9342d41ff80ced0ad5dfcf084926700cdfc5 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d4dc4ec9d84e0578b9bfbe56a11fafdb7cbac771 (commit)]

2.10. Task Control Groups

There have been various proposals in the Linux arena for resource management/accounting and other task grouping subsystems in the kernel (Resgroups, User Beancounters, NSProxy cgroups, and others). Task Control Groups is the framework that is getting merged in 2.6.24 to fulfill the functionality that lead to the creation of such proposals. TCG can track and group processes into arbitrary "cgroups" and assign arbitrary state to those groups, in order to control its behaviour. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access.

For example, cpusets (see Documentation/cpusets.txt) allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. The CFS group scheduling feature uses cgroups to control the CPU time that every cgroup can get. Other various resource management and virtualization/cgroup efforts can become task cgroup clients. The configuration interface is described in Documentation/cgroups.txt

2.11. Linux Kernel Markers

You can read [http://lwn.net/Articles/245671/ this recommended article] about the "Linux Kernel Markers" feature.

The Linux Kernel Markers implement static probing points for the Linux kernel. Dynamic probing system like kprobes/dtrace can put probes pretty much anywhere. However, the scripts that dynamic probing points use can become quickly outdated, because a small change in the kernel may trigger a rewrite of the script, which needs to be maintained and updated separately, and will not work for all kernel versions. Thats why static probing points are useful, since they can be put directly into the kernel source code and hence they are always in sync with the kernel development. Static probing points apparently can also have some performance advantages. They've no performance costs when they're not being used.

The kernel markers are a sort of "derivative" of the long-time and external patchset "Linux Trace Toolkit" (LTT), which is a feature that has been around since [http://www.opersys.com/LTT/news.html#18-11-1999 1999]. The Kernel Markers are a feature needed for the [http://lwn.net/Articles/245671/ SystemTap] project. In this release, there're no probing points being included, but many will be certainly include in the future, and some tracking tools like blktrace will probably be ported to this kind of infrastructure in the future.

2.12. Read-only bind mounts

Read-only bind mounts (mount --bind) allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems.

It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated.

2.13. x86-32/64 arch reunification

You can read [http://lwn.net/Articles/243704/ this recommended article].

When support for the x86-64 AMD architecture was developed, it was decided to develop it as a "fork" of the traditional x86 architecture for comodity reasons. Many patches needed to patch a file in the i386 architecure directory, and another similar patch for the duplicated file in the x86_64 directory. It has been decided to unify both architectures in the same directory again.

This reunification has not been done in a radical way. In this release, botch architectures have been unificated in arch/x86, but only in appearance. All the source files in i386 and x86-64 directories have been moved to arch/x86, but renaming them with "_32" and "_64" suffixes. Ex: arch/i386/kernel/reboot.c has been moved to arch/x86/kernel/reboot_32.c, and arch/x86_64/kernel/reboot.c has been moved to arch/x86/kernel/reboot_64.c. Makefiles have been modified accordingly. So for now the reunification has been pretty much just a relocation of all the files and adaptation of the build machinery to make it compile just as it'd have been compiled in the old separated directories, done mostly with scripts.

In the future lots of those files will be unificated and shared by both architectures, ex. reboot_32.c and reboot_64.c into reboot.c, and many files have already been unificated in this release. Others will keep separated forever, due to the differences between both architectures.

2.14. New drivers

Graphics

SATA/IDE

Network(wireless)

Network

Sound

MTD

USB

V4L/DVB

Hwmon

I2C

Bluetooth

3. Subsystems

3.1. Memory management

3.2. Various

3.3. Networking

3.4. Filesystems

3.5. CRYPTO

3.6. SELinux

3.7. KVM

3.8. DM

3.9. Audit

3.10. Architecture-specific changes

4. Drivers

4.1. Buses

4.2. Graphics

4.3. SATA/IDE

4.4. Networking

4.5. Sound

4.6. ACPI

4.7. MTD

4.8. Input

4.9. SCSI

4.10. USB

4.11. HID

4.12. V4L/DVB

4.13. HWMON

4.14. Cpufreq

4.15. I2C

4.16. Bluetooth

4.17. Watchdog

4.18. Various

5. Crashing soon a kernel near you

This is a list of some of the patches being developed right now at the kernel community that will be part of future Linux releases. Those features may take many months to get into the Linus' git tree, or may be completely dropped. You can test the features in the -mm tree, but be warned, it can crash your machine, eat your data (unlikely but not impossible) or kidnap your family (just because it has never happened it doesn't mean it can't happen):

Reading the [http://www.linux-foundation.org/en/Linux_Weather_Forecast Linux Weather Forecast page] is recommended.

KernelNewbies: Linux_2_6_24 (last edited 2007-12-05 01:14:06 by diegocalleja)