KernelNewbies:

Linux kernel version 2.6.24 Released ([http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.24 full SCM git log])

TableOfContents()

1. Short overview (for news sites, etc)

2.6.23 includes

2. Important things (AKA: ''the cool stuff'')

2.1. CFS improvements

Performance/size improvements

The CFS task scheduler [http://kernelnewbies.org/Linux_2_6_23#head-f3a847a5aace97932f838027c93121321a6499e7 merged in Linux 2.6.23] is getting [http://lkml.org/lkml/2007/9/11/395 some microoptimization work] in 2.6.24. 2.6.23's CFS context switching is more than 10% slower than the old task scheduler. With the optimization done in 2.6.24, CFS is now even a bit faster than the old task scheduler (which is quite fast already). The compiled size of the scheduler has also improved and now it's a bit more smaller on UP and a lot smaller in SMP.

Fair Group Scheduling

You can read [http://lwn.net/Articles/240474/ this recommended article] about the Fair Group Scheduling feature.

Another feature in the scheduler is the Fair Group Scheduling. Normally the scheduler operates on individual tasks and strives to provide fair CPU time to each task. Sometimes, it may be desirable to group tasks and provide fair CPU time to each such task group. For example, it may be desirable to first provide fair CPU time to each user on the system and then to each task belonging to a user. In other words, given two users, one running one cpu-bound process and the other two cpu-bound processes, you may want to give 50% of CPU time to the first users and his task, and 50% to the other user, which will be shared between his two processes - 25% of CPU time for each.

Thats the kind of thing that the Group Scheduling feature does. At present, there are two (mutually exclusive) mechanisms to group tasks for CPU bandwidth control purpose: 1) Group scheduling based on user id, which is the case previously mentioned as example. This mechanism is configurable, which means you can have more CPU time than just a 50%/50% rule - you can assign user root the double of priority than other users. 2) Group scheduling. This mechanism lets the administrator create arbitrary groups of tasks (ie: "multimedia", "compiling"), set how much CPU time 'priority' you want to give that group by catting the value to its cpu_share file, and then attach a PID to whatever task group you want. Documentation on how to use those two features can be found at Documentation/sched-design-CFS.txt.

guest time reporting

Aditionally, the task scheduler in 2.6.24 is adding a new "guest" field after "system" and "user" in /proc/<PID>/stat, where it tracks how much CPU time a task is spending in running a 'virtual' CPU.

2.2. New wireless drivers

In Linux 2.6.22, it was [http://kernelnewbies.org/Linux_2_6_22#head-1498b990e997cc0e95dbfa9047e7ebe8d84847cc merged] the new mac80211 wifi stack, but not many drivers that use this new stack have been merged (only one). Linux 2.6.24 will have a lot of new wireless drivers using the new stack, 2,3 MB of source files in total:

2.3. Per-device dirty thresholds

You can read [http://lwn.net/Articles/245600/ this recommended article] about the "per-device dirty thresholds" feature.

When a process writes data to the disk, the data is stored temporally in 'dirty' memory until the kernel decides to write the data to the disk ('cleaning' the memory used to store the data). A process can 'dirty' the memory faster than the data is written to the disk, so the kernel throttles processes when there's too much dirty memory around. The problem with this mechanism is that the dirty memory thresholds are global, the mechanism doesn't care if there're several storage devices in the system, much less if some of them are faster than others. There're lot of scenaries where this design harms performance. For example, if there's a very slow storage device in the system (ex: a USB 1.0 disk, or a NFS mount over dialup), the thresholds are hit very quickly - not allowing other processes that may be working in much faster local disk to progress. Stacked block devices (ex: LVM/DM) are much worse and even deadlock-prone (check the LWN article).

In 2.6.24, the dirty thresholds are per-device, not global. The limits are variable, depending on the writeout speed of each device. This improves the performance greatly in many situations.

2.4. PID and network namespaces

You can read [http://lwn.net/Articles/256389/ this recommended article] about the "Linux Kernel Markers" feature.

Usually, there's a global PID namespace for a whole Linux system: The list of processes contains all the processes running in the system. There's also a global view of the networking stack (routing tables and firewall rules, etc). However, [http://en.wikipedia.org/wiki/Operating_system-level_virtualization operating-system virtualization] like [http://openvz.org OpenVZ] or [http://en.wikipedia.org/wiki/Linux-VServer Vserver] need to have different views of the PID namespace and the networking stack. Linux 2.6.24 adds PID namespaces and basic support for network namespaces. They're used through the CLONE_NEWPID and CLONE_NEWNET clone() flags.

2.5. Task Control Groups

There have been various proposals in the Linux arena for resource management/accounting and other task grouping subsystems in the kernel (Resgroups, User Beancounters, NSProxy cgroups, and others). Task Control Groups is the framework that is getting merged in 2.6.24 to fulfill the functionality that lead to the creation of such proposals. TCG can track and group processes into arbitrary "cgroups" and assign arbitrary state to those groups, in order to control its behaviour. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access.

For example, cpusets (see Documentation/cpusets.txt) allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. The CFS group scheduling feature uses cgroups to control the CPU time that every cgroup can get. Other various resource management and virtualization/cgroup efforts can become task cgroup clients. The configuration interface is described in Documentation/cgroups.txt

2.6. Linux Kernel Markers

You can read [http://lwn.net/Articles/245671/ this recommended article] about the "Linux Kernel Markers" feature.

The Linux Kernel Markers implement static probing points for the Linux kernel. Dynamic probing system like kprobes/dtrace can put probes pretty much anywhere. However, the scripts that dynamic probing points use can become quickly outdated, because a small change in the kernel may trigger a rewrite of the script, which needs to be maintained and updated separately, and will not work for all kernel versions. Thats why static probing points are useful, since they can be put directly into the kernel source code and hence they are always in sync with the kernel development. Static probing points apparently can also have some performance advantages. They've no performance costs when they're not being used.

The kernel markers are a sort of "derivative" of the long-time and external patchset "Linux Trace Toolkit" (LTT), which is a feature that has been around since [http://www.opersys.com/LTT/news.html#18-11-1999 1999]. The Kernel Markers are a feature needed for the [http://lwn.net/Articles/245671/ SystemTap] project. In this release, there're no probing points being included, but many will be certainly include in the future, and some tracking tools like blktrace will probably be ported to this kind of infrastructure in the future.

2.7. x86-32/64 arch reunification

You can read [http://lwn.net/Articles/243704/ this recommended article].

When support for the x86-64 AMD architecture was developed, it was decided to develop it as a "fork" of the traditional x86 architecture for comodity reasons. Many patches needed to patch a file in the i386 architecure directory, and another similar patch for the duplicated file in the x86_64 directory. It has been decided to unify both architectures in the same directory again.

This reunification has not been done in a radical way. In this release, botch architectures have been unificated in arch/x86, but only in appearance. All the source files in i386 and x86-64 directories have been moved to arch/x86, but renaming them with "_32" and "_64" suffixes. Ex: arch/i386/kernel/reboot.c has been moved to arch/x86/kernel/reboot_32.c, and arch/x86_64/kernel/reboot.c has been moved to arch/x86/kernel/reboot_64.c. Makefiles have been modified accordingly. So for now the reunification has been pretty much just a relocation of all the files and adaptation of the build machinery to make it compile just as it'd have been compiled in the old separated directories, done mostly with scripts.

In the future lots of those files will be unificated and shared by both architectures, ex. reboot_32.c and reboot_64.c into reboot.c, and many files have already been unificated in this release. Others will keep separated forever, due to the differences between both architectures.

KernelNewbies: Linux_2_6_24 (last edited 2007-11-09 22:33:59 by diegocalleja)