KernelNewbies:

Linux kernel version 2.6.23 Released 9 October 2007 (full SCM git log)

During the development of 2.6.23, the 2007 version of the Linux Kernel Developers' Summit was held on September 5 and 6 in Cambridge, UK. We only can recommend to read the excellent coverage done by LWN.

1. Short overview (for news sites, etc)

2.6.23 includes the new, better, fairer CFS process scheduler, a simpler read-ahead mechanism, the lguest 'Linux-on-Linux' paravirtualization hypervisor, XEN guest support, KVM smp guest support, variable process argument length, make SLUB the default slab allocator, SELinux protection for exploiting null dereferences using mmap, XFS and ext4 improvements, PPP over L2TP support, the 'lumpy' reclaim algorithm, a userspace driver framework, the O_CLOEXEC file descriptor flag, splice improvements, new fallocate() syscall, lock statistics, support for multiqueue network devices, various new drivers and many other minor features and fixes.

2. Important things (AKA: ''the cool stuff'')

2.1. The CFS process scheduler

The new process scheduler, a.k.a CFS (Completely Fair Scheduler), has generated too much noise in some circles due to the way this scheduler has been chosen over its competitor RSDL. A bit of history is needed to clarify what happened and what CFS does compared to the old scheduler.

During the development of Linux 2.5, the 'O(1)' process scheduler (PS) from Ingo Molnar was merged to replace the one inherited from 2.4. The O(1) PS was designed to fix the scalability issues in the 2.4 PS - the performance improvements were so big that the O(1) PS was one of the most frequently backported features to 2.4 in commercial Linux distributions. However, the algorithms in charge of scheduling processes were not changed that much, as they were considered 'good enough', or at least it wasn't perceived as a critical issue. But those algorithms can make a huge difference in what the users perceive as 'interactivity'. For example, if a process - or more than one - starts an endless loop and due to those CPU-bound loopers and the PS doesn't assign as much CPU as necessary to the already present non-looping processes in charge of implementing the user interfaces (X.org, kicker, firefox, openoffice.org, etc), the user will perceive that the programs don't react to the users' actions very smoothly. Or worse, in the case of music players your music could skip.

The O(1) PS, just like the previous PSs, tried to improve those cases and generally, it did a good job most of the time. However, many users reported corner cases and not-so-corner cases where the new PS didn't work as expected. One of those users was Con Kolivas, and despite his inexperience in the kernel hacking world, he tried to fine-tune the scheduling algorithms, without replacing them. His work was a success, and his patches found a way into the main kernel, and other people (Mike Galbraith, Davide Libenzi, Nick Piggin) also helped to tweak the scheduler. But not all the corner cases disappeared, and some new ones appeared when trying to fix others. Con found that the 'interactivity estimator' - a piece of code used by the PS to try to decide which processes were more 'interactive' and hence needed more attention, so that the user would perceive their desktops as 'more interactive' - caused more problems than it solved. Contrary to its original purpose, the interactivity estimator couldn't fix all the 'interactivity' problems present in the PS, and trying to fix one would open another issue. It was the typical case of an algorithm using statistics to try to predict the future with heuristics, and failing at it. Con designed a new PS, called RSDL, that killed the interactivity estimation code. Instead, his PS was based on the concept of 'fairness': processes are treated equally and are given same timeslices (see this LWN article for more details on this PS), and the PS doesn't care or even try to guess if the process is CPU bound or IO-bound (interactive). This PS improved the user's perceived "interactivity" in those corner cases as well.

This PS was the one that was going to get merged, but Ingo Molnar (the O(1) creator) created another new PS, called CFS (alias for 'Completely Fair Scheduler'), taking as one the basic design element the 'fair scheduling' idea that Con's PS had proven to be superior. It was well received by some hackers, which helped Ingo (and Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Suresh Siddha, and many others) to make CFS a good PS alternative for mainline. 'Fairness' is the only idea shared between RSDL and CFS and that's where the similarities stop, and even the definition of 'fairness' is very different: RSDL uses a 'strict' definition of fairness. But CFS includes the sleep time in the task's fairness metric: this means that in CFS, sleeping tasks (the kind of tasks that usually run the code that the user feels as 'interactive', like X.org, mp3 players, etc) do get more CPU time than running tasks (unlike the 'strict fairness' of RSDL, where they are treated with a strict fairness), but it's all kept under control by the fairness engine. This design gets the best of both worlds: fairness and interactivity, but without resorting to an interactivity estimator.

CFS has other differences compared to the old mainline scheduler and RSDL: instead of runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, to try to avoid the 'array switch' artifacts that both the vanilla and the RSDL PS can suffer. It also uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail; in fact it does not have the notion of traditional 'timeslices': the slicing is decided dynamically, not statically, and there's no persistency to timeslices (i.e. timeslices are not 'given' to a task and 'used up' by a task, in the traditional sense, because CFS is able to accurately track the full history of the task's execution via the nanoseconds accounting). Plus it has extensive instrumentation with CONFIG_SCHED_DEBUG=y. Because of all those changes, CFS is a quite radical rewrite of the Linux PS (~70% of its code is touched), and hence bigger than RSDL (in terms of patch's size, not the memory footprint: RSDL patchset weighted 88K, whereas CFS patcheset weights 290k). Read this LWN article for more details on CFS design.

So CFS was finally chosen as replacement for the current 'O(1)' PS over RSDL - surprisingly this choice generated much noise due to Con announcement about quitting from kernel development - but Con has publicly said that it's not due to that. It seems like the debate has calmed down now and that there's no reason to think that CFS was chosen for anything but technical reasons. It must be noted that both RSDL and CFS are better schedulers than the one in mainline, and that it was Con who pioneered the idea of using the concept of 'fairness' over the 'interactivity estimations', but that doesn't mean that CFS didn't deserve to get merged as the definitive replacement of the mainline's PS; it doesn't mean either that RSDL isn't also great replacement.

NOTE!: Applications that depend heavily on sched_yield()'s behaviour (like, f.e., many benchmarks) can suffer from huge performance gains/losses due to the very very subtle semantics of what sched_yield() should do and how CFS changes them. There's a sysctl at /proc/sys/kernel/sched_compat_yield that you can set to "1" to change the sched_yield() behaviour that you should try in those cases. It must be also noticed that CFS is also available as a backport for 2.6.22, 2.6.21 and 2.6.20.

CFS code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9)

2.2. On-demand read-ahead

Click to read a recommended LWN article about on-demand read-ahead

On-demand read-ahead is an attempt to simplify the Adaptive read-ahead patches. On-demand readahead reimplements the Linux readahead functionality, removing a lot of complexity from the current system and making it more flexible. This new system maintains the same performance for trivial sequential/random reads, it improves the sysbench/OLTP MySQL benchmark up to 8%, and performance on readahead thrashing gains up to 3 times. There are more read-ahead patches based in this infrastructure pending and further work could be done in this area as well, so expect more improvements in the future. Detailed design document and benchmarks can be found here.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)

2.3. fallocate()

Click to read a recommended LWN article about fallocate()

fallocate() is a new system call which will allow applications to preallocate space to any file(s) in a file system. Applications can get a guarantee of space for particular file(s) - even if later the system becomes full. Applications can also use this feature to avoid fragmentation to certain level in many filesystems (fe: it avoids the fragmentation that can happen in files that are frequently increasing their size) and thus get faster access speed.

Currently, glibc provides the POSIX interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working, it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call, and this is what 2.6.23 does. It is expected that posix_fallocate() will be modified to call this new system call first and in case the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks.

In 2.6.23, only ext4 and ocfs2 are adding support for the fallocate() interface.

Code: (commit)

2.4. Virtualization: lguest and Xen

Linux has good virtualization support thanks to the paravirtualization and KVM support. 2.6.23 is improving the support of the trend-of-the-decade by adding lguest and Xen support - both of them based in the paravirt_ops infrastructure.

2.4.1. lguest

Click to read a recommended article about lguest

lguest is a simple hypervisor for Linux on Linux (in other words, it allows to run linux -only linux- guests) based in the paravirt_ops infrastructure. Unlike kvm it doesn't need VT/SVM hardware. Unlike Xen it's simply "modprobe and go". Unlike both, it's 5000 lines and self-contained.

The goal of his author, Rusty Russell, was not to create the simplest and greatest hypervisor ever, but rather create a simple, small (5000 lines of code) hypervisor example to show the world how powerful the paravirt_ops infrastructure is. Performance is ok, but not great (-30% on kernel compile), precisely because it was written to be simple. But given its hackability, it may improve soon. The author encourages people to fork it and try to create a much better hypervisor: Too much of the kernel is a big ball of hair. lguest is simple enough to dive into and hack, plus has some warts which scream "fork me!". A 64-bit version is also being worked on.

Lguest host support (CONFIG_LGUEST) can be compiled as a module (lg.ko). This is the host support - once you load it, your kernel will be able to run virtualized lguest guests. But kernel guests need to compile lguest guest support in order to be able to run under the lguest host. The configuration variable that enables the guest support is CONFIG_LGUEST_GUEST - but that option will be enabled automatically once you set CONFIG_LGUEST to 'y' or 'm'. This means that a kernel compiled with lguest host support does also get lguest guest support. In other words, you can use the same kernel you use to be a host as guest kernel. In order to load and run new guests, you need a loader userspace program. The instructions and the program can be found at Documentation/lguest/lguest.txt

Code: drivers/lguest, Documentation/lguest

2.4.2. Xen

Part of Xen has been merged. The support included in 2.6.23 will allow the kernel to boot in a paravirtualized environment under the Xen hypervisor. But support for the hypervisor is not included - this is only guest support, no dom0, no suspend/resume, no ballooning. It's based in the paravirt_ops infrastructure.

Code: (part 1, drivers/xen, part 2, arch/i386/xen)

2.5. Variable argument length

From a Slashdot interview to Rob Pike: I didn't use Unix at all, really, from about 1990 until 2002, when I joined Google. (I worked entirely on Plan 9, which I still believe does a pretty good job of solving those fundamental problems.) I was surprised when I came back to Unix how many of even the little things that were annoying in 1990 continue to annoy today. In 1975, when the argument vector had to live in a 512-byte-block, the 6th Edition system would often complain, 'arg list too long'. But today, when machines have gigabytes of memory, I still see that silly message far too often. The argument list is now limited somewhere north of 100K on the Linux machines I use at work, but come on people, dynamic memory allocation is a done deal!

While Linux is not Plan 9, in 2.6.23 Linux is adding variable argument length. Theoretically you shouldn't hit frequently "argument list too long" errors again, but this patch also limits the maximum argument length to 25% of the maximum stack limit (ulimit -s).

Code: (commit)

2.6. PPP over L2TP

Linux 2.6.23 adds support for PPP-over-L2TP socket family. L2TP (RFC 2661) is a protocol used by ISPs and enterprises to tunnel PPP traffic over UDP tunnels. L2TP is replacing PPTP for VPN uses. The kernel component included in 2.6.23 handles only L2TP data packets: a userland daemon handles L2TP the control protocol (tunnel and session setup). One such daemon is OpenL2TP.

Code: (commit 1, 2, 3) Documentation: (commit)

2.7. Autoloading of ACPI kernel modules

With Linux 2.6.23, the ACPI modules are exporting the device table symbols in the drivers so that udev can automatically load them through the usual mechanisms.

Code: (commit 1, 2, 3)

2.6.23 also adds DMI/SMBIOS based module autoloading to the Linux kernel. The idea is to load laptop drivers automatically (and other drivers which cannot be autoloaded otherwise), based on the DMI system identification information of the BIOS. Right now most distros manually try to load all available laptop drivers on bootup in the hope that at least one of them loads successfully. This patch does away with all that, and uses udev to automatically load matching drivers on the right machines.

Code: (commit)

2.8. async_tx API

The async_tx API provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. The raid5 DM engine has been transformed to use the async_tx API, getting performance improvements (in the tiobenchmark and with iop342, it shows 20 - 30% higher throughput for sequential writes and 40 - 55% gains in sequential reads to a degraded array). API documentation.

Code: (commit)

2.9. 'Lumpy' reclaim

Click to read a recommended LWN article which touches the 'lumpy' reclaim feature

High-order petitions of free memory in the kernel (IOW, petitions of free memory that are bigger than one memory page and must be contiguous) can fail easily due to the memory fragmentation when there's very little free memory left: When the memory management subsystem tries to free some memory to make room for the petition, it frees pages in LRU (Least Recently Used) order, and pages freed in LRU order are not necessarily contiguous - rather, they're freed according to how recently it was used. So the allocation may still fail.

The 'lumpy' reclaim modifies the reclaim algorithm to improve this situation: When it needs to free some pages, it tries to free the pages contiguous to the first chosen page in the LRU, ignoring the recency, improving the possibilities of finding a contiguous block of free memory.

Code: (commit)

2.10. Movable Memory Zone

It is often known at allocation time whether a page may be migrated or not. This feature adds a flag called __GFP_MOVABLE to the memory allocator and a new mask called GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated using the page migration mechanism or reclaimed by syncing with backing storage and discarding. This feature also creates a memory zone called ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM and __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a single memory partition while allowing movable allocations to be satisfied from either partition. More details in the commit links.

Code: (commit 1, 2, 3, 4, 5)

2.11. UIO

Click to read a recommended LWN article about UIO

UIO is a framework that allows to implement drivers in userspace. This kind of thing causes much noise due to "monolithic vs microkernel" topic. To the surprise of many, the Linux ecosystem has actually supported userspace drivers for cases that had sense for a long time. libusb allows to access the USB bus from userspace and implement drivers there. This is why you don't have specific drivers for, f.e., your scanner or USB digital camera, programs like sane, gphoto, gnokii, gtkam, hplip, or even some music players like rhythmbox or amarok, use libusb to access the USB bus and talk to USB devices directly. The 2D X.org drivers that you configure in your x.org file are another popular example of drivers that not only they run in userspace, they also are portable to other unix operative systems (they're also an example of why userspace drivers can't avoid hanging your machine due to a bug in the driver that triggers a hardware hang). CUPS and programs accessing the serial port like pppd are yet another example of userspace programs accessing the devices directly - the kernel doesn't implement any specific LPT printer or serial modem driver, those userspace programs implement the driver that knows how to talk to the printer.

In other words, userspace drivers are not new. UIO is not a try to migrate all the Linux kernel drivers to userspace. In fact, a tiny (150 lines in the sample driver, including comments etc) kernel-side driver to handle some basic interrupt routine is needed as part of every UIO driver. UIO is just a simple way to create very simple, non-performance critical drivers, which has probably been merged more with a "merge-and-see-if-it-happens-something-interesting" attitude than anything else. For now UIO doesn't allow to create nothing but very very simple drivers: No DMA, no network and block drivers....

UIO Code: (commit) UIO Documentation: (commit) Sample kernel-side UIO Hilscher CIF card driver (commit)

2.12. O_CLOEXEC file descriptor flag

Click to read a recommended LWN article about the O_CLOEXEC open() flag

In multi-threaded code (or more correctly: all code using clone() with CLONE_FILES) there's a race when exec'ing (see commit link for details). In some applications this can happen frequently. Take a web browser. One thread opens a file and another thread starts, say, an external PDF viewer. The result can even be a security issue if that open file descriptor refers to a sensitive file and the external program can somehow be tricked into using that descriptor. 2.6.23 includes the O_CLOEXEC ("close-on-exec") fd flag on open() and recvmsg() to avoid this problem.

Code: (commit 1, 2)

2.13. Use splice in the sendfile() implementation

Splice is a innovative I/O method which was added in Linux 2.6.17, based in a in-kernel buffer that the user has control over, where "splice()" moves data to/from the buffer from/to an arbitrary file descriptor with splice(), while "tee()" copies the data in one buffer to another, ie: it "duplicates" it, or vmsplice() to splice the data from/to user memory. Because the in-kernel buffer isn't really copied from one address space to another, it allows to move data from/to a fd without an extra copy (ie, "zero-copy").

For the particular case of sending the data from a file descriptor to a fd socket, there's been always the sendfile() syscall. splice() however is a generic mechanism, not just limited to what sendfile(). In other words, sendfile() is just a small subset of what splice can do, splice obsoletes it. In Linux 2.6.23, the sendfile() mechanism implementation is killed, but the API and its functionality is not removed, it's instead implemented internally with the splice() mechanisms.

Because sendfile() is critical for many programs, specially for static web servers and FTPs, performance regressions could happen (and performance improvements!) and the kernel hackers would really like to hear about them both in linux-kernel@vger.kernel.org and/or other usual communication channels.

In other news, 2.6.23 adds splice vmsplice-to-user support. It must be noticed again that splice() obsoletes sendfile() in Linux, and its mechanisms allow to build further performance improvements in your software.

Code: (commit 1, 2, 3, 4, 5, 6)

2.14. XFS and EXT4 improvements

2.15. Coredump filter mask

The purpose of this feature is to control which VMAs should be dumped based on their memory types and per-process flags, in order to avoid longtime system slowdown when a number of processes which share a huge shared memory are dumped at the same time, or just to avoid dumping information that you don't need. Users can access the per-process flags via /proc/<pid>/coredump_filter interface. coredump_filter represents a bitmask of memory types, and if a bit is set, VMAs of corresponding memory type are written into a core file when the process is dumped. The bitmask is inherited from the parent process when a process is created.

Code: (commit 1, 2, 3, 4)

2.16. Rewrite the x86 asm setup in C

In 2.6.23 the x86 setup code, which is currently all in assembly, is replaced with a version written in C, using the ".code16gcc" feature of binutils. The new code is vastly easier to read and debug. It should be noted that a fair number of minor bugs were found while going through this code, but new ones could have been created, due to the extreme fragility of a part of the kernel like this. During testing, it has showed to be very stable.

Code:

(commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)

2.17. New drivers

3. Subsystems

3.1. Filesystems

3.2. Networking

3.3. SELinux

3.4. Audit

3.5. KVM

3.6. Architecture-specific changes

4. Drivers

4.1. Graphics drivers

4.2. SATA/libata/IDE drivers

4.3. Network drivers

4.4. Sound drivers

4.5. SCSI drivers

4.6. V4L/DVB drivers

4.7. USB

4.8. IB/ipath drivers

4.9. Input drivers

4.10. Hwmon drivers

4.11. HID

4.12. Cpufreq

4.13. I2C

4.14. FireWire drivers

4.15. OMAP

4.16. ACPI

4.17. Watchdog

4.18. Various

5. Crashing soon a kernel near you

This is a list of some of the ongoing patches being developed at the kernel community that will be part of future Linux releases. Those features may take many months to get into the Linus' git tree, or may be dropped. The features are tested in the -mm tree, but be warned, it can crash your machine, eat your data (unlikely but not impossible) or kidnap your family (just because it has never happened it doesn't mean you're safe):

Reading the Linux Weather Forecast page is recommended.

6. In the news

KernelNewbies: Linux_2_6_23 (last edited 2017-12-30 01:30:17 by localhost)