Linux 2.6.23; not yet released
1. Short overview (for news sites, etc)
2.6.23 includes the fallocate() syscall
2. In the news
3. Important things (AKA: ''the cool stuff'')
3.1. The CFS process scheduler
The new process scheduler, a.k.a CFS, has generated much noise in some circles due to the way this scheduler has been chosen over it's 'competitor' RDSL. A bit of story is needed to clarify what happened and what CFS does compared to the old scheduler.
Long time ago, during the development of Linux 2.5, the 'O(1)' process scheduler from Ingo Molnar was merged to replace the process scheduler inherited from 2.4. The O(1) scheduler was mainly designed to fix the scalability issues in the 2.4 process scheduler - the improvements were so big, that the O(1) scheduler was one of the most frequently backported features to 2.4 in commercial Linux distributions. However, the algorithms in charge of scheduling the processes did not receive so much attention - the main goal of the new scheduler was to solve the scalability issues from the ground up, where as the process scheduling was considered good enough, or at least it wasn't perceived as a critical issue. Those algorithms can make a huge difference in what the users perceive as 'interactivity'. For example, if a process - or more than one - starts an endless loop and due to those CPU-bound loopers the process scheduler doesn't assign as much CPU as necessary to the already present non-looping processes in charge of implementing the user interfaces (X.org, kicker, firefox, openoffice.org, etc), the user will perceive that the programs don't react to his actions very smoothly. Worse, in the case of music players your music could skip.
The O(1) scheduler, just like the previous scheduler, tried to handle those situations as well as possible, and generally, they did a good job in most of cases. However, many users reported corner cases and not-so-corner cases where the new scheduler didn't worked as expected. One of those people was Con Kolivas, and despite his inexperience in the kernel hacking world, he tried to fine-tune the scheduling algorithms, without replacing them. His work was a big success, and his patches found a way into the main kernel.
He didn't stop there. Con found that the 'interactivity estimator' - a piece of code used by the process scheduler to try to decide which processes were more 'interactive' and hence needed more attention so that the user would perceive a smoother behaviour on their desktops - caused more problems than it solved. Contrary to its original purpose, the interactivity estimator couldn't fix all the 'interactivity' problems present in the process scheduler, and trying to fix one would open another issue. It was the typical case of an algorithm using statistics to try to predict the future with heuristics, and failing at it.
Con designed a new scheduler that killed all the failed interactivity estimations. Instead, his scheduler was based on the concept of fairness while conserving the 'O(1)-ness' of the mainline scheduler: processes are treated equally and are given same timeslices (see [http://lwn.net/Articles/224865/ this LWN article for more details on this scheduler]), and the scheduler doesn't care or even try to guess if the process is CPU bound or IO-bound (interactive). This scheduler improved the user's perceived smoothness to unprecedented levels.
This scheduler was the one that was going to get merged, but Ingo Molnar (the O(1) creator) created his own new scheduler, called CFS (alias for 'Completely Fair Scheduler'), taking as the basic design element the 'fairness' idea that Con's scheduler had proved to be superior. The CFS scheduler has some differences compared to Con's RDSL: Instead of runqueues (that are used in both RDSL and mainline O(1)), it uses a time-ordered rbtree to build a 'timeline' of future task execution, to try to avoid the 'array switch' artifacts that both the vanilla and the RSDL scheduler can suffer. It also uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail; in fact it does not have the notion of 'timeslices' and has no heuristics whatsoever (read [http://lwn.net/Articles/230574/ this LWN article for more details on CFS design]). CFS has been chosen as replacement for the current 'O(1)' scheduler over RDSL - surprisingly this choice has generated much noise. It must be noticed that both RDSL and CFS are great schedulers, much better than the one in mainline, and that it was Con who pioneered the idea of using the concept of 'fairness' over the 'interactivity estimations', but that doesn't mean that CFS didn't deserve to get merged instead of RDSL (neither the contrary, if that had been the case).
CFS code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5e7eaade55d53da856f0e07dc9c188f78f780192 (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=20b8a59f2461e1be911dce2cfafefab9d22e4eee 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6aa645ea5f7a246702e07f29edc7075d487ae4a3 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bb44e5d1c6b3b748e0facf8f516b3162009feb27 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bf0f6f24a1ece8988b243aefe84ee613099a9245 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa72e9e484c16f0c9aee23981917d8c8c03f0482 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dd41f596cda0d7d6e4a8b139ffdfabcefdd46528 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=43ae34cb4cd650d1eb4460a8253a8e747ba052ac 8)],
3.2. On-demand read-ahead
Click to read a [http://lwn.net/Articles/235164/ recommended LWN article about on-demand read-ahead]
On-demand read-ahead is an attempt of simplificating the [http://lwn.net/Articles/155510/ Adaptative read-ahead patches]. On-demand readahead reimplements the Linux readahead functionality, removing a lot of complexity from the current system and making it more flexible. This new system maintains the same performance for trivial sequential/random reads, it improves the sysbench/OLTP MySQL benchmark up to 8%, and performance on readahead thrashing gains up to 3 times. There're more read-ahead patches based in this infrastructure pending of being merged in future releases, and further work could be done in this area aswell so expect more imrpovements in the future. Detailed design document and benchmarks can be found [http://lwn.net/Articles/235181/ here].
Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c743d96b6d2ff55a94df7b5ac7c74987bb9c343b (commit)]
[http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cf914a7d656e62b9dd3e0dffe4f62b953ae6048d (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc7868fcb9a73990e6f30371c1be465c436a7a7f 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3ea89ee86a82e9fbde37018d9b9e92a552e5fd13 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=122a21d11cbfda6d1e33cbc8ae9e4c4ee2f1886e 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ce1110b92b31d079aa443e967f43a2294e01194 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f9acc8c7b35a100f3a9e0e6977f7807b0169f9a5 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46fc3e7b4e7233a0ac981ac9084b55217318d04d 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fe3cba17c49471e99d3421e675fc8b3deaaf0b70 8], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=431a4820bfcdf7ff530e745230bafb06c9bf2d6d 9], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d77c2d7cc5126639a47d73300b40d461f2811a0f 10], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a08a166fe77d9f9ad88ed6d06b97e73453661f89 11], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d8983910a4045fa21022cfccf76ed13eb40fd7f5 12], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f615bfca468c9b80ed2d09be5fdbaf470a32c045 13)]
3.3. fallocate()
Click to read a [http://lwn.net/Articles/240571/ recommended LWN article about fallocate()]
fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Applications can use this feature to avoid fragmentation to certain level (fe: it avoids the fragmentation that can happen in files that are frequently increasing its size) and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full.
Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working, it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call, and this what 2.6.23 does. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks.
In 2.6.23, only ext4 and ocfs2 are adding support for the fallocate() interface.
Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97ac73506c0ba93f30239bb57b4cfc5d73e68a62 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a2df2a63407803a833f82e1fa6693826c8c9d584 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=385820a38d5e7c70b20af4d68767b1920b1e4133 (commit)]
3.4. Two additional virtualisation solutions, Xen and lguest, merged
3.4.1. Xen merged
The Xen virtual machine monitor was recently merged into the upcoming 2.6.23 Linux kernel in a series of patches from Jeremy Fitzhardinge. Xen is a virtual machine monitor (VMM) for x86-compatible computers (http://kerneltrap.org/node/13917) [http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ead97c84fa7d63a6a7a2f4e9f18f452bd109045 (commit)].
From a Kerneltrap comment : "just limited (no dom0, no suspend/resume, no ballooning) xen client support for i386 only".
3.4.2. lguest merged
Rusty Russell's lguest was recently merged into the upcoming 2.6.23 Linux kernel. The merge comment describes the project, "lguest is a simple hypervisor for Linux on Linux. Unlike kvm it doesn't need VT/SVM hardware. Unlike Xen it's simply 'modprobe and go". (http://kerneltrap.org/node/13916) [http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=07ad157f6e5d228be78acd5cea0291e5d0360398 (commit)].
4. Miscellaneous kernel-userland changes
4.1. open() O_CLOEXEC flag
2.6.23 adds a new O_CLOEXEC flag for open(2) (http://lwn.net/Articles/236843/) [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f23513e8d96cf5e6cf8d2ff0cb5dd6bbc33995e4 (commit)] [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a19542e5f694cd408a32c3d9dc593ba9366e2d7 (commit)]. This flag makes it possible to avoid race conditions in multithreaded applications that do the following:
- Thread A: fd=open()
- Thread B: fork + exec
- Thread A: fcntl(fd,F_SETFD,FD_CLOEXEC)
(Instead, Thread A would drop the fcntl() call and just open the file with O_CLOEXEC.)