Linux 2.6.23; not yet released [[TableOfContents()]] == Short overview (for news sites, etc) == 2.6.23 includes the fallocate() syscall == In the news == http://lwn.net/Articles/236843/ http://lwn.net/Articles/224829/ http://lwn.net/Articles/211505/ http://lwn.net/Articles/232575/ == Important things (AKA: ''the cool stuff'') == === The CFS process scheduler === The new process scheduler, a.k.a CFS, has generated much noise in some circles due to the way this scheduler has been chosen over it's 'competitor' RDSL. A bit of story is needed to clarify what happened and what CFS does compared to the old scheduler. Long time ago, during the development of Linux 2.5, the 'O(1)' process scheduler from Ingo Molnar was merged to replace the process scheduler inherited from 2.4. The O(1) scheduler was mainly designed to fix the scalability issues in the 2.4 process scheduler - the improvements were so big, that the O(1) scheduler was one of the most frequently backported features to 2.4 in commercial Linux distributions. However, the algorithms in charge of scheduling the processes did not receive so much attention - the main goal of the new scheduler was to solve the scalability issues from the ground up, where as the process scheduling was considered good enough, or at least it wasn't perceived as a critical issue. Those algorithms can make a huge difference in what the users perceive as 'interactivity'. For example, if a process - or more than one - starts an endless loop and due to those CPU-bound loopers and the process scheduler doesn't assign as much CPU as necessary to the already present non-looping processes in charge of implementing the user interfaces (X.org, kicker, firefox, openoffice.org, etc), the user will perceive that the programs don't react to his actions very smoothly. Worse, in the case of music players your music could skip. The O(1) scheduler, just like the previous scheduler, tried to handle those situations as well as possible, and generally, they did a good job in most of cases. However, many users reported corner cases and not-so-corner cases where the new scheduler didn't worked as expected. One of those people was Con Kolivas, and despite his inexperience in the kernel hacking world, he tried to fine-tune the scheduling algorithms, without replacing them. His work was a big success, and his patches found a way into the main kernel. He didn't stop there. Con found that the 'interactivity estimator' - a piece of code used by the process scheduler to try to decide which processes were more 'interactive' and hence needed more attention so that the user would perceive a smoother behaviour on their desktops - caused more problems than it solved. Contrary to its original purpose, the interactivity estimator couldn't fix all the 'interactivity' problems present in the process scheduler, and trying to fix one would open another issue. It was the typical case of an algorithm using statistics to try to predict the future with heuristics, and failing at it. Con designed a new scheduler that killed all the failed interactivity estimations. Instead, his scheduler was based on the concept of fairness while conserving the 'O(1)-ness' of the mainline scheduler: processes are treated equally and are given same timeslices (see [http://lwn.net/Articles/224865/ this LWN article for more details on this scheduler]), and the scheduler doesn't care or even try to guess if the process is CPU bound or IO-bound (interactive). This scheduler improved the user's perceived smoothness to unprecedented levels. This scheduler was the one that was going to get merged, but Ingo Molnar (the O(1) creator) created his own new scheduler, called CFS (alias for 'Completely Fair Scheduler'), taking as the basic design element the 'fairness' idea that Con's scheduler had proved to be superior. The CFS scheduler has some differences compared to Con's RDSL: Instead of runqueues (that are used in both RDSL and mainline O(1)), it uses a time-ordered rbtree to build a 'timeline' of future task execution, to try to avoid the 'array switch' artifacts that both the vanilla and the RSDL scheduler can suffer. It also uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail; in fact it does not have the notion of 'timeslices' and has no heuristics whatsoever (read [http://lwn.net/Articles/230574/ this LWN article for more details on CFS design]). CFS has been chosen as replacement for the current 'O(1)' scheduler over RDSL - surprisingly this choice has generated much noise. It must be noticed that both RDSL and CFS are great schedulers, much better than the one in mainline, and that it was Con who pioneered the idea of using the concept of 'fairness' over the 'interactivity estimations', but that doesn't mean that CFS didn't deserve to get merged instead of RDSL (neither the contrary, if that had been the case). CFS code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5e7eaade55d53da856f0e07dc9c188f78f780192 (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=20b8a59f2461e1be911dce2cfafefab9d22e4eee 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6aa645ea5f7a246702e07f29edc7075d487ae4a3 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bb44e5d1c6b3b748e0facf8f516b3162009feb27 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bf0f6f24a1ece8988b243aefe84ee613099a9245 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa72e9e484c16f0c9aee23981917d8c8c03f0482 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dd41f596cda0d7d6e4a8b139ffdfabcefdd46528 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=43ae34cb4cd650d1eb4460a8253a8e747ba052ac 8)], === On-demand read-ahead === Click to read a [http://lwn.net/Articles/235164/ recommended LWN article about on-demand read-ahead] On-demand read-ahead is an attempt of simplificating the [http://lwn.net/Articles/155510/ Adaptative read-ahead patches]. On-demand readahead reimplements the Linux readahead functionality, removing a lot of complexity from the current system and making it more flexible. This new system maintains the same performance for trivial sequential/random reads, it improves the sysbench/OLTP MySQL benchmark up to 8%, and performance on readahead thrashing gains up to 3 times. There're more read-ahead patches based in this infrastructure pending of being merged in future releases, and further work could be done in this area aswell so expect more imrpovements in the future. Detailed design document and benchmarks can be found [http://lwn.net/Articles/235181/ here]. Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c743d96b6d2ff55a94df7b5ac7c74987bb9c343b (commit)] [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cf914a7d656e62b9dd3e0dffe4f62b953ae6048d (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc7868fcb9a73990e6f30371c1be465c436a7a7f 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3ea89ee86a82e9fbde37018d9b9e92a552e5fd13 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=122a21d11cbfda6d1e33cbc8ae9e4c4ee2f1886e 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ce1110b92b31d079aa443e967f43a2294e01194 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f9acc8c7b35a100f3a9e0e6977f7807b0169f9a5 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46fc3e7b4e7233a0ac981ac9084b55217318d04d 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fe3cba17c49471e99d3421e675fc8b3deaaf0b70 8], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=431a4820bfcdf7ff530e745230bafb06c9bf2d6d 9], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d77c2d7cc5126639a47d73300b40d461f2811a0f 10], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a08a166fe77d9f9ad88ed6d06b97e73453661f89 11], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d8983910a4045fa21022cfccf76ed13eb40fd7f5 12], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f615bfca468c9b80ed2d09be5fdbaf470a32c045 13)] === fallocate() === Click to read a [http://lwn.net/Articles/240571/ recommended LWN article about fallocate()] fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Applications can use this feature to avoid fragmentation to certain level (fe: it avoids the fragmentation that can happen in files that are frequently increasing its size) and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working, it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call, and this what 2.6.23 does. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. In 2.6.23, only ext4 and ocfs2 are adding support for the fallocate() interface. Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97ac73506c0ba93f30239bb57b4cfc5d73e68a62 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a2df2a63407803a833f82e1fa6693826c8c9d584 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=385820a38d5e7c70b20af4d68767b1920b1e4133 (commit)] === Xen and lguest === Linux has good virtualization support thanks to the paravirtualization and KVM support. 2.6.23 is improving the support of the trend-of-the-decade by adding Xen and lguest support. ==== Xen ==== The Xen virtual machine monitor was recently merged into the upcoming 2.6.23 Linux kernel in a series of patches from Jeremy Fitzhardinge. Xen is a virtual machine monitor (VMM) for x86-compatible computers (http://kerneltrap.org/node/13917) [http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ead97c84fa7d63a6a7a2f4e9f18f452bd109045 (commit)]. From a Kerneltrap comment : "just limited (no dom0, no suspend/resume, no ballooning) xen client support for i386 only". * xen: virtual mmu [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3b827c1b3aadf3adb4c602d19863f2d24e7cbc18 (commit)] * xen: Core Xen implementation [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ead97c84fa7d63a6a7a2f4e9f18f452bd109045 (commit)] * xen: Add Xen interface header files [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a42089dd358a7673a0a23126589a9029e57c2049 (commit)] * xen: event channels [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e46cdb66c8fc1c8d61cfae0f219ff47ac4b9d531 (commit)] * xen: time implementation [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=15c84731d647c34d1491793fa6be96f5de3432eb (commit)] * xen: ignore RW mapping of RO pages in pagetable_init [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9a4029fd3409eb224eb62c32d9792071382694ec (commit)] * xen: Implement sched_clock [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab55028886dd1dd54585f22bf19a00eb23869340 (commit)] * xen: add pinned page flag [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c85b04c3749507546f6d5868976e4793e35c2ec0 (commit)] * xen: configuration [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e738fca8d7dffec30eeee231c38f128eed56c8c8 (commit)] * xen: Complete pagetable pinning [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f4f97b3ea90130520afb478cbc2918be2b6587b8 (commit)] * xen: Account for stolen time [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f91a8b447b9af64f589f6e13fec7f09b5927563d (commit)] * xen: hack to prevent bad segment register reload [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8b84ad942b534f8faeb34b68f0f7277ea375fed0 (commit)] * xen: Add grant table support [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ad9a86121f5a374b48ce2924f8a9d7e94a04db27 (commit)] * xen: use the hvc console infrastructure for Xen console [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b536b4b9623084d86f2b1f19cb44a2d6d74f00bf (commit)] * xen: lazy-mmu operations [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d66bf8fcf3fce058a1cd164a7c8ee6093fdf039c (commit)] * xen: Add support for preemption [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f120f13ea0dbb0b0d6675683d5f6faea71277e65 (commit)] * xen: SMP guest support [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f87e4cac4f4e940b328d3deb5b53e642e3881f43 (commit)] * xen: add virtual network device driver [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0d160211965b79de989cf2d170985abeb8da5ec6 (commit)] * xen: handle external requests for shutdown, reboot and sysrq [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3e2b8fbeec8f005672f2a2e862fb9c26a0bafedc (commit)] * xen: add the Xenbus sysfs and virtual device hotplug driver [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4bac07c993d03434ea902d3d4290d9e45944b66c (commit)] * xen: suppress abs symbol warnings for unused reloc pointers [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=600b2fc242992e552e0b4e24c8c1f084b341f39b (commit)] * xen: Place vcpu_info structure into per-cpu memory [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=60223a326fc8fa6e90e2c3fd28ae6de4a311d731 (commit)] * xen: Attempt to patch inline versions of common operations [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6487673b8a858f99a5348e1078b3f5aec700f9e0 (commit)] * xen: add virtual block device driver. [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9f27ee595038653ddf8bca871200d39247d6f4fc (commit)] * xen: machine operations [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fefa629abebe328cf6d07f99fe5796dbfc3e4981 (commit)] * xen: use iret directly when possible [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9ec2b804e099e8a326369e6cccab10dee1d172ee (commit)] * xen: disable all non-virtual drivers [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dfdcdd42fdf63452ddd1bed6f49ae2a35dfb5d6c (commit)] ==== lguest ==== Click to read a [http://lwn.net/Articles/218766/ recommended article about lguest] lguest is a simple hypervisor for Linux on Linux (in other words, it allows to run linux -only linux- guests) based in the paravirt_ops infrastructure. Unlike kvm it doesn't need VT/SVM hardware. Unlike Xen it's simply "modprobe and go". Unlike both, it's 5000 lines and self-contained. The goal of his author, Rusty Rusell, was not to create the singlest and greatest hypervisor ever, but rather create a simple, small (5000 lines of code) hypervisor example to show the world how powerful the paravirt_ops infrastructure is. Performance is ok, but not great (-30% on kernel compile), precisely because it was written to be simple. But given its hackability, it may improve soon. The author encourages people to fork it and try to create a much better hypervisors: ''Too much of the kernel is a big ball of hair. lguest is simple enough to dive into and hack, plus has some warts which scream "fork me!"''. A 64-bit version is also being worked on. Lguest host support (CONFIG_LGUEST)can be compiled as a module (lg.ko). This is the host support - one you load it, your kernel will be able to run virtualized lguest guests. But kernel guests need to compile lguest guest support in order to be able to run under the lguest host. The configuration variable that enables the guest support is CONFIG_LGUEST_GUEST - but that option will be enabled automatically once you set CONFIG_LGUEST to 'y' or 'm'. This means that a kernel compiled with lguest host support does also get lguest guest support. In other words, you can use the same kernel you use to be a host as guest kernel. In order to load and run new guests, you need a loader userspace program. The instructions and the program can be found at Documentation/lguest/lguest.txt Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3f8c4d3f82c564e5e27c6375fe17544f694359dc (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d503e2fa5aecef99675c5a81b61321a5407bf61f 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5992b6dac0d23a2b51a1ccbaf8f1a2e62097b12b 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=07ad157f6e5d228be78acd5cea0291e5d0360398 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7e28ffe6c74416b54345d6004fd0964c115b12c 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=709e89266b60eff444fc512400321eb02d2474eb 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8ca47e00690914a9e5e6c734baa37c829a2f2fa1 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b754416bfe9adac6468e45fba244d77f52048aeb 8], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f938d2c892db0d80d144253d4a7b7083efdbedeb 9], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b2b47c214f4e85ce3968120d42e8b18eccb4f4e3 10], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e2c9784325490c878b7f69aeec1bed98b288bd97 11], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dde797899ac17ebb812b7566044124d785e98dc7 12], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bff672e630a015d5b54c8bfb16160b7edc39a57c 13], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f8f0fdcd40449d318f8dc30c1b361b0b7f54134a 14], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f56a384e98aa81065038c4e16f39ed989ccae687 15)] == Miscellaneous kernel-userland changes == === open() O_CLOEXEC flag === 2.6.23 adds a new O_CLOEXEC flag for open(2) (http://lwn.net/Articles/236843/) [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f23513e8d96cf5e6cf8d2ff0cb5dd6bbc33995e4 (commit)] [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a19542e5f694cd408a32c3d9dc593ba9366e2d7 (commit)]. This flag makes it possible to avoid race conditions in multithreaded applications that do the following: 1. Thread A: fd=open() 1. Thread B: fork + exec 1. Thread A: fcntl(fd,F_SETFD,FD_CLOEXEC) (Instead, Thread A would drop the fcntl() call and just open the file with O_CLOEXEC.)