Linux 2.6.27 kernel released 9 October 2008.

Note: The 2008 Linux Kernel Summit was held September 15 and 16 in Portland, Oregon, immediately prior to the Linux Plumbers Conference. LWN, as always, has [ excelent coverage of the event]. You can [ download here all the papers] of the conferences in two PDF files. LWN also has [ coverage of the Linux Plumbers Conference]

Summary: 2.6.27 add a new filesystem (UBIFS) optimized for "pure" flash-based storage devices, the page-cache is now lockless, much improved Direct I/O scalability and performance, delayed allocation for ext4, multiqueue networking, an alternative hibernation implementation based on kexec/kdump, data integrity support in the block layer for devices that support it, a simple tracer called ftrace, a mmio tracer, sysprof support, extraction of all the in-kernel's firmware to /lib/firmware, XEN support for saving/restorig VMs, improved video camera support, support for the Intel wireless 5000 series and RTL8187B network cards, a new ath9k driver for the Atheros AR5008 and AR9001 family of chipsets, more new drivers, improved support for others and many other improvements and fixes.


1. Prominent features (the cool stuff)

1.1. Lockless page cache and get_user_pages()

Recommended LWN article: [ "Toward better direct I/O scalability"], [ "The lockless page cache"]

The page cache is the place where the kernel keeps in RAM a copy of a file to improve performance by avoiding disk I/O when the data that needs to be read is already on RAM. Each "mapping", which is the data structure that keeps track of the correspondence between a file and the page cache, is SMP-safe thanks to its own lock. So when different processes in different CPUs access different files, there's no lock contention, but if they access the same file (shared libraries or shared data files for example), they can hit some contention on that lock. In 2.6.27, thanks to some rules on how the page cache can be used and the usage of RCU, the page cache will be able to do lookups (ie., "read" the page cache) without needing to take the mapping lock, and hence improving scalability. But it will only be noticeable on systems with lots of cpus (page fault speedup of 250x on a 64 way system have been measured).

Code: [;a=commit;h=47feff2c8eefe85099f87c43d3096855f0085ca0 (commit 1], [;a=commit;h=e286781d5f2e9c846e012a39653a166e9d31777d 2], [;a=commit;h=a60637c85893e7191faaafa6a72e197c24386727 3)]

Lockless get_user_pages(): get_user_pages() is a function used in direct I/O operations to pin the userspace memory that is going to be transferred. It's a complex function that requires to hold the mmap_sem semaphore in the mm_struct struct of the process and the page table lock. This is a scalability problem when there're several processes using get_user_pages in the same address space (for example, databases that do Direct I/O), because there will be lock contention. In 2.6.27, a new get_user_pages_fast() function has been introduced, which does the same work that get_user_pages() does, but its simplified to speed up the most common workloads that exercise those paths within the same address space. This new function can avoid taking the mmap_sem semaphore and the page table locks in those cases. Benchmarks showed a 10% speedup running a OLTP workload with a IBM DB2 database in a quad-core system

Code: [;a=commit;h=21cc199baa815d7b3f1ace4be20b9558cbddc00f (commit 1], [;a=commit;h=8174c430e445a93016ef18f717fe570214fa38bf 2], [;a=commit;h=f5dd33c494a427b1d1a3b574de5c9e511c888864 3], [;a=commit;h=bc40d73c950146725e9e768e856a416ec8949065 4], [;a=commit;h=652ea695364142b2464744746beac206d050ef19 5], [;a=commit;h=30002ed2e41830ec03ec3e577ad83ac6b188f96e 6)]

1.2. Ext4: Delayed Allocation

In this release, Ext4 is adding one of its most important planned features: Delayed allocation, also called [ "Allocate-on-flush"]. It doesn't changes the disk format in any way, but it improves the performance in a wide range of workloads. This is how it works: When an application write()s data to the disk, the data is usually not written immediately to the disk, it's cached in RAM for a while. But despite of not being written immediately to the disk, the filesystem allocates the neccesary disk structures for it immediately. Delayed allocation consists on not allocating space for that cached data - instead, only the free space counter is updated when write() is called. The on-disk blocks and structures are allocated only when the cached data is finally written to the disk - not when a process writes something (IOW: "delayed allocation"). This approach, used by filesystems such as XFS, btrfs, ZFS, or Reiser 4, improves noticeably the performance on many workloads. It also results in better block allocation decisions, because when allocation decisions are done at write()-time, the block allocator can not know if any other write()s are going to be done.

Code: [;a=commit;h=29a814d2ee0e43c2980f33f91c1311ec06c0aa35 (commit 1], [;a=commit;h=64769240bd07f446f83660bb143bb609d8ab4910 2], [;a=commit;h=d2a1763791a634e315ec926b62829c1e88842c86 3], [;a=commit;h=cd1aac32923a9c8adcc0ae85e33c1ca0c5855838 4], [;a=commit;h=dd919b9822c5fd9fd72f95a602440130297c3857 5)]

There's also a new implementation of the default data=ordered journaling mode based in inodes, not in jbd buffer heads. Code: [;a=commit;h=c851ed540173736e60d48b53b91a16ea5c903896 (commit 1], [;a=commit;h=678aaf481496b01473b778685eca231d6784098b 2], [;a=commit;h=87c89c232c8f7b3820c33c3b9bc803e9358027da 3], [;a=commit;h=772cb7c83ba256a11c7bf99a11bef3858d23767c 4)]

1.3. Kexec jump: kexec/kdump based hibernation

Recommended LWN article: [ "Yet another approach to software suspend"]

Kexec is a Linux feature that allows loading a kernel into memory and executing it, allowing to reboot to a new kernel without rebooting. This infrastructure was used to implement kdump, a kernel crash dump system: A "safe kernel" is loaded into memory as soon as the system starts, and if the running kernel crashes, the oops code kexec's to the "safe kernel", which is able to dump the memory that it's not using to the disk or somewhere else.

This infrastructure has been enhanced in 2.6.27 to be able to be used as an hibernation implementation: Instead of kexec'ing a safe kernel to dump the system memory, a system can kexec to a kernel that will dump all the memory on the disk and then shutdown the system. When the systems boots, the initrd can load the dumped system, and restore it.

This hibernation implementation does not replace the existing hibernation implementations, it's just an alternative. It has some advantages, like not depending on ACPI. For now it only works on x86-32.

Code: [;a=commit;h=3ab83521378268044a448113c6aa9a9e245f4d2f (commit)]. [;a=commit;h=89081d17f7bb81d89fa1aa9b70f821c5cf4d39e9 (commit)]

1.4. UBIFS and OMFS

Recommended LWN article: [ "UBIFS"] [ "OMFS"]

UBIFS is a new filesystem designed to work with flash devices, developed by Nokia with help of the University of Szeged. It's important to understand that UBIFS is very different to any traditional filesystem: UBIFS does not work with block based devices, but pure flash based devices, handled by the MTD subsystem in Linux. Hence, UBIFS does not work with what many people considers flash devices like flash-based hard drives, SD cards, USB sticks, etc; because those devices use a block device emulation layer called FTL (Flash Translation Layer) that make they look like traditional block-based storage devices to the outside world. UBIFS instead is designed to work with flash devices that do not have a block device emulation layer and that are handled by the MTD subsystem and present themselves to userspace as MTD devices.

UBIFS works on top of UBI volumes. UBI is a LVM-like layer which was included in [ Linux 2.6.22], which itself works on top of MTD devices. UBIFS offers various advantages to JFFS2: faster and scalable mount times (unlike JFFS2, UBIFS does not have to scan whole media when mounting), tolerance to unclean reboots (UBIFS is a journaling filesystem), write-back (which improves dramatically the performance), and support of on-the-flight compression.

Documentation: UBIFS [ FAQ], more [ documentation]

Code: [;a=commit;h=1e51764a3c2ac05a23a22b2a95ddee4d9bffb16d (commit)], [;a=commit;h=0d7eff873caaeac84de01a1acdca983d2c7ba3fe (commit)], [;a=commit;h=e56a99d5a42dcb91e622ae7a0289d8fb2ddabffb (commit)]

OMFS stands for "Sonicblue Optimized MPEG File System support". It is the proprietary file system used by the Rio Karma music player and ReplayTV DVR. Despite the name, this filesystem is not more efficient than a standard FS for MPEG files, in fact likely the opposite is true. Code: [;a=commit;h=1b002d7b173ae7cc15ed90d3c07f6d106babc510 (commit 1], [;a=commit;h=36cc410a6799a205bfc6ccc38abd9d52f2afba64 2], [;a=commit;h=555e3775ced1d05203934fc6529bbf0560dd8733 3], [;a=commit;h=63ca8ce2a2641f9cb5f0add33ced4591681d1cd7 5], [;a=commit;h=8f09e98768c17287df076580c4cc72ac358312c6 6], [;a=commit;h=a14e4b572b0ee5c6dbe4aceb83d00b2c969324e9 7], [;a=commit;h=a3ab7155ea21aadc8a4d5687e91b3d876973185e 8)]

1.5. Block layer data integrity support

Recommended LWN article: [ "Block layer: integrity checking and lots of partitions"]

Modern filesystems feature checksumming of data and metadata to protect against data corruption. However, the detection of the corruption is done at read time which could potentially be months after the data was written. At that point the original data that the application tried to write is most likely lost (if there's not data redundancy). The solution is to ensure that the disk is actually storing what the application meant it to. Recent additions to both the SCSI family protocols (SBC Data Integrity Field, SCC protection proposal) as well as SATA/T13 (External Path Protection) try to remedy this by adding support for appending integrity metadata to an I/O. The integrity metadata includes a checksum for each sector as well as an incrementing counter that ensures the individual sectors are written in the right order. And for some protection schemes also that the I/O is written to the right place on disk.

Code: [;a=commit;h=7ba1ba12eeef0aa7113beb16410ef8b7c748e18b (commit 1], [;a=commit;h=c1c72b59941e2f5aad4b02609d7ee7b121734b8d 2], [;a=commit;h=4469f9878059f1707f021512e6b34252c4096ee7 3], [;a=commit;h=db007fc5e20c00b356e9ffe2d0e007398c65c837 4], [;a=commit;h=511e44f42e3239a4df77b8e0e46d294d98a768ad 5], [;a=commit;h=7027ad72a689797475973c6feb5f0b673382f779 6], [;a=commit;h=e0597d70012c82e16ee152270a55d89d8bf66693 7], [;a=commit;h=af55ff675a8461da6a632320710b050af4366e0c 8], [;a=commit;h=f11f594edba7f689af9792a5673ed59d660ad371 9)]

1.6. Multiqueue networking

Recommended LWN article: [ "Multiqueue networking"]

From that article: One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device [...] This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. Linux 2.6.27 adds support for those devices

Code: [;a=commit;h=e8a0464cc950972824e2e128028ae3db666ec1ed (commit)]

1.7. ftrace, sysprof support

Ftrace is a very simple function tracer -unrelated to kprobes/SystemTap- which was born in the -rt patches. It uses a compiler feature to insert a small, 5-byte No-Operation instruction to the beginning of every kernel function, which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. If it's disabled, the overhead of the instructions is very small and not measurable even in micro-benchmarks. Although ftrace is the function tracer, it also includes an plugin infrastructure that allows for other types of tracing. Some of the tracers that are currently in ftrace include a tracer to trace context switches, the time it takes for a high priority task to run after it was woken up, how long interrupts are disabled, the time spent in preemption off critical sections.

The interface to access ftrace can be found in /debugfs/tracing, which are documented in Documentation/ftrace.txt. There's also a sysprof plugin that can be used with a development version of sysprof - "svn checkout sysprof"

Code: [;a=commit;h=7c731e0a495e25e79dc1e9e68772a67a55721a65 (commit 1], [;a=commit;h=502825282e6f79c975a644afc124432ec1744de4 2], [;a=commit;h=6e766410c4babd37bc7cd5e25009c179781742c8 3], [;a=commit;h=16444a8a40d4c7b4f6de34af0cae1f76a4f6c901 4], [;a=commit;h=bc0c38d139ec7fcd5c030aea16b008f3732e42ac 5], [;a=commit;h=1b29b01887e6032dcaf818c14999c7a39593b4e7 6], [;a=commit;h=35e8e302e5d6e32675df2fc1dd3a53dfa6630dc1 7], [;a=commit;h=352ad25aa4a189c667cb2af333948d34692a2d27 8], [;a=commit;h=81d68a96a39844853b37f20cc8282d9b65b78ef3 9], [;a=commit;h=6cd8a4bb2f97527a9ceb30bc77ea4e959c6a95e3 10], [;a=commit;h=3d0833953e1b98b79ddf491dd49229eef9baeac1 11], [;a=commit;h=b0fc494fae96a7089f3651cb451f461c7291244c 12], [;a=commit;h=4e491d14f2506b218d678935c25a7027b79178b1 13] [;a=commit;h=f06c38103ea9dbca27c3f4d77f444ddefb5477cd 14], [;a=commit;h=f984b51e0779a6dd30feedc41404013ca54e5d05 15], [;a=commit;h=014c257cce65e9d1cd2d28ec1c89a37c536b151d 16], [;a=commit;h=bd3bff9e20f454b242d979ec2f9a4dca0d5fa06f 17)]

1.8. Mmiotrace

Recommended LWN article: [ "Tracing memory-mapped I/O operations"]

Mmiotrace is a tool for trapping [ memory mapped IO] (MMIO) accesses within the kernel. Since MMIO is used by drivers, this tool can be used for debugging and especially for reverse engineering binary drivers.

Code: [;a=commit;h=8b7d89d02ef3c6a7c73d6596f28cea7632850af4 (commit)], Documentation: [;a=commit;h=c6c67c1afcce71335b18ed8769b1165c468bfb03 (commit)]

1.9. External firmware

Recommended LWN article: [ "Moving the firmware out"]

Firmware is usually compiled with each driver. For some reasons (mainly, licensing reasons), distributing firmware is not allowed by some companies and some drivers have also supported loading external firmware for a long time. But even if the firmware compiled and shipped with each driver is redistributable, is not libre software, and some people thinks that this breaks the GPL. It also has some disadvantages for distros.

In 2.6.27, the firmware blobs have been moved from the drivers' source code to a new directory: firmware/. By default, the firmware won't be compiled in the kernel binary, or in the modules. It's installed in /lib/firmware when the user types "make modules_install", and drivers have been modified to call request_firmware() and load the firmware when they need it. There's also a configuration option that will compile the firmware files in the kernel binary image, like it was done previously.

Code: [;a=commit;h=5658c769443d543728b6c5c673dffc2df8676317 (commit 1], [;a=commit;h=4d2acfbfdf68257e846aaa355edd10fc35ba0feb 2], [;a=commit;h=d172e7f5c67f2d41f453c7aa83d3bdb405ef8ba5 3], [;a=commit;h=88ecf814c47f577248751ddbe9626d98aeef5783 4)]

1.10. Improved video camera support with the gspca driver

[ Linux 2.6.26] was a big improvement to linux webcam support thanks to a driver that supports devices that implement the [ USB video class] specification, which are quite a lot. 2.6.26 includes of the gspca driver, which implements support for another [ large] set of devices. With this driver, most video camera devices on the market are supported by Linux.

Code: [;a=commit;h=63eb9546dcb5e9dc39ab88a603dede8fdd18e717 (commit)], [;a=commit;h=6a7eba24e4f0ff725d33159f6265e3a79d53a833 (commit)]

1.11. Extended file descriptor system calls

Recommended LWN article: [ "Extending system calls"]

When Unix was designed, some of the interfaces didn't envisioned functionality that would be needed in the future. Many interfaces that allow creating a file descritor don't take a flag parameter, for example. That makes impossible to create file descriptors with new properties things like close-on-exec, non-blocking, or non-sequential descriptors. Being able to do such things today is neccesary - not just for fun: it also closes a security bug that can be exploited in multithreaded apps.

To solve this issue, Linux 2.6.27 is adding a new set of interfaces and syscalls that will be used by glibc.

Code: [;a=commit;h=a677a039be7243357d93502bff2b40850c942e2d (commit 1], [;a=commit;h=aaca0bdca573f3f51ea03139f9c7289541e7bca3 2], [;a=commit;h=c019bbc612f6633ede7ed67725cbf68de45ae8a4 3], [;a=commit;h=7d9dbca34240ebb6ff88d8a29c6c7bffd098f0c1 4], [;a=commit;h=9deb27baedb79759c3ab9435a7d8b841842d56e9 5], [;a=commit;h=b087498eb5605673b0f260a7620d91818cd72304 6], [;a=commit;h=11fcb6c14676023d0bd437841f5dcd670e7990a0 7], [;a=commit;h=a0998b50c3f0b8fdd265c63e0032f86ebe377dbf 8], [;a=commit;h=336dd1f70ff62d7dd8655228caed4c5bfc818c56 9], [;a=commit;h=ed8cae8ba01348bfd83333f4648dd807b04d7f08 10], [;a=commit;h=4006553b06306b34054529477b06b68a1c66249b 11], [;a=commit;h=99829b832997d907c30669bfd17da32151e18f04 12], [;a=commit;h=77d2720059618b9b6e827a8b73831eb6c6fad63c 13], [;a=commit;h=5fb5e04926a54bc1c22bba7ca166840f4476196f 14], [;a=commit;h=e7d476dfdf0bcfed478a207aecfdc84f81efecaf 15], [;a=commit;h=6b1ef0e60d42f2fdaec26baee8327eb156347b4f 16], [;a=commit;h=be61a86d7237dd80510615f38ae21d6e1e98660c 17], [;a=commit;h=510df2dd482496083e1c3b1a8c9b6afd5fa4c7d7 18)]

1.12. Voltage and Current Regulator

This framework is designed to provide a generic interface to voltage and current regulators. The intention is to allow systems to dynamically control regulator output in order to save power and prolong battery life. This applies to both voltage regulators (where voltage output is controllable) and current sinks (where current output is controllable). This framework is designed around SoC based devices and has also been designed against two Power Management ICs (PMICs) currently on the market - namely the Freescale MC13783 and the Wolfson WM8350, however it is quite generic and should apply to all PMICs.

Code: [;a=commit;h=571a354b1542a274d88617e1f6703f3fe7a517f1 (commit 1], [;a=commit;h=e2ce4eaa76214f65a3f328ec5b45c30248115768 2], [;a=commit;h=414c70cb91c445ec813b61e16fe4882807e40240 3], [;a=commit;h=48d335ba3164ce99cb8847513d0e3b6ee604eb20 4], [;a=commit;h=4b74ff6512492dedea353f89d9b56cb715df0d7f 5], [;a=commit;h=4c1184e85cb381121a5273ea20ad31ca3faa0a4f 6], [;a=commit;h=c080909eef2b3e7fba70f57cde3264fba95bdf09 7], [;a=commit;h=6392776d262fcd290616ff5e4246ee95b22c13f0 8], [;a=commit;h=8e6f0848be83c5c406ed73a6d7b4bfbf87880eec 9], [;a=commit;h=ba7e4763437561763b6cca14a41f1d2a7def23e2 10], [;a=commit;h=e7d0fe340557b202dc00135ab3cc877db794a01f 11], [;a=commit;h=e8695ebe5568921c41c269f4434e17590735865c 12], [;a=commit;h=e941d0ce532daf8d8610b2495c06f787fd587b85 13], [;a=commit;h=0eb5d5ab3ec99bfd22ff16797d95835369ffb25b 14)]

2. Architecture-specific changes

3. Core

4. Crypto

5. Security

6. Networking

7. Filesystems

8. Drivers

8.1. Graphics


8.3. Network

8.4. SCSI

8.5. Sound

8.6. V4L/DVB

8.7. Input

8.8. USB

8.9. FireWire

8.10. MTD

8.11. RTC


8.13. Bluetooth

8.14. I2C

8.15. Infiniband/RDMA

8.16. MMC

8.17. HWMON

8.18. ACPI

8.19. Various

9. The Linux Kernel in the news


KernelNewbies: Linux_2_6_27 (last edited 2008-10-10 02:44:17 by ZiyadAlBATLY)