Linux 2.6.27 kernel released 9 October 2008.

Note: The 2008 Linux Kernel Summit was held September 15 and 16 in Portland, Oregon, immediately prior to the Linux Plumbers Conference. LWN, as always, has excellent coverage of the event. You can also download all the papers of Linux Symposium 2008 here - in two PDF files. LWN also has coverage of the Linux Plumbers Conference

Summary: 2.6.27 add a new filesystem (UBIFS) optimized for "pure" flash-based storage devices, the page-cache is now lockless, much improved Direct I/O scalability and performance, delayed allocation for ext4, multiqueue networking, an alternative hibernation implementation based on kexec/kdump, data integrity support in the block layer for devices that support it, a simple tracer called ftrace, a mmio tracer, sysprof support, extraction of all the in-kernel's firmware to /lib/firmware, XEN support for saving/restorig VMs, improved video camera support, support for the Intel wireless 5000 series and RTL8187B network cards, a new ath9k driver for the Atheros AR5008 and AR9001 family of chipsets, more new drivers, improved support for others and many other improvements and fixes.

1. Prominent features (the cool stuff)

1.1. Lockless page cache and get_user_pages()

Recommended LWN article: "Toward better direct I/O scalability", "The lockless page cache"

The page cache is the place where the kernel keeps in RAM a copy of a file to improve performance by avoiding disk I/O when the data that needs to be read is already on RAM. Each "mapping", which is the data structure that keeps track of the correspondence between a file and the page cache, is SMP-safe thanks to its own lock. So when different processes in different CPUs access different files, there's no lock contention, but if they access the same file (shared libraries or shared data files for example), they can hit some contention on that lock. In 2.6.27, thanks to some rules on how the page cache can be used and the usage of RCU, the page cache will be able to do lookups (ie., "read" the page cache) without needing to take the mapping lock, and hence improving scalability. But it will only be noticeable on systems with lots of cpus (page fault speedup of 250x on a 64 way system have been measured).

Code: (commit 1, 2, 3)

Lockless get_user_pages(): get_user_pages() is a function used in direct I/O operations to pin the userspace memory that is going to be transferred. It's a complex function that requires to hold the mmap_sem semaphore in the mm_struct struct of the process and the page table lock. This is a scalability problem when there're several processes using get_user_pages in the same address space (for example, databases that do Direct I/O), because there will be lock contention. In 2.6.27, a new get_user_pages_fast() function has been introduced, which does the same work that get_user_pages() does, but its simplified to speed up the most common workloads that exercise those paths within the same address space. This new function can avoid taking the mmap_sem semaphore and the page table locks in those cases. Benchmarks showed a 10% speedup running a OLTP workload with a IBM DB2 database in a quad-core system

Code: (commit 1, 2, 3, 4, 5, 6)

1.2. Ext4: Delayed Allocation

In this release, Ext4 is adding one of its most important planned features: Delayed allocation (also called "Allocate-on-flush"). It doesn't change the disk format in any way, but it improves the performance in a wide range of workloads.

When an application write()s data to the disk, the data is usually not written immediately to the disk but instead is cached in RAM for a while. Without delayed allocation, despite the data not being written immediately to the disk the filesystem allocates the necessary disk structures for it immediately. Delayed allocation consists of not allocating space for that cached data - instead only the free space counter is updated when write() is called. The procedure is changed so on-disk blocks and structures are now only allocated when the cached data is finally written to the disk - not when a process writes something. This approach (used by filesystems such as XFS, btrfs, ZFS and Reiser 4) noticeably improves the performance of many workloads. It also results in better block allocation decisions because when allocation decisions are done at write()-time, the block allocator cannot know if any other write()s are going to be done.

Code: (commit 1, 2, 3, 4, 5)

There is also a new implementation of the default data=ordered journaling mode based nn inodes, not nn jbd buffer heads. Code: (commit 1, 2, 3, 4)

1.3. Kexec jump: kexec/kdump based hibernation

Recommended LWN article: "Yet another approach to software suspend"

Kexec is a Linux feature that allows loading a kernel into memory and executing it, allowing to reboot to a new kernel without rebooting. This infrastructure was used to implement kdump, a kernel crash dump system: A "safe kernel" is loaded into memory as soon as the system starts, and if the running kernel crashes, the oops code kexec's to the "safe kernel", which is able to dump the memory that it's not using to the disk or somewhere else.

This infrastructure has been enhanced in 2.6.27 to be able to be used as an hibernation implementation: Instead of kexec'ing a safe kernel to dump the system memory, a system can kexec to a kernel that will dump all the memory on the disk and then shutdown the system. When the systems boots, the initrd can load the dumped system, and restore it.

This hibernation implementation does not replace the existing hibernation implementations, it's just an alternative. It has some advantages, like not depending on ACPI. For now it only works on x86-32.

Code: (commit). (commit)

1.4. UBIFS and OMFS

Recommended LWN article: "UBIFS" "OMFS"

UBIFS is a new filesystem designed to work with flash devices, developed by Nokia with help of the University of Szeged. It's important to understand that UBIFS is very different to any traditional filesystem: UBIFS does not work with block based devices, but pure flash based devices, handled by the MTD subsystem in Linux. Hence, UBIFS does not work with what many people considers flash devices like flash-based hard drives, SD cards, USB sticks, etc; because those devices use a block device emulation layer called FTL (Flash Translation Layer) that make they look like traditional block-based storage devices to the outside world. UBIFS instead is designed to work with flash devices that do not have a block device emulation layer and that are handled by the MTD subsystem and present themselves to userspace as MTD devices.

UBIFS works on top of UBI volumes. UBI is a LVM-like layer which was included in Linux 2.6.22, which itself works on top of MTD devices. UBIFS offers various advantages to JFFS2: faster and scalable mount times (unlike JFFS2, UBIFS does not have to scan whole media when mounting), tolerance to unclean reboots (UBIFS is a journaling filesystem), write-back (which improves dramatically the performance), and support of on-the-flight compression.

Documentation: UBIFS FAQ, more documentation

Code: (commit), (commit), (commit)

OMFS stands for "Sonicblue Optimized MPEG File System support". It is the proprietary file system used by the Rio Karma music player and ReplayTV DVR. Despite the name, this filesystem is not more efficient than a standard FS for MPEG files, in fact likely the opposite is true. Code: (commit 1, 2, 3, 5, 6, 7, 8)

1.5. Block layer data integrity support

Recommended LWN article: "Block layer: integrity checking and lots of partitions"

Modern filesystems feature checksumming of data and metadata to protect against data corruption. However, the detection of the corruption is done at read time which could potentially be months after the data was written. At that point the original data that the application tried to write is most likely lost (if there's not data redundancy). The solution is to ensure that the disk is actually storing what the application meant it to. Recent additions to both the SCSI family protocols (SBC Data Integrity Field, SCC protection proposal) as well as SATA/T13 (External Path Protection) try to remedy this by adding support for appending integrity metadata to an I/O. The integrity metadata includes a checksum for each sector as well as an incrementing counter that ensures the individual sectors are written in the right order. And for some protection schemes also that the I/O is written to the right place on disk.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9)

1.6. Multiqueue networking

Recommended LWN article: "Multiqueue networking"

From that article: One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device [...] This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. Linux 2.6.27 adds support for those devices

Code: (commit)

1.7. ftrace, sysprof support

Ftrace is a very simple function tracer -unrelated to kprobes/SystemTap- which was born in the -rt patches. It uses a compiler feature to insert a small, 5-byte No-Operation instruction to the beginning of every kernel function, which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. If it's disabled, the overhead of the instructions is very small and not measurable even in micro-benchmarks. Although ftrace is the function tracer, it also includes an plugin infrastructure that allows for other types of tracing. Some of the tracers that are currently in ftrace include a tracer to trace context switches, the time it takes for a high priority task to run after it was woken up, how long interrupts are disabled, the time spent in preemption off critical sections.

The interface to access ftrace can be found in /debugfs/tracing, which are documented in Documentation/ftrace.txt. There's also a sysprof plugin that can be used with a development version of sysprof - "svn checkout sysprof"

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17)

1.8. Mmiotrace

Recommended LWN article: "Tracing memory-mapped I/O operations"

Mmiotrace is a tool for trapping memory mapped IO (MMIO) accesses within the kernel. Since MMIO is used by drivers, this tool can be used for debugging and especially for reverse engineering binary drivers.

Code: (commit), Documentation: (commit)

1.9. External firmware

Recommended LWN article: "Moving the firmware out"

Firmware is usually compiled with each driver. For some reasons (mainly, licensing reasons), distributing firmware is not allowed by some companies and some drivers have also supported loading external firmware for a long time. But even if the firmware compiled and shipped with each driver is redistributable, is not libre software, and some people thinks that this breaks the GPL. It also has some disadvantages for distros.

In 2.6.27, the firmware blobs have been moved from the drivers' source code to a new directory: firmware/. By default, the firmware won't be compiled in the kernel binary, or in the modules. It's installed in /lib/firmware when the user types "make modules_install", and drivers have been modified to call request_firmware() and load the firmware when they need it. There's also a configuration option that will compile the firmware files in the kernel binary image, like it was done previously.

Code: (commit 1, 2, 3, 4)

1.10. Improved video camera support with the gspca driver

Linux 2.6.26 was a big improvement to linux webcam support thanks to a driver that supports devices that implement the USB video class specification, which are quite a lot. 2.6.27 includes the gspca driver, which implements support for another large set of devices. With this driver, most video camera devices on the market are supported by Linux.

Code: (commit), (commit)

1.11. Extended file descriptor system calls

Recommended LWN article: "Extending system calls"

When Unix was designed, some of the interfaces didn't envisioned functionality that would be needed in the future. Many interfaces that allow creating a file descritor don't take a flag parameter, for example. That makes impossible to create file descriptors with new properties things like close-on-exec, non-blocking, or non-sequential descriptors. Being able to do such things today is neccesary - not just for fun: it also closes a security bug that can be exploited in multithreaded apps.

To solve this issue, Linux 2.6.27 is adding a new set of interfaces and syscalls that will be used by glibc.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)

1.12. Voltage and Current Regulator

This framework is designed to provide a generic interface to voltage and current regulators. The intention is to allow systems to dynamically control regulator output in order to save power and prolong battery life. This applies to both voltage regulators (where voltage output is controllable) and current sinks (where current output is controllable). This framework is designed around SoC based devices and has also been designed against two Power Management ICs (PMICs) currently on the market - namely the Freescale MC13783 and the Wolfson WM8350, however it is quite generic and should apply to all PMICs.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)

2. Architecture-specific changes

3. Core

4. Crypto

5. Security

6. Networking

7. Filesystems

8. Drivers

8.1. Graphics


8.3. Network

8.4. SCSI

8.5. Sound

8.6. V4L/DVB

8.7. Input

8.8. USB

8.9. FireWire

8.10. MTD

8.11. RTC


8.13. Bluetooth

8.14. I2C

8.15. Infiniband/RDMA

8.16. MMC

8.17. HWMON

8.18. ACPI

8.19. Various

9. The Linux Kernel in the news


KernelNewbies: Linux_2_6_27 (last edited 2017-12-30 01:29:54 by localhost)