KernelNewbies:

Linux 2.6.32 has been released on December 3rd 2009.

Summary: This version adds virtualization memory de-duplication, a rewrite of the writeback code which provides noticeable performance speedups, many important Btrfs improvements and speedups, ATI R600/R700 3D and KMS support and other graphic improvements, a CFQ low latency mode, tracing improvements including a "perf timechart" tool that tries to be a better bootchart, soft limits in the memory controller, support for the S+Core architecture, support for Intel Moorestown and its new firmware interface, run time power management support, and many other improvements and new drivers.

1. Prominent features (the cool stuff)

1.1. Per-backing-device based writeback

Slides from Jens Axboe: 'Per backing device writeback'

Recommended LWN article: 'Flushing out pdflush' 'In defense of per-BDI writeback'

"Writeback" in the context of the Linux kernel can be defined as the process of writing "dirty" memory from the page cache to the disk. The amount of data that needs to be written can be huge - hundreds of MB, or even GB, and the work is done by the well know "pdflush" kernel threads when the amount of dirty memory surpasses the limits set in /proc/sys/vm. The current pdflush system has disadvantages, specially in systems with multiple storage devices that need to write large chunks of data to the disk. This design has some deficiencies (described in the links above) that cause poor performance and seekiness in some situations. A new flushing system has been designed by Jens Axboe (Oracle), which focus around the idea of having a dedicated kernel thread to flushing the dirty memory of each storage device. The "pdflush" threads are gone and have been replaced with others named after "flush-MAJOR" (the threads are created when there's flushing work that needs to be done and will dissapear after a while if there's nothing to do).

The new system has much better performance in several workloads: A benchmark with two processes doing streaming writes to a 32 GB file to 5 SATA drives pushed into a LVM stripe set, XFS was 40% faster, and Btrfs 26% faster. A sample ffsb workload that does random writes to files was found to be about 8% faster on a simple SATA drive during the benchmark phase. File layout is much smoother on the vmstat stats. A SSD based writeback test on XFS performs over 20% better as well, with the throughput being very stable around 1GB/sec, where pdflush only manages 750MB/sec and fluctuates wildly while doing so. Random buffered writes to many files behave a lot better as well, as does random mmap'ed writes. A streaming vs random writer benchmark went from a few MB/s to ~120 MB/s. In short, performance improves in many important workloads.

Code: (commit 1, 2, 3, 4, 5, 6, 7)

1.2. Btrfs improvements

Recommended LWN artice: A Btrfs update

1.3. Kernel Samepage Merging (memory deduplication)

Recommended LWN articles: '/dev/ksm: dynamic memory sharing', 'KSM tries again'

Kernel Samepage Merging, aka KSM (also know as Kernel Shared Memory in the past) is a memory de-duplication implementation.

Modern operative systems already use memory sharing extensively, for example forked processes share initially with its parent all the memory, there are shared libraries, etc. Virtualization however can't benefit easily from memory sharing. Even when all the VMs are running the same OS with the same kernel and libraries the host kernel can't know that a lot of those pages are identical and can be shared. KSM allows to share those pages. The KSM kernel daemon, ksmd, periodically scans areas of user memory, looking for pages of identical content which can be replaced by a single write-protected page (which is automatically COW'ed if a process wants to update it). Not all the memory is scanned, the areas to look for candidates for merging are specified by userspace apps using madvise(2): madvise(addr, length, MADV_MERGEABLE).

The result is a dramatic decrease in memory usage in virtualization environments. In a virtualization server, Red Hat found that thanks to KSM, KVM can run as many as 52 Windows XP VMs with 1 GB of RAM each on a server with just 16 GB of RAM. Because KSM works transparently to userspace apps, it can be adopted very easily, and provides huge memory savings for free to current production systems. It was originally developed for use with KVM, but it can be also used with any other virtualization system - or even in non virtualization workloads, for example applications that for some reason have several processes using lots of memory that could be shared.

The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, documentation can be found in Documentation/vm/ksm.txt.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)

1.4. Improvements in the graphic stack

The landing of GEM and KMS in past releases is driving a much needed renovation in the Linux graphic stack. This release adds several improvements to the graphic drivers that show the steady progress of this kernel subsystem:

1.5. CFQ low latency mode

Recommended LWN commentary from Jens Axboe

In this release, the CFQ IO scheduler (the one used by default) gets a new feature that greatly helps to reduce the impact that a writer can have on the system interactiveness. The end result is that the desktop experience should be less impacted by background IO activity, but it can cause noticeable performance issues, so people who only care about throughput (ie, servers) can try to turn it off echoing 0 to /sys/class/block/<device name>/queue/iosched/low_latency. It's worth mentioning that the 'low_latency' setting defaults to on.

Code: (commit), (commit)

1.6. Tracing improvements: perf tracepoints, perf timechart and perf sched

The perf tool is getting a lot of attention and patches. In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility, so the tool has been renamed from "Performance Counters" to "Performance Events".

Code: rename (commit), perf trace (commit), syscalls (commit 1, 2, 3, 4), module (commit), skb (commit), memory allocator: (commit 1, 2, 3, 4, 5, 6, perf sched (commit 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10), perf timechart(commit 1, 2, 3)

1.7. Soft limits in the memory controller

Control groups are a sort of virtual "containers" that are created as directories inside a special virtual filesystem (usually, with tools), and an arbitrary set of processes can be add to that control group and you can configure the control group to have a set of cpu scheduling or memory limits for the processes inside the group.

This release adds soft memory limits - the processes can surpass the soft limit as long as there is no memory contention (and they do no exceed their hard limit), but if the system needs to free memory, it will reclaim it from the groups that exceed their soft limit.

Code: (commit), (commit), (commit), (commit), (commit)

1.8. Easy local kernel configuration

Most people uses the kernel shipped by distros - and that's good. But some people like to compile their own kernels from kernel.org, or maybe they like following the Linux development and want to try it. Configuring your own kernel, however, has become a very difficult and tedious task - there're too many options, and some times userspace software will stop working if you don't enable some key option. You can use a standard distro .config file, but it takes too much time to compile all the options it enables.

To make the process of configuration easier, a new build target has been added: make localmodconfig. It runs "lsmod" to find all the modules loaded on the current running system. It will read all the Makefiles to map which CONFIG enables a module. It will read the Kconfig files to find the dependencies and selects that may be needed to support a CONFIG. Finally, it reads the .config file and removes any module "=m" that is not needed to enable the currently loaded modules. With this tool, you can strip a distro .config of all the unuseful drivers that are not needed in our machine, and it will take much less time to build the kernel. There's an additional "make localyesconfig" target, in case you don't want to use modules and/or initrds.

1.9. Virtualization improvements

This version adds a few notable improvements to the Linux virtualization subsystem, KVM:

1.10. Run-time Power Management

Recommended LWN article: 'Runtime power management'

This feature enables functionality allowing I/O devices to be put into energy-saving (low power) states at run time (or autosuspended) after a specified period of inactivity and woken up in response to a hardware-generated wake-up event or a driver's request. Hardware support is generally required for this functionality to work and the bus type drivers of the buses the devices are on are responsible for the actual handling of the autosuspend requests and wake-up events.

Code: Introduce core framework for run-time PM of I/O devices (rev. 17) (commit)

1.11. S+core architecture support

This release adds support for a new architecture, S+core. Score instruction set support 16bits, 32bits and 64bits instruction, Score SOCs had been used in game machine and LCD TV.

Code: (commit)

1.12. Intel Moorestown and SFI (Simple Firmware Interface) and ACPI 4.0 support

The Simple Firmware Interface (SFI) is a method for platform firmware to export static tables to the operating system (OS) - something analogous to ACPI, used in the MID devices based on the 2nd generation Intel Atom processor platform, code-named Moorestown.

SFI is used instead of ACPI in those platforms because it's more simple and lightweight. It's not intended to replace ACPI. For more information, see the web site

At the same time, this release adds support for Moorestown, Intel's Low Power Intel Architecture (LPIA) based Moblin Internet Device(MID) platform. Moorestown consists of two chips: Lincroft (CPU core, graphics, and memory controller) and Langwell IOH. Unlike standard x86 PCs, Moorestown does not have many legacy devices nor standard legacy replacement devices/features. e.g. Moorestown does not contain i8259, i8254, HPET, legacy BIOS, most of the io ports.

There're also several patches that implement ACPI 4.0 support - Linux is in fact the first platform to support it.

SFI: (commit 1, 2, 3, 4, 5, 6) Moorestown: (commit), (commit)

1.13. NAPI-like approach for block devices

Recommended LWN article: 'Interrupt mitigation in the block layer'

blk-iopoll is a NAPI like approach for block devices, it reduces the interrupt overhead. In benchmarks, blk-iopoll cut sys time by 40% in some cases.

Code: (commit)

2. Various core changes

3. Block

4. Virtualization

5. PCI

6. MD/DM

7. Filesystems

8. Networking

9. Security

10. Tracing/Profiling

11. Crypto

12. Architecture-specific changes

13. Drivers

13.1. Graphics

13.2. Storage

13.3. Networking devices

13.4. USB

13.5. FireWire

13.6. Input

13.7. Sound

13.8. Staging Drivers

13.9. V4L/DVB

13.10. Bluetooth

13.11. MTD

13.12. HWMON

13.13. ACPI

13.14. Various

13.15. Other news sources tracking the kernel changes

KernelNewbies: Linux_2_6_32 (last edited 2017-12-30 01:30:30 by localhost)