KernelNewbies:

2.6.25 includes RCU preemption support, fairer spinlocks in x86, new interfaces for more accurate measurement of process memory usage,

TableOfContents()

1. Important features (AKA: the cool stuff)

1.1. Memory Resource Controller

Recommended LWN article (somewhat outdated, but still interesting): [http://lwn.net/Articles/243795/ "Controlling memory use in containers"]

The memory resource controller is a cgroups-based feature. Cgroups, aka "Control Groups", is a feature that was merged in [http://kernelnewbies.org/Linux_2_6_24 2.6.24], and its purpose is to be a generic framework where several "resource controllers" can plug in and manage different resources of the system such as process scheduling or memory allocation. It also offers a unified user interface, based on a virtual filesystem where administrators can assign arbitrary resource constraints to a group of chosen tasks. For example, in [http://kernelnewbies.org/Linux_2_6_24 2.6.24] they merged two resource controllers: Cpusets and Group Scheduling. The first allows to bind CPU and Memory nodes to the arbitrarily chosen group of tasks, aka cgroup, and the second allows to bind a CPU bandwidth policy to the cgroup.

The memory resource controller isolates the memory behavior of a group of tasks -cgroup- from the rest of the system. It can be used to:

The configuration interface, like all the cgroups, is done by mounting the cgroup filesystem with the "-o memory" option, creating a randomly-named directory (the cgroup), adding tasks to the cgroup by catting its PID to the 'task' file inside the cgroup directory, and writing values to the following files: 'memory.limit_in_bytes', 'memory.usage_in_bytes' (memory statistic for the cgroup), 'memory.stats' (more statistics: RSS, caches, inactive/active pages), 'memory.failcnt' (number of times that the cgroup exceeded the limit), and 'mem_control_type'. OOM conditions are also handled in a per-cgroup manner: when the tasks in the cgroup surpass the limits, OOM will be called to kill a task between all the tasks involved in that specific cgroup.

Code: (commit [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1b6df3aa457690100f9827548943101447766572 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8cdea7c05454260c0d4d83503949c358eb131d17 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e552b6617067ab785256dcec5ca29eeea981aacb 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=78fb74669e80883323391090e4d26d17fe29488f 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8a9f3ccd24741b50200c3f33d62534c7271f3dfc 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=66e1707bc34609f626e2e7b4fe7e454c9748bad5 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=67e465a77ba658635309ee00b367bec6555ea544 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0eea10301708c64a6b793894c156e21ddd15eb64 8], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d 9], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8697d33194faae6fdd6b2e799f6308aa00cfdf67 10], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bed7161a519a2faef53e1bce1b47595e297c1d14 11], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e1a1cd590e3fcb0d2e230128daf2337ea55387dc 12])

1.2. Real Time Group scheduling

Group scheduling is a feature introduced in [http://kernelnewbies.org/Linux_2_6_24 2.6.24]. It allows to assign different process scheduling priorities other than nice levels. For example, given two users on a system, you may want to to assign 50% of CPU time to each one, regardless of how many processes is running each one (traditionally, if one user is running f.e. 10 cpu-bound processes and the other user only 1, this last user would get starved its CPU time), this is the "group tasks by user id" configuration option of Group Scheduling does. You may also want to create arbitrary groups of tasks and give them CPU time privileges, this is what the "group tasks by Control Groups" option does, basing its configuration interface in cgroups (feature introduced in 2.6.24 and described in the "Memory resource controller" section).

Those are the two working modes of Control Groups. Aditionally there're several types of tasks. What 2.6.25 adds to Group Scheduling is the ability to also handle real time (aka SCHED_RT) processes. This makes much easier to handle RT tasks and give them scheduling guarantees.

Documentation: [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/sched-rt-group.txt;hb=HEAD sched-rt-group.txt]

Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa85ae2418e6843953107cd6a06f645752829bc0 (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6f505b16425a51270058e4a93441fe64de3dd435 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9f0c1e560c43327b70998e6c702b2f01321130d9 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=052f1dc7eb02300b05170ae341ccd03b76207778 4)]

There's serious interest in running RT tasks on enterprise-class hardware, so a large number of enhancements to the RT scheduling class and load-balancer have been merged to provide optimum behaviour for RT tasks. Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e8fa136262e1121288bb93befe2295928ffd240d (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4fd29176b7cd24909f8ceba2105cb3ae2857b90c 2], http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f65eda4f789168ba5ff3fa75546c29efeed19f58 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4642dafdf93dc7d66ee33437b93a5e6b8cea20d2 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7a1e46aa9782a947cf2ed506245d43396dbf991 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=73fe6aae84400e2b475e2a1dc4e8592cd3ed6e69 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e7693a362ec84bb5b6fd441d8a8b4b9d568a7a0c 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=318e0893ce3f524ca045f9fd9dfd567c0a6f9446 8], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6e1254d2c41215da27025add8900ed187bca121d 9)]

1.3. RCU Preempt support

Recommended LWN article: [http://lwn.net/Articles/253651/ "The design of preemptible read-copy-update"]

[http://en.wikipedia.org/wiki/Read-copy-update RCU] is a very powerful locking scheme used in Linux to scale to [http://lkml.org/lkml/2007/5/4/314 very large] number of CPUs on a single system. However, it wasn't well suited for the Real Time patchsets that have been developed to make Linux a RT OS, because some parts weren't preemptible, causing latencies too big for RT workloads. In 2.6.25, RCU can be preempted, eliminating that source of latencies and making Linux a bit more RT-ish.

Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e260be673a15b6125068270e0216a3bfbfc12f87 (commit 1],[http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2232c2d8e0a6a31061dec311f3d1cf7624bc14f1 2)]

1.4. FIFO ticket spinlocks in x86

Recommended LWN article: [http://lwn.net/Articles/267968/ "Ticket spinlocks"]

In certain workloads, spinlocks can be unfair, ie: a process spinning on a spinlock can be starved up to 1,000,000 times. Usually starvation in spinlocks is not a problem, and it was thougt that it was not too important because such spinlock would become a performance problem before any starvation is noticed, but testing has showed the contrary. And it's always possible to find an obscure corner case that will generate a lot of contention on some lock, and the processor that will grab the lock does it randomly.

With the new spinlocks, the processes grab the spinlock in FIFO order, ensuring fairness (and more importantly, guaranteeing to some point the

Spinlocks configured to run on machines with more than 255 CPUs will use a 32-bit value, and 16 bits when the number of CPUs is smaller. As a bonus, the maximum theorical number of limit of CPUs that Linux can support is raised up to 65536 processors.

Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314cdbefd1fd0a7acf3780e9628465b77ea6a836 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3a556b26a2718e48aa2b6ce06ea4875ddcd0778e (commit)]

1.5. Better process memory usage measurement

Recommended LWN article: [http://lwn.net/Articles/230975/ "How much memory are applications really using?"]

Measuring how much memory processes are using is more difficult than it looks, specially when processes are sharing the memory used. Features like /proc/$PID/smaps (added in [http://kernelnewbies.org/Linux_2_6_14 2.6.14]) help, but it has not been enough. 2.6.25 adds new statistics to make this task easier. A new /proc/$PID/pagemaps file is added for each process. In this file the kernel exports (in binary format) the physical page localization for each page used by the process. Comparing this file with the files of other processes allows to know what pages they are sharing. Another file, /proc/kpagemaps, exposes another kind of statistics about the pages of the system. The author of the patch, Matt Mackall, proposes two new statistic metrics: "proportional set size" (PSS) - divide each shared page by the number of processes sharing it; and "unique set size" (USS) (counting of pages not shared). The first statistic, PSS, has also been added to each file in /proc/$PID/smaps. In [http://selenic.com/repo/pagemap/ this HG repository] you can find some sample command line and graphic tools that exploits all those statistics.

Code: (commit [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1e88328111aae3ea408f346763ba9f9bad71f876 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=304daa8132a95e998b6716d4b7bd8bd76aa152b2 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=161f47bf41c5ece90ac53cbb6a4cb9bf74ce0ef6 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=85863e475e59afb027b0113290e3796ee6020b7d 4])

1.6. timerfd() syscall

timerfd() is a feature that got merged in 2.6.22 but was disabled due to late complaints about the syscall interface. Its purpose is to extend the timer event notifications to something else than signals, because doing such things with signals is hard. poll()/epoll() only covers file descriptors, so the options were a BSDish kevent-like subsystem or delivering time notifications via a file descriptor, so that poll/epoll could handle them.

There were implementations for both approachs, but the cleaner and more "unixy" design of the file descriptor approach won. In 2.6.25, a revised API has been finally introduced. The API can be found [http://lwn.net/Articles/260172/ in this LWN article]

Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4d672e7ac79b5ec5cdc90e450823441e20464691 (commit)]

1.7. SMACK, Simplified Mandatory Access Control

Recommended LWN article: [http://lwn.net/Articles/244531/ "Smack for simplified access control"]

The most used MAC solution in Linux is SELinux, a very powerful security framework. SMACK is an alternative MAC framework, not so powerful as SELinux but simpler to use and configure. Linux is all about flexibility, and in the same way it has several filesystems, this alternative security framework doesn't pretends to reemplaze SELinux, it's just an alternative for those who find it more suited to its needs.

From the LWN article: Like SELinux, Smack implements Mandatory Access Control (MAC), but it purposely leaves out the role based access control and type enforcement that are major parts of SELinux. Smack is geared towards solving smaller security problems than SELinux, requiring much less configuration and very little application support.

Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e114e473771c848c3cfec05f0123e70f1cdbdc99 (commit)]

1.8. Latencytop

Recommended LWN article: [http://lwn.net/Articles/266153/ "Finding system latency with LatencyTOP"]

Slow servers, Skipping audio, Jerky video - everyone knows the symptoms of latency. But to know what's really going on in the system, what's causing the latency, and how to fix it... those are difficult questions without good answers right now. LatencyTOP is a Linux tool for software developers (both kernel and userspace), aimed at identifying where system latency occurs, and what kind of operation/action is causing the latency to happen. By identifying this, developers can then change the code to avoid the worst latency hiccups.

There are many types and causes of latency, and LatencyTOP focus on type that causes audio skipping and desktop stutters. Specifically, LatencyTOP focuses on the cases where the applications want to run and execute useful code, but there's some resource that's not currently available (and the kernel then blocks the process). This is done both on a system level and on a per process level, so that you can see what's happening to the system, and which process is suffering and/or causing the delays.

You can find the latencytop userspace tool, including screenshots, at [http://www.latencytop.org latencytop.org]. Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9745512ce79de686df354dc70a8d1a74d801892d (commit)]

1.9. BRK and PIE executable randomization

[http://en.wikipedia.org/wiki/Exec_Shield Exec-shield] is a Red Hat that was started in 2003 by Red Hat to implement several security protections and is mainly used in Red Hat and Fedora. Many features have already been merged lot of time ago, but not all of them. In 2.6.25 two of them are being merged: brk() randomization and PIE executable randomization. Those two features should make the address space randomization on i386 and x86_64 complete.

Code [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c1d171a002942ea2d93b4fbd0c9583c56fce0772 (commit)],[http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cc503c1b43e002e3f1fed70f46d947e2bf349bb6 (commit)], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=32a932332c8bad842804842eaf9651ad6268e637 (commit)]

1.10. Controller area network (CAN) protocol support

Recommended LWN article: [http://lwn.net/Articles/253425/ "PF_CAN"]

From the [http://en.wikipedia.org/wiki/Controller_Area_Network "Controller Area Network" Wikipedia article]: Controller Area Network (CAN or CAN-bus) is a computer network protocol and bus standard designed to allow microcontrollers and devices to communicate with each other and without a host computer.. This implementation has been contributed by Volkswagen.

Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cd05acfe65ed2cf2db683fa9a6adb8d35635263b (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0d66548a10cbbe0ef256852d63d30603f0f73f9b 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c18ce101f2e47d97ace125033e2896895a6db3dd 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ffd980f976e7fd666c2e61bf8ab35107efd11828 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ccb29637991fa6b8321a80c2320a71e379aea962 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f7ab97f78a5c573e49474edbd260ea6898ddccda 6])

1.11. ACPI thermal regulation/WMI

In 2.6.25 ACPI adds thermal regulacion support (Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3f655ef8c439e0775ffb7d1ead5d1d4f060e1f8b (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=203d3d4aa482339b4816f131f713e1b8ee37f6dd 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=05a83d972293f39a66bc2aa409a5e7996bba585d 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d9460fd227ed2ce52941b6a12ad4de05c195f6aa 4)] and a WMI ([http://www.microsoft.com/whdc/archive/wmi-acpi.mspx Windows Management Interface], a proprietary extension to ACPI) mapper (Code:[http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bff431e49ff531a343fbb2b4426e313000844f32 (commit 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=745a5d2126926808295742932d0e36d485efa485 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dd8cd7793781c87be47bbfee65efa3fb5110f898 3)]

1.12. EXT4 update

Recommended article: [http://lwn.net/Articles/266274/ "A better ext4"]

EXT4 mainline snapshot gets an update with a bunch of features: Multi-block allocation, large blocksize up to PAGE_SIZE, journal checksumming, large file support, large filesystem support, inode versioning, and allow in-inode extended attributes on the root inode. These features should be the last ones that require on-disk format changes. Other features that don't affect the disk format, like delayed allocation, have still to be merged.

Code: (commit [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9de560ded61faa5b754137b7753da252391c55a 1], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0040d9875dcccfcb2131417b10fbd9841bc5f05b 2], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0fc1b451471dfc3cabd6e99ef441df9804616e63 3], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c14c6fd5c56a0d0495d8a7c0f2bc330be658663e 4], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=25ec56b518257a56d2ff41a941d288e4b5ff9488 5], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=725d26d3f09ccb5bac4b4293096b985a312a0d67 6], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a224228ed79d587ece2304869000aad1b8e97dd 7], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8180a5627d126362c2f64e4fa886d6f608d9632a 8], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=818d276ceb83aa9fdebb5e0a53188290312de987 9], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e85fb3f305b24b79c6d9cb7a56d22b062335ad3 10], [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afc7cbca5bfd556c3e12d3acefbee5ab0cbd4670 11])

1.13. MN10300/AM33 architecture support

The MN10300/AM33 architecture is now supported under the "mn10300" subdirectory. 2.6.25 adds support MN10300/AM33 CPUs produced by MEI. It also adds board support for the ASB2303 with the ASB2308 daughter board, and the ASB2305. The only processor supported is the MN103E010, which is an AM33v2 core plus on-chip devices. Code: [http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b920de1b77b72ca9432ac3f97edb26541e65e5dd (commit)]

2. Subsystems

2.1. Various

2.2. Filesystems

2.3. Networking

2.4. Crypto

2.5. Security

2.6. Architecture-specific changes

3. Drivers

3.1. Graphics

3.2. SATA/IDE

3.3. Sound

3.4. SCSI

3.5. Network

3.6. V4L/DVB

3.7. I2C

3.8. HID

3.9. Input

3.10. USB

3.11. RDMA

3.12. Hwmon

3.13. MTD

3.14. ACPI

3.15. RTC/W1

3.16. LEDs

3.17. Various

KernelNewbies: Linux_2_6_25 (last edited 2008-04-03 21:09:51 by diegocalleja)