Important things (AKA: ''the cool stuff'')

Lightweight userspace priority inheritance (PI) support

PI is a critical feature for RT-ish apps. Currently (without PI), if a high-prio and a low-prio task shares a lock, even if all critical sections are coded carefully to be deterministic (i.e. all critical sections are short in duration and only execute a limited number of instructions), the kernel cannot guarantee any deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. User-space PI helps to achieving/improving determinism for user-space applications in those cases. Detailed [ LWN article], glibc patch can be found [ here], justification for this feature and design documentation: [;a=commit;h=a6537be9324c67b41f6d98f5a60a1bd5a8e02861 (commit)]; code: [;a=commit;h=e2970f2fb6950183a34e8545faa093eb49d186e1 (commit)], [;a=commit;h=b29739f902ee76a05493fb7d2303490fc75364f4 (commit)], [;a=commit;h=23f78d4a03c53cbd75d87a795378ea540aa08c86 (commit)]

lockdep, a kernel lock validator

Linux's locking style is know for being simple compared with other Unix SMP-friendly derivatives. Still, locking is a neccesary evil that is hard to get right for most of normal programmers (most of us), and locking bugs can be very difficult to find, specially in drivers, that don't get the solid review that core kernel has. The kernel lock validator is a debugging tool that tries to makes such things easier, it's [ (LWN article)] a complex infrastructure to the kernel which can then be used to prove that none of the locking patterns observed in a running system could ever deadlock the kernel. Design documentation: [;a=commit;h=f3e97da38e1d69d24195d76f96b912323f5ee30c (commit)], code: [;a=commit;h=fbb9ce9530fd9b66096d5187fa6a115d16d9746c (commit)]

Power saving policy for the process scheduler

In machines with several multicore/smt "packages" (which will become increasinly common in the future), the power consumption can be improved by letting some packages idle while others do all the work, instead of spreading the tasks over all CPUs, so a optional power saving policy has been developed to make this possible. When this power savings policy is enabled - set to 1 the sysfs entry 'sched_mc_power_savings' or 'sched_smt_power_savings' placed under /sys/devices/system/cpu/cpuX/ when enabled CONFIG_SCHED_MC / CONFIG_SCHED_SMT - and under light load conditions, the scheduler will minimize the physical packages/cpu cores carrying the load and thus conserving power, but impacting the performance depending on the workload characteristics (when there's lot of work to do all CPUs will be used, to completely disable individual CPUs use the already available CPU hotplugging feature by writing 0 to the "online" file in that sysfs directory). For more details on the effect of this policy read the "Chip Multi Processing(CMP) aware Linux Kernel Scheduler" talk from [ the OLS 2005] (page 201 and onwards) [;a=commit;h=5c45bf279d378d436ce45825c0f136696c7b6109 (commit)]

'SMPnice': take priority into account when balancing processes between CPUs

One of the design principles of the new 2.6 scheduler (aka, "Ingo's O(1) scheduler") was the idea of having a separate run queue of processes for each CPU present on the system, instead of a single run queue for all CPUs, for scalability reasons. Periodically, the scheduler would balance the per-cpu run queues to distribute all the jobs and keep all the CPUs busy. However, priority levels were not taken into account at the time of doing this balance and it was possible recreate scenarios where the kernel was being unfair, specially with unprivileged processes. "SMPnice" is a implementation of a solution for this problem [ (LWN article)], [;a=commit;h=2dd73a4f09beacadde827a032cf15fd8b1fa3d48 (commit)]

Swapless page migration

Being able to migrate pyshical pages between nodes in NUMA-like systems - to improve the [ locality of reference] - was introduced in [ Linux 2.6.16], but it didn't use a very clean method: pages were swapped out in purpose, and then the next time those pages would be faulted, they'd be swapped in to the node where you wanted to move those pages instead of the old one. This trick was used but now the feature has been completed with "direct page migration": Now pages are moved directly from one node to another, without using swap. This feature includes a new system call which allows to move individual pages of a process from one node to another: long move_pages(pid, number_of_pages_to_move, addresses_of_pages[], nodes[] or NULL, status[],lags) - the swap-based migration had already added a migrate_pages() syscall and a MPOL_MF_MOVE option to the set_mempolicy() syscall). For full details, read this [ (LWN article)]. Code: [;a=commit;h=0697212a411c1dae03c27845f2de2f3adb32c331 (commit)], [;a=commit;h=6c5240ae7f48c83fcaa8e24fa63e7eb09aba5651 (commit)], [;a=commit;h=d75a0fcda2cfc71b50e16dc89e0c32c57d427e85 (commit)], [;a=commit;h=04e62a29bf157ce1edd168f2b71b533c80d13628 (commit)], [;a=commit;h=8d3c138b77f195ca0eee6fb639ae73f5ea9edb6b (commit)], [;a=commit;h=742755a1d8ce2b548428f7aacf1758b4bba50080 (commit)]

Per-zone VM counters

Zone based VM statistics are necessary to be able to determine what the state of memory in a zone is. The counters that we currently have for the VM are split per processor, but the processor has not much to do with the zone these pages belong to: we cannot tell f.e. how many pages on a particular node are dirty - if we knew then we could put measures into the VM to balance the use of memory between different zones and different nodes in a NUMA system. It would allow the development of new NUMA balancing algorithms that may be able to improve the decision making in the scheduler of when to move a process to another node - and hopefully will also enable automatic page migration through a user space program that can analyse the memory load distribution and then rebalance memory use in order to increase performance. This feature allows to have such info. The zone_reclaim_interval sysctl vanishes (since VM stats can now determine when it is worth to do local reclaim), and there're accurate counters in /sys/devices/system/node/node*/meminfo (current counters are not very accurate). Other detailed VM counters are available in more /proc and /sys status files [;a=commit;h=f6ac2354d791195ca40822b84d73d48a4e8b7f2b (commit)], [;a=commit;h=2244b95a7bcf8d24196f8a3a44187ba5dfff754c (commit)], [;a=commit;h=f3dbd34460ff54962d3e3244b6bcb7f5295356e6 (commit)], [;a=commit;h=65ba55f500a37272985d071c9bbb35256a2f7c14 (commit)], [;a=commit;h=b1e7a8fd854d2f895730e82137400012b509650e (commit)], [;a=commit;h=ce866b34ae1b7f1ce60234cf65855886ac7e7d30 (commit)], [;a=commit;h=df849a1529c106f7460e51479ca78fe07b07dc8c (commit)], [;a=commit;h=34aa1330f9b3c5783d269851d467326525207422 (commit)], [;a=commit;h=9a865ffa34b6117a5e0b67640a084d8c2e198c93 (commit)], [;a=commit;h=ca889e6c45e0b112cb2ca9d35afc66297519b5d5 (commit)], [;a=commit;h=fd39fc8561be33065306bdac0e30414e1e8ac8e1 (commit)], [;a=commit;h=d2c5e30c9a1420902262aa923794d2ae4e0bc391 (commit)], [;a=commit;h=9614634fe6a138fd8ae044950700d2af8d203f97 (commit)], [;a=commit;h=f8891e5e1f93a128c3900f82035e8541357896a7 (commit)]

Big libata (SATA) update

[ (LWN article)] Mainstreamn libata has been missing some features like NCQ and hotplug. The code had been written a while ago (more than a year ago in the case of NCQ) but only now it has been considered stable. The features included in this update are: a revamped error handling across all the libata code, which makes libata more robust to errors and failures, and makes easier to debug problems [;a=commit;h=022bdb075b9e1f224088a0b268de56268d7bc5b6 (commit)]; NCQ ([ Native Command Queuing]) which improves the performance greatly for many workloads) [;a=commit;h=3dc1d88193b9c65b01b64fb2dc730e486306649f (commit)], hotplug [;a=commit;h=084fe639b81c4d418a2cf714acb0475e3713cb73 (commit)], warmplug [;a=commit;h=83c47bcb3c533180a6dda78152334de50065358a (commit)], and bootplug - boot probing via hotplug path - support [;a=commit;h=3e706399b03bd237d087d731d4b1b029e546b33d (commit)], interrupt-driven PIO mode (instead of the inefficient poll method), [;a=commit;h=312f7da2824c82800ee78d6190f12854456957af (commit)], add MCP61 support [;a=commit;h=4c5c81613b0eb0dba97a8f312a2f1162f39fd47b (commit)]

Change the default IO scheduler to 'CFQ'

2.6 features modular I/O schedulers: There're several I/O schedulers with different performance properties (that you can change at runtime with /sys/block/hda/queue/scheduler). The [ Anticipatory Scheduler] (AS) has been the default one since then, but the CFQ (Complete Fair Queuing) scheduler has been gaining adoption since then, to the point that it's the default I/O scheduler for RHEL 4, Suse, and other distros. One of the coolest things about CFQ is that it features (since 2.6.13) "io priorities": That means you can set the "I/O" priority of a process so you can avoid that a process that does too much I/O (daily updatedb) starves the rest of the system, or give extra priority to a process that shouldn't be starved by other processes, by using the "ionice" tool included in schedutils (1.5.0 and onwards). Now CFQ is the default scheduler [;a=commit;h=b17fd9bceb99610f6dc7998c9a4ed6b71520be2b (commit)] (after some performance tweaks that should improve the performancein many workloads) [;a=commit;h=caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3 (commit)]. If you want to continue using the AS scheduler, you can change it at runtime in /sys/block/hda/queue/scheduler, or use the "elevator=as" boot option.

Secmark: Add security markings to packets via iptables

SELinux already has methods to "mark" network packets, but they're not as expressive or powerful as the controls provided by Netfilter/iptables. So Netfilter/iptables has been leveraged for packet selection and labeling, so that now SELinux can have more powerful and expressive network controls for adding security markings to packets. This also allows for increased security, as the policy is more effective, allowing access to the full range of iptables selectors and support mechanisms. The feature includes a SECMARK target allowing the admin to apply security marks to packets via both iptables and ip6tables, a CONNSECMARK target used to specify rules for copying security marks from packets to connections and for copying security marks back from connections to packets, and secmark support to conntrack. Examples of policies and rulesets, and patches for libselinux can be found [ here]. [;a=commit;h=29a395eac4c320c570e73f0a90d8953d80da8359 (commit)], [;a=commit;h=4e5ab4cb85683cf77b507ba0c4d48871e1562305 (commit)], [;a=commit;h=984bc16cc92ea3c247bf34ad667cfb95331b9d3c (commit)], [;a=commit;h=4e5ab4cb85683cf77b507ba0c4d48871e1562305 (commit)], [;a=commit;h=5e6874cdb8de94cd3c15d853a8ef9c6f4c305055 (commit)], [;a=commit;h=100468e9c05c10fb6872751c1af523b996d6afa9 (commit)], [;a=commit;h=7c9728c393dceb724d66d696cfabce82151a78e5 (commit)]

New drivers

Here are some important drivers that have been added to the linux tree - note that it says 'drivers', only new important drivers are listed today. Other small drivers are listed below; the already available drivers also add support for new devices and some are listed below but support for new devices is added so fast that it's impossible to keep track of all of them.

Generic IRQ layer

This is Yet More Generalization of the IRQ layer. Not all architectures were using the current IRQ layer (specially ARM) and the current one had some shortcomings. From this [ LWN article]: These patches attempt to take lessons learned about optimal interrupt handling on all architectures, mix in the quirks found in the fifty (yes, fifty) ARM subarchitectures, and create a new IRQ subsystem which is truly generic, and more powerful as well. Design documentation: [;a=commit;h=11c869eaf1a9c97ef273f824a697fac017d68286 (commit)]; code: [;a=commit;h=6a6de9ef5850d063c3d3fb50784bfe3a6d0712c6 (commit)], [;a=commit;h=94d39e1f6e8132ea982a1d61acbe0423d3d14365 (commit)], [;a=commit;h=6550c775cb5ee94c132d93d84de3bb23f0abf37b (commit)], [;a=commit;h=a4633adcdbc15ac51afcd0e1395de58cee27cf92 (commit)], [;a=commit;h=dd87eb3a24c4527741122713e223d74b85d43c85 (commit)], [;a=commit;h=e76de9f8eb67b7acc1cc6f28c4be8583adf0a90c (commit)], [;a=commit;h=3418d72404e35eb19e7995cbf3e7a76ba8fefbce (commit)], [;a=commit;h=ba9a2331bae5da8f65be3722b9e2d210f1987857 (commit)]

Generic core time subsystem

The time work is done in a architecture-dependent way. This work tries to provide a core time subsystems that can be used for all architectures, avoiding lots of code duplication. Detailed analysis in this [ LWN article]; [;a=commit;h=734efb467b31e56c2f9430590a9aa867ecf3eea1 (commit)], [;a=commit;h=ad596171ed635c51a9eef829187af100cbf8dcf7 (commit)], [;a=commit;h=260a42309b31cbc54eb4b6b85649e412bcad053f (commit)], [;a=commit;h=5eb6d20533d14a432df714520939a6181e28f099 (commit)], [;a=commit;h=cf3c769b4b0dd1146da84d5cf045dcfe53bd0f13 (commit)], [;a=commit;h=8d016ef1380a2a9a5ca5742ede04334199868f82 (commit)], [;a=commit;h=539eb11e6e904f2cd4f62908cc5e44d724879721 (commit)], [;a=commit;h=539eb11e6e904f2cd4f62908cc5e44d724879721 (commit)], [;a=commit;h=6f84fa2f3edc8902cfed02cd510c7c58334bb9bd (commit)], [;a=commit;h=61743fe445213b87fb55a389c8d073785323ca3e (commit)], [;a=commit;h=5d0cf410e94b1f1ff852c3f210d22cc6c5a27ffa (commit)]

Randomize the i386 vDSO

Move the i386 VDSO down into a vma and thus randomize it. Besides the security implications (attackers cannot use the predictable high-mapped VDSO page as syscall trampoline anymore) this feature also helps debuggers, which can COW a vma-backed VDSO just like a normal DSO and can thus do single-stepping and other debugging features. It's good for hypervisors (Xen, VMWare) too, which typically live in the same high-mapped address space as the VDSO, hence whenever the VDSO is used, they get lots of guest pagefaults and have to fix such guest accesses up - which slows things down instead of speeding things up (the primary purpose of the VDSO). There's a new CONFIG_COMPAT_VDSO option, which provides support for older glibcs that still rely on a prelinked high-mapped VDSO. Newer distributions (using glibc 2.3.3 or later) can turn this backwards-compatibility option off (recommended, for security reasons, as the features makes harder certains types of attacks). There is a new vdso=[0|1] boot option as well, and a runtime /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned on/off [;a=commit;h=e6e5494cb23d1933735ee47cc674ffe1c4afed6f (commit)]

Various core stuff

Other stuff

Architecture-specific changes


[;a=commit;h=fe610671d7a88e363e8cebcb7e2f32078b0151ce (commit)]

[;a=commit;h=c067a7899790ed4c03b00ed186c6e3b6a3964379 (commit)], L5D [;a=commit;h=ebccb84810729f0e86a83a65681ba2de45ff84d8 (commit)], A3G [;a=commit;h=ed2cb07b2bb04f14793cdeecb0b384374e979525 (commit)], A4G [;a=commit;h=f78c589d108f4b06a012817536c9ced37f473eae (commit)], LED display support [;a=commit;h=42cb891295795ed9b3048c8922d93f7a71f63968 (commit)]

KernelNewbies: Linux_2_6_18 (last edited 2006-07-13 23:40:29 by diegocalleja)