• Immutable Page
  • Info
  • Attachments

Linux 2 6 38

Linux 2.6.38 released 14 March, 2011.

Summary: This release adds support for a automatic process grouping (called "the wonder patch" in the news), significant scalability improvements in the VFS, Btrfs LZO compression and read-only snapshots, support for the B.A.T.M.A.N. mesh protocol (which helps to provide network connectivity in the presence of natural disasters, military conflicts or Internet censorship), transparent Huge Page support (without using hugetblfs), automatic spreading of outcoming network traffic across multiple CPUs, support for the AMD Fusion APUs, many drivers and other changes.

  1. Prominent features (the cool stuff)
    1. Automatic process grouping (a.k.a. "the patch that does wonders")
    2. VFS scalability: scaling the directory cache
    3. Btrfs LZO compression, read-only snapshots
    4. Transparent huge pages
    5. Transparent spreading of outcoming network traffic across CPUs on multiqueue devices
    6. B.A.T.M.A.N. mesh protocol
    7. Support for AMD Fusion graphics
  2. Drivers and architectures
  3. Core
  4. CPU scheduler
  5. Memory management
  6. Block
  7. File systems
  8. Networking
  9. Crypto
  10. Virtualization
  11. Security
  12. Tracing/perf

1. Prominent features (the cool stuff)

1.1. Automatic process grouping (a.k.a. "the patch that does wonders")

Recommended LWN article :Group scheduling and alternatives

The most impacting feature in this release is the so-called "patch that does wonders", a patch that changes substantially how the process scheduler assigns shares of CPU time to each process. With this feature the system will group all processes with the same session ID as a single scheduling entity. Example: Let's imagine a system with six CPU-hungry processes, with the first four sharing the same session ID and the other using another two different sessions each one.

Without automatic process grouping:  [proc. 1 | proc. 2 | proc. 3 | proc. 4 | proc. 5 | proc. 6] 

With automatic process grouping:    [proc. 1, 2, 3, 4  |     proc. 5       |     proc. 6      ] 

The session ID is a property of processes in Unix systems (you can see it with commands like ps -eo session,pid,cmd). It is inherited by forked child processes, which can start a new session using setsid(3). The bash shell uses setsid(3) every time it is started, which means you can run a "make -j 20" inside a shell in your desktop and not notice it while you browse the web. This feature is implemented on top of group scheduling (merged in [2.6.24). You can disable it in /proc/sys/kernel/sched_autogroup_enabled

Code: (commit)

1.2. VFS scalability: scaling the directory cache

Recommended LWN article: Dcache scalability and RCU-walk

There are ongoing efforts to make the Linux VFS layer ("Virtual File System", the code that glues the syscall and the filesystem) more scalable. In the previous release some changes were already merged as part of this work, in this release, the dcache (alias for "directory cache", which keeps a cache of directories ) and the whole path lookup mechanisms have been reworked to be more scalable (you can find details in the LWN article).

These changes make the VFS more scalable in multithreaded workloads, but more interestingly (and it's what excites Linus Torvalds) they also make some single threaded workloads quite faster (due to the removal of atomic CPU operations in the code paths): a hot-cache "find . -size" on his home directory seems to be 35% faster. Single threaded git diff on a cached kernel tree runs 20% faster (64 parallel git diffs increase throughput by 26 times). Everything that calls stat() a lot is faster.

Changes: Far too many to track here, see the patches done by Nick Piggin in this list (inverse chronological order)

1.3. Btrfs LZO compression, read-only snapshots

Btrfs adds supports for transparent compression using the LZO algorithm, as an alternative to zlib. You can find here a small performance comparison.

There is also support for marking snapshots as read-only. Finally, filesystems which find errors will be "force mounted" as read-only, which is a step forward to make the codebase more tolerant to failures.

Code: LZO (commit 1,2, 3); read-only snapshots (commit 1, 2), forced readonly mounts (commit)

1.4. Transparent huge pages

Recommended LWN article: Transparent huge pages in 2.6.38

Processors manage memory in small units called "pages" (which is 4 KB in size in x86). Each process has a virtual memory address space, and there is a "page table" where all the correspondencies between each virtual memory address page and its correspondent real RAM page are kept. The work of walking the page table to find out which RAM page corresponds to a given virtual address is expensive, so the CPU has a small cache to store the result of that work for frequently accessed virtual addresses. However, this cache is not very big and it only supports 4KB pages, so many data-intensive workloads (databases, KVM) have performance problems because all their frequently accessed virtual addresses can't be cached.

To solve this problem, modern processors add cache entries that support pages bigger than 4KB (like 2MB/4MB). Until now, the one way that userspace had to use those pages in Linux was hugetblfs, a filesystem-based API. This release adds support for transparent hugepages ( - hugepages are used automatically where possible. Transparent Huge Pages can be configured to be used always or only as requested with madvise(MADV_HUGEPAGE), and its behaviour can be changed online in /sys/kernel/mm/transparent_hugepage/enabled. For more details, check Documentation/vm/transhuge.txt

Code: Far too many to track here, see the patches from Andrea Arcangeli in this list (inverse chronological order)

1.5. Transparent spreading of outcoming network traffic across CPUs on multiqueue devices

This patch implements transmit packet steering (XPS) for multiqueue devices. XPS selects a transmit queue during packet transmission based on configuration. This is done by mapping the CPU transmitting the packet to a queue. This is the transmit side analogue to RPS -- where RPS is selecting a CPU based on receive queue, XPS selects a queue based on the CPU.

Each transmit queue can be associated with a number of CPUs which will use the queue to send packets. This is configured as a CPU mask on a per queue basis in /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

A netperf benchmark with 500 instances of netperf TCP_RR test with 1 byte req. and resp. on 16 core AMD: XPS (16 queues, 1 TX queue per CPU) 1234K at 100% CPU No XPS (16 queues) 996K at 100% CPU

Code: (commit)

1.6. B.A.T.M.A.N. mesh protocol

B.A.T.M.A.N. is an alias for "Better Approach To Mobile Adhoc Networking". An ad hoc network is a decentralized network that does not rely on a preexisting infrastructure, such as routers in wired networks or access points in managed (infrastructure) wireless networks. Instead, each node participates in routing by forwarding data for other nodes, and so the determination of which nodes forward data is made dynamically based on the network connectivity. B.A.T.M.A.N. is a routing protocol implementation ot these networks. B.A.T.M.A.N is useful for emergency situations like natural disasters, military conflicts or Internet censorship. More information about this project can be found at http://www.open-mesh.org/

Code: (commit)

1.7. Support for AMD Fusion graphics

This release adds support for the AMD Fusion GPU+CPUs

2. Drivers and architectures

All the driver and architecture-specific changes can be found in the Linux_2_6_38-DriversArch page

3. Core

  • Add /proc/consoles: To see which character device lines are currently used for the system console /dev/console, you may simply look into this file (commit)

  • Add hole punching support to fallocate() (commit)

  • Script for automatic kernel testing: ktest.pl (commit)

  • Add boot-time XZ compression support (commit), (commit)

  • rcu: priority boosting for TINY_PREEMPT_RCU (commit), add tracing for TINY_RCU and TINY_PREEMPT_RCU (commit), demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status (commit)

  • oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down (commit)

  • A new jhash implementation (commit)

  • ntp: add hardpps implementation (commit)

4. CPU scheduler

  • Improve cpu-cgroup performance for smp systems significantly by rewriting tg_shares_up (commit)

  • Remove long deprecated CLONE_STOPPED flag (commit)

  • Add sysctl_sched_shares_window for the shares window (commit)

5. Memory management

  • mlock(): do not hold the mmap_sem lock for extended periods of time while loading data into the page cache (commit), (commit)

  • Use compaction instead of lumpy reclaim (commit)

  • migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path (commit)

  • kswapd tweaking (commit)

  • smaps: export mlock information (commit)

  • Batch activate_page() to reduce zone->lru_lock contention (commit)

  • Trace events for memory compaction activity (commit)

6. Block

  • Implement media polling for removable in the kernel (commit)

  • Allow creation of hierarchical cgroups in the blk cgroup controller (commit)

  • Export a read-only sysfs attribute for partitions (commit)

Device Mapper (DM)

  • Improve significantly write throughput when writing to the origin with a snapshot on the same device (commit)

  • Improve sequential write throughput (commit)

  • dm-crypt: scale to multiple cpus (commit)

  • dm-crypt: add loop AES IV generator (commit)

  • RAID1: support discard (commit)

  • Skeleton for the DM target that will be the bridge from DM to MD (initially RAID456 and later RAID1). It provides a way to use device-mapper interfaces to the MD RAID456 drivers (commit)

7. File systems


  • Add manual SSD discard support via the FITRIM ioctl (commit)

  • Convert inode cache lookups to use RCU locking (commit)

  • Dynamic speculative EOF preallocation (commit)


  • Add strictcache mount option. In this mode the client reads from the cache all the time if possible. As for write, the client stores a data in the cache when possible (commit)

  • Add cruid= mount option (commit)


  • Speed up file creates by microoptimizing some functions (commit), (commit)

  • Add batched discard support for ext3 (commit)



  • Support the fiemap ioctl, used to get extent information for a inode (commit)

8. Networking

  • Increase default initial receive window. (commit)

  • Expose the per device configuration settings via netlink (commit), (commit)

  • IPv4: ECN-aware IP defragmentation (as per RFC3168) (commit)

  • Add 32/64 bit compatibility in the ipv4 multicast ioctl SIOCGETSGCNT (commit)

  • Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4) (commit)


  • Throughput based LED blink trigger (commit)

  • Let userspace enable and configure vendor specific path selection, in accordance with the version 7.0 of the 802.11s draft (commit)

  • Support hardware TX fragmentation offload (commit)

  • Report signal average (commit)

  • Notify for dropped Deauth/Disassoc (commit)

  • Add mesh join/leave configuration commands (commit)

* dcbnl: add support for ieee8021Qaz attributes (commit)

9. Crypto

10. Virtualization

  • Asynchronous page faults, which allow a guest to continue processing interrupts even when its memory is being paged in; in the case of a Linux 2.6.38+ guest, it will receive a notification that the host is servicing a page fault, and may switch into another guest process (commit 1, 2, 3, 4, 5, 6, 7)

  • AMD Bulldozer virtualization extensions: instruction decode assist, clean bits, xsave/avx, flush-by-asid (commit), (commit)

  • lguest: --username and --chroot options, to drop privileges and chroot to a directory (commit)

11. Security


  • Address a number of long standing issues with the way Smack treats UNIX domain sockets (commit)

  • Introduce a new attribute SMACK64TRANSMUTE that instructs Smack to create the file with the label of the directory under certain circumstances. A new access mode, "t" for transmute, is made available to Smack access rules, which are expanded from "rwxa" to "rwxat". If a file is created in a directory marked as transmutable and if access was granted to perform the operation by a rule that included the transmute mode, then the file gets the Smack label of the directory instead of the Smack label of the creating process (commit)

  • Add a new security attribute to Smack called SMACK64EXEC. It defines label that is used while task is running (commit)

  • Add two new hey types: trusted, which are random number symmetric keys, generated and RSA-sealed by the TPM (commit) and encrypted, which are kernel generated random numbers encrypted/decrypted with a 'trusted' symmetric key (commit)

12. Tracing/perf

  • perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem' (commit)

  • perf: new, more generic poer events (commit)

  • perf record: Add "nodelay" mode, disabled by default (commit)

  • perf record: Add option to disable collecting build-ids (commit)

  • perf stat: Add no-aggregation mode to -a (commit)

  • perf symbols: Add symfs option for off-box analysis using specified tree (commit)

  • tracing: Allow raw syscall trace events for non privileged users (commit), (commit), (commit)

  • oprofile: Add support for 6 counters (AMD family 15h) (commit)


Tell others about this page:

last edited 2011-04-27 16:13:52 by Morot