Linux 2.6.37 released 4 January, 2011.
Summary: Linux 2.6.37 includes several SMP scalability improvements for Ext4 and XFS, an option to compile the kernel with the Big Kernel Lock disabled, support for per-cgroup IO throttling, a network device based in the Ceph cluster filesystem, several Btrfs improvements, more efficient static probes, perf support to probe modules and listing of accesible local and global variables, image hibernation using LZO compression, PPP over IPv4 support, several networking microoptimizations and many other small changes, improvements and new drivers.
Contents
-
Prominent features (the cool stuff)
- Ext4: better SMP scalability, faster mkfs
- XFS scalability improvements
- No BKL (Big Kernel Lock)
- A Ceph-based network block device
- I/O throttling support
- "Jump label": disabled tracepoints don't impact performance
- Btrfs Updates
- Perf probe improvements
- Power management improvements: LZO hibernation compression, delayed autosuspends
- Support for PPP over IPv4
- Enable Fanotify API
- Drivers and architectures
- Core
- VFS scalability work
- CPU scheduler
- Memory management
- File systems
- Networking
- Block
- Crypto
- Virtualization
- Security
- Tracing/perf
1. Prominent features (the cool stuff)
1.1. Ext4: better SMP scalability, faster mkfs
Better SMP scalability: In this release Ext4 will use the "bio" layer directly instead of the intermediate "buffer" layer. The "bio" layer (alias for Block I/O: it's the part of the kernel that sends the requests to the IO/O scheduler) was one of the first features merged in the Linux 2.5.1 kernel. The buffer layer has a lot of performance and SMP scalability issues that will get solved with this port. A FFSB benchmark in a 48 core AMD box using a 24 SAS-disk hardware RAID array with 192 simultaneous ffsb threads speeds up by 300% (400% disabling journaling), while reducing CPU usage by a factor of 3-4. Code: (commit)
Faster mkfs: One of the slowest parts while creating a new Ext4 filesystem is initializating the inode tables. mkfs can avoid this step and leave the inode tables uninitialized. When mounted for first time, the kernel will run a kernel thread -ext4lazyinit- which will initialize the tables. Code: (commit)
Add batched discard support for ext4 (commit), (commit), (commit)
1.2. XFS scalability improvements
Scalability of metadata intensive workloads has been improved. A 8-way machine running a fs_mark instance of 50 million files was improved by over 15%, and removal of those files by over 100%. More scalability improvements are expected in 2.6.38.
Code: (list of commits)
1.3. No BKL (Big Kernel Lock)
The Big Kernel Lock is a giant lock that was introduced in Linux 2.0, when Alan Cox introduced SMP support for first time. But it was just an step to achieve SMP scalability - only one process can run kernel code at the same time in Linux 2.0, long term the BKL must be replaced by fine-grained locking to allow multiple processes running kernel code in parallel. In this version, it is possible to compile a kernel completely free of BKL support. Note that this doesn't have performance impact: all the critical Linux codepaths have been BKL-free for a long time. It still was used in many non-performance critical places -ioctls, drivers, non-mainstream filesystems, etc-, which are the ones that are being cleaned up in this version. But the BKL is being replaced in these places with mutexes, which doesn't improve parallelism (these places are not performance critical anyway).
Code: (commit)
1.4. A Ceph-based network block device
Ceph is a distributed network filesystem that was merged in Linux 2.6.34. In the Ceph design there are "object storage devices" and "metadata servers" which store metadata about the storage objects. Ceph uses these to implement its filesystem; however these objets can also be used to implement a network block device (or even Amazon S3-compatible object storage)
This release introduces the Rados block device (RBD). RBD lets you create a block device that is striped over objects stored in a Ceph distributed object store. In contrasts to alternatives like iSCSI or AoE, RBD images are striped and replicated across the Ceph object storage cluster, providing reliable (if one node fails it still works), scalable, and thinly provisioned access to block storage. RBD also supports read-only snapshots with rollback, and there are also Qemu patches to create a VM block device stored in a Ceph cluster.
Code: (commit)
1.5. I/O throttling support
I/O throttling support has been added. It makes possible to set upper read/write limits to a group of processes, which can be useful in many setups. Example:
{{{ Mount the cgroup blkio controller # mount -t cgroup -o blkio none /cgroup/blkio
Specify a bandwidth rate on particular device for root group. The format for policy is "<major>:<minor> <byes_per_second>" # echo "8:16 1048576" > /cgroup/blkio/blkio.read_bps_device
Above will put a limit of 1MB/second on reads happening for root group on device having major/minor number 8:16. }}} The limits can also be set in IO operations per second (blkio.throttle.read_iops_device). There also write equivalents - blkio.throttle.write_bps_device and blkio.throttle.write_iops_device. This feature does not replace the IO weight controller merged in 2.6.33.
Code.(commit 1, 2, 3, 4, 5, 6)
1.6. "Jump label": disabled tracepoints don't impact performance
A tracepoint can be described as a special printf() call, which is used inside the kernel and is used with tools like perf, LTT or systemtap to analyze the system behaviour. There are two types of tracepoints: Dynamic and static. Dynamic tracepoints modify the kernel code at runtime inserting CPU instructions where neccesary to obtain the data. Dynamic tracepoints are called 'kprobes' in the linux kernel, and their performance overhead was optimized in Linux 2.6.34.
Static tracepoints, on the other hand, are inserted by the kernel developers by hand in strategic points of the code. For example, Ext4 has 50 static tracepoints. These tracepoints are compiled with the rest of the kernel code, and by default they are "disabled" - until someone activates them, they are not called. Basically, an 'if' condition tests a variable. The performance impact is nearly negligible, but it can be improved, and that's what the "jump label" feature does: A "no operation" CPU instruction is inserted in place of the conditional test, so a disabled static tracepoint has zero overhead. (Tip: You can use the "sudo perf list" command to see the full list of static tracepoints available in your system)
Recommended LWN article: Jump label
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
1.7. Btrfs Updates
Btrfs stores the free space data ondisk to make the caching of a block group much quicker. Previously when Btrfs had to allocate from a block group which had not been cached previously, it had to scan the entire extent-tree. Now the free space cache is dumped to disk for every dirtied block group each time a transaction is commited, and the scan is not neccesary. This is a disk format change, however it is safe to boot into old kernels, they will just generate the cache the old fashion way. Also, the feature for now it is disabled by default and needs to be turned on with the -o space_cache mount option. There is also a new -o clear_cache debug option that will clear all the caches on mount. Code: (commit 1, 2, 3, 4)
Support for asyncrhonous snapshot creation. This makes possible to avoid waiting for a new snapshot to be commited to the disk. It has been developed with the Ceph storage daemon in mind, but it's also available for users adding "async" to the "btrfs subvolume snapshot" command. Code: (commit 1, 2)
Allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed (commit)
Switch the extent buffer rbtree into a radix tree and using the rcu lock instead of the spin lock: reduces the CPU time spent in the extent buffer search and improves performance for some operations. Code: (commit)
Chunk allocation tuning: Mixed data+metadata block groups are supported (useful for small storage devices) (commit), don't allocate chunks as aggressively (avoids early -ENOSPC cases due to overallocation of space for metadata) (commit), (commit),
1.8. Perf probe improvements
Show accessible local and global variables: A "-V" ("--vars") option has been added for listing accessible local variables at given probe point. This will help finding which local variables are available for event arguments. For example: "# perf probe -V call_timer_fn:23" will show all the local variables in that point of the function. In addition, global variables can also be shown addin the "--externs" argument (commit), (commit), (commit)
Module support: It's possible to set a probe inside modules, using the "--module" command. For example, "# ./perf probe --module drm drm_vblank_info:3 node m" (commit)
1.9. Power management improvements: LZO hibernation compression, delayed autosuspends
Several power-management related features have been added
Delayed device autosuspends: This is a feature that improves the runtime power managent feature added in Linux 2.6.32. Some drivers do not want their device to suspend as soon as it becomes idle at run time; they want the device to remain inactive for a certain minimum period of time first. This is what this feature does (commit)
Compress hibernation image with LZO (commit)
1.10. Support for PPP over IPv4
This version introduces PPP over IPv4 support (PPTP). It dramatically speeds up pptp vpn connections and decreases cpu usage in comparison of existing user-space implementation (poptop/pptpclient). There is accel-pptp project to utilize this module, t contains plugin for pppd to use pptp in client-mode and modified pptpd (poptop) to build high-performance pptp NAS.
Code: (commit)
1.11. Enable Fanotify API
Fanotify was included in the previous version, but it was disabled before the release due to concerns about the API. The concerns have been solved and Fanotify has been enabled.
Code: (commit)
2. Drivers and architectures
All the driver and architecture-specific changes can be found in the Linux_2_6_37-DriversArch page
3. Core
init: add support for root devices specified by partition UUID (commit)
sysvipc: add RSS and swap size information to /proc/sysvipc/shm (commit)
cgroups: make swap accounting CONFIGurable (commit)
Remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices (commit)
Allow boot time switching between deprecated and modern sysfs layout (commit)
CPUfreq: Add sampling_down_factor tunable to improve ondemand performance (commit)
rcu: Add a TINY_PREEMPT_RCU, a small-memory-footprint uniprocessor-only implementation of preemptible RCU (commit)
- fs
Add FITRIM ioctl (commit)
Allow for more than 2^31 files (commit), (commit)
4. VFS scalability work
Convert nr_inodes and nr_unused to per-cpu counters (commit)
Implement lazy LRU updates for inodes (commit)
Introduce a per-cpu last_ino allocator (commit)
Use percpu counter for nr_dentry and nr_dentry_unused (commit)
Inode split IO and LRU lists (commit)
5. CPU scheduler
Do not account IRQ time to current task: Scheduler accounts both softirq and interrupt processing times to the currently running task. Change sched task accounting to account only actual task time from currently running task (commit). Also, remove IRQ time from available CPU power (commit)
Try not to migrate higher priority RT tasks to other CPUs (commit)
Add book scheduling domain: On top of the SMT and MC scheduling domains this adds the BOOK scheduling domain. This is useful for NUMA like machines which do not have an interface which tells which piece of memory is attached to which node or where the hardware performs striping (commit)
6. Memory management
Retry page fault when blocking on disk transfer. This change reduces mmap_sem hold times that are caused by waiting for disk transfers when accessing file mapped VMAs. Benchmarks: A microbenchmark with thread A mmap'ing a large file and doing random read accesses to the mmaped area - achieves about 55 iterations/s, and a thread B doing mmap/munmap'ing in a loop at a separate location - achieves 55 iterations/s before, 15000 iterations/s with this patch (commit)
Stack based kmap_atomic() (commit)
Extend page migration code to support hugepage migration (commit)
Add two counters to /proc/vmstat: nr_dirtied (page dirtyings since bootup) and nr_written (page dirtyings since bootup). These entries allow user apps to understand writeback behaviour over time and learn how it is impacting their performance (commit), (commit)
Report dirty thresholds in /proc/vmstat (nr_dirty_threshold and nr_dirty_background_threshold)(commit)
Add pernode vmstat file (with nr_dirtied and nr_written) in /sys/devices/system/node/<node>/vmstat (commit)
Add trace events for LRU list shrinking (commit)
/proc/pid/smaps: export amount of anonymous memory (commit)
/proc/stat: Make reading /proc/stat scalable (commit), fix scalability of irq sum of all cpu (commit)
/proc/swaps: support polling (commit)
Use percpu allocator on UP too (commit)
7. File systems
XFS
Remove experimental tag from the delaylog option (commit)
Extend project quotas to support 32bit project ids (commit)
Introduce XFS_IOC_ZERO_RANGE (commit)
Lockless per-ag lookups (commit)
convert buffer cache hash to rbtree (commit)
OCFS2
Allow huge (> 16 TiB) volumes to mount (commit)
Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. (commit)
Add new OCFS2_IOC_INFO ioctl: offers the none-privileged end-user a possibility to get filesys info gathering (commit)
Add support for heartbeat=global mount option (commit)
EXT4
Add interface to advertise ext4 features in sysfs (commit)
Use dedicated slab caches for group_info structures (commit)
CIFS
Add "mfsymlinks" mount option (commit)
Add "multiuser" mount option (commit)
Allow binding to local IP address. (commit)
NFS
Readdir plus in NTFSv4 (commit)
New idmapper (commit)
Introduce mount option '-olocal_lock' to make locks local (commit), (commit)
Remove spkm3 (commit)
Allow deprecated syscall interface to be compiled out (commit)
GFS2
fallocate( ) support (commit)
NILFS2
Add bdev freeze/thaw support (commit)
8. Networking
TCP: Update the use of larger initial windows, as originally specified in RFC 3390, to use the newer IW values specified in RFC 5681, section 3.1 (commit)
TCP: Provides a "user timeout" support as described in RFC793 with a new TCP_USER_TIMEOUT socket option. TCP_USER_TIMEOUT takes an unsigned int to specify the maximum amount of time in ms that transmitted data may remain unacknowledged before TCP will forcefully close the corresponding connection and return ETIMEDOUT to the application (commit)
TCP: Allow effective reduction of TCP's rcv-buffer via setsockopt (commit)
Implement Any-IP support for IPv6. AnyIP is the capability to receive packets and establish incoming connections on IPs we have not explicitly configured on the machine (commit)
Added IPv6 support to the TPROXY target (commit)
IPVS: IPv6 tunnel mode (commit)
IPv4: Allow configuring subnets as local addresses, For instance, to configure a host to respond to any address in 10.1/16 received on eth0 as a local address we can do: "ip rule add from all iif eth0 lookup 200; ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200" (commit)
AF_UNIX: Implement SO_TIMESTAMP and SO_TIMETAMPNS on Unix sockets (commit)
ctnetlink: add support for user-space expectation helpers (commit), add expectation deletion events (commit)
Enable Generic Receive Offload by default for vlan devices (commit)
ethtool: Add support for vlan accleration. (commit)
sctp: implement SIOCINQ ioctl() (take 3) (commit)
tipc: add SO_RCVLOWAT support to stream socket receive path (commit)
Infiniband: Add 802.1q VLAN support to Infiniband over Ethernet (commit 1, 2, 3, 4)
- Allocate skbs on local node: With multiqueue NICs, or using RPS to spread the load it has not sense
Phonet: Implement Pipe Controller to support Nokia Slim Modems (commit), (commit)
bonding: enable generic receive offload by default (commit), allow sysadmins to configure the number of multicast membership report sent on a link failure event (commit)
vlan: Enable software emulation for vlan accleration (commit)
sched: update packets checksums after some direct packet alterations (configurable) (commit)
9P: Add a Direct IO support for non-cached operations. (commit), implement TGETLOCK (commit), implement TLOCK (commit), implement TREADLINK operation for 9p2000.L (commit), introduce client side TFSYNC/RFSYNC for dotl (commit), implement TLERROR/RLERROR on the 9P client (commit); implement POSIX ACL client checks (commit 1, 2, 3, 4, 5, 6)
Many routing, neighbour, and device handling optimizations on SMP (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit)
9. Block
cfq: improve fsync performance for small files (commit)
Kill block barriers and replace it with a REQ_FLUSH/FUA based interface. See this LWN article for more details (commit), (commit), (commit)
10. Crypto
Adding the AEAD interface type support to cryptd (commit)
OMAP2/3 AES hw accelerator driver (commit)
11. Virtualization
vmware: Remove deprecated VMI kernel support (commit)
KVM
MMU: support disable/enable mmu audit dynamicly (commit)
PPC: Magic Page Book3s support (commit)
S390: Add virtio hotplug add support (commit)
12. Security
SELinux
Fast status update interface (/selinux/status) (commit)
Implement mmap on /selinux/policy (commit)
Allow userspace to read policy back out of the kernel (commit)
13. Tracing/perf
tracing: Graph support for wakeup tracer (commit)
perf: Add a script to show packets processing (commit)