Summary: This release includes support for Ext4 block sizes bigger than 4KB and up to 1MB, which improve performance with big files; btrfs has been updated with faster scrubbing, automatic backup of critical filesystem metadata and tools for manual inspection of the filesystems; the process scheduler has added support to set upper limits of CPU time; the desktop reponsiveness in presence of heavy writes has been improved, TCP has been updated to include an algorithm which speeds up the recovery of the connection after lost packets; the profiling tool "perf top" has added support for live inspection of tasks and libraries and see the annotated assembly code; the Device Mapper has added support for 'thin provisioning' of storage, and a new architeture has been added: the Hexagon DSP processor from Qualcomm. Other drivers and small improvements and fixes are also available in this release.


1. Prominent features in Linux 3.2

1.1. Ext4: Support for bigger block sizes

Recommended LWN article: [ Improving ext4: bigalloc, inline data, and metadata checksums]

The maximum size of a filesystem block in Ext4 has always been 4 KB in x86 systems. But the storage capacity of modern hard disks is growing fast, and with the size of hard disks, the overhead of using such small size as block size increases. Small block sizes benefit users who have many small files, because the space will be used more efficiently, but people who uses large files would benefit of larger block sizes.

Ext4 supports now block sizes of up to 1MB of size, which decreases considerably [ the time spent doing block allocations], and there is smaller fragmentation. These new block sizes must be set at creation time, using the mkfs -C option (requires e2fsprogs 1.42). This feature is not backwards compatible with older kernels. Code: [ (commit 1], [ 2], [ 3], [ 4], [ 5], [ 6], [ 7], [ 8], [ 9], [ 10], [ 11], [ 12], [ 13], [ 14], [ 15], [ 16], [ 17)]

1.2. Btrfs: Faster scrubbing, automatic backup of tree roots, detailed corruption messages, manual inspection of metadata

Recommended LWN article: [ A btrfs update at LinuxCon Europe]

Scrub read-ahead

Scrubbing -the process of checking all the checksums of the filesystem- uses read-ahead to improve the performance. The average disk bandwith utilisation on a test volume was raised from 70% to 90%. On another volume, the time for a test run went down from 89 seconds to 43 seconds. Code: [ (commit 1], [ 2], [ 3], [ 4)]

Log of past tree roots

Btrfs will store in the filesystem superblock information about most of the tree roots in the last four commits. A "-o recovery" mount option has been used to allow a user to use the root history log when the filesystem is not able to read the tree of the tree roots, the extent tree root, the device tree root or the csum root. Code: [ (commit)]

Detailed corruption messages

Btrfs has always had [ "back references"] that allow to find which files or btrees actually reference a given block, but until now walking those references has been a manual process. Code to follow these backrefs has been added, with improved messages as result. For example, after scribbled over the blocks in one file on the disk and starting a scrub, instead of just telling that block xxyyzz is bad, the kernel now will print this: Code: [ (commit 1], [ 2)]

btrfs: checksum error at logical 5085110272 on dev /dev/sde, sector 2474832, root 5, inode 32583, offset 0, length 4096, links 1 (path: default/kernel-0/Makefile)

Manual inspection of the filesystem

As part of the previous feature, some code has also been added to allow manual inspection of the filesystem from userspace utilities. To find the file that belongs to extent 5085110272 , you can run: Code: [ (commit)]

btrfs inspect logical 5085110272 /mnt Or to find the filename for inode number 32583:

btrfs inspect inode 32583 /mnt

Performance improvements
Performance improvements haven been done in several areas, specially random write workloads.

1.3. Process bandwith controller

Recommended LWN article: [ CFS bandwidth control]

The process scheduler divides the available CPU bandwith between all processes that need to run. There is no limits of how much CPU bandwith each process gets if there is free bandwith available, because all processes are supposed to want as much as possible. But apparently, some companies like Google have some scenarios where this unbounded allocation of CPU bandwith may lead to unacceptable utilization or latency variation.

The CPU bandwidth control solves this problem allowing to set an explicit maximum limit for allowable CPU bandwidth. The bandwidth allowed for a group pf processes is specified using a quota and period. Within each given "period" (microseconds), a group is allowed to consume only up to "quota" microseconds of CPU time. When the CPU bandwidth consumption of a group exceeds this limit (for that period), the tasks belonging to its hierarchy will be throttled and are not allowed to run again until the next period. Documentation: [;a=blob;f=Documentation/scheduler/sched-bwc.txt;hb=HEAD Documentation/scheduler/sched-bwc.txt]. Code: [ (commit 1], [ 2], [ 3], [ 4], [ 5], [ 6], [ 7], [ 8], [ 9], [ 10], [ 11], [ 12], [ 13], [ 14], [ 15], [ 16)]

1.4. New architecture: Hexagon

Recommended LWN article: [ Upcoming DSP architectures]

[ Qualcomm's Hexagon home page]

The Hexagon processor is a general-purpose digital signal processor designed for high performance and low power across a wide variety of applications. It merges the numeric support, parallelism, and wide computation engine of a DSP, with the advanced system architecture of a modern microprocessor.

Code: [;a=tree;f=arch/hexagon;hb=HEAD arch/hexagon]

1.5. Thin provisioning and recursive snapshots in the Device Mapper

Typically, provisioning storage capacity to multiple users can be inefficient. For example, if 10 users need 10 GB each one, you will need 100 GB of storage capacity. These users, however, very probably won't use most of that storage space. Let's suppose that, on average, they only use 50% of their allocated space: only 50 GB will be used, and the other 50 GB will be underutilized.

Thin provisioning allows to assign to all users combined more storage capacity than the total storage capacity of the system. In the previous case, you could buy only 50 GB of storage, let each users have 10 GB of theorical storage space (100 GB in total), and have no problems, because the 50 GB you bought are enought to satisfy the real demand of storage. And if users increase the demand, you can add more storage capacity. Thanks to thin provisioning, you can optimize your storage investment and avoid over-provisioning.

Linux 3.2 adds experimental support for thin provisioning in the DM layer. Users will be able to create multiple thinly provisioned volumes out of a storage pool. Another significant feature included in the thin-provision DM target is support for an arbitrary depth of recursive snapshots (snapshots of snapshots of snapshots...), which avoids degradation with depth. Code: [ (commit 1], [ 2], [ 3)]

1.6. I/O-less dirty throttling, reduce filesystem writeback from page reclaim

Recommended LWN article: [ No-I/O dirty throttling]

"Writeback" is the process of writing buffered data from the RAM to the disk, and in this context throttling means blocking processes temporally to avoid them creating new data that needs to be written, until the current data has been written to the disk.

A critical part of the writeback code is deciding how much data pending of being written can be hold on RAM. In this kernel, the algorithms to make that decision have been rewritten (check the LWN article for more details). As a result, IO seeks and CPU contentions should be greatly reduced. Users will notice a more responsive system during heavy writeback, "killall dd" will take effect instantly. Users may also notice much smoothed pause times in workloads that have the write() syscall inside its loop, and also in NFS, JBOD and concurrent dd's. Lock contention and cache bouncing in concurrent IO workloads have been much improved. Code: [ (commit 1], [ 2], [ 3], [ 4], [ 5], [ 6], [ 7], [ 8], [ 9], [ 10], [ 11], [ 12], [ 13], [ 14)]

There has been also work to reduce the filesystem writeback from the page reclaim, which also improves performance in many cases. Code: [ (commit 1], [ 2], [ 3], [ 4], [ 5], [ 6], [ 7)]

1.7. TCP Proportional Rate Reduction

Recommended LWN article: [ LPC: Making the net go faster]

TCP tries to achieve the maximum bandwidth of a network link increasing the send rate until the network link starts losing packets. When a packet is lost, TCP slows down it tries to increase slowly the speed again.

This systems works well, but in some cases where packets are lost, it takes too much time to recover the maximum speed. Google has developed an alternative recovering algorithm, called "Proportional Rate Reduction", which improves latency and the time to recover. For information, you can check [ a IETF draft], two slides ([ 1], [ 2]), or the [ LWN article]. Code: [ (commit)]

1.8. Improved live profiling tool "perf top"

The live profiling tool "perf top" has been rewritten and improved. Beyond the prettier output, it has the ability to navigate while data capture is going on, and the new ability to zoom into tasks and libraries. Users can even see annotated assembly code, hit enter on a CALLQ instruction and get moved to the called function's annotated assembly code. This works recursively, so users can explore the assembly code arbitrarily deep. Code: many different commits

1.9. Cross memory attach

Cross memory attach adds two syscalls -process_vm_readv, process_vm_writev- which allow to read/write from/to another processes' address space. The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. Code: [ (commit)]

2. Driver and architecture-specific changes

All the driver and architecture-specific changes can be found in the [ Linux_3.2_DriverArch page]

3. File systems

4. Memory management

5. Networking

6. Device Mapper

7. Power management

8. Virtualization

9. Crypto

10. Security

11. Tracing/profiling

12. Various core changes


KernelNewbies: Linux_3.2 (last edited 2012-01-05 01:09:29 by diegocalleja)