Linux 3.2 released on 4 Jan, 2012
Summary: This release includes support for ext4 block sizes bigger than 4KB and up to 1MB, which improve performance with big files; btrfs has been updated with faster scrubbing, automatic backup of critical filesystem metadata and tools for manual inspection of the filesystems; the process scheduler has added support to set upper limits of CPU time; the desktop reponsiveness in presence of heavy writes has been improved, TCP has been updated to include an algorithm which speeds up the recovery of the connection after lost packets; the profiling tool "perf top" has added support for live inspection of tasks and libraries and see the annotated assembly code; the Device Mapper has added support for 'thin provisioning' of storage, and a new architeture has been added: the Hexagon DSP processor from Qualcomm. Other drivers and small improvements and fixes are also available in this release.
Contents
-
Prominent features in Linux 3.2
- ext4: Support for bigger block sizes
- Btrfs: Faster scrubbing, automatic backup of tree roots, detailed corruption messages, manual inspection of metadata
- Process bandwith controller
- New architecture: Hexagon
- Thin provisioning and recursive snapshots in the Device Mapper
- I/O-less dirty throttling, reduce filesystem writeback from page reclaim
- TCP Proportional Rate Reduction
- Improved live profiling tool "perf top"
- Cross memory attach
- Driver and architecture-specific changes
- File systems
- Memory management
- Networking
- Device Mapper
- Power management
- Virtualization
- Crypto
- Security
- Tracing/profiling
- Various core changes
1. Prominent features in Linux 3.2
1.1. ext4: Support for bigger block sizes
Recommended LWN article: Improving ext4: bigalloc, inline data, and metadata checksums
The maximum size of a filesystem block in ext4 has always been 4 KB in x86 systems. But the storage capacity of modern hard disks is growing fast, and with the size of hard disks, the overhead of using such small size as block size increases. Small block sizes benefit users who have many small files, because the space will be used more efficiently, but people who uses large files would benefit of larger block sizes.
ext4 supports now block sizes of up to 1MB of size, which decreases considerably the time spent doing block allocations, and there is smaller fragmentation. These new block sizes must be set at creation time, using the mkfs -C option (requires e2fsprogs 1.42). This feature is not backwards compatible with older kernels. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
1.2. Btrfs: Faster scrubbing, automatic backup of tree roots, detailed corruption messages, manual inspection of metadata
Recommended LWN article: A btrfs update at LinuxCon Europe
- Scrub read-ahead
Scrubbing -the process of checking all the checksums of the filesystem- uses read-ahead to improve the performance. The average disk bandwith utilisation on a test volume was raised from 70% to 90%. On another volume, the time for a test run went down from 89 seconds to 43 seconds. Code: (commit 1, 2, 3, 4)
- Log of past tree roots
Btrfs will store in the filesystem superblock information about most of the tree roots in the last four commits. A "-o recovery" mount option has been used to allow a user to use the root history log when the filesystem is not able to read the tree of the tree roots, the extent tree root, the device tree root or the csum root. Code: (commit)
- Detailed corruption messages
Btrfs has always had "back references" that allow to find which files or b-trees actually reference a given block, but until now walking those references has been a manual process. Code to follow these backrefs has been added, with improved messages as result. For example, after scribbled over the blocks in one file on the disk and starting a scrub, instead of just telling that block xxyyzz is bad, the kernel now will print this: Code: (commit 1, 2)
btrfs: checksum error at logical 5085110272 on dev /dev/sde, sector 2474832, root 5, inode 32583, offset 0, length 4096, links 1 (path: default/kernel-0/Makefile)
- Manual inspection of the filesystem
As part of the previous feature, some code has also been added to allow manual inspection of the filesystem from userspace utilities. To find the file that belongs to extent 5085110272 , you can run: Code: (commit)
btrfs inspect logical 5085110272 /mnt Or to find the filename for inode number 32583:
btrfs inspect inode 32583 /mnt
- Performance improvements
- Performance improvements haven been done in several areas, specially random write workloads.
1.3. Process bandwith controller
Recommended LWN article: CFS bandwidth control
The process scheduler divides the available CPU bandwith between all processes that need to run. There is no limits of how much CPU bandwith each process gets if there is free bandwith available, because all processes are supposed to want as much as possible. But apparently, some companies like Google have some scenarios where this unbounded allocation of CPU bandwith may lead to unacceptable utilization or latency variation.
The CPU bandwidth control solves this problem allowing to set an explicit maximum limit for allowable CPU bandwidth. The bandwidth allowed for a group pf processes is specified using a quota and period. Within each given "period" (microseconds), a group is allowed to consume only up to "quota" microseconds of CPU time. When the CPU bandwidth consumption of a group exceeds this limit (for that period), the tasks belonging to its hierarchy will be throttled and are not allowed to run again until the next period. Documentation: Documentation/scheduler/sched-bwc.txt. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
1.4. New architecture: Hexagon
Recommended LWN article: Upcoming DSP architectures
The Hexagon processor is a general-purpose digital signal processor designed for high performance and low power across a wide variety of applications. It merges the numeric support, parallelism, and wide computation engine of a DSP, with the advanced system architecture of a modern microprocessor.
Code: arch/hexagon
1.5. Thin provisioning and recursive snapshots in the Device Mapper
Typically, provisioning storage capacity to multiple users can be inefficient. For example, if 10 users need 10 GB each one, you will need 100 GB of storage capacity. These users, however, very probably won't use most of that storage space. Let's suppose that, on average, they only use 50% of their allocated space: only 50 GB will be used, and the other 50 GB will be underutilized.
Thin provisioning allows to assign to all users combined more storage capacity than the total storage capacity of the system. In the previous case, you could buy only 50 GB of storage, let each users have 10 GB of theorical storage space (100 GB in total), and have no problems, because the 50 GB you bought are enought to satisfy the real demand of storage. And if users increase the demand, you can add more storage capacity. Thanks to thin provisioning, you can optimize your storage investment and avoid over-provisioning.
Linux 3.2 adds experimental support for thin provisioning in the DM layer. Users will be able to create multiple thinly provisioned volumes out of a storage pool. Another significant feature included in the thin-provision DM target is support for an arbitrary depth of recursive snapshots (snapshots of snapshots of snapshots...), which avoids degradation with depth. Code: (commit 1, 2, 3)
1.6. I/O-less dirty throttling, reduce filesystem writeback from page reclaim
Recommended LWN article: No-I/O dirty throttling
"Writeback" is the process of writing buffered data from the RAM to the disk, and in this context throttling means blocking processes temporally to avoid them creating new data that needs to be written, until the current data has been written to the disk.
A critical part of the writeback code is deciding how much data pending of being written can be hold on RAM. In this kernel, the algorithms to make that decision have been rewritten (check the LWN article for more details). As a result, IO seeks and CPU contentions should be greatly reduced. Users will notice a more responsive system during heavy writeback, "killall dd" will take effect instantly. Users may also notice much smoothed pause times in workloads that have the write() syscall inside its loop, and also in NFS, JBOD and concurrent dd's. Lock contention and cache bouncing in concurrent IO workloads have been much improved. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
There has been also work to reduce the filesystem writeback from the page reclaim, which also improves performance in many cases. Code: (commit 1, 2, 3, 4, 5, 6, 7)
1.7. TCP Proportional Rate Reduction
Recommended LWN article: LPC: Making the net go faster
TCP tries to achieve the maximum bandwidth of a network link increasing the send rate until the network link starts losing packets. When a packet is lost, TCP slows down it tries to increase slowly the speed again.
This systems works well, but in some cases where packets are lost, it takes too much time to recover the maximum speed. Google has developed an alternative recovering algorithm, called "Proportional Rate Reduction", which improves latency and the time to recover. For information, you can check [ http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01 a IETF draft], two slides (1, 2), or the LWN article. Code: (commit)
1.8. Improved live profiling tool "perf top"
The live profiling tool "perf top" has been rewritten and improved. Beyond the prettier output, it has the ability to navigate while data capture is going on, and the new ability to zoom into tasks and libraries. Users can even see annotated assembly code, hit enter on a CALLQ instruction and get moved to the called function's annotated assembly code. This works recursively, so users can explore the assembly code arbitrarily deep. Code: many different commits
1.9. Cross memory attach
Cross memory attach adds two syscalls -process_vm_readv, process_vm_writev- which allow to read/write from/to another processes' address space. The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. Code: (commit)
2. Driver and architecture-specific changes
All the driver and architecture-specific changes can be found in the Linux_3.2_DriverArch page
3. File systems
- ext4
Optimize ext4_ext_convert_to_initialized(). Programs performing appending writes into files pre-allocated via fallocate (FALLOC_FL_KEEP_SIZE) via direct I/O and when using a suboptimal implementation of memmove() will see a considerable reduction of kernel CPU consumption (commit)
Optimize memmmove lengths in extent/index insertions: Reduce the system CPU consumption by over 25% on a 4kB synchronous append DIO write workload (commit)
Remove deprecated oldalloc (commit)
- ext3
Remove deprecated oldalloc (commit)
- CIFS
- Btrfs
Allow to mount -o subvol=path/to/subvol/you/want relative from the normal fs_tree root (commit)
Allow to overcommit ENOSPC reservations (speeds up a test from 45 minutes to 10 seconds) (commit)
Be smarter about committing the transaction: xfstests 83 goes from taking 445 seconds to taking 28 seconds (commit)
- JFFS2
- EXOFS
- NFS
Support for RAID5 read-4-write interface. (commit)
- GFS2
Speed up delete/unlink performance for large files (commit)
- SquashFS
Add an option to set dev block size to 4K (commit)
4. Memory management
vmscan: add block plug for page reclaim to reduce lock contention (commit)
thp: mremap support and TLB optimization (commit)
Restrict access to slab files under procfs and sysfs (commit)
5. Networking
Support for transmission of IPv6 packets as well as the formation of IPv6 link-local addresses and statelessly autoconfigured addresses on top of IEEE 802.15.4 networks. For more information please look at the RFC4944 "Compression Format for IPv6 Datagrams in Low Power and Lossy Networks (6LoWPAN) (commit)
NCI support. The NFC Controller Interface (NCI) is a standard communication protocol between an NFC Controller (NFCC) and a Device Host (DH), defined by the NFC Forum (commit), (commit)
Add netlink-based CAN routing (commit)
Add ethtool -g support to virtio_net (commit)
B.A.T.M.A.N. ad hoc networking: implement AP-isolation on the receiver side (commit), implement AP-isolation on the sender side (commit)
af-iucv: The current transport mechanism for af_iucv is the z/VM offered communications facility IUCV. To provide equivalent support when running Linux in an LPAR, HiperSockets transport is added to the AF_IUCV address family (commit)
ipv4: gc_interval sysctl removed (commit)
mac80211: implement uAPSD (commit), mesh gate implementation (commit)
af-packet: Added TPACKET_V3 support (commit), TPACKET_V3 flexible buffer implementation. (commit)
bridge: allow forwarding some link local frames, adding a new sysfs attribute /sys/class/net/brX/bridge/group_fwd_mask that controls forwarding of frames (commit)
6. Device Mapper
dm table: add always writeable feature (commit), add immutable feature (commit), add singleton feature (commit)
dm log userspace: add log device dependency (commit)
7. Power management
devfreq: devfreq is a generic DVFS framework that can be registered for a device with OPP support in order to let the governor provided to DEVFREQ choose an operating frequency based on the OPP's list and the policy given with DEVFREQ (commit), (commit),(commit)
Improve performance of LZO/plain hibernation, checksum image (commit)
Include storage keys in hibernation image on s390 (commit)
Implement per-device PM QoS constraints (commit)
8. Virtualization
xen: Implement discard requests ('feature-discard') (commit), support 'feature-barrier' aka old-style BARRIER (commit)
lguest: Allow running under paravirt-enabled KVM. (commit)
Move Hyper-V code out of staging directory (commit)
9. Crypto
Add userspace configuration API (commit)
blowfish: add x86_64 assembly implementation (commit)
sha1: SSSE3-based SHA-1 implementation for x86-64 (commit)
twofish: add 3-way parallel x86_64 assembler implemention (commit)
10. Security
EVM: EVM protects a file's security extended attributes(xattrs) against integrity attacks (commit)
- Smack
Domain transition protections (commit)
Rule list lookup performance (commit)
Allow to access /smack/access as normal user (commit)
- TOMOYO
Add environment variable name restriction support. (commit)
Add socket operation restriction support. (commit)
Allow controlling generation of access granted logs for per (commit)
Allow domain transition without execve(). (commit)
11. Tracing/profiling
perf annotate: Add --symfs option (commit)
perf script: Add drop monitor script (commit)
perf stat: Add -o and --append options (commit)
perf: Support setting the disassembler style (commit)
perf tools: Make --no-asm-raw the default (commit)
perf tools: Make perf.data more self-descriptive(commit)
x86: Implement IBS initialization (commit)
powerpc: Add POWER7 stalled-cycles-frontend/backend (commit)
12. Various core changes
The i_mutex lock use of generic _file_llseek hurts. Do (nearly) lockless generic_file_llseek (commit)
init: add root=PARTUUID=UUID/PARTNROFF=%d support (commit)
iommu: Add fault reporting mechanism (commit)
loop: always allow userspace partitions and optionally support (commit), add discard support for loop devices dfaa2ef68e80c378e610e3c8c536f1c239e8d3ef (commit)
aio: allocate kiocbs in batches, to improve performance (commit)
sysfs: Implement support for tagged files (commit), (commit)
process connector: add comm change event (commit)
debug-pagealloc: add support for highmem pages (commit)
sysctl: add support for poll() (commit)