• Immutable Page
  • Info
  • Attachments

Diff for "Linux 3.2"

Differences between revisions 20 and 21

Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
Linux 3.2 released on 4 Jan 2011
Linux 3.2 [https://lkml.org/lkml/2012/1/4/395 released] on 4 Jan, 2012

Linux 3.2 released on 4 Jan, 2012

Summary: This release includes support for ext4 block sizes bigger than 4KB and up to 1MB, which improve performance with big files; btrfs has been updated with faster scrubbing, automatic backup of critical filesystem metadata and tools for manual inspection of the filesystems; the process scheduler has added support to set upper limits of CPU time; the desktop reponsiveness in presence of heavy writes has been improved, TCP has been updated to include an algorithm which speeds up the recovery of the connection after lost packets; the profiling tool "perf top" has added support for live inspection of tasks and libraries and see the annotated assembly code; the Device Mapper has added support for 'thin provisioning' of storage, and a new architeture has been added: the Hexagon DSP processor from Qualcomm. Other drivers and small improvements and fixes are also available in this release.

  1. Prominent features in Linux 3.2
    1. ext4: Support for bigger block sizes
    2. Btrfs: Faster scrubbing, automatic backup of tree roots, detailed corruption messages, manual inspection of metadata
    3. Process bandwith controller
    4. New architecture: Hexagon
    5. Thin provisioning and recursive snapshots in the Device Mapper
    6. I/O-less dirty throttling, reduce filesystem writeback from page reclaim
    7. TCP Proportional Rate Reduction
    8. Improved live profiling tool "perf top"
    9. Cross memory attach
  2. Driver and architecture-specific changes
  3. File systems
  4. Memory management
  5. Networking
  6. Device Mapper
  7. Power management
  8. Virtualization
  9. Crypto
  10. Security
  11. Tracing/profiling
  12. Various core changes

1. Prominent features in Linux 3.2

1.1. ext4: Support for bigger block sizes

Recommended LWN article: Improving ext4: bigalloc, inline data, and metadata checksums

The maximum size of a filesystem block in ext4 has always been 4 KB in x86 systems. But the storage capacity of modern hard disks is growing fast, and with the size of hard disks, the overhead of using such small size as block size increases. Small block sizes benefit users who have many small files, because the space will be used more efficiently, but people who uses large files would benefit of larger block sizes.

ext4 supports now block sizes of up to 1MB of size, which decreases considerably the time spent doing block allocations, and there is smaller fragmentation. These new block sizes must be set at creation time, using the mkfs -C option (requires e2fsprogs 1.42). This feature is not backwards compatible with older kernels. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)

1.2. Btrfs: Faster scrubbing, automatic backup of tree roots, detailed corruption messages, manual inspection of metadata

Recommended LWN article: A btrfs update at LinuxCon Europe

Scrub read-ahead

Scrubbing -the process of checking all the checksums of the filesystem- uses read-ahead to improve the performance. The average disk bandwith utilisation on a test volume was raised from 70% to 90%. On another volume, the time for a test run went down from 89 seconds to 43 seconds. Code: (commit 1, 2, 3, 4)

Log of past tree roots

Btrfs will store in the filesystem superblock information about most of the tree roots in the last four commits. A "-o recovery" mount option has been used to allow a user to use the root history log when the filesystem is not able to read the tree of the tree roots, the extent tree root, the device tree root or the csum root. Code: (commit)

Detailed corruption messages

Btrfs has always had "back references" that allow to find which files or b-trees actually reference a given block, but until now walking those references has been a manual process. Code to follow these backrefs has been added, with improved messages as result. For example, after scribbled over the blocks in one file on the disk and starting a scrub, instead of just telling that block xxyyzz is bad, the kernel now will print this: Code: (commit 1, 2)

  • btrfs: checksum error at logical 5085110272 on dev /dev/sde, sector 2474832, root 5, inode 32583, offset 0, length 4096, links 1 (path: default/kernel-0/Makefile)

Manual inspection of the filesystem

As part of the previous feature, some code has also been added to allow manual inspection of the filesystem from userspace utilities. To find the file that belongs to extent 5085110272 , you can run: Code: (commit)

  • btrfs inspect logical 5085110272 /mnt

    Or to find the filename for inode number 32583:

    btrfs inspect inode 32583 /mnt

Performance improvements

Performance improvements haven been done in several areas, specially random write workloads.

1.3. Process bandwith controller

Recommended LWN article: CFS bandwidth control

The process scheduler divides the available CPU bandwith between all processes that need to run. There is no limits of how much CPU bandwith each process gets if there is free bandwith available, because all processes are supposed to want as much as possible. But apparently, some companies like Google have some scenarios where this unbounded allocation of CPU bandwith may lead to unacceptable utilization or latency variation.

The CPU bandwidth control solves this problem allowing to set an explicit maximum limit for allowable CPU bandwidth. The bandwidth allowed for a group pf processes is specified using a quota and period. Within each given "period" (microseconds), a group is allowed to consume only up to "quota" microseconds of CPU time. When the CPU bandwidth consumption of a group exceeds this limit (for that period), the tasks belonging to its hierarchy will be throttled and are not allowed to run again until the next period. Documentation: Documentation/scheduler/sched-bwc.txt. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

1.4. New architecture: Hexagon

Recommended LWN article: Upcoming DSP architectures

Qualcomm's Hexagon home page

The Hexagon processor is a general-purpose digital signal processor designed for high performance and low power across a wide variety of applications. It merges the numeric support, parallelism, and wide computation engine of a DSP, with the advanced system architecture of a modern microprocessor.

Code: arch/hexagon

1.5. Thin provisioning and recursive snapshots in the Device Mapper

Typically, provisioning storage capacity to multiple users can be inefficient. For example, if 10 users need 10 GB each one, you will need 100 GB of storage capacity. These users, however, very probably won't use most of that storage space. Let's suppose that, on average, they only use 50% of their allocated space: only 50 GB will be used, and the other 50 GB will be underutilized.

Thin provisioning allows to assign to all users combined more storage capacity than the total storage capacity of the system. In the previous case, you could buy only 50 GB of storage, let each users have 10 GB of theorical storage space (100 GB in total), and have no problems, because the 50 GB you bought are enought to satisfy the real demand of storage. And if users increase the demand, you can add more storage capacity. Thanks to thin provisioning, you can optimize your storage investment and avoid over-provisioning.

Linux 3.2 adds experimental support for thin provisioning in the DM layer. Users will be able to create multiple thinly provisioned volumes out of a storage pool. Another significant feature included in the thin-provision DM target is support for an arbitrary depth of recursive snapshots (snapshots of snapshots of snapshots...), which avoids degradation with depth. Code: (commit 1, 2, 3)

1.6. I/O-less dirty throttling, reduce filesystem writeback from page reclaim

Recommended LWN article: No-I/O dirty throttling

"Writeback" is the process of writing buffered data from the RAM to the disk, and in this context throttling means blocking processes temporally to avoid them creating new data that needs to be written, until the current data has been written to the disk.

A critical part of the writeback code is deciding how much data pending of being written can be hold on RAM. In this kernel, the algorithms to make that decision have been rewritten (check the LWN article for more details). As a result, IO seeks and CPU contentions should be greatly reduced. Users will notice a more responsive system during heavy writeback, "killall dd" will take effect instantly. Users may also notice much smoothed pause times in workloads that have the write() syscall inside its loop, and also in NFS, JBOD and concurrent dd's. Lock contention and cache bouncing in concurrent IO workloads have been much improved. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)

There has been also work to reduce the filesystem writeback from the page reclaim, which also improves performance in many cases. Code: (commit 1, 2, 3, 4, 5, 6, 7)

1.7. TCP Proportional Rate Reduction

Recommended LWN article: LPC: Making the net go faster

TCP tries to achieve the maximum bandwidth of a network link increasing the send rate until the network link starts losing packets. When a packet is lost, TCP slows down it tries to increase slowly the speed again.

This systems works well, but in some cases where packets are lost, it takes too much time to recover the maximum speed. Google has developed an alternative recovering algorithm, called "Proportional Rate Reduction", which improves latency and the time to recover. For information, you can check [ http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01 a IETF draft], two slides (1, 2), or the LWN article. Code: (commit)

1.8. Improved live profiling tool "perf top"

The live profiling tool "perf top" has been rewritten and improved. Beyond the prettier output, it has the ability to navigate while data capture is going on, and the new ability to zoom into tasks and libraries. Users can even see annotated assembly code, hit enter on a CALLQ instruction and get moved to the called function's annotated assembly code. This works recursively, so users can explore the assembly code arbitrarily deep. Code: many different commits

1.9. Cross memory attach

Cross memory attach adds two syscalls -process_vm_readv, process_vm_writev- which allow to read/write from/to another processes' address space. The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. Code: (commit)

2. Driver and architecture-specific changes

All the driver and architecture-specific changes can be found in the Linux_3.2_DriverArch page

3. File systems

  • ext4

    • Optimize ext4_ext_convert_to_initialized(). Programs performing appending writes into files pre-allocated via fallocate (FALLOC_FL_KEEP_SIZE) via direct I/O and when using a suboptimal implementation of memmove() will see a considerable reduction of kernel CPU consumption (commit)

    • Optimize memmmove lengths in extent/index insertions: Reduce the system CPU consumption by over 25% on a 4kB synchronous append DIO write workload (commit)

    • Remove deprecated oldalloc (commit)

  • ext3

  • CIFS

    • uid/gid to SID mapping (commit)

    • Add mount options for backup intent (commit)

    • Allow for larger rsize= options and change defaults (commit)

  • Btrfs

    • Introduce mount option nospace_cache (commit), (commit)

    • Allow to mount -o subvol=path/to/subvol/you/want relative from the normal fs_tree root (commit)

    • Allow to overcommit ENOSPC reservations (speeds up a test from 45 minutes to 10 seconds) (commit)

    • Be smarter about committing the transaction: xfstests 83 goes from taking 445 seconds to taking 28 seconds (commit)

  • JFFS2

    • Add compr=lzo and compr=zlib options (commit)

    • Implement mount option parsing and compression overriding (commit)


  • NFS

    • Support for RAID5 read-4-write interface. (commit)

  • GFS2

    • Speed up delete/unlink performance for large files (commit)

  • SquashFS

    • Add an option to set dev block size to 4K (commit)

4. Memory management

  • vmscan: add block plug for page reclaim to reduce lock contention (commit)

  • thp: mremap support and TLB optimization (commit)

  • slub: per CPU cache for partial pages (commit), (commit)

  • Restrict access to slab files under procfs and sysfs (commit)

5. Networking

  • Support for transmission of IPv6 packets as well as the formation of IPv6 link-local addresses and statelessly autoconfigured addresses on top of IEEE 802.15.4 networks. For more information please look at the RFC4944 "Compression Format for IPv6 Datagrams in Low Power and Lossy Networks (6LoWPAN) (commit)

  • NCI support. The NFC Controller Interface (NCI) is a standard communication protocol between an NFC Controller (NFCC) and a Device Host (DH), defined by the NFC Forum (commit), (commit)

  • Add netlink-based CAN routing (commit)

  • Add ethtool -g support to virtio_net (commit)

  • B.A.T.M.A.N. ad hoc networking: implement AP-isolation on the receiver side (commit), implement AP-isolation on the sender side (commit)

  • af-iucv: The current transport mechanism for af_iucv is the z/VM offered communications facility IUCV. To provide equivalent support when running Linux in an LPAR, HiperSockets transport is added to the AF_IUCV address family (commit)

  • ipv4: gc_interval sysctl removed (commit)

  • mac80211: implement uAPSD (commit), mesh gate implementation (commit)

  • af-packet: Added TPACKET_V3 support (commit), TPACKET_V3 flexible buffer implementation. (commit)

  • bridge: allow forwarding some link local frames, adding a new sysfs attribute /sys/class/net/brX/bridge/group_fwd_mask that controls forwarding of frames (commit)

6. Device Mapper

  • dm table: add always writeable feature (commit), add immutable feature (commit), add singleton feature (commit)

  • dm log userspace: add log device dependency (commit)

7. Power management

  • devfreq: devfreq is a generic DVFS framework that can be registered for a device with OPP support in order to let the governor provided to DEVFREQ choose an operating frequency based on the OPP's list and the policy given with DEVFREQ (commit), (commit),(commit)

  • Improve performance of LZO/plain hibernation, checksum image (commit)

  • Include storage keys in hibernation image on s390 (commit)

  • Implement per-device PM QoS constraints (commit)

8. Virtualization

  • xen: Implement discard requests ('feature-discard') (commit), support 'feature-barrier' aka old-style BARRIER (commit)

  • lguest: Allow running under paravirt-enabled KVM. (commit)

  • Move Hyper-V code out of staging directory (commit)

9. Crypto

  • Add userspace configuration API (commit)

  • blowfish: add x86_64 assembly implementation (commit)

  • sha1: SSSE3-based SHA-1 implementation for x86-64 (commit)

  • twofish: add 3-way parallel x86_64 assembler implemention (commit)

10. Security

  • EVM: EVM protects a file's security extended attributes(xattrs) against integrity attacks (commit)

  • Domain transition protections (commit)

  • Rule list lookup performance (commit)

  • Allow to access /smack/access as normal user (commit)

  • Add environment variable name restriction support. (commit)

  • Add socket operation restriction support. (commit)

  • Allow controlling generation of access granted logs for per (commit)

  • Allow domain transition without execve(). (commit)

11. Tracing/profiling

  • perf annotate: Add --symfs option (commit)

  • perf script: Add drop monitor script (commit)

  • perf stat: Add -o and --append options (commit)

  • perf: Support setting the disassembler style (commit)

  • perf tools: Make --no-asm-raw the default (commit)

  • perf tools: Make perf.data more self-descriptive(commit)

  • x86: Implement IBS initialization (commit)

  • powerpc: Add POWER7 stalled-cycles-frontend/backend (commit)

12. Various core changes


Tell others about this page:

last edited 2012-07-22 14:05:26 by diegocalleja