• Immutable Page
  • Info
  • Attachments

Diff for "Linux 3.10"

Differences between revisions 21 and 22

Deletions are marked like this. Additions are marked like this.
Line 12: Line 12:
== Timer free multitasking == == Timerless multitasking ==
Line 36: Line 36:
== Btrfs: smaller extents == == Btrfs: smaller, more space-efficient extent tree ==

Linux 3.10 has been released on Sun, 30 Jun 2013.

Summary: This release adds support for bcache, which allows to use SSD devices to cache data from other block devices; a Btrfs format improvement that makes the tree dedicated to store extent information 30-35% smaller; support for XFS metadata checksums and self-describing metadata, timerless multitasking, SysV IPC, rwlock and mutex scalability improvements, a TCP Tail loss probe algorithm that reduces tail latency of short transactions, KVM virtualization support in the MIPS architecture, support for the ARM big.LITTLE architecture that mixes CPUs of different types, tracing snapshots, new drivers and many small improvements.

  1. Prominent features (the cool stuff)
    1. Timerless multitasking
    2. Bcache, a block layer cache for SSD caching
    3. Btrfs: smaller, more space-efficient extent tree
    4. XFS metadata checksums
    5. SysV IPC scalability improvements
    6. rwsem locking scalability improvements
    7. mutex locking scalability improvements
    8. TCP optimization: Tail loss probe
    9. ARM big.LITTLE support
    10. MIPS KVM support
    11. tracing: tracing snapshots, stack tracing
  2. Drivers and architectures
  3. Core
  4. Memory management
  5. Block layer
  6. File systems
  7. Networking
  8. Crypto
  9. Virtualization
  10. Security
  11. Tracing/perf
  12. Other news sites that track the changes of this release

1. Prominent features (the cool stuff)

1.1. Timerless multitasking

In the prehistory of computing, computers could only have one task running at one time. But people wanted to start other tasks without waiting for first one to end, and even switch between tasks, and thus multitasking was born. First, multitasking was "collaborative", a process would run until its own code voluntarily decided to pause and allow other tasks to run. But it was possible to do multitasking better: the hardware could have a timer that fires up at regular intervals (called "ticks"); this timer could forcefully pause any program and run a OS routine that decides which task should continue running next. This is called preemptive multitasking, and it's what modern OSs do.

But preemptive multitasking had some side effects in modern hardware. CPUs of laptops and mobile devices require inactivity to enter in low power modes. Preemptive multitasking fires the the timer often, 1000 times per second in a typical Linux kernel, even when the system is not doing anything, so the CPUs could not save as much power as it was possible. Virtualization created more problems, since each Linux VM runs its own timer.In 2.6.21, released in April 2007, Linux partially solved this: the timer would fire off 1000 times per second as always when the system is running tasks, but it would stop completely the timer when the system is idle. But this is not enough. There are single task workloads like scientific number crunching or users of the real-time pachset whose performance or latency is hurt because they need to be temporally paused 1000 times per second for no reason.

This Linux release adds support for not firing the timer (tickless) even when tasks are running. With some caveats: in this release it's not actually fully tickless, it still needs the timer, but only fires up one time per second; the full tickless mode is disabled when a CPU runs more than one process; and a CPU must be kept running with full ticks to allow other CPUs to go into tickless mode.

For more details and future plans, it's strongly recommended to read this LWN article: '(Nearly) full tickless operation in 3.10' and the Documentation.

Code: (merge commit)

1.2. Bcache, a block layer cache for SSD caching

Since SSD storage devices became popular, many people has used them to speed up their storage stack. Bcache is an implementation of this functionality, and it allows SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching (besides just write through caching), and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high-end storage arrays, and perhaps even embedded.

For more details read the documentation or visit the wiki

Recommended LWN article: A bcache update

Code: (commit)

1.3. Btrfs: smaller, more space-efficient extent tree

Btrfs has incorporated a new key type for metadata extent references which uses disk space more efficiently and reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice, this results in a 30-35% decrease in the size of the extent tree, which means less copy-on-write operations, larger parts of the extent tree stored in memory which makes heavy metadata operations go much faster.

This is not an automatic format change, it must be enabled at mkfs time or with btrfstune -x.

Code: (commit)

1.4. XFS metadata checksums

In this release, XFS has a experimental implementation of metadata CRC32c checksums. These metadata checksums are part of a bigger project that aims to implement what the XFS developers have called "self-describing metadata". This project aims to solve the problem of verification scalability (fsck will need too much time to verify petabyte scale filesystems with billions of inodes). It requires a filesystem format change that will add to every XFS metadata object some information that allows to quickly determine if the metadata is intact and can be ignored for the purpose of forensic analysis. metadata type, filesystem identifier and block placement, metadata owner, log sequence identifier and, of course, CRC checksum.

This feature is experimental and requires using experimental xfsprogs. For more information, you can read the self-describing metadata Documentation.

Code: (merge commit)

1.5. SysV IPC scalability improvements

Linux IPC semaphore scalability was pitiful. Linux used to lock much too big ranges, and it used to have a single IPC lock per IPC semaphore array. Most loads never cared, but some do. This release splits out locking and adds per-semaphore locks for greater scalability of the IPC semaphore code. Micro benchmarks show improvements of more than 10x in some cases (see commit links for details).

Code: (merge commit),(commit 1, 2, 3, 4, 5, 6, 7

1.6. rwsem locking scalability improvements

The rwsem ("read-writer semaphore") locking scheme, used in many places in the Linux kernel, had performance problems because of strict, serialized, FIFO sequential write-ownership of the semaphore. In Linux 3.9, an "opportunistic lock stealing" patch was merged to fix it, but only in the slow path.

In this release, opportunity lock stealing has been implemented in the fast path, improving the performance of pgbench with double digits in some cases.

Code: (merge commit)

1.7. mutex locking scalability improvements

The mutex locking scheme, used widely in the Linux kernel, has been improved with some scalability improvements due to the use of less atomic operations and some queuing changes that reduce reduce cacheline contention. For details, see the commit links.

Code: (commit), (commit)

1.8. TCP optimization: Tail loss probe

This release adds the TCP Tail loss probe algorithm. Its goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly retransmission timeout.

Code: (commit 1, 2)

1.9. ARM big.LITTLE support

The ARM big.LITTLE architecture is a ARM SMP solution where, according to this LWN Article, "instead of having a bunch of identical CPU cores put together in a system, the big.LITTLE architecture is effectively pushing the concept further by pulling two different SMP systems together: one being a set of "big" and fast processors, the other one consisting of "little" and power-efficient processors."

Recommended LWN article: Multi-cluster power management

Product site: http://www.arm.com/products/processors/technologies/bigLITTLEprocessing.php

Code: (commit)

1.10. MIPS KVM support

Another Linux architecture has added support for KVM; in this case MIPS. KVM/MIPS should support MIPS32R2 and beyond. For more details, see the release notes.

Code: (commit)

1.11. tracing: tracing snapshots, stack tracing

The tracing framework has got the ability to allow several tracing buffers, which can be used to take snapshots of the main tracing buffer. These tracing snapshots can be triggered manually or with function probes. It's also possible to cause a stack trace to be traced in the ring buffer when a given function is called.

Code: (commit 1, 2, 3, 4, 5, 6)

2. Drivers and architectures

All the driver and architecture-specific changes can be found in the Linux_3.10-DriversArch page

3. Core

  • Asynchronous I/O scalability improvements "Performance wise, the end result of this patch series is that submitting a kiocb writes to _no_ shared cachelines - the penalty for sharing an ioctx is gone there (commit)

  • Make VT switching to the suspend console optional (commit)

  • posix-timers: Introduce /proc/PID/timers file to get info about what posix timers are configured by processes (commit)

  • kconfig: implement KCONFIG_PROBABILITY for randconfig (commit)

  • modpost: add -T option to read module names from file/stdin. (commit)

  • lib/int_sqrt.c: optimize square root algorithm (commit)

  • device control group: propagate local changes down the hierarchy (commit)

  • Add uid and gid to devtmpfs (commit)

  • Introduce a dummy IRQ handler driver. This module accepts a single 'irq' parameter, which it should register for. The sole purpose of this module is to help with debugging (commit)

  • control groups: introduce sane_behavior mount option (commit)

  • ptrace: add ability to retrieve signals without removing from a queue (commit)

  • cpufreq: Implement per policy instances of governors (commit)


  • Implement sysfs interface for workqueues in /sys/bus/workqueue/devices/WQ_NAME. There currently are two attributes common to both per-cpu and unbound pools and extra attributes for unbound pools including nice level and cpumask (commit)

  • Implement NUMA affinity for unbound workqueues (commit)


4. Memory management

  • Limit the growth of the memory reserved for other user processes to min(3% current process size, user_reserve_pages) in the OVERCOMMIT_NEVER mode. For more details, see the commit links (commit), (commit)

Memory control group

  • Add memory.pressure_level events (commit)

  • Add rss_huge stat to memory.stat (commit)

5. Block layer

  • Expose the block layer bdi_wq workqueue to userland. It appears under /sys/bus/workqueue/devices/writeback/ and allows adjusting maximum concurrency level, cpumask and nice level (commit)

  • Implement runtime power management (commit)

  • md: Allow devices to be re-added to a read-only array. (commit)

6. File systems


  • Introduce reserved space (commit)

  • Implementation of a new ioctl called EXT4_IOC_SWAP_BOOT (commit)

  • Reserve xattr index for Rich ACL support (commit)


  • add quota-driven speculative preallocation throttling (commit)

  • increase prealloc size to double that of the previous extent (commit)

  • introduce CONFIG_XFS_WARN (commit)

  • xfs_dquot prealloc throttling watermarks and low free space (commit)


  • Rescan for qgroups (commit)

  • Automatic rescan after "quota enable" command (commit)

  • Create the subvolume qgroup automatically when enabling quota (commit)

  • Deprecate subvolrootid mount option (obsoleted by subvol) (commit)


  • Introduce readahead mode of node pages (commit)


  • NFSv4.1: Enable open-by-filehandle (commit)

7. Networking

  • netlink: Add support for memory mapped netlink I/O (commit 1, 2, 3 ,4, 5, 6, 7, 8)

  • per hash bucket locking for the frag queue hash. This removes two write locks, and the only remaining write lock is for protecting hash rebuild. This essentially reduce the readers-writer lock to a rebuild lock (commit), (commit)

  • IPv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling (commit)

  • IPv6: Add support for IPv6 tokenized IIDs, that allow for administrators to assign well-known host-part addresses to nodes whilst still obtaining global network prefix from Router Advertisements. It is currently in draft status (commit)

  • tcp: implement RFC5682 F-RTO (commit)

  • tcp: Remove TCP cookie transactions (commit)

  • tunneling: Add generic Tunnel segmentation offloading support for IPv4-UDP based tunnels (commit)

  • vlan: Add 802.1ad support (commit), (commit), (commit)

  • bond: add support to read speed and duplex via ethtool (commit)

  • xfrm: add rfc4494 AES-CMAC-96 support (commit)

  • sctp: Add buffer utilization fields to /proc/net/sctp/assocs (commit)

  • tipc: Add support for running TIPC on IP-over-InfiniBand devices (commit), (commit)

  • Add MIB counters for checksum errors in IP layer, and TCP/UDP/ICMP layers (commit)

  • Add socket option to enable error queue packets waking select (commit)

  • team: introduce random mode (commit)

  • sock_diag: allow to dump bpf filters (commit)

  • filter: add minimal BPF JIT image disassembler (commit)


  • Allow L2 redirection with L3 switching (commit)

  • Use UDP Tunnel segmention. (commit)

  • Allow setting destination to unicast address. (commit)


  • implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling (commit)

  • ipset: Make possible to test elements marked with nomatch (commit)

  • ipset: set match: add support to match the counters (commit)

  • nfnetlink_queue: zero copy support (commit)

  • Diag core and basic socket info dumping (commit)

802.11 (wireless)

  • Extend support for IEEE 802.11r Fast BSS Transition (commit)

  • Add P2P Notice of Absence attribute (commit)

  • Enable TDLS on P2P client interfaces (commit)

  • Introduce critical protocol indication from user-space (commit)

  • mac80211: add P2P NoA settings (commit)

  • Support userspace MPM (commit)


  • RFKILL support (commit)

  • llcp: Implement socket options (commit)

  • llcp: Service Name Lookup SDRES aggregation (commit)

  • llcp: Service Name Lookup netlink interface (commit)

  • llcp: Add support in getsockopt for RW, LTO, and MIU remote parameters (commit)

  • llcp: Aggregated frames support (commit)

8. Crypto

  • Add CMAC support to CryptoAPI (commit)

  • aesni_intel - add more optimized XTS mode for x86-64 (commit)

  • atmel-aes: add support for latest release of the IP (0x130) (commit)

  • atmel-sha - add support for latest release of the IP (0x410) (commit)

  • atmel-tdes - add support for latest release of the IP (0x700) (commit)

  • blowfish: add AVX2/x86_64 implementation of blowfish cipher (commit)

  • camellia: add AVX2/AES-NI/x86_64 assembler implementation of camellia cipher (commit), add more optimized XTS code (commit)

  • sahara: Add driver for SAHARA2 accelerator. (commit)

  • sha256: optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions. (commit), otimized sha256 x86_64 assembly routine with AVX instructions. (commit), optimized sha256 x86_64 routine using AVX2's RORX instructions (commit); module providing optimized routines using SSSE3, AVX or AVX2 instructions. (commit)

  • sha512: Optimized SHA512 x86_64 assembly routine using AVX instructions. (commit), optimized SHA512 x86_64 assembly routine using AVX2 RORX instruction. (commit), optimized SHA512 x86_64 assembly routine using Supplemental SSE3 instructions. (commit); create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions. (commit)

  • twofish: add AVX2/x86_64 assembler implementation of twofish cipher (commit), use optimized XTS code (commit)

  • Add more optimized XTS-mode for serpent-avx (commit)

9. Virtualization

  • pvpanic: pvpanic device driver (commit)


  • New emulated device API (commit), (commit)

  • x86: Increase the "hard" max VCPU limit (commit)

  • PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls (commit), add kernel emulation for the XICS interrupt controller (commit)


  • caif_virtio: Introduce caif over virtio (commit)

  • virtio-scsi: introduce multiqueue support (commit)

  • vringh: host-side implementation of virtio rings. (commit)


  • Add a new driver to support host initiated backup (commit)

  • balloon: Implement hot-add functionality (commit)


10. Security

  • Smack: add support for modification of existing rules (commit)

  • audit: add an option to control logging of passwords with pam_tty_audit (commit)

  • audit: allow checking the type of audit message in the user filter (commit)

11. Tracing/perf


  • Add new "perf mem" command for memory access profiling (commit 1, 2, 3, 4, 5)

  • perf stat: Add per-core aggregation. This option is used to aggregate system-wide counts on a per physical core basis. On processors with hyperthreading, this means counts of all HT threads running on a physical core are aggregated (commit)

  • perf stat: Introduce --repeat forever (commit), rename --aggr-socket to --per-socket (commit)

  • perf annotate: Add --group option to enable event grouping. When enabled, all the group members information will be shown with the leader so skip non-leader events (commit), (commit), (commit)

  • perf report: Add --no-demangle option (commit)

  • perf tests: Add attr record -C cpu test (commit), add attr stat -C cpu test (commit)

  • Add support for weightened sampling (commit)

  • Make perf_event cgroup hierarchical (commit)


  • Add function probe triggers to enable/disable events (commit)

  • Add "uptime" trace clock that uses jiffies (commit)

  • Add a way to soft disable trace events (commit)

  • Add function-trace option to disable function tracing of latency tracers (commit)

12. Other news sites that track the changes of this release


Tell others about this page:

last edited 2013-07-02 16:20:29 by diegocalleja