Linux 3.5 has been released on 21 Jul 2012.
Summary: This release includes support for metadata checksums in ext4, userspace probes for performance profiling with tools like Systemtap or perf, a sandboxing mechanism that allows to filters syscalls, a new network queue management algorithm designed to fight bufferbloat, support for checkpointing and restoring TCP connections, support for TCP Early Retransmit (RFC 5827), support for Android-style opportunistic suspend, btrfs I/O failure statistics, and SCSI over Firewire and USB. Many small features and new drivers and fixes are also available.
Contents
1. Prominent features in Linux 3.5
1.1. ext4 metadata checksums
Modern filesystems such as ZFS and Btrfs have proved that ensuring the integrity of the filesystem using checksums is a valuable feature. Ext4 has added the ability to store checksums of various metadata fields. Every time a metadata field is read, the checksum of the read data is compared with the stored checksums, if they are different it means that the medata is corrupted (note that this feature doesn't cover data, only the internal metadata structures, and it doesn't have "self-healing" capabilities). The amount of code added to implement this feature is: 1659 insertions(+), 162 deletions(-).
Any ext4 filesystem can be upgraded to use checksums using the "tune2fs -O metadata_csum" command, or "mkfs -O metadata_csum" at creation time. Once this feature is enabled in a filesystem, older kernels with no checksum support will only be able to mount it in read-only mode.
As far as performance impact goes, it shouldn't be noticeable for common desktop and server workloads. A mail server ffsb simulation show nearly no change. On a test doing only file creation and deletion and extent tree modifications, a performance drop of about 20 percent was measured. However, it's a workload very heavily oriented towards metadata, in most real-world workloads metadata is usually a small fraction of total IO, so unless your workload is metadata-oriented, the cost of enabling this feature should be negligible.
Recommended LWN article: "Improving ext4: bigalloc, inline data, and metadata checksums"
Implementation details: Ext4 Metadata checksums
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
1.2. Uprobes: userspace probes
Uprobes, the user-space counterpart of kprobes, enables to place performance probes in any memory address of a user application, and collect debugging and performance information non-disruptively, which can be used to find performance problems. These probes can be placed dynamically in a running process, there is no need to restart the program or modify the binaries. The probes are usually managed with a instrumentation application, such as perf probe, systemtap or LTTng.
A sample usage of uprobes with perf could be to profile libc's malloc() calls:
$ perf probe -x /lib64/libc.so.6 malloc -> Added new event: probe_libc:malloc (on 0x7eac0)
A probe has been created. Now, let's record the global usage of malloc across all the system during 1 second:
- $ perf record -e probe_libc:malloc -agR sleep 1
Now you can watch the results with the TUI interface doing "$ perf report", or watch a plain text output without the call graph info in the stdio output with "$ perf report -g flat --stdio"
If you don't know which function you want to probe, you can get a list of probe-able funcions in libraries and executables using the -F parameter, for example: "$ perf probe -F -x /lib64/libc.so.6" or "$ perf probe -F -x /bin/zsh". You can use multiple probes as well and mix them with kprobes and regular PMU events or kernel tracepoints.
The uprobes code is one of the longest standing out-of-the-tree patches. It originates from SystemTap and has been included for years in Fedora and RHEL kernels.
Recommended LWN article: Uprobes in 3.5
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
1.3. Seccomp-based system call filtering
Seccomp (alias for "secure computing") is a simple sandboxing mechanism added back in 2.6.12 that allows to transition to a state where it cannot make any system calls except a very restricted set (exit, sigreturn, read and write to already open file descriptors). Seccomp has now been extended: instead of a fixed and very limited set of system calls, seccomp has evolved into a filtering mechanism that allows processes to specify an arbitrary filter of system calls (expressed as a Berkeley Packet Filter program) that should be forbidden. This can be used to implement different types of security mechanisms; for example, the Linux port of the Chromium web browser supports this feature to run plugins in a sandbox.
The systemd init daemon has added support for this feature. A Unit file can use the SystemCallFilter to specify a list with the syscalls that will be allowed to run, any other syscall will not be allowed:
[Service] ExecStart=/bin/echo "I am in a sandbox" SystemCallFilter=brk mmap access open fstat close read fstat mprotect arch_prctl munmap write
Recommended links: Documentation and Samples).
Recommended LWN article: Yet another new approach to seccomp
1.4. Bufferbloat fighting: CoDel queue management
Codel (alias for "controlled delay") is a new queue management algorithm designed to fight the problems associated to excessive buffering across an entire network path - a problem know as "bufferbloat". According to Jim Gettys, who coined the term bufferbloat, "this work is the culmination of their at three major attempts to solve the problems with AQM algorithms over the last 14 years"
ACM paper detailing the algorithm, by Kathleen Nichols and Van Jacobson: Controlling Queue Delay
Codel bufferbloat project page: http://www.bufferbloat.net/projects/codel/wiki
Recommended LWN article: The CoDel queue management algorithm
1.5. TCP connection repair
As part of an ongoing effort to implement process checkpointing/restart, Linux adds in this release support for stopping a TCP connection and restart it in another host. Container virtualization implementations will use this feature to relocate a entire network connection from one host to another transparently for the remote end. This is achieved putting the socket in a "repair" mode that allows to gather the necessary information or restore previous state into a new socket.
Documentation: http://criu.org/TCP_connection
Recommended LWN article: TCP connection repair
1.6. TCP Early Retransmit
TCP (and STCP) Early Retransmit (RFC 5827) allows to trigger fast retransmit, in certain conditions, to reduce the number of duplicate acknowledgments required to trigger a fast retransmission. This allows the transport to use fast retransmit to recover segment losses that would otherwise require a lengthy retransmission timeout. In other words, connections recover from lost packets faster, which improves latency. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP"
Early retransmit is enabled with the tcp_early_retrans sysctl, found at /proc/sys/net/ipv4/tcp_early_retrans. It accepts three values: "0" (disables early retransmit), "1" (enables it), and "2", the default one, which enables early retransmit but delays fast recovery and fast retransmit by a fourth of the RTT (this mitigates connection falsely recovers when network has a small degree of reordering)
1.7. Android-style opportunistic suspend
The most controversial issue in the merge of Android code into Linux is the functionality called "suspend blockers" or "wakelocks". They are part of a specific approach to power management, which is based on aggressive utilization of full system suspend as much as possible. The natural state of the system is a sleep state, in which energy is only used for refreshing memory and providing power to a few devices that can wake the system up. The system only uses the full power state when it has to do some real work, and when it finishes it goes back to a suspend state.
This is a good idea, but the kernel developers didn't like Android's "suspend blockers" (a full technical analysis on the issue can be found here). Endless flames have been going on for years, and little progress was been made, which was a huge problem for the convergence of Android and Linux, because drivers of Android devices use the suspend blocker APIs, and the lack of such APIs in Linux makes impossible to merge them. But in this release, the kernel incorporates a similar functionality, called "autosleep and wake locks". It is expected/hoped that Android will be able to use it, and merging drivers from Android devices will be easier.
Recommended LWN article: Autosleep and wake locks
1.8. Btrfs: I/O failure statistics, latency improvements
Support for I/O failure statistics has been added. I/O errors, CRC errors, and generation checks of metadata blocks are tracked for each drive. The Btrfs command to retrieve and print the device stats, to be included in future btrfs-progs, should be "btrfs device stats".
This release also includes fairly large changes that make Btrfs much friendly to memory reclaim and lowers latencies quite a lot for synchronous I/O.
1.9. SCSI over FireWire and USB
This release includes a driver for using an IEEE-1394 connection as a SCSI transport. This enables to expose SCSI devices to other nodes on the Firewire bus, for example hard disk drives. It's a similar functionality to Firewire Target Disk Mode on many Apple computers.
This release also adds a usb-gadget driver that does the same with USB. The driver supports two USB protocols are supported that is BBB or BOT (Bulk Only Transport) and UAS (USB Attached SCSI). BOT is advertised on alternative interface 0 (primary) and UAS is on alternative interface 1. Both protocols can work on USB 2.0 and USB 3.0. UAS utilizes the USB 3.0 feature called streams support.
2. Driver and architecture-specific changes
All the driver and architecture-specific changes can be found in the Linux_3.5_DriverArch page
3. Various core changes
Introduce /proc/<pid>/task/<tid>/children entry, which provides information about task children. This is useful for process checkpointing/restore (commit)
Report file/anon bit in /proc/pid/pagemap (commit)
Add skew_tick boot option: offsets the periodic timer tick per CPU to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. It increases power consumption, thus should only be enabled if running jitter sensitive (HPC/RT) workloads (commit)
microoptimization: move inode stat information closer together (commit)
fuse: add fallocate() operation (commit)
process scheduler: remove stale power aware scheduling remnants and dysfunctional knobs (commit)
epoll(): Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready (commit)
Add Apple NLS (Native Language Support) tables (commit)
ramoops: use pstore interface (commit), add ECC support (commit)
Connect tools/ to the kernel build system. "make tools/<toolname>" will build the project (commit)
- RCU locking
- IPC mqueue
4. Memory Management
Frontswap support. Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The data is stored into "transcendent memory", memory that is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. When space in transcendent memory is available, a significant swap I/O reduction may be achieved. When none is available, all frontswap calls are reduced to a single pointer-compare-against-NULL resulting in a negligible performance hit and swap data is stored as normal on the matching swap device (commit 1, 2, 3, 4)
Add a Contiguous Memory Allocator (recommended LWN article: A deep dive into CMA). This is a memory allocator that attempts to provide big contiguous allocations of memory. It operates on memory regions where only movable pages can be allocated from. This way, kernel can use the memory for pagecache and when device driver requests (commit)
Remove swap token code and lumpy reclaim: they no longer fit in the current VM model (commit), (commit)
5. Block
dm thin target: provide userspace access to pool metadata (commit)
dm thin: use dedicated slab caches prefixed with a "dm_" name rather than relying on kmalloc mempools backed by generic slab caches (commit)
raid5: add AVX optimized RAID5 checksumming (commit)
raid6: Add SSSE3 optimized recovery functions (commit)
md: allow a reshape operation to be reversed. (commit)
raid10: add reshape support (commit)
6. Perf/tracing
- annotate browser
7. Virtualization
KVM: Introduce direct MSI message injection for in-kernel irqchips (commit)
8. Security
- SELinux
- Smack
TOMOYO: Accept manager programs which do not start with / . (commit)
Yama: add additional ptrace scopes (commit)
KEYS: Add support for invalidating a key (commit)
9. Networking
mac802154: hardware-independent IEEE 802.15.4 networking stack for SoftMAC devices (the ones implementing only PHY level of IEEE 802.15.4 standard) (commit 1, 2, 3, 4, 5, 6, 7 ,8, 9, 10)
TCP microoptimization: 10Gb+ TCP sender was dropping lot of incoming ACKs because of sk_rcvbuf limit in sk_add_backlog() (commit)
team: add binary option type (commit), add loadbalance mode (commit), add per-port option for enabling/disabling ports (commit), add support for per-port options (commit), allow to enable/disable ports (commit)
Infiniband: Add raw packet QP type (commit)
ipv6: treat ND option 31 as userland (DNSSL support) (commit)
6lowpan: IPv6 link-local address (commit)
batman-adv: add basic bridge loop avoidance code (commit), (commit), remove old bridge loop avoidance code (commit)
caif: set traffic class for CAIF packets (commit)
Add generic PF_BRIDGE:RTM_ FDB hooks (commit)
pktsched: netem: add ECN capability (commit)
Delete all instances of special processing for token ring (commit), (commit)
econet: remove ancient bug-ridden protocol (commit)
dcb: Add an optional max rate attribute (commit), add CEE notify calls (commit)
- 802.11 (Wireless)
- Netfilter
- L2TP
- NFC
10. File systems
- Btrfs
- Tmpfs
- XFS
Introduce lseek(2) SEEK_DATA/SEEK_HOLE support (commit)
- CIFS
Introduce SMB2 mounts as vers=2.1 (commit)
- JFFS2
Add parameter to reserve disk space for root (commit)
- exofs
Add sysfs info for autologin/pNFS export (commit)
- Cifs
Add a cache= option to better describe the different cache flavors (commit)
11. Other news sites that track the changes of this release
H-Online part 1 - Networking, part 2 - filesystems and storage, part 3 - architecture, part 4 - drivers, part 5 - infrastructure