Linux 3.0 released on 21 Jul, 2011

Summary: Besides a new version numbering scheme, Linux 3.0 also has several new features: Btrfs data scrubbing and automatic defragmentation, XEN Dom0 support, unprivileged ICMP_ECHO, wake on WLAN, Berkeley Packet Filter JIT filtering, a memcached-like system for the page cache, a sendmmsg() syscall that batches sendmsg() calls and setns(), a syscall that allows better handling of light virtualization systems such as containers. New hardware support has been added: for example, Microsoft Kinect, AMD Llano Fusion APUs, Intel iwlwifi 105 and 135, Intel C600 serial-attached-scsi controller, Ralink RT5370 USB, several Realtek RTL81xx devices or the Apple iSight webcam. Many other drivers and small improvements have been added.

1. Prominent features

1.1. Btrfs: Automatic defragmentation, scrubbing, performance improvements

Automatic defragmentation

COW (copy-on-write) filesystems have many advantages, but they also have some disadvantages, for example fragmentation. Btrfs lays out the data sequentially when files are written to the disk for first time, but a COW design implies that any subsequent modification to the file must not be written on top of the old data, but be placed in a free block, which will cause fragmentation (RPM databases are a common case of this problem). Aditionally, it suffers the fragmentation problems common to all filesystems.

Btrfs already offers alternatives to fight this problem: First, it supports online defragmentation using the command "btrfs filesystem defragment". Second, it has a mount option, -o nodatacow, that disables COW for data. Now btrfs adds a third option, the -o autodefrag mount option. This mechanism detects small random writes into files and queues them up for an automatic defrag process, so the filesystem will defragment itself while it's used. It isn't suited to virtualization or big database workloads yet, but works well for smaller files such as rpm, SQLite or bdb databases. Code: (commit)


"Scrubbing" is the process of checking the integrity of the data in the filesystem. This initial implementation of scrubbing will check the checksums of all the extents in the filesystem. If an error occurs (checksum or IO error), a good copy is searched for. If one is found, the bad copy will be rewritten. Code: (commit 1, 2)

Other improvements

-File creation/deletion speedup: The performance of file creation and deletion on btrfs was very poor. The reason is that for each creation or deletion, btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. Now btrfs can do some delayed b+ tree insertions or deletions, which allows to batch these modifications. Microbenchmarks of file creation have been speed up by ~15%, and file deletion by ~20%. Code: (commit)

-Do not flush csum items of unchanged file data: speeds up fsync. A sysbench workload doing "random write + fsync" went from 112.75 requests/sec to 1216 requests/sec. Code: (commit)

-Quasi-round-robin for space allocation in multidevice setups: the chunk allocator currently always allocates space on the devices in the same order. This leads to a very uneven distribution, especially with RAID1 or RAID10 and an uneven number of devices. Now Btrfs always sorts the devices before allocating, and allocates the stripes on the devices with the most available space. Code: (commit)

1.2. sendmmsg(): batching of sendmsg() calls

Recvmsg() and sendmsg() are the syscalls used to receive/send data to the network. In 2.6.33, Linux added recvmmsg(), a syscall that allows to receive in a single call data that would need multiple recvmsg() calls, improving throughput and latency for a number of scenarios. Now, a equivalent sendmmsg() syscall has been added. A microbenchmark saw a 20% improvement in throughput on UDP send and 30% on raw socket send

Code: (commit)

1.3. XEN dom0 support

Finally, Linux has got Xen dom0 support

1.4. Cleancache

Recommended LWN article: Cleancache and Frontswap

Cleancache is an optional feature that can potentially increases page cache performance. It could be described as a memcached-like system, but for cache memory pages. It provides memory storage not directly accessible or addressable by the kernel, and it does not guarantee that the data will not vanish. It can be used by virtualization software to improve memory handling for guests, but it can also be useful to implement things like a compressed cache.

Code: (commit), (commit)

1.5. Berkeley Packet Filter just-in-time filtering

Recommended LWN article: A JIT for packet filters

The Berkeley Packet Filter filtering capabilities, used by tools like libpcap/tcpdump, are normally handled by an interpreter. This release adds a simple JIT that generates native code when filter is loaded in memory (something already done by other OSes, like FreeBSD). Admin need to enable this feature writting "1" to /proc/sys/net/core/bpf_jit_enable

Code: (commit)

1.6. Wake on WLAN support

Wake on Wireless is a feature to allow the system to go into a low-power state (e.g. ACPI S3 suspend) while the wireless NIC remains active and does varying things for the host, e.g. staying connected to an AP or searching for networks. The 802.11 stack has added support for it.

Code: (commit 1, 2)

1.7. Unprivileged ICMP_ECHO messages

Recommended LWN article: ICMP sockets

This release makes it possible to send ICMP_ECHO messages (ping) and receive the corresponding ICMP_ECHOREPLY messages without any special privileges, similar to what is implemented in Mac OS X. In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. Initially this functionality was written for Linux 2.4.32, but unfortunately it was never made public. The new functionality is disabled by default, and is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range.

Code: (commit)

1.8. setns() syscall: better namespace handling

Recommended LWN article: Namespace file descriptors

Linux supports different namespaces for many of the resources its handles; for example, lightweight forms of virtualization such as containers or systemd-nspaw show to the virtualized processes a virtual PID different from the real PID. The same thing can be done with the filesystem directory structure, network resources, IPC, etc. The only way to set different namespace configurations was using different flags in the clone() syscall, but that system didn't do things like allow to one processes to access to other process' namespace. The setns() syscall solves that problem-

Code: (commit 1, 2, 3, 4, 5, 6)

1.9. Alarm-timers

Recommended LWN article: Waking systems from suspend

Alarm-timers are a hybrid style timer, similar to high-resolution timers, but when the system is suspended, the RTC device is set to fire and wake the system for when the soonest alarm-timer expires. The concept for Alarm-timers was inspired by the Android Alarm driver, and the interface to userland uses the POSIX clock and timers interface, using two new clockids:CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM.

Code: (commit 1, 2)

2. Driver and architecture-specific changes

All the driver and architecture-specific changes can be found in the Linux_3.0_DriverArch page

3. VFS

4. Process scheduler

5. Memory management

6. Networking

7. File systems







8. Crypto

9. Virtualization

10. Security

11. Tracing/profiling

12. Various core changes


KernelNewbies: Linux_3.0 (last edited 2017-12-30 01:30:17 by localhost)