Linux 3.0 released on 21 Jul, 2011
Summary: Besides a new version numbering scheme, Linux 3.0 also has several new features: Btrfs data scrubbing and automatic defragmentation, XEN Dom0 support, unprivileged ICMP_ECHO, wake on WLAN, Berkeley Packet Filter JIT filtering, a memcached-like system for the page cache, a sendmmsg() syscall that batches sendmsg() calls and setns(), a syscall that allows better handling of light virtualization systems such as containers. New hardware support has been added: for example, Microsoft Kinect, AMD Llano Fusion APUs, Intel iwlwifi 105 and 135, Intel C600 serial-attached-scsi controller, Ralink RT5370 USB, several Realtek RTL81xx devices or the Apple iSight webcam. Many other drivers and small improvements have been added.
Contents
1. Prominent features
1.1. Btrfs: Automatic defragmentation, scrubbing, performance improvements
Automatic defragmentation
COW (copy-on-write) filesystems have many advantages, but they also have some disadvantages, for example fragmentation. Btrfs lays out the data sequentially when files are written to the disk for first time, but a COW design implies that any subsequent modification to the file must not be written on top of the old data, but be placed in a free block, which will cause fragmentation (RPM databases are a common case of this problem). Aditionally, it suffers the fragmentation problems common to all filesystems.
Btrfs already offers alternatives to fight this problem: First, it supports online defragmentation using the command "btrfs filesystem defragment". Second, it has a mount option, -o nodatacow, that disables COW for data. Now btrfs adds a third option, the -o autodefrag mount option. This mechanism detects small random writes into files and queues them up for an automatic defrag process, so the filesystem will defragment itself while it's used. It isn't suited to virtualization or big database workloads yet, but works well for smaller files such as rpm, SQLite or bdb databases. Code: (commit)
Scrub
"Scrubbing" is the process of checking the integrity of the data in the filesystem. This initial implementation of scrubbing will check the checksums of all the extents in the filesystem. If an error occurs (checksum or IO error), a good copy is searched for. If one is found, the bad copy will be rewritten. Code: (commit 1, 2)
Other improvements
-File creation/deletion speedup: The performance of file creation and deletion on btrfs was very poor. The reason is that for each creation or deletion, btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. Now btrfs can do some delayed b+ tree insertions or deletions, which allows to batch these modifications. Microbenchmarks of file creation have been speed up by ~15%, and file deletion by ~20%. Code: (commit)
-Do not flush csum items of unchanged file data: speeds up fsync. A sysbench workload doing "random write + fsync" went from 112.75 requests/sec to 1216 requests/sec. Code: (commit)
-Quasi-round-robin for space allocation in multidevice setups: the chunk allocator currently always allocates space on the devices in the same order. This leads to a very uneven distribution, especially with RAID1 or RAID10 and an uneven number of devices. Now Btrfs always sorts the devices before allocating, and allocates the stripes on the devices with the most available space. Code: (commit)
1.2. sendmmsg(): batching of sendmsg() calls
Recvmsg() and sendmsg() are the syscalls used to receive/send data to the network. In 2.6.33, Linux added recvmmsg(), a syscall that allows to receive in a single call data that would need multiple recvmsg() calls, improving throughput and latency for a number of scenarios. Now, a equivalent sendmmsg() syscall has been added. A microbenchmark saw a 20% improvement in throughput on UDP send and 30% on raw socket send
Code: (commit)
1.3. XEN dom0 support
Finally, Linux has got Xen dom0 support
1.4. Cleancache
Recommended LWN article: Cleancache and Frontswap
Cleancache is an optional feature that can potentially increases page cache performance. It could be described as a memcached-like system, but for cache memory pages. It provides memory storage not directly accessible or addressable by the kernel, and it does not guarantee that the data will not vanish. It can be used by virtualization software to improve memory handling for guests, but it can also be useful to implement things like a compressed cache.
1.5. Berkeley Packet Filter just-in-time filtering
Recommended LWN article: A JIT for packet filters
The Berkeley Packet Filter filtering capabilities, used by tools like libpcap/tcpdump, are normally handled by an interpreter. This release adds a simple JIT that generates native code when filter is loaded in memory (something already done by other OSes, like FreeBSD). Admin need to enable this feature writting "1" to /proc/sys/net/core/bpf_jit_enable
Code: (commit)
1.6. Wake on WLAN support
Wake on Wireless is a feature to allow the system to go into a low-power state (e.g. ACPI S3 suspend) while the wireless NIC remains active and does varying things for the host, e.g. staying connected to an AP or searching for networks. The 802.11 stack has added support for it.
1.7. Unprivileged ICMP_ECHO messages
Recommended LWN article: ICMP sockets
This release makes it possible to send ICMP_ECHO messages (ping) and receive the corresponding ICMP_ECHOREPLY messages without any special privileges, similar to what is implemented in Mac OS X. In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. Initially this functionality was written for Linux 2.4.32, but unfortunately it was never made public. The new functionality is disabled by default, and is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range.
Code: (commit)
1.8. setns() syscall: better namespace handling
Recommended LWN article: Namespace file descriptors
Linux supports different namespaces for many of the resources its handles; for example, lightweight forms of virtualization such as containers or systemd-nspaw show to the virtualized processes a virtual PID different from the real PID. The same thing can be done with the filesystem directory structure, network resources, IPC, etc. The only way to set different namespace configurations was using different flags in the clone() syscall, but that system didn't do things like allow to one processes to access to other process' namespace. The setns() syscall solves that problem-
Code: (commit 1, 2, 3, 4, 5, 6)
1.9. Alarm-timers
Recommended LWN article: Waking systems from suspend
Alarm-timers are a hybrid style timer, similar to high-resolution timers, but when the system is suspended, the RTC device is set to fire and wake the system for when the soonest alarm-timer expires. The concept for Alarm-timers was inspired by the Android Alarm driver, and the interface to userland uses the POSIX clock and timers interface, using two new clockids:CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM.
2. Driver and architecture-specific changes
All the driver and architecture-specific changes can be found in the Linux_3.0_DriverArch page
3. VFS
Cache xattr security drop check for write: benchmarking on btrfs showed that a major scaling bottleneck on large systems on btrfs is currently the xattr lookup on every write, which causes an additional tree walk, hitting some per file system locks and quite bad scalability. This is also a problem in ext4, where it hits the global mbcache lock. Caching this check solves the problem (commit)
4. Process scheduler
Increase SCHED_LOAD_SCALE resolution: With this extra resolution, the scheduler can handle deeper cgroup hiearchies and do better shares distribution and load balancing on larger systems (especially for low weight task groups) (commit), (commit)
Move the second half of ttwu() to the remote CPU: avoids having to take rq->lock and doing the task enqueue remotely, saving lots on cacheline transfers. A semaphore benchmark goes from 647278 worker burns per second to 816715 (commit)
Next buddy hint on sleep and preempt path: a worst-case benchmark consisting of 2 tbench client processes with 2 threads each running on a single CPU changed from 105.84 MB/sec to 112.42 MB/sec (commit)
5. Memory management
Make mmu_gather preempemtible (commit)
Batch activate_page() calls to reduce zone->lru_lock contention (commit)
tmpfs: implement generic xattr support (commit)
- Memory cgroup controller:
6. Networking
Allow setting the network namespace by fd (commit)
- Wireless
Allow ethtool to set interface in loopback mode. (commit)
Allow no-cache copy from user on transmit (commit)
ipset: SCTP, UDPLITE support added (commit)
sctp: implement socket option SCTP_GET_ASSOC_ID_LIST (commit), implement event notification SCTP_SENDER_DRY_EVENT (commit)
bridge: allow creating bridge devices with netlink (commit), allow creating/deleting fdb entries via netlink (commit)
batman-adv: multi vlan support for bridge loop detection (commit)
pkt_sched: QFQ - quick fair queue scheduler (commit)
RDMA: Add netlink infrastructure that allows for registration of RDMA clients (commit)
7. File systems
BLOCK LAYER
Submit discard bio in batches in blkdev_issue_discard() - makes discarding data faster (commit)
EXT4
Enable "punch hole" functionality (recommended LWN article) (commit), (commit)
Add support for multiple mount protection (commit)
CIFS
Add support for mounting Windows 2008 DFS shares (commit)
Convert cifs_writepages to use async writes (commit), (commit)
Add rwpidforward mount option that enables a mode when CIFS forwards pid of a process who opened a file to any read and write operation (commit)
OCFS2
NILFS2
Implement resize ioctl (commit)
XFS
Add online discard support (commit)
8. Crypto
caam - Add support for the Freescale SEC4/CAAM (commit)
padlock - Add SHA-1/256 module for VIA Nano (commit)
s390: add System z hardware support for CTR mode (commit), add System z hardware support for GHASH (commit), add System z hardware support for XTS mode (commit)
s5p-sss - add S5PV210 advanced crypto engine support (commit)
9. Virtualization
User-mode Linux: add earlyprintk support (commit), add ucast Ethernet transport (commit)
xen: add blkback support (commit)
10. Security
Allow the application of capability limits to usermode helpers (commit)
- SELinux
11. Tracing/profiling
perf stat: Add -d -d and -d -d -d options to show more CPU events (commit), (commit)
perf stat: Add --sync/-S option (commit)
12. Various core changes
rcu: priority boosting for TREE_PREEMPT_RCU (commit)
ulimit: raise default hard ulimit on number of files to 4096 (commit)
- cgroups
kbuild: implement several W= levels (commit)
PM/Hibernate: Add sysfs knob to control size of memory for drivers (commit)
posix-timers: RCU conversion (commit)
coredump: add support for exe_file in core name (commit)