• Immutable Page
  • Info
  • Attachments

Linux 2 6 35

Linux 2.6.35 has been released on 1 Aug, 2010.

Summary: Linux 2.6.35 includes support for transparent spreading of incoming network load across CPUs, Direct-IO support for Btrfs, an new experimental journal mode for XFS, the KDB debugger UI based on top of KGDB, improvements to 'perf', H.264 and VC1 video acceleration in Intel G45+ chips, support for the future Intel Cougarpoint graphic chip, power management for AMD Radeon chips, a memory defragmentation mechanism, support for the Tunneling Protocol version 3 (RFC 3931), support for multiple multicast route tables, support for the CAIF protocol used by ST-Ericsson products, support for the ACPI Platform Error Interface, and many new drivers and small improvements.

Note: Details on architecture-specific and driver changes have been moved to this page: Linux_2_6_35-DriversArch

  1. Prominent features (the cool stuff)
    1. Transparent spreading of incoming network traffic load across CPUs
    2. Btrfs improvements
    3. XFS Delayed logging
    4. KDB kernel debugger frontend
    5. perf improvements
    6. Graphic improvements
    7. Memory compaction
    8. Support for multiple multicast route tables
    9. L2TP Version 3 (RFC 3931) support
    10. CAIF Protocol support
    11. ACPI Platform Error Interface support
  2. Various core changes
  3. Filesystems
  4. Block
  5. Memory management
  6. Networking
  7. Tracing/Profiling
  8. Crypto
  9. Virtualization
  10. MD
  11. CPU scheduler
  12. Cpufreq/cpuidle
  13. Security

1. Prominent features (the cool stuff)

1.1. Transparent spreading of incoming network traffic load across CPUs

Recommended LWN articles: "Receive packet steering" and "Receive flow steering"

Network cards have improved the bandwidth to the point where it's hard for a single modern CPU to keep up. Two new features contributed by Google aim to spread the load of network handling across the CPUs available in the system: Receive Packet Steering (RPS) and Receive Flow Steering (RFS).

RPS distributes the load of received packet processing across multiple CPUs. This solution allows protocol processing (e.g. IP and TCP) to be performed on packets in parallel (contrary to the previous code). For each device (or each receive queue in a multi-queue device) a hashing of the packet header is used to index into a mask of CPUs (which can be configured manually in /sys/class/net/<device>/queues/rx-<n>/rps_cpus) and decide which CPU will be used to process a packet. But there're also some heuristics provided by the RFS side of this feature. Instead of randomly choosing the CPU from a hash, RFS tries to use the CPU where the application running the recvmsg() syscall is running or has run in the past, to improve cache utilization. Hardware hashing is used if available. This feature effectively emulates what a multi-queue NIC can provide, but instead it is implement in software and for all kind of network hardware, including single queue cards and not excluding multiqueue cards.

Benchmarks of 500 instances of netperf TCP_RR test with 1 byte request and response show the potential benefit of this feature, a e1000e network card on 8 core Intel server goes from 104K tps at 30% CPU usage, to 303K tps at 61% CPU usage when using RPS+RFS. A RPC test which is similar in structure to the netperf RR test with 100 threads on each host, but doing more work in userspace that netperf, goes from 103K tps at 48% of CPU utilization to 223K at 73% CPU utilization and much lower latency.

Code: (commit 1, 2, 3)

1.2. Btrfs improvements

  • Direct I/O support: Direct I/O is a technique used to bypass the filesystem cache. This harms performance, but it's widely used by high performance software like some databases, which like to implement their own cache. Code: (commit 1, 2)

  • Complete -ENOSPC support: Linux 2.6.32 already added reliable -ENOSPC support for common filesystem usage, but some corner cases could still be hit in operations, like doing volume management operations. The -ENOSPC code added in this version handles all difficult corner cases like space balancing, drive management, fsync logging and many others. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11)

1.3. XFS Delayed logging

This version adds a logging (journaling) mode called delayed logging, which is very briefly modeled after the journaling mode in the ext3/4 and reiserfs filesystems. It allows to accumulated multiple asynchronous transactions in memory instead of possibly writing them out many times. The I/O bandwidth used for the log decreases by orders of magnitude and performance on metadata intensive workloads increases massively. The log disk format is not changed, only the in-memory data structures and code. This feature is experimental, so it's not recommended for final users or production servers. Those who want to test it can enable it with the "-o delaylog" mount option. Code: (commit 1, 2)

1.4. KDB kernel debugger frontend

The Linux kernel has had a kernel debugger since 2.6.26, called Kgdb. But Kgdb is not the only linux kernel debugger, there is also KDB, developed years ago by SGI. The key difference between Kgdb and KDB is that using Kgdb requires an additional computer to run a gdb frontend, and you can do source level debugging. KDB, on the other hand, can be run on the local machine and can be used to inspect the system, but it doesn't do source-level debugging. What is happening in this version is that Jason Wessel, from Windriver, has ported KDB to work on top of the Kgdb core, making possible to use both interfaces.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8)

1.5. perf improvements

  • perf-inject live mode. Until now, users had to run "perf record" and "perf report" in two different commands, perf inject introduces a 'live mode', which allows to record and report in the same command, for example perf record -o - ./hackbench 10 | perf inject -v -b | perf report -v -i - . But this is too complex for normal users, so support has been added to invoke live mode automatically if record/report is not specified, for example:

    # perf trace rwtop 5 

    Any of the scripts listed in 'perf trace -l' can now be used directly in live mode, with the expected arguments, by simply specifying the script and args to 'perf trace' (commit 1, 2)

  • perf kvm tool for monitoring guest performance from host (commit)

  • perf probe: Support accessing members in the data structures. With this, perf-probe accepts data-structure members (IOW, it now accepts dot '.' and arrow '->' operators) as probe arguments. Code (commit 1, 2). Examples:

    # perf probe --add 'schedule:44 rq->curr'


    # perf probe --add 'vfs_read file->f_op->read file->f_path.dentry' 
  • Improve --list to show current exist probes with line number and file name. This enables user easily to check which line is already probed. Code: (commit). For example:

    # perf probe --list
    probe:vfs_read       (on vfs_read:8@linux-2.6-tip/fs/read_write.c) 
  • Implement a console UI using newt (commit 1, 2)

1.6. Graphic improvements

This version carries a bunch of interesting improvements to the graphics stack.

  • i915: Support for H.264 and VC1 hardware acceleration on G45+ (commit), support for the graphics in the future Intel Cougarpoint chipset (commit 1, 2, 4, 5, 6, 7, 8), power monitoring support (commit), support of memory self-refresh on Ironlake (commit) and support for interlaced display (commit)

  • Radeon: Initial power management support (commit 1, 2, 3, 4), simplification and improvement of the GPU reset handling (commit 1, 2), implement several important pieces needed to support the Evergreen hardware (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, enable use of unmappable VRAM 12, add polling support for when nothing is connected 13, 14, 15)

1.7. Memory compaction

Recommended LWN article: "Memory compaction"

The memory compaction mechanism tries reduces external memory fragmentation in a memory zone by trying to move used pages to create a new big block of contiguous used pages. When compaction finishes, the memory zone will have a big block of used pages and a big block of free pages. This will make easier to allocate bigger chunks of memory. The mechanism is called "compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating within a zone, a migration scanner and a free scanner. The migration scanner starts at the beginning of a zone and finds all used pages that can be moved. The free scanner begins at the end of the zone and searches for enough free pages to migrate all the used pages found by the previous scanner. A compaction run completes within a zone when the two scanners meet and used pages are migrated to the free blocks scanned. Testing has showed the amount of IO required to satisfy a huge page allocation is reduced significantly.

Memory compaction can be triggered in one of three ways. It may be triggered explicitly by writing any value to /proc/sys/vm/compact_memory and compacting all of memory. It can be triggered on a per-node basis by writing any value to /sys/devices/system/node/nodeN/compact where N is the node ID to be compacted. When a process fails to allocate a high-order page, it may compact memory in an attempt to satisfy the allocation instead of entering direct reclaim. Explicit compaction does not finish until the two scanners meet and direct compaction ends if a suitable page becomes available that would meet watermarks.

Code: (commit 1, 2, 3, 4 ,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

1.8. Support for multiple multicast route tables

Normally, a multicast router runs a userspace daemon and decides what to do with a multicast packet based on the source and destination addresses. This feature adds support for multiple independent multicast routing instances, so the kernel is able to take interfaces and packet marks into account and run multiple instances of userspace daemons simultaneously, each one handling a single table. Code: (commit 1, 2, 3)

1.9. L2TP Version 3 (RFC 3931) support

This version adds support for Layer 2 Tunneling Protocol (L2TP) version 3, RFC 3931. L2TP provides a dynamic mechanism for tunneling Layer 2 (L2) "circuits" across a packet-oriented data network (e.g., over IP). L2TP, as originally defined in RFC 2661, is a standard method for tunneling Point-to-Point Protocol (PPP) [RFC1661] sessions. L2TP has since been adopted for tunneling a number of other L2 protocols, including ATM, Frame Relay, HDLC and even raw ethernet frames, this is the version 3. Code: (commit 1, 2, 3, 4)

1.10. CAIF Protocol support

Support for the CAIF protocol. CAIF is a MUX protocol used by ST-Ericsson cellular modems for communication between Modem and host. The host processes can open virtual AT channels, initiate GPRS Data connections, Video channels and Utility Channels. The Utility Channels are general purpose pipes between modem and host. ST-Ericsson modems support a number of transports between modem and host. Currently, UART and Loopback are available for Linux (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

1.11. ACPI Platform Error Interface support

Support for the ACPI Platform Error Interface (APEI). This improves NMI handling especially. In addition it supports the APEI Error Record Serialization Table (ERST), the APEI Generic Hardware Error Source and APEI Error INJection (EINJ) and saving of MCE (Machine Check Exception) errors into flash. For more information about APEI, please refer to ACPI Specification version 4.0, chapter 17. Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)

2. Various core changes

  • Add a new configuration interface: make nconfig. Built on menuconfig, it implements a more modern look interface using ncurses and ncurses' satellite libraries (menu, panel, form) (commit)

  • Add support for in-kernel initramfs compressed with LZO (commit)

  • Add support for shrinking and growing pipes via F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() operations (commit)

  • IPC: solve a spinlock contention problem found in some big databases (commit 1, 2, 3, 4)

  • Add support for augmented rbtrees, a modification of regular rbtrees which allows to speedup some kinds of searches (recommended LWN article) (commit)

  • Implement sysfs tagged directory support to allow network namespaces in sysfs (1, 2)

  • fuse: splice() support (commit 1, 2, 3)

  • Add devname module aliases to allow module on-demand auto-loading (commit)

  • crc32: major optimization (commit)

  • char drivers: RAM oops/panic logger (commit)

  • SFI: add sysfs interface /sys/firmware/sfi/tables/ for SFI tables, analogous to ACPI's /sys/firmware/acpi/tables/... (commit)

  • fault-injection: add CPU notifier error injection module (commit)

  • devtmpfs: support !CONFIG_TMPFS (commit)

3. Filesystems

  • OCFS2

    • Implement allocation reservations, which reduces fragmentation significantly (commit 1, 2, 3, 4, 5),

    • Optimize punching-hole code, speeds up significantly some rare operations (commit)

    • Discontiguous block groups, necessary to improve some kind of allocations. It is a feature that marks an incompatible bit, ie, it makes a forward-compatible change (commit 1, 2, 3)

    • Make nointr ("don't allow file operations to be interrupted") a default mount option (commit)

  • Squashfs: XATTR support (commit 1, 2, 3, 4)

  • Ext2: Remove BKL from ext2 filesystem (commit)

  • Ext4: check for a good block group before loading buddy pages (it speeds up allocations in some cases where partitions are relatively full) (commit)

  • UFS: Support UFS Borderware variation (commit)

  • NILFS2: change default of 'errors' mount option to 'remount-ro' mode (commit)

4. Block

  • laptop-mode: Make flushes per-device, to avoid waking up unnecessarily devices that have nothing to flush (commit)

  • Block I/O controller (blkio)

    • Add the following statistics: io_service_time, io_wait_time, io_serviced, io_service_bytes (commit), io_merged stat (commit) and, io_queued and avg_queue_size (commit)

    • Add more debug-only per-cgroup stats (group_wait_time, empty_time, idle_time) (commit)

    • Add a new interface "weight_device" for IO-Controller (commit), (commit)

    • Add a reset_stats interface to reset statistics (commit)

  • Generate "change" uevent for loop device (commit)

  • brd: support discard request (commit)

5. Memory management

  • Memory resource controller

    • Userspace oom notifier (commit 1, 2, 3)

    • Add support for moving charge of file pages, which include normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting bit 1 of <target cgroup>/memory.move_charge_at_immigrate (commit)

  • page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists (commit)

  • slab: add memory hotplug support (commit)

  • percpu: implement kernel memory based chunk allocation for nommu SMP machines (commit)

6. Networking

  • IPv6

    • Add GSO ("Generic Segmentation Offload") support on IPv6 forwarding path (commit)

    • Add support for the Generic TTL security mechanism (RFC 5082), equivalent to the IPv4 functionality merged in 2.6.34 (commit)

    • Complete support of the IPV6_DONTFRAG socket option (commit)

  • mac80211 (wireless stack)

    • Add support for connection quality monitoring (commit), (commit), (commit)

    • Enable QoS explicitly in AP mode (commit)

    • Implement AP isolation support (commit), (commit)

    • Add support for offloading the channel switch operation to devices that support such operation (commit)

    • Use different MAC addresses for virtual interfaces automatically (commit)

    • Allow controlling aggregation manually in debugfs (commit)

  • L2TP (Layer 2 Tunneling Protocol)

    • Add netlink control API for L2TP (commit)

    • Add L2TP ethernet pseudowire support (commit)

    • Add debugfs files for dumping l2tp debug info (commit)

  • Bridge

    • per-cpu (scalable) packet statistics (commit)

    • IPv6 MLD snooping support. (commit)

  • Bluetooth

    • Add sockopt configuration for set/get transmission window size via sockopt on L2CAP (commit)

    • Add SOCK_STREAM support to L2CAP (commit)

  • 9P: Add support for 9p2000.L protocol (commit), (commit), (commit), (commit)

  • RDS: Enable per-cpu workqueue threads, which is more scalable (commit)

  • netpoll: add generic support for bridge and bonding devices (commit), (commit), (commit)

  • Add netlink support for virtual port management (was iovnl) (commit)

  • Implement sctp association probing module (commit)

  • Microoptimize alloc_skb(), a critical fast path (commit)

  • Netfilter xtables: merge xt_CONNMARK into xt_connmark (commit), merge xt_MARK into xt_mark (commit), inclusion of xt_TEE (commit)

  • Add 64-bit userspace support for interface counters (commit)

7. Tracing/Profiling

  • Implement Intel PEBS infrastructure, which is an alternative counter mode in which the counter triggers a hardware assist to collect information on events (commit), (commit)

  • Implement initial P4 PMU driver, which implements a different way of doing performance monitoring in P4/Netburst processors (commit), (commit)

  • Change perf parameter --pid to process-wide collection instead of thread-wide (commit)

  • perf lock: Add "info" subcommand for dumping misc information (commit)

  • perf probe: add --dry-run option (commit), add --max-probes option (commit)

  • scripting: Add rwtop and sctop scripts (commit), enable scripting shell scripts for live mode (commit)

  • Reduce the size and memory footprint of tracepoints (commit 1, 2, 3, 4, 5, 6, 7, 8, 9)

  • Delete the never used BTS-ptrace code (commit)

8. Crypto

9. Virtualization

  • KVM

    • SVM: Make lazy FPU switching work with nested svm (commit)

    • Support for tracing emulated instructions (commit)

    • PPC: Add host MMU Support for 32 bit Book3S (commit)

  • virtio: Add disk identification ioctl (commit), (commit)

10. MD

  • Add support for Raid0->Raid10 takeover (commit)

  • Add support for Raid0->Raid5 takeover (commit)

  • Add support for Raid5->Raid0 and Raid10->Raid0 takeover (commit)

  • Add support for Raid5->Raid4 takeover (commit)

  • Add support for Raid0->Raid4 takeover (commit)

  • Add support for Raid4->Raid0 takeover (commit)

11. CPU scheduler

12. Cpufreq/cpuidle

13. Security

  • TOMOYO: Add pathname grouping support. (commit)

Tell others about this page:

last edited 2010-08-02 06:51:27 by ZiyadAlBATLY