KernelNewbies:

Linux 3.5 has [https://lkml.org/lkml/2012/7/21/114 been released] on 21 Jul 2012.

Summary: This release includes support for metadata checksums in ext4, userspace probes for performance profiling with tools like Systemtap or perf, a sandboxing mechanism that allows to filters syscalls, a new network queue management algorithm designed to fight bufferbloat, support for checkpointing and restoring TCP connections, support for TCP Early Retransmit (RFC 5827), support for Android-style opportunistic suspend, btrfs I/O failure statistics, and SCSI over Firewire and USB. Many small features and new drivers and fixes are also available.

TableOfContents()

1. Prominent features in Linux 3.5

1.1. ext4 metadata checksums

Modern filesystems such as ZFS and Btrfs have proved that ensuring the integrity of the filesystem using checksums is a valuable feature. Ext4 has added the ability to store checksums of various metadata fields. Every time a metadata field is read, the checksum of the read data is compared with the stored checksums, if they are different it means that the medata is corrupted (note that this feature doesn't cover data, only the internal metadata structures, and it doesn't have "self-healing" capabilities). The amount of code added to implement this feature is: 1659 insertions(+), 162 deletions(-).

Any ext4 filesystem can be upgraded to use checksums using the "tune2fs -O metadata_csum" command, or "mkfs -O metadata_csum" at creation time. Once this feature is enabled in a filesystem, older kernels with no checksum support will only be able to mount it in read-only mode.

As far as performance impact goes, it shouldn't be noticeable for common desktop and server workloads. A mail server ffsb simulation show nearly no change. On a test doing only file creation and deletion and extent tree modifications, a performance drop of about 20 percent was measured. However, it's a workload very heavily oriented towards metadata, in most real-world workloads metadata is usually a small fraction of total IO, so unless your workload is metadata-oriented, the cost of enabling this feature should be negligible.

Recommended LWN article: [https://lwn.net/Articles/469805/ "Improving ext4: bigalloc, inline data, and metadata checksums"]

Implementation details: [https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums Ext4 Metadata checksums]

Code: [http://git.kernel.org/linus/e93376c20b70d1e62bb3246acd1bbe21fe58859f (commit 1], [http://git.kernel.org/linus/dbe89444042ab6540bc304343cfdcbc8b95d003d 2], [http://git.kernel.org/linus/cc8e94fd126ab2d2e4bcb1b65c7316196f0cec8c 3], [http://git.kernel.org/linus/5c359a47e7d999a0ea7f397da2c15590d0a82815 4], [http://git.kernel.org/linus/fa77dcfafeaa6bc73293c646bfc3d5192dcf0be2 5], [http://git.kernel.org/linus/41a246d1ff75a95d2be3191ca6e6db139dc0f430 6], [http://git.kernel.org/linus/b0336e8d2108e6302aecaadd17c6c0bd686da22d 7], [http://git.kernel.org/linus/814525f4df50a196464ce2c7abe91f693203060f 8], [http://git.kernel.org/linus/a9c4731780544d52b243bf46e4dd635c67fa9f84 9], [http://git.kernel.org/linus/e615391896064eb5a0c760d086b8e1c6ecfffeab 10], [http://git.kernel.org/linus/f84891289e62a74e9b4942eaad80617368b2d778 11], [http://git.kernel.org/linus/0441984a3398970ab4820410b9cf4ff85bf3a6b0 12], [http://git.kernel.org/linus/feb0ab32a57e4e6c8b24f6fb68f0ce08efe4603c 13], [http://git.kernel.org/linus/01b5adcebb977bc61b64167adce6d8260c9da33c 14], [http://git.kernel.org/linus/d25425f8e0ed01fc0167c043aee7e619fc3f6ab2 15], [http://git.kernel.org/linus/7ac5990d5a3e2e465162880819cc46c6427d3b6f 16], [http://git.kernel.org/linus/8f888ef846d4481e24c74b4a91ece771d2bcbcb5 17], [http://git.kernel.org/linus/1f56c5890e3e815c6f4eabfc87a8a81f439b6f3d 18], [http://git.kernel.org/linus/c390087591dcbecd244c31d979ccdad49ae83364 19], [http://git.kernel.org/linus/3caa487f53f65fd1e3950a6b6ae1709e6c43b334 20], [http://git.kernel.org/linus/4fd5ea43bc11602bfabe2c8f5378586d34bd2b0a 21], [http://git.kernel.org/linus/42a7106de636ebf9c0b93d25b4230e14f5f2682e 22], [http://git.kernel.org/linus/25ed6e8a54df904c875365eebedbd19138a47328 23], [http://git.kernel.org/linus/2db938bee32e7469ca8ed9bfb3a05535f28c680d 24]

1.2. Uprobes: userspace probes

Uprobes, the user-space counterpart of kprobes, enables to place performance probes in any memory address of a user application, and collect debugging and performance information non-disruptively, which can be used to find performance problems. These probes can be placed dynamically in a running process, there is no need to restart the program or modify the binaries. The probes are usually managed with a instrumentation application, such as perf probe, systemtap or LTTng.

A sample usage of uprobes with perf could be to profile libc's malloc() calls:

A probe has been created. Now, let's record the global usage of malloc across all the system during 1 second:

Now you can watch the results with the TUI interface doing "$ perf report", or watch a plain text output without the call graph info in the stdio output with "$ perf report -g flat --stdio"

If you don't know which function you want to probe, you can get a list of probe-able funcions in libraries and executables using the -F parameter, for example: "$ perf probe -F -x /lib64/libc.so.6" or "$ perf probe -F -x /bin/zsh". You can use multiple probes as well and mix them with kprobes and regular PMU events or kernel tracepoints.

The uprobes code is one of the longest standing out-of-the-tree patches. It originates from SystemTap and has been included for years in Fedora and RHEL kernels.

Recommended LWN article: [https://lwn.net/Articles/499190/ Uprobes in 3.5]

Code: [http://git.kernel.org/linus/225466f1c2d816c33b4341008f45dfdc83a9f0cb (commit 1], [http://git.kernel.org/linus/f3f096cfedf8113380c56fc855275cc75cd8cf55 2], [http://git.kernel.org/linus/2b144498350860b6ee9dc57ff27a93ad488de5dc 3], [http://git.kernel.org/linus/d4b3b6384f98f8692ad0209891ccdbc7e78bbefe 4], [http://git.kernel.org/linus/7b2d81d48a2d8e37efb6ce7b4d5ef58822b30d89 5], [http://git.kernel.org/linus/cbc91f71b51b8335f1fc7ccfca8011f31a717367 6], [http://git.kernel.org/linus/0326f5a94ddea33fa331b2519f4172f4fb387baa 7], [http://git.kernel.org/linus/7396fa818d6278694a44840f389ddc40a3269a9a 8], [http://git.kernel.org/linus/04a3d984d32e47983770d314cdb4e4d8f38fccb7 9], [http://git.kernel.org/linus/900771a483ef28915a48066d7895d8252315607a 10], [http://git.kernel.org/linus/e3343e6a2819ff5d0dfc4bb5c9fb7f9a4d04da73 11], [http://git.kernel.org/linus/3ff54efdfaace9e9b2b7c1959a865be6b91de96c 12], [http://git.kernel.org/linus/682968e0c425c60f0dde37977e5beb2b12ddc4cc 13], [http://git.kernel.org/linus/96379f60075c75b261328aa7830ef8aa158247ac 14], [http://git.kernel.org/linus/5cb4ac3a583d4ee18c8682ab857e093c4a0d0895 15)]

1.3. Seccomp-based system call filtering

Seccomp (alias for "secure computing") is a simple sandboxing mechanism added back in [http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commit;h=d949d0ec9c601f2b148bed3cdb5f87c052968554 2.6.12] that allows to transition to a state where it cannot make any system calls except a very restricted set (exit, sigreturn, read and write to already open file descriptors). Seccomp has now been extended: instead of a fixed and very limited set of system calls, seccomp has evolved into a filtering mechanism that allows processes to specify an arbitrary filter of system calls (expressed as a [http://en.wikipedia.org/wiki/Berkeley_Packet_Filter Berkeley Packet Filter] program) that should be forbidden. This can be used to implement different types of security mechanisms; for example, the Linux port of the Chromium web browser [http://src.chromium.org/viewvc/chrome/trunk/src/sandbox/linux/seccomp-bpf/ supports this feature] to run plugins in a sandbox.

The systemd init daemon has [https://plus.google.com/115547683951727699051/posts/cb3uNFMNUyK added support] for this feature. A Unit file can use the SystemCallFilter to specify a list with the syscalls that will be allowed to run, any other syscall will not be allowed:

[Service]
ExecStart=/bin/echo "I am in a sandbox"
SystemCallFilter=brk mmap access open fstat close read fstat mprotect arch_prctl munmap write

Recommended links: [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/prctl/seccomp_filter.txt;hb=HEAD Documentation] and [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=samples/seccomp;hb=HEAD Samples]).

Recommended LWN article: [https://lwn.net/Articles/475043/ Yet another new approach to seccomp]

Code: [http://git.kernel.org/linus/e2cfabdfd075648216f99c2c03821cf3f47c1727 (commit 1], [http://git.kernel.org/linus/fb0fadf9b213f55ca9368f3edafe51101d5d2deb 2], [http://git.kernel.org/linus/bb6ea4301a1109afdacaee576fedbfcd7152fc86 3], [http://git.kernel.org/linus/acf3b2c71ed20c53dc69826683417703c2a88059 4], [http://git.kernel.org/linus/c6cfbeb4029610c8c330c312dcf4d514cc067554 5)]

1.4. Bufferbloat fighting: CoDel queue management

Codel (alias for "controlled delay") is a new queue management algorithm designed to fight the problems associated to excessive buffering across an entire network path - a problem know as "bufferbloat". [http://gettys.wordpress.com/2012/05/22/a-milestone-reached-codel-is-in-linux/ According to Jim Gettys], who coined the term bufferbloat, "this work is the culmination of their at three major attempts to solve the problems with AQM algorithms over the last 14 years"

ACM paper detailing the algorithm, by Kathleen Nichols and Van Jacobson: [http://queue.acm.org/detail.cfm?id=2209336 Controlling Queue Delay]

Codel bufferbloat project page: http://www.bufferbloat.net/projects/codel/wiki

Recommended LWN article: [https://lwn.net/Articles/496509/ The CoDel queue management algorithm]

Code: [http://git.kernel.org/linus/76e3cc126bb223013a6b9a0e2a51238d1ef2e409 (commit 1], [http://git.kernel.org/linus/4b549a2ef4bef9965d97cbd992ba67930cd3e0fe 2)]

1.5. TCP connection repair

As part of an ongoing effort to implement [http://criu.org process checkpointing/restart], Linux adds in this release support for stopping a TCP connection and restart it in another host. Container virtualization implementations will use this feature to relocate a entire network connection from one host to another transparently for the remote end. This is achieved putting the socket in a "repair" mode that allows to gather the necessary information or restore previous state into a new socket.

Documentation: http://criu.org/TCP_connection

Recommended LWN article: [https://lwn.net/Articles/495304/ TCP connection repair]

Code: [http://git.kernel.org/linus/ee9952831cfd0bbe834f4a26489d7dce74582e37 (commit 1], [http://git.kernel.org/linus/370816aef0c5436c2adbec3966038f36ca326933 2], [http://git.kernel.org/linus/b139ba4e90dccbf4cd4efb112af96a5c9e0b098c 3], [http://git.kernel.org/linus/c0e88ff0f256958401778ff692da4b8891acb5a9 4], [http://git.kernel.org/linus/5e6a3ce6573f0c519d1ff57df60e3877bb2d3151 5)]

1.6. TCP Early Retransmit

TCP (and STCP) Early Retransmit ([http://tools.ietf.org/html/rfc5827 RFC 5827]) allows to trigger fast retransmit, in certain conditions, to reduce the number of duplicate acknowledgments required to trigger a fast retransmission. This allows the transport to use fast retransmit to recover segment losses that would otherwise require a lengthy retransmission timeout. In other words, connections recover from lost packets faster, which improves latency. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "[http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Proportional Rate Reduction for TCP]"

Early retransmit is enabled with the tcp_early_retrans sysctl, found at /proc/sys/net/ipv4/tcp_early_retrans. It accepts three values: "0" (disables early retransmit), "1" (enables it), and "2", the default one, which enables early retransmit but delays fast recovery and fast retransmit by a fourth of the RTT (this mitigates connection falsely recovers when network has a small degree of reordering)

Code: [http://git.kernel.org/linus/eed530b6c67624db3f2cf477bac7c4d005d8f7ba (commit 1], [http://git.kernel.org/linus/750ea2bafa55aaed208b2583470ecd7122225634 2], [http://git.kernel.org/linus/1fbc340514fc3003514bd681b372e1f47ae6183f 3)]

1.7. Android-style opportunistic suspend

The most controversial issue in the merge of Android code into Linux is the functionality called "suspend blockers" or "wakelocks". They are part of a specific approach to power management, which is based on aggressive utilization of full system suspend as much as possible. The natural state of the system is a sleep state, in which energy is only used for refreshing memory and providing power to a few devices that can wake the system up. The system only uses the full power state when it has to do some real work, and when it finishes it goes back to a suspend state.

This is a good idea, but the kernel developers didn't like Android's "suspend blockers" (a full technical analysis on the issue can be found [https://lwn.net/images/pdf/suspend_blockers.pdf here]). Endless flames have been going on for years, and little progress was been made, which was a huge problem for the convergence of Android and Linux, because drivers of Android devices use the suspend blocker APIs, and the lack of such APIs in Linux makes impossible to merge them. But in this release, the kernel incorporates a similar functionality, called "autosleep and wake locks". It is expected/hoped that Android will be able to use it, and merging drivers from Android devices will be easier.

Recommended LWN article: [https://lwn.net/Articles/479841/ Autosleep and wake locks]

Code: [http://git.kernel.org/linus/55850945e872531644f31fefd217d61dd15dcab8 (commit 1], [http://git.kernel.org/linus/b86ff9820fd5df69295273b9aa68e58786ffc23f 2], [http://git.kernel.org/linus/30e3ce6dcbe3fc29c343b17e768b07d4a795de21 3], [http://git.kernel.org/linus/7483b4a4d9abf9dcf1ffe6e805ead2847ec3264e 4)]

1.8. Btrfs: I/O failure statistics, latency improvements

Support for I/O failure statistics has been added. I/O errors, CRC errors, and generation checks of metadata blocks are tracked for each drive. The Btrfs command to retrieve and print the device stats, to be included in future btrfs-progs, should be "btrfs device stats".

This release also includes fairly large changes that make Btrfs much friendly to memory reclaim and lowers latencies quite a lot for synchronous I/O.

Code: [http://git.kernel.org/linus/442a4f6308e694e0fa6025708bd5e4e424bbf51c (commit 1], [http://git.kernel.org/linus/c11d2c236cc260b36ef644700fbe99bcc7e7da33 2], [http://git.kernel.org/linus/733f4fbbc1083aa343da739f46ee839705d6cfe3 3)]

1.9. SCSI over FireWire and USB

This release includes a driver for using an IEEE-1394 connection as a SCSI transport. This enables to expose SCSI devices to other nodes on the Firewire bus, for example hard disk drives. It's a similar functionality to Firewire [http://en.wikipedia.org/wiki/Target_Disk_Mode Target Disk Mode] on many Apple computers.

This release also adds a usb-gadget driver that does the same with USB. The driver supports two USB protocols are supported that is BBB or BOT (Bulk Only Transport) and UAS (USB Attached SCSI). BOT is advertised on alternative interface 0 (primary) and UAS is on alternative interface 1. Both protocols can work on USB 2.0 and USB 3.0. UAS utilizes the USB 3.0 feature called streams support.

Code: [http://git.kernel.org/linus/a511ce3397803558a3591e55423f3ae6aa28c9db (commit)], [http://git.kernel.org/linus/c52661d60f636d17e26ad834457db333bd1df494 (commit)]

2. Driver and architecture-specific changes

All the driver and architecture-specific changes can be found in the [http://kernelnewbies.org/Linux_3.5_DriverArch Linux_3.5_DriverArch page]

3. Various core changes

4. Memory Management

5. Block

6. Perf/tracing

7. Virtualization

8. Security

9. Networking

10. File systems

11. Other news sites that track the changes of this release


KernelNewbies: Linux_3.5 (last edited 2013-02-22 16:55:19 by diegocalleja)