== Testing == Generally, there are many ways in which you can test the Linux kernel, but we will concentrate on the following four approaches: 1. Using a test version of the kernel for normal work. 2. Running special test suites, like LTP, on the new kernel. 3. Doing unusual things with the new kernel installed. 4. Measuring the system performance with the new kernel installed. Of course, all of them can be used within one combined test procedure, so they can be regarded as different phases of the testing process. === 2.1 Phase One === The first phase of kernel testing is simple: we try to boot the kernel and use it for normal work. * Before starting the system in a fully functional configuration it is recommended to boot the kernel with the init=/bin/bash command line argument, which makes it start only one bash process. From there you can check if the filesystems are mounted and unmounted properly and you can test some more complex kernel functions, like the suspend to disk or to RAM, in the minimal configuration. In that case the only kernel modules loaded are the ones present in the initrd image mentioned in Subsection 1.6.6. Generally, you should refer to the documentation of your boot loader for more information about manual passing command line arguments to the kernel (in our opinion it is easier if GRUB is used). * Next, it is advisable to start the system in the runlevel 2 (usually, by passing the number 2 to the kernel as the last command line argument), in which case network servers and the X server are not started (your system may be configured to use another runlevel for this purpose, although this is very unusual, so you should look into {{{/etc/inittab}}} for confidence). In this configuration you can check if the network interfaces work and you can try to run the X server manually to make sure that it does not crash. * Finally, you can boot the system into the runlevel 5 (ie. fully functional) or 3 (ie. fully functional without X), depending on your needs. Now, you are ready to use the system in a normal way for some time. Still, if you want to test the kernel quickly, you can carry out some typical operations, like downloading some files, reading email, browsing some web pages, ripping some audio tracks (from a legally bought audio CD, we presume), burning a CD or DVD etc., in a row to check if any of them fail in a way that would indicate a kernel problem. === 2.2 Phase Two (AutoTest) === In the next phase of testing we use special programs designed for checking if specific kernel subsystems work correctly. We also carry out regression and performance tests of the kernel. The latter are particularly important for kernel developers (and for us), since they allow us to identify changes that hurt performance. For example, if the performance of one of our filesystems is 10% worse after we have upgraded the 2.6.x-rc1 kernel to the 2.6.x-rc2 one, it is definitely a good idea to find the patch that causes this to happen. For automated kernel testing we recommend you to use the AutoTest suite (http://test.kernel.org/autotest/) consisting of many test applications and profiling tools combined with a fairly simple user interface. To install AutoTest you can go into the {{{/usr/local}}} directory (as root) and run {{{# svn checkout svn://test.kernel.org/autotest/trunk autotest}}} Although it normally is not recommended to run such commands as root, this particular one should be safe, unless you cannot trust your DNS server, because it only downloads some files and saves them in {{{/usr/local}}} . Besides, you will need to run AutoTest as root, since some of its tests require superuser privileges to complete. For this reason you should not use AutoTest on a production system: in extreme cases the data stored in the system the privileged tests are run on can be damaged or even destroyed, and we believe that you would not like this to happen to your production data. By design, AutoTest is noninteractive, so once started, it will not require your attention (of course, if something goes really wrong, you will have to recover the system, but this is a different kettle of fish). To start it you can go to {{{/usr/local/autotest/client}}} (we assume that AutoTest has been installed in {{{/usr/local}}}) and execute (as root) {{{# bin/autotest tests/test_name/control}}} where test_name is the name of the directory in {{{/usr/local/autotest/client/tests}}} that contains the test you want to run. The control file tests/test_name/control contains instructions for AutoTest. In the simplest cases only one such instruction is needed, namely {{{job.run_test(’test_name’)}}} where test_name is the name of the directory that contains the control file. The contents of more sophisticated control files can look like this: {{{ job.run_test(’pktgen’, ’eth0’, 50000, 0, tag=’clone_skb_off’) job.run_test(’pktgen’, ’eth0’, 50000, 1, tag=’clone_skb_on’) }}} where the strings after the test name represent arguments that should be passed to the test application. You can modify these arguments, but first you should read the documentation of the test application as well as the script {{{tests/test_name/test_name.py}}} (eg. {{{tests/pktgen/pktgen.py}}}) used by AutoTest to actually run the test (as you have probably noticed, the AutoTest scripts are written in Python). The results of the execution of the script {{{tests/test_name/test_name.py}}} are saved in the directory {{{results/default/test_name/}}}, where the status file contains the information indicating whether or not the test has been completed successfully. To cancel the test, press Ctrl+C while it is being executed. If you want to run several tests in a row, it is best to prepare a single file containing multiple instructions for AutoTest. The instructions in this file should be similar to the ones contained in the above-mentioned control files. For example, the file {{{samples/all_tests}}} contains instructions for running all of the available tests and its first five lines are the following: {{{ job.run_test(’aiostress’) job.run_test(’bonnie’) job.run_test(’dbench’) job.run_test(’fio’) job.run_test(’fsx’) }}} To run all of the tests requested by the instructions in this file, you can use the command {{{bin/autotest samples/all_tests}}} but you should remember that it will take a lot of time to complete. Analogously, to run a custom selection of tests, put the instructions for AutoTest into one file and provide its name as a command line argument to autotest. To run several tests in parallel, you will need to prepare a special control file containing instructions like these: {{{ def kernbench(): job.run_test(’kernbench’, 2, 5) def dbench(): job.run_test(’dbench’) job.parallel([kernbench], [dbench]) }}} While the tests are being executed, you can stop them by pressing Ctrl+C at any time. For people who do not like the command line and configuration files, ATCC (AutoTest Control Center ) has been created. If you run it, for example by using the command {{{ui/menu}}}, you will be provided with a simple menu-driven user interface allowing you to select tests and profiling tools, view the results of their execution and, to a limited extent, configure them. If you are bored with the selection of tools available in the AutoTest package, you can visit the web page http://ltp.sourceforge.net/tooltable.php containing a comprehensive list of tools that can be used for testing the Linux kernel. === 2.3 Phase Three === Has your new kernel passed the first two phases of testing? Now, you can start to experiment. That is, to do stupid things that nobody sane will do during the normal work, so no one knows that they can crash the kernel. What exactly should be done? Well, if there had been a ”standard” procedure, it would have certainly been included in some test suite. The third phase can be started, for example, from unplugging and replugging USB devices. While in theory the replugging of a USB device should not change anything, at least from the user’s point of view, doing it many times in a row may cause the kernel to crash if there is a bug in the USB subsystem (this may only cause the problem to appear provided that no one has ever tried this on a similarly configured system). Note, however, that this is also stressful to your hardware, so such experiments should better be carried out on add-on cards rather than on the USB ports attached directly to your computer’s mainboard. Next, you can write a script that will read the contents of files from the {{{/proc}}} directory in a loop or some such. In short, in the third phase you should do things that are never done by normal users (or that are done very rarely: why would anyone mount and unmount certain filesystem in an infinite loop? :)). === 2.4 Measuring performance === As we have already mentioned, it is good to check the effects of the changes made to the kernel on the performance of the entire system (by the way, this may be an excellent task for beginner testers, who do not want to deal with development kernels yet, although they eagerly want to help develop the kernel). Still, to do this efficiently, you need to know how to do it and where to begin. To start with, it is recommended to choose one subsystem that you will test regularly, since in that case your reports will be more valuable to the kernel developers. Namely, from time to time messages like ”Hello, I’ve noticed that the performance of my network adapter decreased substantially after I had upgraded from 2.6.8 to 2.6.20. Can anyone help me?” appear on the LKML. Of course, in such cases usually no one has a slightest idea of what could happen, because the kernel 2.6.20 was released two and a half years (and gazillion random patches) after 2.6.8. Now, in turn, if you report that the performance of your network adapter has dropped 50% between the kernels 2.6.x-rc3 and 2.6.x-rc4, it will be relatively easy to find out why. For this reason it is important to carry out the measurements of performance regularly. Another thing that you should pay attention to is how your tests actually work. Ideally, you should learn as much as possible about the benchmark that you want to use, so that you know how to obtain reliable results from it. For example, in some Internet or press publications you can find the opinion that running {{{$ time make}}} in the kernel source directory is a good test of performance, as it allows you to measure how much time it takes to build the kernel on given system. While it is true that you can use this kind of tests to get some general idea of how ”fast” (or how ”slow”) the system is, they generally should not be regarded as measurements of performance, since they are not sufficiently precise. In particular, the kernel compilation time depends not only on the ”speed” of the CPU and memory, but also on the time needed to load the necessary data into memory from the hard disk, which in turn may depend on where exactly these data are physically located. In fact, the kernel compilation is quite I/O-intensive and the time needed to complete it may depend on some more or less random factors. Moreover, if you run it twice in a row, the first run usually takes more time to complete than the second one, since the kernel caches the necessary data in memory during the first run and afterwards they can simply be read from there. Thus in order to obtain reproducible results, it is necessary to suppress the impact of the I/O, for example by forcing the kernel to load the data into memory before running the test. Generally, if you want to carry out the ”{{{time make}}}” kind of benchmarks, it is best to use the kernbench script (http://ck.kolivas.org/kernbench/), a newer version of which is included in the AutoTest suite (see Section 2.2). Several good benchmarks are also available from http://ltp.sourceforge.net/tooltable.php (some of them are included in AutoTest too). Still, if you are interested in testing the kernel rather than in testing hardware, you should carefully read the documentation of the chosen benchmark, because it usually contains some information that you may need. The next important thing that you should always remember about is the stability (ie. invariableness) of the environment in which the measurements are carried out. In particular, if you test the kernel, you should not change anything but the kernel in your system, since otherwise you would test two (or more) things at a time and it would not be easy to identify the influence of each of them on the results. For instance, if the measurement is based on building the kernel, it should always be run against the same kernel tree with exactly the same configuration file, using the same compiler and the other necessary tools (of course, once you have upgraded at least one of these tools, the results that you will obtain from this moment on should not be compared with the results obtained before the upgrade). Generally, you should always do your best to compare apples to apples. For example, if you want to test the performance of three different file systems, you should not install them on three different partitions of the same disk, since the time needed to read (or write) data from (or to) the disk generally depends on where exactly the operation takes place. Instead, you should create one partition on which you will install each of the tested filesystems. Moreover, in such a case it is better to restart the system between consecutive measurements in order to suppress the effect of the caching of data. Concluding, we can say that, as far as the measurements of performance are concerned, it is important to * carry out the tests regularly * know the ”details” allowing one to obtain reliable results * ensure the stability of the test environment * compare things that are directly comparable If all of these conditions are met, the resulting data will be very valuable source of information on the performance of given kernel subsystem. === 2.5 Hello world!, or what exactly are we looking for? === What is it we are looking for? Well, below you can find some examples of messages that are related to kernel problems. Usually, such messages appear on the system console, but sometimes (eg. when the problem is not sufficiently serious to make the kernel crash immediately) you can also see them in the logs. {{{ ============================================= [ INFO: possible recursive locking detected ] --------------------------------------------- idle/1 is trying to acquire lock: (lock_ptr){....}, at: [] acpi_os_acquire_lock+0x8/0xa but task is already holding lock: (lock_ptr){....}, at: [] acpi_os_acquire_lock+0x8/0xa other info that might help us debug this: 1 lock held by idle/1: #0: (lock_ptr){....}, at: [] acpi_os_acquire_lock+0x8/0xa stack backtrace: [] show_trace+0xd/0x10 [] dump_stack+0x19/0x1b [] __lock_acquire+0x7d9/0xa50 [] lock_acquire+0x71/0x91 [] _spin_lock_irqsave+0x2c/0x3c [] acpi_os_acquire_lock+0x8/0xa [] acpi_ev_gpe_detect+0x4d/0x10e [] acpi_ev_sci_xrupt_handler+0x15/0x1d [] acpi_irq+0xe/0x18 [] request_irq+0xbe/0x10c [] acpi_os_install_interrupt_handler+0x59/0x87 [] acpi_ev_install_sci_handler+0x1c/0x21 [] acpi_ev_install_xrupt_handlers+0x9/0x50 [] acpi_enable_subsystem+0x7d/0x9a [] acpi_init+0x3f/0x170 [] _stext+0x116/0x26c [] kernel_thread_helper+0x5/0xb }}} The above message indicates that the kernel’s runtime locking correctness validator (often referred to as ”lockdep”) has detected a possible locking error. The errors detected by lockdep need not be critical, so if you have enabled lockdep in the kernel configuration, which is recommended for testing (see Subsection 1.6.4), from time to time you can see them in the system logs or in the output of dmesg. They are always worth reporting, although sometimes lockdep may think that there is a problem even if the locking is used in a correct way. Still, in such a case your report will tell the kernel developers that they should teach lockdep not to trigger in this particular place any more. {{{ BUG: sleeping function called from invalid context at /usr/src/linux-mm/sound/core/info.c:117 in_atomic():1, irqs_disabled():0 show_trace+0xd/0xf dump_stack+0x17/0x19 __might_sleep+0x93/0x9d snd_iprintf+0x1b/0x84 [snd] snd_card_module_info_read+0x34/0x4e [snd] snd_info_entry_open+0x20f/0x2cc [snd] __dentry_open+0x133/0x260 nameidata_to_filp+0x1c/0x2e do_filp_open+0x2e/0x35 do_sys_open+0x54/0xd7 sys_open+0x16/0x18 sysenter_past_esp+0x54/0x75 BUG: using smp_processor_id() in preemptible [00000001] code: init/1 caller is __handle_mm_fault+0x2b/0x20d [] show_trace+0xd/0xf [] dump_stack+0x17/0x19 [] debug_smp_processor_id+0x8c/0xa0 [] __handle_mm_fault+0x2b/0x20d [] do_page_fault+0x226/0x61f [] error_code+0x39/0x40 [] padzero+0x19/0x28 [] load_elf_binary+0x836/0xc02 [] search_binary_handler+0x123/0x35a [] load_script+0x221/0x230 [] search_binary_handler+0x123/0x35a [] do_execve+0x164/0x215 [] sys_execve+0x3b/0x7e [] syscall_call+0x7/0xb }}} The above message means that one of the kernel’s functions has been called from a wrong place. In this particular case the execution of the function snd_iprintf() might be suspended until certain condition is satisfied (in such cases the kernel developers say that the function might sleep), so it should not be called from the portions of code that have to be executed atomically, such as interrupt handlers (ie. from atomic context ). However, apparently snd_iprintf() has been called from atomic context and the kernel reports this as a potential problem. Such problems need not cause the kernel to crash and the related messages, similar to the above one, can appear in the system logs or in the output of dmesg, but they are serious and should always be reported. The next message is a so-called Oops, which means that it represents a problem causing the kernel to stop working. In other words, it means that something real ly bad has happened and the kernel cannot continue running, since your hardware might be damaged or your data might be corrupted otherwise. Such messages are often accompanied by so-called kernel panics (the origin of the term ”kernel panic” as well as its possible meanings are explained in the OSWeekly.com article by Puru Govind which is available at http://www.osweekly.com/index.php?option=com_content&task=view&id=2241&Itemid=449). {{{ BUG: unable to handle kernel paging request at virtual address 6b6b6c07 printing eip: c0138722 *pde = 00000000 Oops: 0002 [#1] 4K_STACKS PREEMPT SMP last sysfs file: /devices/pci0000:00/0000:00:1d.7/uevent Modules linked in: snd_timer snd soundcore snd_page_alloc intel_agp agpgart ide_cd cdrom ipv6 w83627hf hwmon_vid hwmon i2c_isa i2c_i801 skge af_packet ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables cpufreq_userspace p4_clockmod speedstep_lib binfmt_misc thermal processor fan container rtc unix CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010046 (2.6.18-rc2-mm1 #78) EIP is at __lock_acquire+0x362/0xaea eax: 00000000 ebx: 6b6b6b6b ecx: c0360358 edx: 00000000 esi: 00000000 edi: 00000000 ebp: f544ddf4 esp: f544ddc0 ds: 007b es: 007b ss: 0068 Process udevd (pid: 1353, ti=f544d000 task=f6fce8f0 task.ti=f544d000) Stack: 00000000 00000000 00000000 c7749ea4 f6fce8f0 c0138e74 000001e8 00000000 00000000 f6653fa4 00000246 00000000 00000000 f544de1c c0139214 00000000 00000002 00000000 c014fe3a c7749ea4 c7749e90 f6fce8f0 f5b19b04 f544de34 Call Trace: [] lock_acquire+0x71/0x91 [] _spin_lock+0x23/0x32 [] __delayacct_blkio_ticks+0x16/0x67 [] do_task_stat+0x3df/0x6c1 [] proc_tgid_stat+0xd/0xf [] proc_info_read+0x50/0xb3 [] vfs_read+0xcb/0x177 [] sys_read+0x3b/0x71 [] sysenter_past_esp+0x56/0x8d DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d Leftover inexact backtrace: [] show_stack_log_lvl+0x8c/0x97 [] show_registers+0x15c/0x1ed [] die+0x1b2/0x2b7 [] do_page_fault+0x410/0x4f0 [] error_code+0x39/0x40 [] lock_acquire+0x71/0x91 [] _spin_lock+0x23/0x32 [] __delayacct_blkio_ticks+0x16/0x67 [] do_task_stat+0x3df/0x6c1 [] proc_tgid_stat+0xd/0xf [] proc_info_read+0x50/0xb3 [] vfs_read+0xcb/0x177 [] sys_read+0x3b/0x71 [] sysenter_past_esp+0x56/0x8d Code: 68 4b 75 2f c0 68 d5 04 00 00 68 b9 75 31 c0 68 e3 06 31 c0 e8 ce 7e fe ff e8 87 c2 fc ff 83 c4 10 eb 08 85 db 0f 84 6b 07 00 00 ff 83 9c 00 00 00 8b 55 dc 8b 92 5c 05 00 00 89 55 e4 83 fa EIP: [] __lock_acquire+0x362/0xaea SS:ESP 0068:f544ddc0 }}} The following message represents an error resulting from a situation that, according to the kernel developers, cannot happen: {{{ KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148) KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148) KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147) KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147) }}} {{{ BUG: warning at /usr/src/linux-mm/kernel/cpu.c:56/unlock_cpu_hotplug() [] dump_trace+0x70/0x176 [] show_trace_log_lvl+0x12/0x22 [] show_trace+0xd/0xf [] dump_stack+0x17/0x19 [] unlock_cpu_hotplug+0x46/0x7c [] cpufreq_set+0x81/0x8b [cpufreq_userspace] [] store_speed+0x35/0x40 [cpufreq_userspace] [] store+0x38/0x49 [] flush_write_buffer+0x23/0x2b [] sysfs_write_file+0x4b/0x6c [] vfs_write+0xcb/0x173 [] sys_write+0x3b/0x71 [] sysenter_past_esp+0x56/0x8d [] 0xb7fbe410 [] show_trace_log_lvl+0x12/0x22 [] show_trace+0xd/0xf [] dump_stack+0x17/0x19 [] unlock_cpu_hotplug+0x46/0x7c [] cpufreq_set+0x81/0x8b [cpufreq_userspace] [] store_speed+0x35/0x40 [cpufreq_userspace] [] store+0x38/0x49 [] flush_write_buffer+0x23/0x2b [] sysfs_write_file+0x4b/0x6c [] vfs_write+0xcb/0x173 [] sys_write+0x3b/0x71 [] sysenter_past_esp+0x56/0x8d }}} Apart from the messages that can appear on the console, in the output of dmesg or in the system logs, some problems can be reported in a less direct way. For example, if the kernel memory leak detector is used (so far, it has not been included in stable kernels), there is the file {{{/sys/kernel/debug/memleak}}}, in which possible kernel memory leaks are registered, eg. {{{ orphan pointer 0xf5a6fd60 (size 39): c0173822: <__kmalloc> c01df500: c01df679: c01d7eee: f884f019: f8850698: c02a88c2: c02a9c7a: }}} This information, supplemented with the kernel configuration file (see Section 1.6), may allow the kernel developers to fix the bug that you have managed to find. The examples shown above are related to problems that occur when the kernel is running, called run-time errors. Obviously, to get a run-time error you need to build, install and boot the kernel. Surprisingly, however, it is possible that you will not be able to build the kernel due to a compilation error. This is not a frequent problem and it usually indicates that the author of certain piece of kernel code was not careful enough. Still, this happens to many developers, including us, and if you find a compilation error, report it immediately (you can even try to fix it if you are good at programming in C). To find examples of what happens after someone finds a compilation problem in the kernel, you can look at one of the discussions taking place on the LKML after Andrew Morton announces a new -mm kernel (for more information about the -mm tree see Section 1.5). It should be stressed that some kernel bugs are not immediately visible. Some of them show up only in specific situations and may manifest themselves, for example, in hanging random processes or dropping random data into files that are written to. For instance, there is a whole category of kernel problems that appear only when the system is suspended to RAM or hibernated (ie. suspended to disk), either during the suspend, or while the kernel is resuming normal operations. All in all, you will never know what surprises the kernel has got for you, so you should better be prepared. Generally, kernel run-time errors can be divided into three categories: * easily reproducible – such that we know exactly what to do to provoke them to happen * fairly reproducible – such that occur quite regularly and we know more or less in what situations * difficult to reproduce – such that occur in (seemingly) random circumstances and we have no idea how to make them occur The easily reproducible bugs are the easiest to fix, since in these cases it is quite easy, albeit often quite time-consuming, to find a patch that has introduced the problem. For this reason, the easily reproducible bugs are usually fixed relatively quickly. In turn, the bugs that are difficult to reproduce usually take a lot of time to get fixed, since in these cases the source of the problem cannot be easily identified. If you encounter such a bug, you will probably need help of the developers knowing the relevant kernel subsystem and you will have to be very patient. === 2.6 Binary drivers and distribution kernels === Quite often you can hear that so-called ”binary” drivers are ”evil” and you should not use them. Well, this is generally true, apart from the fact that sometimes you have no choice (eg. new AMD/ATI graphics adapters are not supported by any Open Source driver known to us). In our opinion there is at least one practical argument for not using binary drivers. Namely, if you find a bug in the kernel that occurs while you are using a binary driver, the kernel developers may be unable to help you, because they have no access to the driver’s source code. When you are using a binary driver, the kernel is ”tainted”, which means that the source of possible problems may be unrelated to the kernel code (see https://secure-support.novell.com/KanisaPlatform/Publishing/250/3582750_f.SAL_Public.html for more details). You can check whether or not the kernel was tainted when the problem occurred by looking at the corresponding error message. If can you see something similar to the following line: {{{EIP: 0060:[] Tainted: P VLI}}} (the word Tainted is crucial here), the kernel was tainted and most probably the kernel developers will not be able to help you. In that case you should try to reproduce the problem without the binary driver loaded. Moreover, if the problem does not occur without it, you should send a bug report to the creators of the binary driver and ask them to fix it. In the file Documentation/admin-guide/tainted-kernels.rst, included in the kernel sources, there is a list of reasons why the kernel can be considered as tainted. As follows from this document, the presence of a binary module is not the only possible reason of tainting the kernel, but in practice it turns out to be the most frequent one. Generally, you should avoid reporting problems in tainted kernels to the LKML (or to the kernel developers in general) and the problems related to binary drivers should be reported to their providers. Another case in which you should not report kernel problems to the LKML, or directly to the kernel developers, is when you are using a distribution kernel. The main reasons of this are the following: * Distribution kernels often contain modifications that are not included in the kernels available from http://www.kernel.org and have not been accepted by the maintainers of relevant kernel subsystems * Some of these modification are very experimental and they tend to introduce bugs that are not present in the ”official” kernels * Distribution kernels are meant to be supported by their distributors rather by the kernel developers, so the problems in these kernels should be reported to the distributors in the first place (usually, the distributor will contact the kernel developers anyway if that is necessary) Of course, if the problem can be reproduced using the ”original” kernel on which the distribution one is based, it can be reported to the kernel developers. Still, it usually is better to let the distributor know of the problem anyway and in our opinion it does not make sense to report the same problem twice, does it?