Linux_Kernel_Tester's_Guide_Chapter2

Testing

Generally, there are many ways in which you can test the Linux kernel, but we will concentrate on the following four approaches:

Using a test version of the kernel for normal work.
Running special test suites, like LTP, on the new kernel.
Doing unusual things with the new kernel installed.
Measuring the system performance with the new kernel installed.

Of course, all of them can be used within one combined test procedure, so they can be regarded as different phases of the testing process.

2.1 Phase One

The ﬁrst phase of kernel testing is simple: we try to boot the kernel and use it for normal work.

Before starting the system in a fully functional conﬁguration it is recommended to boot the kernel with the init=/bin/bash command line argument, which makes it start only one bash process. From there you can check if the ﬁlesystems are mounted and unmounted properly and you can test some more complex kernel functions, like the suspend to disk or to RAM, in the minimal conﬁguration. In that case the only kernel modules loaded are the ones present in the initrd image mentioned in Subsection 1.6.6. Generally, you should refer to the documentation of your boot loader for more information about manual passing command line arguments to the kernel (in our opinion it is easier if GRUB is used).
Next, it is advisable to start the system in the runlevel 2 (usually, by passing the number 2 to the kernel as the last command line argument), in which case network servers and the X server are not started (your system may be conﬁgured to use another runlevel for this purpose, although this is very unusual, so you should look into /etc/inittab for conﬁdence). In this conﬁguration you can check if the network interfaces work and you can try to run the X server manually to make sure that it does not crash.
Finally, you can boot the system into the runlevel 5 (ie. fully functional) or 3 (ie. fully functional without X), depending on your needs.

Now, you are ready to use the system in a normal way for some time. Still, if you want to test the kernel quickly, you can carry out some typical operations, like downloading some ﬁles, reading email, browsing some web pages, ripping some audio tracks (from a legally bought audio CD, we presume), burning a CD or DVD etc., in a row to check if any of them fail in a way that would indicate a kernel problem.

2.2 Phase Two (AutoTest)

In the next phase of testing we use special programs designed for checking if specific kernel subsystems work correctly. We also carry out regression and performance tests of the kernel. The latter are particularly important for kernel developers (and for us), since they allow us to identify changes that hurt performance. For example, if the performance of one of our ﬁlesystems is 10% worse after we have upgraded the 2.6.x-rc1 kernel to the 2.6.x-rc2 one, it is definitely a good idea to ﬁnd the patch that causes this to happen.

For automated kernel testing we recommend you to use the AutoTest suite (http://test.kernel.org/autotest/) consisting of many test applications and proﬁling tools combined with a fairly simple user interface.

To install AutoTest you can go into the /usr/local directory (as root) and run

# svn checkout svn://test.kernel.org/autotest/trunk autotest

Although it normally is not recommended to run such commands as root, this particular one should be safe, unless you cannot trust your DNS server, because it only downloads some ﬁles and saves them in /usr/local . Besides, you will need to run AutoTest as root, since some of its tests require superuser privileges to complete. For this reason you should not use AutoTest on a production system: in extreme cases the data stored in the system the privileged tests are run on can be damaged or even destroyed, and we believe that you would not like this to happen to your production data.

By design, AutoTest is noninteractive, so once started, it will not require your attention (of course, if something goes really wrong, you will have to recover the system, but this is a different kettle of fish). To start it you can go to /usr/local/autotest/client (we assume that AutoTest has been installed in /usr/local) and execute (as root)

# bin/autotest tests/test_name/control

where test_name is the name of the directory in /usr/local/autotest/client/tests that contains the test you want to run. The control ﬁle tests/test_name/control contains instructions for AutoTest. In the simplest cases only one such instruction is needed, namely

job.run_test(’test_name’)

where test_name is the name of the directory that contains the control ﬁle. The contents of more sophisticated control ﬁles can look like this:

job.run_test(’pktgen’, ’eth0’, 50000, 0, tag=’clone_skb_off’)
job.run_test(’pktgen’, ’eth0’, 50000, 1, tag=’clone_skb_on’)

where the strings after the test name represent arguments that should be passed to the test application. You can modify these arguments, but ﬁrst you should read the documentation of the test application as well as the script tests/test_name/test_name.py (eg. tests/pktgen/pktgen.py) used by AutoTest to actually run the test (as you have probably noticed, the AutoTest scripts are written in Python). The results of the execution of the script tests/test_name/test_name.py are saved in the directory results/default/test_name/, where the status ﬁle contains the information indicating whether or not the test has been completed successfully. To cancel the test, press Ctrl+C while it is being executed.

If you want to run several tests in a row, it is best to prepare a single ﬁle containing multiple instructions for AutoTest. The instructions in this ﬁle should be similar to the ones contained in the above-mentioned control files. For example, the file samples/all_tests contains instructions for running all of the available tests and its ﬁrst ﬁve lines are the following:

job.run_test(’aiostress’)
job.run_test(’bonnie’)
job.run_test(’dbench’)
job.run_test(’fio’)
job.run_test(’fsx’)

To run all of the tests requested by the instructions in this ﬁle, you can use the command bin/autotest samples/all_tests but you should remember that it will take a lot of time to complete. Analogously, to run a custom selection of tests, put the instructions for AutoTest into one ﬁle and provide its name as a command line argument to autotest.

To run several tests in parallel, you will need to prepare a special control ﬁle containing instructions like these:

def kernbench():
          job.run_test(’kernbench’, 2, 5)
def dbench():
          job.run_test(’dbench’)
job.parallel([kernbench], [dbench])

While the tests are being executed, you can stop them by pressing Ctrl+C at any time.

For people who do not like the command line and conﬁguration ﬁles, ATCC (AutoTest Control Center ) has been created. If you run it, for example by using the command ui/menu, you will be provided with a simple menu-driven user interface allowing you to select tests and proﬁling tools, view the results of their execution and, to a limited extent, conﬁgure them.

If you are bored with the selection of tools available in the AutoTest package, you can visit the web page http://ltp.sourceforge.net/tooltable.php containing a comprehensive list of tools that can be used for testing the Linux kernel.

2.3 Phase Three

Has your new kernel passed the ﬁrst two phases of testing? Now, you can start to experiment. That is, to do stupid things that nobody sane will do during the normal work, so no one knows that they can crash the kernel. What exactly should be done? Well, if there had been a ”standard” procedure, it would have certainly been included in some test suite.

The third phase can be started, for example, from unplugging and replugging USB devices. While in theory the replugging of a USB device should not change anything, at least from the user’s point of view, doing it many times in a row may cause the kernel to crash if there is a bug in the USB subsystem (this may only cause the problem to appear provided that no one has ever tried this on a similarly conﬁgured system). Note, however, that this is also stressful to your hardware, so such experiments should better be carried out on add-on cards rather than on the USB ports attached directly to your computer’s mainboard.

Next, you can write a script that will read the contents of ﬁles from the /proc directory in a loop or some such. In short, in the third phase you should do things that are never done by normal users (or that are done very rarely: why would anyone mount and unmount certain ﬁlesystem in an inﬁnite loop? :)).

2.4 Measuring performance

As we have already mentioned, it is good to check the eﬀects of the changes made to the kernel on the performance of the entire system (by the way, this may be an excellent task for beginner testers, who do not want to deal with development kernels yet, although they eagerly want to help develop the kernel). Still, to do this eﬃciently, you need to know how to do it and where to begin.

To start with, it is recommended to choose one subsystem that you will test regularly, since in that case your reports will be more valuable to the kernel developers. Namely, from time to time messages like ”Hello, I’ve noticed that the performance of my network adapter decreased substantially after I had upgraded from 2.6.8 to 2.6.20. Can anyone help me?” appear on the LKML. Of course, in such cases usually no one has a slightest idea of what could happen, because the kernel 2.6.20 was released two and a half years (and gazillion random patches) after 2.6.8. Now, in turn, if you report that the performance of your network adapter has dropped 50% between the kernels 2.6.x-rc3 and 2.6.x-rc4, it will be relatively easy to ﬁnd out why. For this reason it is important to carry out the measurements of performance regularly.

Another thing that you should pay attention to is how your tests actually work. Ideally, you should learn as much as possible about the benchmark that you want to use, so that you know how to obtain reliable results from it. For example, in some Internet or press publications you can ﬁnd the opinion that running

$ time make

in the kernel source directory is a good test of performance, as it allows you to measure how much time it takes to build the kernel on given system. While it is true that you can use this kind of tests to get some general idea of how ”fast” (or how ”slow”) the system is, they generally should not be regarded as measurements of performance, since they are not sufficiently precise. In particular, the kernel compilation time depends not only on the ”speed” of the CPU and memory, but also on the time needed to load the necessary data into memory from the hard disk, which in turn may depend on where exactly these data are physically located. In fact, the kernel compilation is quite I/O-intensive and the time needed to complete it may depend on some more or less random factors. Moreover, if you run it twice in a row, the ﬁrst run usually takes more time to complete than the second one, since the kernel caches the necessary data in memory during the ﬁrst run and afterwards they can simply be read from there. Thus in order to obtain reproducible results, it is necessary to suppress the impact of the I/O, for example by forcing the kernel to load the data into memory before running the test. Generally, if you want to carry out the ”time make” kind of benchmarks, it is best to use the kernbench script (http://ck.kolivas.org/kernbench/), a newer version of which is included in the AutoTest suite (see Section 2.2). Several good benchmarks are also available from http://ltp.sourceforge.net/tooltable.php (some of them are included in AutoTest too). Still, if you are interested in testing the kernel rather than in testing hardware, you should carefully read the documentation of the chosen benchmark, because it usually contains some information that you may need.

The next important thing that you should always remember about is the stability (ie. invariableness) of the environment in which the measurements are carried out. In particular, if you test the kernel, you should not change anything but the kernel in your system, since otherwise you would test two (or more) things at a time and it would not be easy to identify the inﬂuence of each of them on the results. For instance, if the measurement is based on building the kernel, it should always be run against the same kernel tree with exactly the same conﬁguration ﬁle, using the same compiler and the other necessary tools (of course, once you have upgraded at least one of these tools, the results that you will obtain from this moment on should not be compared with the results obtained before the upgrade).

Generally, you should always do your best to compare apples to apples. For example, if you want to test the performance of three diﬀerent ﬁle systems, you should not install them on three different partitions of the same disk, since the time needed to read (or write) data from (or to) the disk generally depends on where exactly the operation takes place. Instead, you should create one partition on which you will install each of the tested ﬁlesystems. Moreover, in such a case it is better to restart the system between consecutive measurements in order to suppress the eﬀect of the caching of data.

Concluding, we can say that, as far as the measurements of performance are concerned, it is important to

carry out the tests regularly
know the ”details” allowing one to obtain reliable results
ensure the stability of the test environment
compare things that are directly comparable

If all of these conditions are met, the resulting data will be very valuable source of information on the performance of given kernel subsystem.

2.5 Hello world!, or what exactly are we looking for?

What is it we are looking for? Well, below you can ﬁnd some examples of messages that are related to kernel problems. Usually, such messages appear on the system console, but sometimes (eg. when the problem is not sufficiently serious to make the kernel crash immediately) you can also see them in the logs.

=============================================
[ INFO: possible recursive locking detected ]
---------------------------------------------
idle/1 is trying to acquire lock:
 (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
but task is already holding lock:
 (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
other info that might help us debug this:
1 lock held by idle/1:
 #0: (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
stack backtrace:
 [<c0103e89>] show_trace+0xd/0x10
 [<c0104483>] dump_stack+0x19/0x1b
 [<c01395fa>] __lock_acquire+0x7d9/0xa50
 [<c0139a98>] lock_acquire+0x71/0x91
 [<c02f0beb>] _spin_lock_irqsave+0x2c/0x3c
 [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
 [<c0222d95>] acpi_ev_gpe_detect+0x4d/0x10e
 [<c02215c3>] acpi_ev_sci_xrupt_handler+0x15/0x1d
 [<c021c8b1>] acpi_irq+0xe/0x18
 [<c014d36e>] request_irq+0xbe/0x10c
 [<c021cf33>] acpi_os_install_interrupt_handler+0x59/0x87
 [<c02215e7>] acpi_ev_install_sci_handler+0x1c/0x21
 [<c0220d41>] acpi_ev_install_xrupt_handlers+0x9/0x50
 [<c0231772>] acpi_enable_subsystem+0x7d/0x9a
 [<c0416656>] acpi_init+0x3f/0x170
 [<c01003ae>] _stext+0x116/0x26c
 [<c0101005>] kernel_thread_helper+0x5/0xb

The above message indicates that the kernel’s runtime locking correctness validator (often referred to as ”lockdep”) has detected a possible locking error. The errors detected by lockdep need not be critical, so if you have enabled lockdep in the kernel conﬁguration, which is recommended for testing (see Subsection 1.6.4), from time to time you can see them in the system logs or in the output of dmesg. They are always worth reporting, although sometimes lockdep may think that there is a problem even if the locking is used in a correct way. Still, in such a case your report will tell the kernel developers that they should teach lockdep not to trigger in this particular place any more.

BUG: sleeping function called from invalid context at /usr/src/linux-mm/sound/core/info.c:117
in_atomic():1, irqs_disabled():0
 <c1003ef9> show_trace+0xd/0xf
 <c100440c> dump_stack+0x17/0x19
 <c10178ce> __might_sleep+0x93/0x9d
 <f988eeb5> snd_iprintf+0x1b/0x84 [snd]
 <f988d808> snd_card_module_info_read+0x34/0x4e [snd]
 <f988f197> snd_info_entry_open+0x20f/0x2cc [snd]
 <c1067a17> __dentry_open+0x133/0x260
 <c1067bb7> nameidata_to_filp+0x1c/0x2e
 <c1067bf7> do_filp_open+0x2e/0x35
 <c1068bf2> do_sys_open+0x54/0xd7
 <c1068ca1> sys_open+0x16/0x18
 <c11dab67> sysenter_past_esp+0x54/0x75
BUG: using smp_processor_id() in preemptible [00000001] code: init/1
caller is __handle_mm_fault+0x2b/0x20d
 [<c0103ba8>] show_trace+0xd/0xf
 [<c0103c7a>] dump_stack+0x17/0x19
 [<c0203bcc>] debug_smp_processor_id+0x8c/0xa0
 [<c0160e60>] __handle_mm_fault+0x2b/0x20d
 [<c0116f7b>] do_page_fault+0x226/0x61f
 [<c0103959>] error_code+0x39/0x40
 [<c019d4c1>] padzero+0x19/0x28
 [<c019e716>] load_elf_binary+0x836/0xc02
 [<c017db53>] search_binary_handler+0x123/0x35a
 [<c019d3b9>] load_script+0x221/0x230
 [<c017db53>] search_binary_handler+0x123/0x35a
 [<c017deee>] do_execve+0x164/0x215
 [<c0101e7a>] sys_execve+0x3b/0x7e
 [<c02fabc3>] syscall_call+0x7/0xb

The above message means that one of the kernel’s functions has been called from a wrong place. In this particular case the execution of the function snd_iprintf() might be suspended until certain condition is satisfied (in such cases the kernel developers say that the function might sleep), so it should not be called from the portions of code that have to be executed atomically, such as interrupt handlers (ie. from atomic context ). However, apparently snd_iprintf() has been called from atomic context and the kernel reports this as a potential problem. Such problems need not cause the kernel to crash and the related messages, similar to the above one, can appear in the system logs or in the output of dmesg, but they are serious and should always be reported.

The next message is a so-called Oops, which means that it represents a problem causing the kernel to stop working. In other words, it means that something real ly bad has happened and the kernel cannot continue running, since your hardware might be damaged or your data might be corrupted otherwise. Such messages are often accompanied by so-called kernel panics (the origin of the term ”kernel panic” as well as its possible meanings are explained in the OSWeekly.com article by Puru Govind which is available at http://www.osweekly.com/index.php?option=com_content&task=view&id=2241&Itemid=449).

BUG: unable to handle kernel paging request at virtual address 6b6b6c07
 printing eip:
c0138722
*pde = 00000000
Oops: 0002 [#1]
4K_STACKS PREEMPT SMP
last sysfs file: /devices/pci0000:00/0000:00:1d.7/uevent
Modules linked in: snd_timer snd soundcore snd_page_alloc intel_agp agpgart
ide_cd cdrom ipv6 w83627hf hwmon_vid hwmon i2c_isa i2c_i801 skge af_packet
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp
iptable_filter ip_tables x_tables cpufreq_userspace p4_clockmod speedstep_lib
binfmt_misc thermal processor fan container rtc unix
CPU:     0
EIP:     0060:[<c0138722>]     Not tainted VLI
EFLAGS: 00010046     (2.6.18-rc2-mm1 #78)
EIP is at __lock_acquire+0x362/0xaea
eax: 00000000    ebx: 6b6b6b6b    ecx: c0360358  edx: 00000000
esi: 00000000    edi: 00000000    ebp: f544ddf4  esp: f544ddc0
ds: 007b    es: 007b    ss: 0068
Process udevd (pid: 1353, ti=f544d000 task=f6fce8f0 task.ti=f544d000)
Stack: 00000000 00000000 00000000 c7749ea4 f6fce8f0 c0138e74 000001e8 00000000
         00000000 f6653fa4 00000246 00000000 00000000 f544de1c c0139214 00000000
         00000002 00000000 c014fe3a c7749ea4 c7749e90 f6fce8f0 f5b19b04 f544de34
Call Trace:
[<c0139214>] lock_acquire+0x71/0x91
[<c02f2bfb>] _spin_lock+0x23/0x32
[<c014fe3a>] __delayacct_blkio_ticks+0x16/0x67
[<c01a4f76>] do_task_stat+0x3df/0x6c1
[<c01a5265>] proc_tgid_stat+0xd/0xf
[<c01a29dd>] proc_info_read+0x50/0xb3
[<c0171cbb>] vfs_read+0xcb/0x177
[<c017217c>] sys_read+0x3b/0x71
[<c0103119>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c0104318>] show_stack_log_lvl+0x8c/0x97
[<c010447f>] show_registers+0x15c/0x1ed
[<c01046c2>] die+0x1b2/0x2b7
[<c0116f5f>] do_page_fault+0x410/0x4f0
[<c0103d1d>] error_code+0x39/0x40
[<c0139214>] lock_acquire+0x71/0x91
[<c02f2bfb>] _spin_lock+0x23/0x32
[<c014fe3a>] __delayacct_blkio_ticks+0x16/0x67
[<c01a4f76>] do_task_stat+0x3df/0x6c1
[<c01a5265>] proc_tgid_stat+0xd/0xf
[<c01a29dd>] proc_info_read+0x50/0xb3
[<c0171cbb>] vfs_read+0xcb/0x177
[<c017217c>] sys_read+0x3b/0x71
[<c0103119>] sysenter_past_esp+0x56/0x8d
Code: 68 4b 75 2f c0 68 d5 04 00 00 68 b9 75 31 c0 68 e3 06 31 c0 e8 ce 7e fe ff
e8 87 c2 fc ff 83 c4 10 eb 08 85 db 0f 84 6b 07 00 00 <f0> ff 83 9c 00 00 00 8b
55 dc 8b 92 5c 05 00 00 89 55 e4 83 fa
EIP: [<c0138722>] __lock_acquire+0x362/0xaea SS:ESP 0068:f544ddc0

The following message represents an error resulting from a situation that, according to the kernel developers, cannot happen:

KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148)
KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148)
KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147)
KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147)

BUG: warning at /usr/src/linux-mm/kernel/cpu.c:56/unlock_cpu_hotplug()
 [<c0103e41>] dump_trace+0x70/0x176
 [<c0103fc1>] show_trace_log_lvl+0x12/0x22
 [<c0103fde>] show_trace+0xd/0xf
 [<c01040b0>] dump_stack+0x17/0x19
 [<c0140e19>] unlock_cpu_hotplug+0x46/0x7c
 [<fd9560b0>] cpufreq_set+0x81/0x8b [cpufreq_userspace]
 [<fd956109>] store_speed+0x35/0x40 [cpufreq_userspace]
 [<c02ac9f2>] store+0x38/0x49
 [<c01aec16>] flush_write_buffer+0x23/0x2b
 [<c01aec69>] sysfs_write_file+0x4b/0x6c
 [<c01770af>] vfs_write+0xcb/0x173
 [<c0177203>] sys_write+0x3b/0x71
 [<c010312d>] sysenter_past_esp+0x56/0x8d
 [<b7fbe410>] 0xb7fbe410
 [<c0103fc1>] show_trace_log_lvl+0x12/0x22
 [<c0103fde>] show_trace+0xd/0xf
 [<c01040b0>] dump_stack+0x17/0x19
 [<c0140e19>] unlock_cpu_hotplug+0x46/0x7c
 [<fd9560b0>] cpufreq_set+0x81/0x8b [cpufreq_userspace]
 [<fd956109>] store_speed+0x35/0x40 [cpufreq_userspace]
 [<c02ac9f2>] store+0x38/0x49
 [<c01aec16>] flush_write_buffer+0x23/0x2b
 [<c01aec69>] sysfs_write_file+0x4b/0x6c
 [<c01770af>] vfs_write+0xcb/0x173
 [<c0177203>] sys_write+0x3b/0x71
 [<c010312d>] sysenter_past_esp+0x56/0x8d

Apart from the messages that can appear on the console, in the output of dmesg or in the system logs, some problems can be reported in a less direct way. For example, if the kernel memory leak detector is used (so far, it has not been included in stable kernels), there is the file /sys/kernel/debug/memleak, in which possible kernel memory leaks are registered, eg.

orphan pointer 0xf5a6fd60 (size 39):
c0173822: <__kmalloc>
c01df500: <context_struct_to_string>
c01df679: <security_sid_to_context>
c01d7eee: <selinux_socket_getpeersec_dgram>
f884f019: <unix_get_peersec_dgram>
f8850698: <unix_dgram_sendmsg>
c02a88c2: <sock_sendmsg>
c02a9c7a: <sys_sendto>

This information, supplemented with the kernel conﬁguration ﬁle (see Section 1.6), may allow the kernel developers to ﬁx the bug that you have managed to ﬁnd.

The examples shown above are related to problems that occur when the kernel is running, called run-time errors. Obviously, to get a run-time error you need to build, install and boot the kernel. Surprisingly, however, it is possible that you will not be able to build the kernel due to a compilation error. This is not a frequent problem and it usually indicates that the author of certain piece of kernel code was not careful enough. Still, this happens to many developers, including us, and if you ﬁnd a compilation error, report it immediately (you can even try to ﬁx it if you are good at programming in C). To ﬁnd examples of what happens after someone finds a compilation problem in the kernel, you can look at one of the discussions taking place on the LKML after Andrew Morton announces a new -mm kernel (for more information about the -mm tree see Section 1.5).

It should be stressed that some kernel bugs are not immediately visible. Some of them show up only in specific situations and may manifest themselves, for example, in hanging random processes or dropping random data into ﬁles that are written to. For instance, there is a whole category of kernel problems that appear only when the system is suspended to RAM or hibernated (ie. suspended to disk), either during the suspend, or while the kernel is resuming normal operations. All in all, you will never know what surprises the kernel has got for you, so you should better be prepared.

Generally, kernel run-time errors can be divided into three categories:

easily reproducible – such that we know exactly what to do to provoke them to happen
fairly reproducible – such that occur quite regularly and we know more or less in what situations
difficult to reproduce – such that occur in (seemingly) random circumstances and we have no idea how to make them occur

The easily reproducible bugs are the easiest to ﬁx, since in these cases it is quite easy, albeit often quite time-consuming, to ﬁnd a patch that has introduced the problem. For this reason, the easily reproducible bugs are usually ﬁxed relatively quickly. In turn, the bugs that are difficult to reproduce usually take a lot of time to get fixed, since in these cases the source of the problem cannot be easily identified. If you encounter such a bug, you will probably need help of the developers knowing the relevant kernel subsystem and you will have to be very patient.

2.6 Binary drivers and distribution kernels

Quite often you can hear that so-called ”binary” drivers are ”evil” and you should not use them. Well, this is generally true, apart from the fact that sometimes you have no choice (eg. new AMD/ATI graphics adapters are not supported by any Open Source driver known to us). In our opinion there is at least one practical argument for not using binary drivers. Namely, if you ﬁnd a bug in the kernel that occurs while you are using a binary driver, the kernel developers may be unable to help you, because they have no access to the driver’s source code.

When you are using a binary driver, the kernel is ”tainted”, which means that the source of possible problems may be unrelated to the kernel code (see https://secure-support.novell.com/KanisaPlatform/Publishing/250/3582750_f.SAL_Public.html for more details). You can check whether or not the kernel was tainted when the problem occurred by looking at the corresponding error message. If can you see something similar to the following line:

EIP: 0060:[<c046c7c3>] Tainted: P VLI

(the word Tainted is crucial here), the kernel was tainted and most probably the kernel developers will not be able to help you. In that case you should try to reproduce the problem without the binary driver loaded. Moreover, if the problem does not occur without it, you should send a bug report to the creators of the binary driver and ask them to ﬁx it.

In the ﬁle Documentation/oops-tracing.txt, included in the kernel sources, there is a list of reasons why the kernel can be considered as tainted. As follows from this document, the presence of a binary module is not the only possible reason of tainting the kernel, but in practice it turns out to be the most frequent one. Generally, you should avoid reporting problems in tainted kernels to the LKML (or to the kernel developers in general) and the problems related to binary drivers should be reported to their providers.

Another case in which you should not report kernel problems to the LKML, or directly to the kernel developers, is when you are using a distribution kernel. The main reasons of this are the following:

Distribution kernels often contain modifications that are not included in the kernels available from ftp://ftp.kernel.org and have not been accepted by the maintainers of relevant kernel subsystems
Some of these modification are very experimental and they tend to introduce bugs that are not present in the ”official” kernels
Distribution kernels are meant to be supported by their distributors rather by the kernel developers, so the problems in these kernels should be reported to the distributors in the ﬁrst place (usually, the distributor will contact the kernel developers anyway if that is necessary)

Of course, if the problem can be reproduced using the ”original” kernel on which the distribution one is based, it can be reported to the kernel developers. Still, it usually is better to let the distributor know of the problem anyway and in our opinion it does not make sense to report the same problem twice, does it?

KernelNewbies: Linux_Kernel_Tester's_Guide_Chapter2 (last edited 2007-06-11 22:02:02 by unregister003160097217)