Testing of hardware
Before you carry out any serious kernel tests, you should make sure that your hardware is functioning correctly. In principle it should suffice to thoroughly test all of the hardware components once. Later, if you suspect that one of them may be faulty, you can check it separately once again. For example, hard disks are generally prone to failures, so you may want to test your hard disk from time to time.
For scanning the hard disk surface for unusable areas you can use the standard tool badblocks. After running it:
# /sbin/badblocks -v /dev/<your disk device>
you should learn relatively quickly if the disk is as good as you would want it to be.
Additional information on the hard disk state may be obtained from its S.M.A.R.T., by running
# smartctl --test=long /dev/<your disk device>
and then
# smartctl -a /dev/<your disk device>
to see the result of the test.
For memory testing you can use the program Memtest86+ (it is recommended to download the ISO image from the project’s web page at http://www.memtest.org/ and run the program out of a CD). For this purpose you can also use the older program Memtest86 (http://www.memtest86.com/) or any other memory-testing utility known to work well.
If you overclock the CPU, the memory, or the graphics adapter, you ought to resign from doing that while the kernel is being tested, since the overclocking of hardware may theoretically introduce some distortions and cause some random errors to appear.
It is worthy of checking if the voltages used to power the components of your computer are correct. You can do this with the help of the lm_sensors program. Alternatively, sometimes you can use a utility provided by the motherboard vendor for this purpose (unfortunately, these utilities are usually Windows-only).
Additionally, you should remember about various errors related to hardware that may appear even if the hardware is not broken. For example, cosmic rays and strong electromagnetic fields may sometimes cause hardware to fail and there are no 100% effective safeguards against them. For this reason, various technologies, such as the ECC (Error Correction Codes), are developed in order to detect and eventually correct the errors caused by unexpected physical interactions of this kind. For instance, ECC memory modules store additional bits of information, often referred to as the parity bits, used to check if the regular data bits stored in the memory are correct and to restore the right values of that bits if need be. Of course, such memory modules are more expensive than the non-ECC modules of similar characteristics, but it generally is a good idea to use them, especially in mission critical systems.
While testing the kernel you can also encounter the so-called MCEs (Machine Check Exceptions) thrown by the CPU in some problematic situations. Some processors generate them whenever a parity error is detected in an ECC memory module and they can also be generated if, for example, there is a problem with the CPU’s internal cache or when the CPU is overheating. In such cases the kernel may start to consider itself as ”tainted” (see Section 2.6) and if it crashes in that state, the corresponding error message will contain the letter ’M’ in the instruction pointer status line, e.g.
EIP: 0060:[<c046c7c3>] Tainted: PM VLI
There is one more potential source of hardware-related problems that you should be aware of, which is your computer’s BIOS (Basic Input-Output System). In the vast majority of contemporary computers the BIOS, sometimes referred to as the platform firmware, is responsible for configuring the hardware before the operating system kernel is loaded. It also provides the operating system kernel with the essential information on the hardware configuration and capabilities. Thus, if the BIOS is buggy, the Linux kernel will not be able to manage the hardware in the right way and the entire system will not work correctly.
To check if your computer’s BIOS is compatible with the Linux kernel you can use the Linux-ready Firmware Developer Kit https://www.linux.com/news/intel-linux-ready-firmware-developer-kit-1/ . Of course, if it turns out that the BIOS is not Linux-compatible, you will not be able to do much about that, except for updating the BIOS and notifying the mainboard vendor of the problem, but you will know that some issues are likely to appear. This, in turn, may help you assess whether the unexpected behavior of the kernel that you observe is a result of a software bug or it stems from the BIOS incompatibility.