Wade Mealing
Email: <wmealing AT SPAMFREE gmail DOT com>
Going to document some of the things that I work on, specifically how I use crash in RHEL style kernels. Perhaps this might be useful to someone else who also needs to figure out
the output of crash and what some of the crazyness means. Filling in sections this is not complete as yet..
What is crash:
The Crash utility is mechanism used for running a "gdb like" session against a kernel image. The image itself can be from a running kernel or a "core image" (known as a vmcore) after the system has paniced. While crash commands may seem familar to gdb users, gdb is design to debug userspace applications. Because of this gdb has lesser constraints and is able to manipulate and see the user process. The crash program while much like GDB has to work under a tighter set of constraints and is unable to use all gdb functionality. While the crash utility can be run against a live system, I will not discuss this scenario although I'd imagine that information provided in this document will still be useful.
Core dumps
A core dump is
- what produces them
- - disk dump / net dump / kdump - links to setting these up from rhkbase.
Getting set up
- Get right vmlinuz
- Get correct debugging modules
- Setting up the environment (requirements, modules)
- vmcore incomplete
So, I've received a vmcore running the kernel 2.6.9-34.0.2. The kernel was captured via a netdump server. After installing the correct [javascript:void(0);/*1230876717868*/ kernel debuginfo] package and [javascript:void(0);/*1230876762145*/ starting crash] , This was a crash from a 32 bit system. I was greeted with the usual style 'omgpanic' info from crash shown below:
[wmealing@core-i386 work]$ ./crash crash 4.0-5.0.3 Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i686-pc-linux-gnu"... KERNEL: /cores/vmlinux DUMPFILE: /cores/251703.vmcore CPUS: 2 DATE: Thu Dec 18 00:23:40 2008 UPTIME: 459 days, 21:16:32 LOAD AVERAGE: 1.05, 0.77, 0.68 TASKS: 280 NODENAME: paranioa RELEASE: 2.6.9-34.0.2.ELsmp VERSION: #1 SMP Fri Jun 30 10:33:58 EDT 2006 MACHINE: i686 (3801 Mhz) MEMORY: 4.4 GB PANIC: "Oops: 0002 [#1]" (check log for details) PID: 1629 COMMAND: "kjournald" TASK: c37fedb0 [THREAD_INFO: c37a4000] CPU: 0 STATE: TASK_UNINTERRUPTIBLE (PANIC)
A smart punter in this readership may have noticed that this doesn't actually tell you much, besides that the actual panic information is in the log. Sometimes the BUG() or the problem can appear in this first initial screen but we are not that lucky today. The log command in crash is pretty much the same as the dmesg that the system should/would have had if the system had continued to run instead of falling over. Because this can be very long, I'm going to tail the last 75 lines for brevity and your sanity, you should probably look through most of it as previous oopses or problems can appear in this command.
crash> log | tail -n 75 cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. cdrom: open failed. Unable to handle kernel NULL pointer dereference at virtual address 0000023c printing eip: f8855437 *pde = 30619001 Oops: 0002 [#1] SMP Modules linked in: iptable_filter ip_tables parport_pc parport st seos(U) eAC_mini(U) sg cpqci(U) netconsole netdump dm_mirror dm_mod uhci_hcd ehci_hcd hw_random e1000(U) tg3 bond1(U) bonding(U) floppy ext3 jbd cciss sd_mod scsi_mod CPU: 0 EIP: 0060:[<f8855437>] Tainted: P VLI EFLAGS: 00010087 (2.6.9-34.0.2.ELsmp) EIP is at do_cciss_intr+0xdc/0x4b4 [cciss] eax: 00000000 ebx: 00000004 ecx: 00000004 edx: 00000000 esi: f7400000 edi: 00000000 ebp: c3765800 esp: c03eafbc ds: 007b es: 007b ss: 0068 Process kjournald (pid: 1629, threadinfo=c03ea000 task=c37fedb0) Stack: 00000000 00000001 00000001 00000082 f7dd4800 00000001 00000000 c37a4ab8 c0107472 c37a4a9c c03ea000 c0387900 c37a4000 c01079d2 00000032 c37a4ab8 f7dd4800 Call Trace: [<c0107472>] handle_IRQ_event+0x25/0x4f [<c01079d2>] do_IRQ+0x11c/0x1ae ======================= [<c02d304c>] common_interrupt+0x18/0x20 [<f885510e>] do_cciss_request+0x9e/0x2eb [cciss] [<c0142742>] mempool_alloc+0x7b/0x135 [<c0120291>] autoremove_wake_function+0x0/0x2d [<c0142742>] mempool_alloc+0x7b/0x135 [<c0120291>] autoremove_wake_function+0x0/0x2d [<c022a6ce>] __cfq_get_queue+0x91/0xf6 [<c0120291>] autoremove_wake_function+0x0/0x2d [<c022a763>] cfq_get_queue+0x30/0x37 [<c022aa13>] cfq_set_request+0x33/0x6b [<c022a9e0>] cfq_set_request+0x0/0x6b [<c0223557>] get_request+0x1de/0x1e8 [<c012026d>] finish_wait+0x2c/0x50 [<c0222b9a>] ll_back_merge_fn+0x175/0x1de [<c022174b>] elv_merged_request+0x9/0xa [<c0224174>] __make_request+0x452/0x46c [<c014285c>] mempool_free+0x60/0x64 [<c022a55a>] cfq_dispatch_requests+0x55/0x80 [<c022a5a6>] cfq_next_request+0x21/0x35 [<c0222fa0>] __generic_unplug_device+0x2b/0x2d [<c0222fb7>] generic_unplug_device+0x15/0x21 [<c0222fd2>] blk_backing_dev_unplug+0xf/0x10 [<c015b3d9>] sync_buffer+0x2c/0x2d [<c015b4d7>] __wait_on_buffer+0x67/0x83 [<c015b384>] bh_wake_function+0x0/0x29 [<c015e199>] submit_bh+0x15a/0x166 [<c015b384>] bh_wake_function+0x0/0x29 [<f8863ac2>] journal_commit_transaction+0x8a7/0xfc1 [jbd] [<c0120291>] autoremove_wake_function+0x0/0x2d [<c0120291>] autoremove_wake_function+0x0/0x2d [<c011dcf7>] find_busiest_group+0xdd/0x2ba [<c011e115>] load_balance_newidle+0x56/0x82 [<c02d05c1>] schedule+0x83d/0x8d3 [<c02d05f1>] schedule+0x86d/0x8d3 [<c0129d4a>] del_timer_sync+0x7a/0x9c [<f8865e8d>] kjournald+0xc7/0x219 [jbd] [<c0120291>] autoremove_wake_function+0x0/0x2d [<c0120291>] autoremove_wake_function+0x0/0x2d [<c011d549>] schedule_tail+0x31/0xa7 [<f8865dc0>] commit_timeout+0x0/0x5 [jbd] [<f8865dc6>] kjournald+0x0/0x219 [jbd] [<c01041f5>] kernel_thread_helper+0x5/0xb Code: 95 30 03 00 00 74 38 8b 86 3c 02 00 00 39 f0 74 2e 39 b5 30 03 00 00 75 06 89 85 30 03 00 00 8b 86 38 02 00 00 8b 96 3c 02 00 00 <89> 90 3c 02 00 00 8b 96 3c 02 00 00 89 82 38 02 00 00 eb 06 c7
What we are looking at is what people call a "panic message". I'll try to bisect it below so that you can have understanding and common lexicon when discussing problems with fellow hackers. The line numbers do not appear in the original panic and are shown in line to make my explanation easier to parse.
1: Unable to handle kernel NULL pointer dereference at virtual address 0000023c printing eip: 2: f8855437 3: *pde = 30619001 4: Oops: 0002 [#1]
1) The problem as understood by the kernel is: Unable to handle kernel NULL pointer dereference at virtual address 0000023c. 2) The EIP[1] or "Executable instruction pointer" is the current location in the loaded code in memory that the CPU is executing. Further down the panic message, crash will resolve this into a function and instruction offset. I guess this is printed at this point in case crash is unable to so.}}} 3) pde = 30619001 ( page descriptor entry i think) is which page descriptor that the oops occured in.}}} 4) Oops: 0002 #1 <-- seems to be more than one oops, nwe dont know why.}}}
Module Information Modules linked in: iptable_filter ip_tables parport_pc parport st seos(U) eAC_mini(U) sg cpqci(U) netconsole netdump dm_mirror dm_mod uhci_hcd ehci_hcd hw_random e1000(U) tg3 bond1(U) bonding(U) floppy ext3 jbd cciss sd_mod scsi_m
This is a list of modules that are loaded at the time of the kernel panic. It is unlikely that your module list is the same as this one. You'll see a number of modules with a character beside them. These explain any strange or non standard conditions that may be involved with loading the module. In this case, the modules listed with a U in brackets are not modules that shipped with the RHEL kernel. All modules shipped with the kernel are "signed" and will not have a U symbol in the list. Knowing which modules are not standard kernel code will allow you to hunt the bug down in the right section of source code. References: 1: http://en.wikipedia.org/wiki/Program_counter 2: Tainted module flags: http://www.mjmwired.net/kernel/Documentation/oops-tracing.txt Further reading: Crash usage whitepaper: http://people.redhat.com/anderson/crash_whitepaper/ Overview of using kernel crash: http://www.redhatmagazine.com/2007/08/15/a-quick-overview-of-linux-kernel-crash-dump-analysis/