From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Oct 29 2008 - 12:31:10 PST
Dominique, I must say that we have not seen an error like this before. However, we've not done testing ourselves on a kernel more recent than a vanilla 2.6.26. It seems likely that BLCR is not prepared for something in the 2.6.26.6-49.fc8 kernel. I hope to begin beta testing on BLCR 0.8.0 in late November, and we are busy with other tasks between now and then. So, I am afraid there is little chance that your problem will be resolved prior to that. However, I will try to see that we include a kernel like your inour testing of 0.8.0 to ensure we identify the cause of the problem and get it solved as quickly as we can. -Paul Dominique DELANDE wrote: > > Dear Sir or Madam, > > I am using blcr-0.7.3 to checkpoint MPI jobs, compiled either > with MVAPICH2-1.2rc2 or OpenMPI-1.3b1. In both cases, > checkpointing is apparently working, but restart ends > with a kernel oops on each node running a MPI process (see below). > The kernel oops is similar when using OpenMPI or MVAPICH2, I thus > suspect that the problem comes from blcr. > I am running Linux Fedora 8 with a 2.6.26 Fedora x86_64 kernel. > Each node has two Dual Core AMD Opteron Processors 285 with 8 GBytes > of memory, Infiniband DDR, and the OFED-1.4-rc3 stack. > > Any help will be appreciated. > > Best regards > > Dominique Delande > > ******************************* > > Ouput of 'uname -a': > Linux node11.cluster.local 2.6.26.6-49.fc8 #1 SMP Fri Oct 17 15:33:32 > EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > > Example: > > $ cr_restart context.20722 > node11 kernel: Oops: 0002 [1] SMP > node11 kernel: Code: 48 8b 11 31 c0 c3 48 83 e9 07 eb 00 31 d2 48 c7 > c0 f2 ff ff ff c3 90 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 e9 > 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 66 66 66 2e 0f 1f 84 00 00 00 > 00 00 > node11 kernel: CR2: 0000000000000000 > > dmesg output is: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 > IP: [<ffffffff81140d1b>] memcpy_c+0xb/0x20 > PGD 22e4ed067 PUD 22dccf067 PMD 0 > Oops: 0002 [1] SMP > CPU 1 > Modules linked in: blcr(U) blcr_vmadump(U) blcr_imports(U) nfs lockd > nfs_acl sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) > ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) > mlx4_core(U) dm_mirror dm_log dm_multipath dm_mod cfi_cmdset_0002 > cfi_util tg3 jedec_probe cfi_probe ib_mthca(U) gen_probe ib_mad(U) > ck804xrom ib_core(U) mtd k8temp i2c_nforce2 i2c_core pcspkr chipreg > hwmon shpchp map_funcs sr_mod cdrom sg pata_amd ata_generic pata_acpi > sata_nv libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd > ehci_hcd [last unloaded: scsi_wait_scan] > Pid: 20789, comm: cr_restart Not tainted 2.6.26.6-49.fc8 #1 > RIP: 0010:[<ffffffff81140d1b>] [<ffffffff81140d1b>] memcpy_c+0xb/0x20 > RSP: 0018:ffff81022e4c7770 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000040 > RDX: 0000000000000000 RSI: ffff81022e4c7788 RDI: 0000000000000000 > RBP: ffff81022e4c7b58 R08: 0000000000000000 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000001 R12: ffff81022e4c7788 > R13: ffff810128c2a240 R14: 0000000000000000 R15: ffffffffffffffff > FS: 00007f4cabcdc6f0(0000) GS:ffff81022fa25300(0000) > knlGS:0000000055788b90 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 000000012e96b000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process cr_restart (pid: 20789, threadinfo ffff81022e4c6000, task > ffff81022e4596a0) > Stack: ffffffffa0334524 0000000000000000 0000000000000000 > 000000000000037f > 0000000000000000 0000000000000000 0000ffff00009fc0 0000000000000000 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > Call Trace: > [<ffffffffa0334524>] ? :blcr_vmadump:vmadump_restore_cpu+0x197/0x5b4 > [<ffffffff810aa7a0>] ? do_sync_read+0xe2/0x126 > [<ffffffff810492dc>] ? autoremove_wake_function+0x0/0x38 > [<ffffffff8100c12a>] ? system_call_after_swapgs+0x8a/0x8f > [<ffffffffa0332048>] ? :blcr_vmadump:read_user+0x48/0x72 > [<ffffffffa0333846>] :blcr_vmadump:vmadump_thaw_proc+0x169/0xb6b > [<ffffffff8102fb8d>] ? hrtick_set+0xe0/0xe9 > [<ffffffff8129748d>] ? thread_return+0x7e/0xab > [<ffffffffa03455dc>] :blcr:cr_thaw_threads+0x18d/0x1f1 > [<ffffffff810492dc>] ? autoremove_wake_function+0x0/0x38 > [<ffffffffa0343413>] :blcr:cr_rstrt_child+0xa1e/0x1ca2 > [<ffffffffa033e0b8>] ? :blcr:ctrl_ioctl+0x0/0x1bc > [<ffffffffa033e22f>] :blcr:ctrl_ioctl+0x177/0x1bc > [<ffffffff810e8aae>] proc_reg_unlocked_ioctl+0xa6/0xc6 > [<ffffffff810b6aaa>] vfs_ioctl+0x2a/0x77 > [<ffffffff810b6d45>] do_vfs_ioctl+0x24e/0x26b > [<ffffffff8100a92a>] ? __switch_to+0xd8/0x376 > [<ffffffff810b6db9>] sys_ioctl+0x57/0x7a > [<ffffffff8100c12a>] system_call_after_swapgs+0x8a/0x8f > > > Code: 48 8b 11 31 c0 c3 48 83 e9 07 eb 00 31 d2 48 c7 c0 f2 ff ff ff > c3 90 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> > 48 a5 89 d1 f3 a4 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 > RIP [<ffffffff81140d1b>] memcpy_c+0xb/0x20 > RSP <ffff81022e4c7770> > CR2: 0000000000000000 > ---[ end trace 6c22587b3026d5be ]--- > blcr: rstrt_watchdog: 'cr_restart' (tgid/pid 20788/20789) exited with > signal 9 during restart > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900