Re: kernel oops with blcr-0.7.3

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Nov 25 2008 - 11:10:02 PST

Next message: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"

Previous message: Jerry Mersel: "Re: callback function for parallel apps."
In reply to: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
Next in thread: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
Reply: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"

Dominique,
  I am now testing for a Beta release of 0.8.0 and I have been able to 
reproduce your Oops with the kernel version you indicated, and also with 
a vanilla 2.6.26 from kernel.org.  The bug relates to the restore of the 
FPU state, and since none of our test suite perform any floating point 
math, the bug went unnoticed in our testing of 0.7.3 against the 2.6.26 
kernel.

  I have a possible fix in testing right now, and it is one of the 
things on my list to finish before the 0.8.0 beta release.

-Paul

Paul H. Hargrove wrote:
> Dominique,
>
>  I must say that we have not seen an error like this before.  However, 
> we've not done testing ourselves on a kernel more recent than a 
> vanilla 2.6.26.  It seems likely that BLCR is not prepared for 
> something in the 2.6.26.6-49.fc8 kernel.
>  I hope to begin beta testing on BLCR 0.8.0 in late November, and we 
> are busy with other tasks between now and then.  So, I am afraid there 
> is little chance that your problem will be resolved prior to that.
>  However, I will try to see that we include a kernel like your inour 
> testing of 0.8.0 to ensure we identify the cause of the problem and 
> get it solved as quickly as we can.
>
> -Paul
>
>
> Dominique DELANDE wrote:
>>         Dear Sir or Madam,
>>
>> I am using blcr-0.7.3 to checkpoint MPI jobs, compiled either
>> with MVAPICH2-1.2rc2 or OpenMPI-1.3b1. In both cases,
>> checkpointing is apparently working, but restart ends
>> with a kernel oops on each node running a MPI process (see below).
>> The kernel oops is similar when using OpenMPI or MVAPICH2, I thus 
>> suspect that the problem comes from blcr.
>> I am running Linux Fedora 8 with a 2.6.26 Fedora x86_64 kernel.
>> Each node has two Dual Core AMD Opteron Processors 285 with 8 GBytes
>> of memory, Infiniband DDR, and the OFED-1.4-rc3 stack.
>>
>> Any help will be appreciated.
>>
>> Best regards
>>
>> Dominique Delande
>>
>> *******************************
>>
>> Ouput of 'uname -a':
>> Linux node11.cluster.local 2.6.26.6-49.fc8 #1 SMP Fri Oct 17 15:33:32 
>> EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Example:
>>
>> $ cr_restart context.20722
>>  node11 kernel: Oops: 0002 [1] SMP
>>  node11 kernel: Code: 48 8b 11 31 c0 c3 48 83 e9 07 eb 00 31 d2 48 c7 
>> c0 f2 ff ff ff c3 90 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 e9 
>> 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 66 66 66 2e 0f 1f 84 00 00 
>> 00 00 00
>>  node11 kernel: CR2: 0000000000000000
>>
>> dmesg output is:
>>
>> BUG: unable to handle kernel NULL pointer dereference at 
>> 0000000000000000
>> IP: [<ffffffff81140d1b>] memcpy_c+0xb/0x20
>> PGD 22e4ed067 PUD 22dccf067 PMD 0
>> Oops: 0002 [1] SMP
>> CPU 1
>> Modules linked in: blcr(U) blcr_vmadump(U) blcr_imports(U) nfs lockd 
>> nfs_acl sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) 
>> ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) 
>> mlx4_core(U) dm_mirror dm_log dm_multipath dm_mod cfi_cmdset_0002 
>> cfi_util tg3 jedec_probe cfi_probe ib_mthca(U) gen_probe ib_mad(U) 
>> ck804xrom ib_core(U) mtd k8temp i2c_nforce2 i2c_core pcspkr chipreg 
>> hwmon shpchp map_funcs sr_mod cdrom sg pata_amd ata_generic pata_acpi 
>> sata_nv libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd 
>> ehci_hcd [last unloaded: scsi_wait_scan]
>> Pid: 20789, comm: cr_restart Not tainted 2.6.26.6-49.fc8 #1
>> RIP: 0010:[<ffffffff81140d1b>]  [<ffffffff81140d1b>] memcpy_c+0xb/0x20
>> RSP: 0018:ffff81022e4c7770  EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000040
>> RDX: 0000000000000000 RSI: ffff81022e4c7788 RDI: 0000000000000000
>> RBP: ffff81022e4c7b58 R08: 0000000000000000 R09: 0000000000000001
>> R10: 0000000000000000 R11: 0000000000000001 R12: ffff81022e4c7788
>> R13: ffff810128c2a240 R14: 0000000000000000 R15: ffffffffffffffff
>> FS:  00007f4cabcdc6f0(0000) GS:ffff81022fa25300(0000) 
>> knlGS:0000000055788b90
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000000000000 CR3: 000000012e96b000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process cr_restart (pid: 20789, threadinfo ffff81022e4c6000, task 
>> ffff81022e4596a0)
>> Stack:  ffffffffa0334524 0000000000000000 0000000000000000 
>> 000000000000037f
>>  0000000000000000 0000000000000000 0000ffff00009fc0 0000000000000000
>>  0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> Call Trace:
>>  [<ffffffffa0334524>] ? :blcr_vmadump:vmadump_restore_cpu+0x197/0x5b4
>>  [<ffffffff810aa7a0>] ? do_sync_read+0xe2/0x126
>>  [<ffffffff810492dc>] ? autoremove_wake_function+0x0/0x38
>>  [<ffffffff8100c12a>] ? system_call_after_swapgs+0x8a/0x8f
>>  [<ffffffffa0332048>] ? :blcr_vmadump:read_user+0x48/0x72
>>  [<ffffffffa0333846>] :blcr_vmadump:vmadump_thaw_proc+0x169/0xb6b
>>  [<ffffffff8102fb8d>] ? hrtick_set+0xe0/0xe9
>>  [<ffffffff8129748d>] ? thread_return+0x7e/0xab
>>  [<ffffffffa03455dc>] :blcr:cr_thaw_threads+0x18d/0x1f1
>>  [<ffffffff810492dc>] ? autoremove_wake_function+0x0/0x38
>>  [<ffffffffa0343413>] :blcr:cr_rstrt_child+0xa1e/0x1ca2
>>  [<ffffffffa033e0b8>] ? :blcr:ctrl_ioctl+0x0/0x1bc
>>  [<ffffffffa033e22f>] :blcr:ctrl_ioctl+0x177/0x1bc
>>  [<ffffffff810e8aae>] proc_reg_unlocked_ioctl+0xa6/0xc6
>>  [<ffffffff810b6aaa>] vfs_ioctl+0x2a/0x77
>>  [<ffffffff810b6d45>] do_vfs_ioctl+0x24e/0x26b
>>  [<ffffffff8100a92a>] ? __switch_to+0xd8/0x376
>>  [<ffffffff810b6db9>] sys_ioctl+0x57/0x7a
>>  [<ffffffff8100c12a>] system_call_after_swapgs+0x8a/0x8f
>>
>>
>> Code: 48 8b 11 31 c0 c3 48 83 e9 07 eb 00 31 d2 48 c7 c0 f2 ff ff ff 
>> c3 90 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 
>> <f3> 48 a5 89 d1 f3 a4 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
>> RIP  [<ffffffff81140d1b>] memcpy_c+0xb/0x20
>>  RSP <ffff81022e4c7770>
>> CR2: 0000000000000000
>> ---[ end trace 6c22587b3026d5be ]---
>> blcr: rstrt_watchdog: 'cr_restart' (tgid/pid 20788/20789) exited with 
>> signal 9 during restart
>>
>>
>
>


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"

Previous message: Jerry Mersel: "Re: callback function for parallel apps."
In reply to: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
Next in thread: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
Reply: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"

Date view	Thread view	Subject view	Author view	Attachment view