Re: kernel oops with blcr-0.7.3

From: Dominique DELANDE (Dominique.Delande_at_spectro.jussieu.fr)
Date: Tue Jan 06 2009 - 06:43:36 PST

  • Next message: Karthik Gopalakrishnan: "Migrating from 0.7.1 to 0.8.0"
    	Paul,
    
    Sorry for the very late reply, I have been out of town in dec 2008.
    I applied the patch, which solves the problem: cr_restart now works 
    fine. I tested with MVAPICH2-1.2p1. I will soon test OpenMPI too.
    
    Thanks a lot for your work: blcr is great software.
    
    Dominique
    
    Paul H. Hargrove wrote:
    > I have created a bugzilla entry for this issue at  
    > http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2454
    > 
    > I believe I have fixed this bug for the upcoming 0.8.0 beta release, but 
    > can't be 100% certain my test case fully duplicates the circumstances of 
    > the reported failure.  So, I have also attached to the bugzilla entry a 
    > patch relative to the 0.7.3 sources.  I'd appreciate it if Dominique (or 
    > anyone else who has seen this problem) could report success of failure 
    > when using the patch.
    > 
    > -Paul
    > 
    > Paul H. Hargrove wrote:
    >> Dominique,
    >>  I am now testing for a Beta release of 0.8.0 and I have been able to 
    >> reproduce your Oops with the kernel version you indicated, and also 
    >> with a vanilla 2.6.26 from kernel.org.  The bug relates to the restore 
    >> of the FPU state, and since none of our test suite perform any 
    >> floating point math, the bug went unnoticed in our testing of 0.7.3 
    >> against the 2.6.26 kernel.
    >>
    >>  I have a possible fix in testing right now, and it is one of the 
    >> things on my list to finish before the 0.8.0 beta release.
    >>
    >> -Paul
    >>
    >> Paul H. Hargrove wrote:
    >>> Dominique,
    >>>
    >>>  I must say that we have not seen an error like this before.  
    >>> However, we've not done testing ourselves on a kernel more recent 
    >>> than a vanilla 2.6.26.  It seems likely that BLCR is not prepared for 
    >>> something in the 2.6.26.6-49.fc8 kernel.
    >>>  I hope to begin beta testing on BLCR 0.8.0 in late November, and we 
    >>> are busy with other tasks between now and then.  So, I am afraid 
    >>> there is little chance that your problem will be resolved prior to that.
    >>>  However, I will try to see that we include a kernel like your inour 
    >>> testing of 0.8.0 to ensure we identify the cause of the problem and 
    >>> get it solved as quickly as we can.
    >>>
    >>> -Paul
    >>>
    >>>
    >>> Dominique DELANDE wrote:
    >>>>         Dear Sir or Madam,
    >>>>
    >>>> I am using blcr-0.7.3 to checkpoint MPI jobs, compiled either
    >>>> with MVAPICH2-1.2rc2 or OpenMPI-1.3b1. In both cases,
    >>>> checkpointing is apparently working, but restart ends
    >>>> with a kernel oops on each node running a MPI process (see below).
    >>>> The kernel oops is similar when using OpenMPI or MVAPICH2, I thus 
    >>>> suspect that the problem comes from blcr.
    >>>> I am running Linux Fedora 8 with a 2.6.26 Fedora x86_64 kernel.
    >>>> Each node has two Dual Core AMD Opteron Processors 285 with 8 GBytes
    >>>> of memory, Infiniband DDR, and the OFED-1.4-rc3 stack.
    >>>>
    >>>> Any help will be appreciated.
    >>>>
    >>>> Best regards
    >>>>
    >>>> Dominique Delande
    >>>>
    >>>> *******************************
    >>>>
    >>>> Ouput of 'uname -a':
    >>>> Linux node11.cluster.local 2.6.26.6-49.fc8 #1 SMP Fri Oct 17 
    >>>> 15:33:32 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
    >>>>
    >>>> Example:
    >>>>
    >>>> $ cr_restart context.20722
    >>>>  node11 kernel: Oops: 0002 [1] SMP
    >>>>  node11 kernel: Code: 48 8b 11 31 c0 c3 48 83 e9 07 eb 00 31 d2 48 
    >>>> c7 c0 f2 ff ff ff c3 90 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 
    >>>> e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 66 66 66 2e 0f 1f 84 00 
    >>>> 00 00 00 00
    >>>>  node11 kernel: CR2: 0000000000000000
    >>>>
    >>>> dmesg output is:
    >>>>
    >>>> BUG: unable to handle kernel NULL pointer dereference at 
    >>>> 0000000000000000
    >>>> IP: [<ffffffff81140d1b>] memcpy_c+0xb/0x20
    >>>> PGD 22e4ed067 PUD 22dccf067 PMD 0
    >>>> Oops: 0002 [1] SMP
    >>>> CPU 1
    >>>> Modules linked in: blcr(U) blcr_vmadump(U) blcr_imports(U) nfs lockd 
    >>>> nfs_acl sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) 
    >>>> ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) 
    >>>> mlx4_ib(U) mlx4_core(U) dm_mirror dm_log dm_multipath dm_mod 
    >>>> cfi_cmdset_0002 cfi_util tg3 jedec_probe cfi_probe ib_mthca(U) 
    >>>> gen_probe ib_mad(U) ck804xrom ib_core(U) mtd k8temp i2c_nforce2 
    >>>> i2c_core pcspkr chipreg hwmon shpchp map_funcs sr_mod cdrom sg 
    >>>> pata_amd ata_generic pata_acpi sata_nv libata sd_mod scsi_mod ext3 
    >>>> jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
    >>>> Pid: 20789, comm: cr_restart Not tainted 2.6.26.6-49.fc8 #1
    >>>> RIP: 0010:[<ffffffff81140d1b>]  [<ffffffff81140d1b>] memcpy_c+0xb/0x20
    >>>> RSP: 0018:ffff81022e4c7770  EFLAGS: 00010246
    >>>> RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000040
    >>>> RDX: 0000000000000000 RSI: ffff81022e4c7788 RDI: 0000000000000000
    >>>> RBP: ffff81022e4c7b58 R08: 0000000000000000 R09: 0000000000000001
    >>>> R10: 0000000000000000 R11: 0000000000000001 R12: ffff81022e4c7788
    >>>> R13: ffff810128c2a240 R14: 0000000000000000 R15: ffffffffffffffff
    >>>> FS:  00007f4cabcdc6f0(0000) GS:ffff81022fa25300(0000) 
    >>>> knlGS:0000000055788b90
    >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    >>>> CR2: 0000000000000000 CR3: 000000012e96b000 CR4: 00000000000006e0
    >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    >>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    >>>> Process cr_restart (pid: 20789, threadinfo ffff81022e4c6000, task 
    >>>> ffff81022e4596a0)
    >>>> Stack:  ffffffffa0334524 0000000000000000 0000000000000000 
    >>>> 000000000000037f
    >>>>  0000000000000000 0000000000000000 0000ffff00009fc0 0000000000000000
    >>>>  0000000000000000 0000000000000000 0000000000000000 0000000000000000
    >>>> Call Trace:
    >>>>  [<ffffffffa0334524>] ? :blcr_vmadump:vmadump_restore_cpu+0x197/0x5b4
    >>>>  [<ffffffff810aa7a0>] ? do_sync_read+0xe2/0x126
    >>>>  [<ffffffff810492dc>] ? autoremove_wake_function+0x0/0x38
    >>>>  [<ffffffff8100c12a>] ? system_call_after_swapgs+0x8a/0x8f
    >>>>  [<ffffffffa0332048>] ? :blcr_vmadump:read_user+0x48/0x72
    >>>>  [<ffffffffa0333846>] :blcr_vmadump:vmadump_thaw_proc+0x169/0xb6b
    >>>>  [<ffffffff8102fb8d>] ? hrtick_set+0xe0/0xe9
    >>>>  [<ffffffff8129748d>] ? thread_return+0x7e/0xab
    >>>>  [<ffffffffa03455dc>] :blcr:cr_thaw_threads+0x18d/0x1f1
    >>>>  [<ffffffff810492dc>] ? autoremove_wake_function+0x0/0x38
    >>>>  [<ffffffffa0343413>] :blcr:cr_rstrt_child+0xa1e/0x1ca2
    >>>>  [<ffffffffa033e0b8>] ? :blcr:ctrl_ioctl+0x0/0x1bc
    >>>>  [<ffffffffa033e22f>] :blcr:ctrl_ioctl+0x177/0x1bc
    >>>>  [<ffffffff810e8aae>] proc_reg_unlocked_ioctl+0xa6/0xc6
    >>>>  [<ffffffff810b6aaa>] vfs_ioctl+0x2a/0x77
    >>>>  [<ffffffff810b6d45>] do_vfs_ioctl+0x24e/0x26b
    >>>>  [<ffffffff8100a92a>] ? __switch_to+0xd8/0x376
    >>>>  [<ffffffff810b6db9>] sys_ioctl+0x57/0x7a
    >>>>  [<ffffffff8100c12a>] system_call_after_swapgs+0x8a/0x8f
    >>>>
    >>>>
    >>>> Code: 48 8b 11 31 c0 c3 48 83 e9 07 eb 00 31 d2 48 c7 c0 f2 ff ff ff 
    >>>> c3 90 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 
    >>>> <f3> 48 a5 89 d1 f3 a4 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
    >>>> RIP  [<ffffffff81140d1b>] memcpy_c+0xb/0x20
    >>>>  RSP <ffff81022e4c7770>
    >>>> CR2: 0000000000000000
    >>>> ---[ end trace 6c22587b3026d5be ]---
    >>>> blcr: rstrt_watchdog: 'cr_restart' (tgid/pid 20788/20789) exited 
    >>>> with signal 9 during restart
    >>>>
    >>>>
    >>>
    >>>
    >>
    >>
    > 
    > 
    
    
    -- 
        Dominique Delande ([email protected])
        Laboratoire Kastler-Brossel - Case 74 - Universite P. et M. Curie
        4, place Jussieu, F-75252 Paris Cedex 05, FRANCE
        Phone : 33 (0)1 44 27 27 97 - Fax : 33 (0)1 44 27 38 45
        Acces : Pyramide de la Scolarite Paris VI - 1er etage - Bureau 214
    

  • Next message: Karthik Gopalakrishnan: "Migrating from 0.7.1 to 0.8.0"