Re: Error while using cr_checkpoint on ARM

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Aug 08 2008 - 16:28:16 PDT

  • Next message: Paul H. Hargrove: "Patch for proc w/ no supplementary groups"
    Manish,
    
      Thanks for the logs.  Based on their contents, I am pretty certain 
    where the ENOMEM originates, but am not sure about *why*.
      If you could please apply the attached 1-line patch and recompile and 
    generate new logs, the added output should (I hope) explain what the 
    failed memory allocation is like.  From there I should be able to work 
    toward a solution.
    
    -Paul
    
    Manish Dwivedi wrote:
    > Hi Paul,
    >
    > We have reload the modules with the cr_ktrace_mask variable and got 
    > the logs as follows:
    > ==========================================================
    >
    [snip]
    >
    > ================================================================
    >
    > Thanks a lot for your help.
    >
    > Regards,
    > Manish
    > On Fri, Aug 8, 2008 at 1:20 A-------M, Paul H. Hargrove 
    > <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> wrote:
    >
    >     Manish,
    >
    >      There is no stated/known minimum memory requirement for BLCR, but
    >     it is still possible that we are too aggressive with memory.  I
    >     run an emulated ARM environment in QEMU and have not yet tried
    >     running with so little memory (though I plan to try today).
    >      The default level of tracing detail didn't produce much output
    >     for your case because the failure appears to come relatively
    >     early.  By requesting more detailed tracing, we should be able to
    >     narrow down when in BLCR we've failed to allocate memory.
    >      Please reload the kernel modules with "make insmod
    >     cr_ktrace_mask=0xffffffff", which will enable the most detailed
    >     tracing.  Then rerun your failed checkpoint and, again, send the
    >     output.  Hopefully this time there will be enough for me to move
    >     forward on diagnosing your problem.
    >
    >     Thanks for your patience,
    >     Paul
    >
    >     Manish Dwivedi wrote:
    >
    >         Hi Paul,
    >
    >         Thanks for the information. We tried compiling it with the
    >         enable-debug option today. But we didn't get much information
    >         in the log (log file is attached in the e-mail.
    >
    >         In between, we have 64 MB RAM in the system, is there a
    >         limitation or minimum requirement of the RAM in BLCR ?
    >
    >         Regards,
    >         Manish
    >
    >         Ps: We followed the exactly same process for X86 and it is
    >         working fine for us.
    >
    >
    >         On Wed, Aug 6, 2008 at 10:58 PM, Paul H. Hargrove
    >         <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>
    >         <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>> wrote:
    >
    >            Manish,
    >
    >             I am sorry to hear that you are having problems.  From the
    >            information you provide below, it is hard to say what the
    >         problem
    >            is, other than to guess that your ARM system is low on memory.
    >             I am aware of a kernel-side memory leak in blcr-0.7.2, which
    >            should be fixed in the 0.7.3 release expected later this
    >         week or
    >            early next week.  So, I'd like to know if the failure you
    >         describe
    >            happens on the very first use of cr_checkpoint, or does it
    >         happen
    >            after BLCR has been used several times (for instance by running
    >            "make check")?  If it works for a while and then begins to
    >         fail,
    >            I'd suspect the known memory leak and suggest that you wait for
    >            blcr-0.7.3.
    >             If you are seeing failure on the very first attempt to use
    >         blcr,
    >            then I suggest that you rebuild blcr with debugging enabled and
    >            send me the information dumped to the system logs (run dmesg or
    >            see /var/log/messages to find the logs).  To do this,
    >         you'll need
    >            to start at the beginning of the configure/make/install process
    >            and pass the "--enable-debug" option to configure, and then
    >            proceed with the rest of the build/install process.  Be sure to
    >            "make insmod" (or manually rmmod the old modules and
    >            insmod/modprobe the new ones); otherwise the kernel modules
    >         from
    >            your previous (non-debug) build may still be running.  With the
    >            new kernel modules loaded, you should retry your failing
    >         command
    >            and then look for messages with "blcr: " in them in the
    >         system logs.
    >
    >             I also should tell you that there is an ARM-specific
    >         mailing list
    >            (very low volume) for BLCR that may help you reach other ARM
    >            users.  You can find list info and subscribe (required to
    >         post) at
    >            https://hpcrdm.lbl.gov/mailman/listinfo/blcr-arm
    >
    >            -Paul
    >
    >
    >            Manish Dwivedi wrote:
    >
    >                Hi All,
    >
    >                I am trying to use BLCR for ARM. But when I am trying
    >         to use
    >                cr_checkpoint with a hello.c program it is giving me an
    >         error
    >                as below:
    >
    >                cr_checkpoint --term <pid> (command run)
    >                Checkpoint failed: Cannot allocate memory
    >
    >                I have compiled hello.c in the same kernel as mentioned
    >         in the
    >                release notes, I am using blcr-0.7.2.tar.gz for this.
    >
    >                Could anyone help me out resolving this issue so that I can
    >                test it. It works fine for me on a X86 machine.
    >
    >                Regards,
    >                Manish
    >
    >
    >
    >            --    Paul H. Hargrove                        
    >          PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>
    >            <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    >
    >            Future Technologies Group                 HPC Research
    >         Department
    >                              Tel: +1-510-495-2352
    >            Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    >
    >
    >     -- 
    >     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >     <mailto:PHHargrove_at_lbl_dot_gov>
    >     Future Technologies Group
    >     HPC Research Department                   Tel: +1-510-495-2352
    >     Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    
    
    --- cr_module/cr_dump_self.c	26 Jun 2008 00:05:42 -0000	1.202.2.5
    +++ cr_module/cr_dump_self.c	8 Aug 2008 23:25:09 -0000
    @@ -827,6 +827,7 @@
                 result = -ENOMEM;
                 groups = vmalloc(sizeof_groups);
                 if (groups == NULL) {
    +CR_ERR("vmalloc(%ld) failed w/ ngroups=%ld NGROUPS_MAX=%ld", (long)sizeof_groups, (long)cf_creds.ngroups, NGROUPS_MAX);
                     goto out_early_mutex;
                 }
     
    

  • Next message: Paul H. Hargrove: "Patch for proc w/ no supplementary groups"