From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Aug 08 2008 - 16:28:16 PDT
Manish, Thanks for the logs. Based on their contents, I am pretty certain where the ENOMEM originates, but am not sure about *why*. If you could please apply the attached 1-line patch and recompile and generate new logs, the added output should (I hope) explain what the failed memory allocation is like. From there I should be able to work toward a solution. -Paul Manish Dwivedi wrote: > Hi Paul, > > We have reload the modules with the cr_ktrace_mask variable and got > the logs as follows: > ========================================================== > [snip] > > ================================================================ > > Thanks a lot for your help. > > Regards, > Manish > On Fri, Aug 8, 2008 at 1:20 A-------M, Paul H. Hargrove > <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> wrote: > > Manish, > > There is no stated/known minimum memory requirement for BLCR, but > it is still possible that we are too aggressive with memory. I > run an emulated ARM environment in QEMU and have not yet tried > running with so little memory (though I plan to try today). > The default level of tracing detail didn't produce much output > for your case because the failure appears to come relatively > early. By requesting more detailed tracing, we should be able to > narrow down when in BLCR we've failed to allocate memory. > Please reload the kernel modules with "make insmod > cr_ktrace_mask=0xffffffff", which will enable the most detailed > tracing. Then rerun your failed checkpoint and, again, send the > output. Hopefully this time there will be enough for me to move > forward on diagnosing your problem. > > Thanks for your patience, > Paul > > Manish Dwivedi wrote: > > Hi Paul, > > Thanks for the information. We tried compiling it with the > enable-debug option today. But we didn't get much information > in the log (log file is attached in the e-mail. > > In between, we have 64 MB RAM in the system, is there a > limitation or minimum requirement of the RAM in BLCR ? > > Regards, > Manish > > Ps: We followed the exactly same process for X86 and it is > working fine for us. > > > On Wed, Aug 6, 2008 at 10:58 PM, Paul H. Hargrove > <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>> wrote: > > Manish, > > I am sorry to hear that you are having problems. From the > information you provide below, it is hard to say what the > problem > is, other than to guess that your ARM system is low on memory. > I am aware of a kernel-side memory leak in blcr-0.7.2, which > should be fixed in the 0.7.3 release expected later this > week or > early next week. So, I'd like to know if the failure you > describe > happens on the very first use of cr_checkpoint, or does it > happen > after BLCR has been used several times (for instance by running > "make check")? If it works for a while and then begins to > fail, > I'd suspect the known memory leak and suggest that you wait for > blcr-0.7.3. > If you are seeing failure on the very first attempt to use > blcr, > then I suggest that you rebuild blcr with debugging enabled and > send me the information dumped to the system logs (run dmesg or > see /var/log/messages to find the logs). To do this, > you'll need > to start at the beginning of the configure/make/install process > and pass the "--enable-debug" option to configure, and then > proceed with the rest of the build/install process. Be sure to > "make insmod" (or manually rmmod the old modules and > insmod/modprobe the new ones); otherwise the kernel modules > from > your previous (non-debug) build may still be running. With the > new kernel modules loaded, you should retry your failing > command > and then look for messages with "blcr: " in them in the > system logs. > > I also should tell you that there is an ARM-specific > mailing list > (very low volume) for BLCR that may help you reach other ARM > users. You can find list info and subscribe (required to > post) at > https://hpcrdm.lbl.gov/mailman/listinfo/blcr-arm > > -Paul > > > Manish Dwivedi wrote: > > Hi All, > > I am trying to use BLCR for ARM. But when I am trying > to use > cr_checkpoint with a hello.c program it is giving me an > error > as below: > > cr_checkpoint --term <pid> (command run) > Checkpoint failed: Cannot allocate memory > > I have compiled hello.c in the same kernel as mentioned > in the > release notes, I am using blcr-0.7.2.tar.gz for this. > > Could anyone help me out resolving this issue so that I can > test it. It works fine for me on a X86 machine. > > Regards, > Manish > > > > -- Paul H. Hargrove > PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > Future Technologies Group HPC Research > Department > Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 --- cr_module/cr_dump_self.c 26 Jun 2008 00:05:42 -0000 1.202.2.5 +++ cr_module/cr_dump_self.c 8 Aug 2008 23:25:09 -0000 @@ -827,6 +827,7 @@ result = -ENOMEM; groups = vmalloc(sizeof_groups); if (groups == NULL) { +CR_ERR("vmalloc(%ld) failed w/ ngroups=%ld NGROUPS_MAX=%ld", (long)sizeof_groups, (long)cf_creds.ngroups, NGROUPS_MAX); goto out_early_mutex; }