From: Manish Dwivedi (mdwivedi_at_gmail_dot_com)
Date: Thu Aug 07 2008 - 22:46:27 PDT
Hi Paul, We have reload the modules with the cr_ktrace_mask variable and got the logs as follows: ========================================================== blcr: Berkeley Lab Checkpoint/Restart (BLCR) module version 0.7.2. blcr: Tracing enabled (trace_mask=0xffffffff) blcr: Supports kernel interface version 0.9.0. blcr: Supports context file format version 7. blcr: http://ftg.lbl.gov/checkpoint cr_proc_init <cr_proc.c:47>, pid 704: entering eth0: spurious interrupt (mask = 0xb3) ctrl_open <cr_fops.c:231>, pid 712: entering ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a130 arg=0x9 ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a107 arg=0x4 cr_phase2_register <cr_sync.c:73>, pid 712: entering __cr_task_get <cr_task.c:98>, pid 712: Alloc cr_task_t c53e8628 for pid 712 ctrl_ioctl <cr_fops.c:133>, pid 714: entering op=4004a105 arg=0x4 cr_phase1_register <cr_async.c:145>, pid 714: entering __cr_task_get <cr_task.c:98>, pid 714: Alloc cr_task_t c53e85cc for pid 714 ctrl_ioctl <cr_fops.c:133>, pid 714: entering op=c004a106 arg=0x0 cr_suspend <cr_async.c:79>, pid 714: entering ctrl_open <cr_fops.c:231>, pid 712: entering ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a130 arg=0x9 ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a110 arg=0xbe9ed9b4 cr_chkpt_req <cr_chkpt_req.c:856>, pid 712: entering cr_log_request <cr_chkpt_req.c:834>, pid 712: checkpointing process tree 709 cr_chkpt_req <cr_chkpt_req.c:878>, pid 712: checkpoint params: secs= 0, opts=00000000, fd=3 alloc_request <cr_chkpt_req.c:254>, pid 712: Alloc cr_chkpt_req_t c4f6bb38 ctrl_open <cr_fops.c:231>, pid 712: entering cr_loc_init <cr_dest_file.c:158>, pid 712: Calling do_init_reg on fd 3 build_req_tree <cr_chkpt_req.c:707>, pid 712: in build_req_tree add_proc <cr_chkpt_req.c:562>, pid 712: Add proc pid=709 add_task <cr_chkpt_req.c:436>, pid 712: entering task=c0536040 (709) __cr_task_get <cr_task.c:98>, pid 712: Alloc cr_task_t c53e8570 for pid 709 build_req_tree <cr_chkpt_req.c:722>, pid 712: scanning children build_req_tree <cr_chkpt_req.c:728>, pid 712: found child 709 do_trigger <cr_trigger.c:94>, pid 712: triggered pid 709 (hello) w/ retval=0 ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=c004a111 arg=0x0 cr_chkpt_done <cr_chkpt_req.c:1383>, pid 712: entering ctrl_ioctl <cr_fops.c:133>, pid 709: entering op=4004a101 arg=0x4000 cr_dump_self <cr_dump_self.c:999>, pid 709: entering flags=0x4000 cr_dump_self <cr_dump_self.c:1019>, pid 709: NOTIFY(&req->preshared_barrier) cr_dump_self <cr_dump_self.c:1020>, pid 709: TEST(&req->preshared_barrier) returning 1 cr_save_file_header <cr_dump_self.c:697>, pid 709: Dumping file header cr_signal_predump_barrier <cr_chkpt_req.c:1148>, pid 709: NOTIFY(&proc_req->predump_barrier) cr_signal_predump_barrier <cr_chkpt_req.c:1149>, pid 709: ONCE(&proc_req->predump_barrier, 1) begin cr_signal_predump_barrier <cr_chkpt_req.c:1149>, pid 709: ONCE(&proc_req->predump_barrier, 1) returning 1 cr_do_vmadump <cr_dump_self.c:777>, pid 709: Preparing to dump 1 threads of hello cr_save_header <cr_dump_self.c:725>, pid 709: Dumping header for 1 threads cr_do_vmadump <cr_dump_self.c:786>, pid 709: Writing the per-process linkage. cr_do_vmadump <cr_dump_self.c:813>, pid 709: Writing credentials cr_do_vmadump <cr_dump_self.c:938>, pid 709: NOTIFY(&proc_req->vmadump_barrier) cr_dump_self <cr_dump_self.c:1115>, pid 709: ENTER(&req->postdump_barrier) begin cr_dump_self <cr_dump_self.c:1115>, pid 709: ENTER(&req->postdump_barrier) returning 1 cr_dump_self <cr_dump_self.c:1120>, pid 709: Writing the trailer. cr_save_header <cr_dump_self.c:725>, pid 709: Dumping header for 0 threads cr_signal_chkpt_complete_barrier <cr_chkpt_req.c:1178>, pid 709: NOTIFY(&proc_req->pre_complete_barrier) cr_signal_chkpt_complete_barrier <cr_chkpt_req.c:1180>, pid 709: ONCE(&proc_req->pre_complete_barrier, 1) begin cr_signal_chkpt_complete_barrier <cr_chkpt_req.c:1180>, pid 709: ONCE(&proc_req->pre_complete_barrier, 1) returning 1 cr_chkpt_task_complete <cr_chkpt_req.c:1305>, pid 709: NOTIFY(&proc_req->post_complete_barrier) cr_chkpt_task_complete <cr_chkpt_req.c:1306>, pid 709: WAIT(&proc_req->post_complete_barrier) begin cr_chkpt_task_complete <cr_chkpt_req.c:1306>, pid 709: WAIT(&proc_req->post_complete_barrier) returning 1 __cr_task_put <cr_task.c:126>, pid 709: Free cr_task_t c53e8570 cr_dump_self <cr_dump_self.c:1152>, pid 709: leaving Returning -12 cr_chkpt_done <cr_chkpt_req.c:1424>, pid 712: leaving Returning 1 ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=0000a112 arg=0xffffffff ctrl_release <cr_fops.c:246>, pid 712: entering release_request <cr_chkpt_req.c:52>, pid 712: Free cr_chkpt_req_t c4f6bb38 ctrl_release <cr_fops.c:246>, pid 712: entering cr_suspend <cr_async.c:117>, pid 714: leaving with pending signal ctrl_release <cr_fops.c:246>, pid 712: entering __cr_task_put <cr_task.c:126>, pid 712: Free cr_task_t c53e85cc __cr_task_put <cr_task.c:126>, pid 712: Free cr_task_t c53e8628 ================================================================ Thanks a lot for your help. Regards, Manish On Fri, Aug 8, 2008 at 1:20 A-------M, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote: > Manish, > > There is no stated/known minimum memory requirement for BLCR, but it is > still possible that we are too aggressive with memory. I run an emulated > ARM environment in QEMU and have not yet tried running with so little memory > (though I plan to try today). > The default level of tracing detail didn't produce much output for your > case because the failure appears to come relatively early. By requesting > more detailed tracing, we should be able to narrow down when in BLCR we've > failed to allocate memory. > Please reload the kernel modules with "make insmod > cr_ktrace_mask=0xffffffff", which will enable the most detailed tracing. > Then rerun your failed checkpoint and, again, send the output. Hopefully > this time there will be enough for me to move forward on diagnosing your > problem. > > Thanks for your patience, > Paul > > Manish Dwivedi wrote: > >> Hi Paul, >> >> Thanks for the information. We tried compiling it with the enable-debug >> option today. But we didn't get much information in the log (log file is >> attached in the e-mail. >> >> In between, we have 64 MB RAM in the system, is there a limitation or >> minimum requirement of the RAM in BLCR ? >> >> Regards, >> Manish >> >> Ps: We followed the exactly same process for X86 and it is working fine >> for us. >> >> >> On Wed, Aug 6, 2008 at 10:58 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto: >> PHHargrove_at_lbl_dot_gov>> wrote: >> >> Manish, >> >> I am sorry to hear that you are having problems. From the >> information you provide below, it is hard to say what the problem >> is, other than to guess that your ARM system is low on memory. >> I am aware of a kernel-side memory leak in blcr-0.7.2, which >> should be fixed in the 0.7.3 release expected later this week or >> early next week. So, I'd like to know if the failure you describe >> happens on the very first use of cr_checkpoint, or does it happen >> after BLCR has been used several times (for instance by running >> "make check")? If it works for a while and then begins to fail, >> I'd suspect the known memory leak and suggest that you wait for >> blcr-0.7.3. >> If you are seeing failure on the very first attempt to use blcr, >> then I suggest that you rebuild blcr with debugging enabled and >> send me the information dumped to the system logs (run dmesg or >> see /var/log/messages to find the logs). To do this, you'll need >> to start at the beginning of the configure/make/install process >> and pass the "--enable-debug" option to configure, and then >> proceed with the rest of the build/install process. Be sure to >> "make insmod" (or manually rmmod the old modules and >> insmod/modprobe the new ones); otherwise the kernel modules from >> your previous (non-debug) build may still be running. With the >> new kernel modules loaded, you should retry your failing command >> and then look for messages with "blcr: " in them in the system logs. >> >> I also should tell you that there is an ARM-specific mailing list >> (very low volume) for BLCR that may help you reach other ARM >> users. You can find list info and subscribe (required to post) at >> https://hpcrdm.lbl.gov/mailman/listinfo/blcr-arm >> >> -Paul >> >> >> Manish Dwivedi wrote: >> >> Hi All, >> >> I am trying to use BLCR for ARM. But when I am trying to use >> cr_checkpoint with a hello.c program it is giving me an error >> as below: >> >> cr_checkpoint --term <pid> (command run) >> Checkpoint failed: Cannot allocate memory >> >> I have compiled hello.c in the same kernel as mentioned in the >> release notes, I am using blcr-0.7.2.tar.gz for this. >> >> Could anyone help me out resolving this issue so that I can >> test it. It works fine for me on a X86 machine. >> >> Regards, >> Manish >> >> >> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> <mailto:PHHargrove_at_lbl_dot_gov> >> Future Technologies Group HPC Research Department >> Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> >> > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >