From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Feb 25 2009 - 12:22:39 PST
Andrea Autiero S143785 wrote: > i'm using shared memory in my program > removing every line refering to them let blcr checkpoint my applications.. > could be this the problem? > Yes, that is almost certainly the problem. In the dmesg output you sent I found blcr: vfs_read returned -22 blcr: write returned -22 on copy-out of mmap()ed data blcr: vfs_read returned -22 blcr: write returned -22 on copy-out of mmap()ed data which is consistent with use of SysV or POSIX shared memory. Unfortunately, BLCR does not yet have support for SvsY or POSIX shared memory. However, if you can change your program to instead use an anonymous mmap() to obtain shared memory, that *is* supported by BLCR. Additionally, it is possible to construct a program with BLCR callbacks that would disconnect from the shared memory when a checkpoint request is received, allowing the checkpoint to be taken, and then reconnect afterwards. However, that opens up the messy issue of adding a mechanism for preserving the shared memory values. -Paul > On Mon, 23 Feb 2009 13:50:39 -0800, "Paul H. Hargrove" <PHHargrove_at_lbl_dot_gov> > wrote: > >> Andrea, >> >> I cannot tell from the information you have provided what the problem >> might be. If I construct a simple example program that behaves as you >> describe, and I compile it as you describe, then I am able to checkpoint >> it and restart it just fine. >> Could you please check the output of the "dmesg" command and/or your >> system logs to see if there are any kernel messages that might help >> explain the failure. >> >> -Paul >> >> Andrea Autiero S143785 wrote: >> >>> hi! >>> it's me another time.. >>> after made statically linked file with blcr I've got another problem.. >>> I'm trying to checkpoint a program after it forks twice >>> then from another shell (but in the future it will be done by the >>> > program > >>> itself) >>> i try to checkpoint it and the answer is: >>> >ps -a >>> PID TTY TIME CMD >>> 5878 pts/0 00:00:00 controller >>> 5879 pts/0 00:00:02 controller >>> 5880 pts/0 00:00:02 controller >>> 5881 pts/1 00:00:00 ps >>> >cr_checkpoint 5878 >>> Checkpoint failed: Invalid argument >>> >>> 5878 is the father.. >>> i've compiled it by >>> >gcc -o controller controller.c -L/usr/local/lib/ -lcr_run -u >>> cr_run_link_me -ldl -lpthread >>> >nm controller | grep _link_me >>> U cr_run_link_me >>> >>> (now is not statically linked because I'm trying on a pc and not on an >>> embedded system, but is in the last one that it must work) >>> why it do this?could you help me to make it works? >>> thanks.. >>> have a good day >>> Andrea Autiero >>> >>> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory