Re: using blcr on program with fork

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Feb 25 2009 - 12:22:39 PST

  • Next message: Hongjia Cao: "Process deadlock on checkpoint after restart (BLCR-0.8.0)"
    Andrea Autiero S143785 wrote:
    > i'm using shared memory in my program
    > removing every line refering to them let blcr checkpoint my applications..
    > could be this the problem?
    Yes, that is almost certainly the problem.  In the dmesg output you sent 
    I found
        blcr: vfs_read returned -22
        blcr: write returned -22 on copy-out of mmap()ed data
        blcr: vfs_read returned -22
        blcr: write returned -22 on copy-out of mmap()ed data
    which is consistent with use of SysV or POSIX shared memory.
    Unfortunately, BLCR does not yet have support for SvsY or POSIX shared 
    memory.  However, if you can change your program to instead use an 
    anonymous mmap() to obtain shared memory, that *is* supported by BLCR.
    Additionally, it is possible to construct a program with BLCR callbacks 
    that would disconnect from the shared memory when a checkpoint request 
    is received, allowing the checkpoint to be taken, and then reconnect 
    afterwards.  However, that opens up the messy issue of adding a 
    mechanism for preserving the shared memory values.
    > On Mon, 23 Feb 2009 13:50:39 -0800, "Paul H. Hargrove" <PHHargrove_at_lbl_dot_gov>
    > wrote:
    >> Andrea,
    >>   I cannot tell from the information you have provided what the problem 
    >> might be.  If I construct a simple example program that behaves as you 
    >> describe, and I compile it as you describe, then I am able to checkpoint 
    >> it and restart it just fine.
    >>   Could you please check the output of the "dmesg" command and/or your 
    >> system logs to see if there are any kernel messages that might help 
    >> explain the failure.
    >> -Paul
    >> Andrea Autiero S143785 wrote:
    >>> hi!
    >>> it's me another time..
    >>> after made statically linked file with blcr I've got another problem..
    >>> I'm trying to checkpoint a program after it forks twice
    >>> then from another shell (but in the future it will be done by the
    > program
    >>> itself)
    >>> i try to checkpoint it and the answer is:
    >>>  >ps -a
    >>>    PID TTY          TIME CMD
    >>>    5878 pts/0    00:00:00 controller
    >>>    5879 pts/0    00:00:02 controller
    >>>    5880 pts/0    00:00:02 controller
    >>>    5881 pts/1    00:00:00 ps
    >>>  >cr_checkpoint 5878
    >>> Checkpoint failed: Invalid argument
    >>> 5878 is the father..
    >>> i've compiled it by 
    >>>     >gcc -o controller controller.c -L/usr/local/lib/ -lcr_run -u
    >>> cr_run_link_me -ldl -lpthread
    >>>     >nm controller | grep _link_me
    >>>          U cr_run_link_me
    >>> (now is not statically linked because I'm trying on a pc and not on an
    >>> embedded system, but is in the last one that it must work)
    >>> why it do this?could you help me to make it works?
    >>> thanks..
    >>> have a good day
    >>> Andrea Autiero
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Hongjia Cao: "Process deadlock on checkpoint after restart (BLCR-0.8.0)"