Re: Checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Oct 31 2008 - 12:42:23 PST

  • Next message: Paul H. Hargrove: "Re: checkpointing (OpenMP) multithreaded applications within SGE"
    I am sorry about the slow reply.  I am very busy right now.
    
    We don't have LAM/MPI installed anywhere for testing of our own.  However, I 
    have tried the following simple non-MPI program based on your code:
    
    #include <stdio.h>
    int main(int argc, char **argv)
    {
       int i;
       scanf("%d",&i); printf("1st read: %d\n", i);
       scanf("%d",&i); printf("2nd read: %d\n", i);
       return 0;
    }
    
    If I checkpoint while the program is blocked at the first scanf(), and then I 
    restart, I find that the application is not responding to input.  However, if 
    I hit ^Z and then type "fg" to the shell the application behaves normally:
    
    $ ./bin/cr_restart context.2307
    [ENTER]
    [ENTER]
    [^Z]
    [1]+  Stopped                 ./bin/cr_restart context.2307
    $ fg
    ./bin/cr_restart context.2307
    1
    1st read: 1
    1
    2nd read: 1
    
    
    So, there does appear to be something odd about how the read() has been 
    restarted.  We don't normally deal much with applications with standard input, 
    but this certainly seems like a BLCR bug.
    
    My recommendation is to avoid using I/O in this way.  In general, reading 
    stdin in an MPI program is poorly defined anyway (for instance, does only rank 
    0 get the input, or is it cloned for all ranks?).
    
    I am guessing you wanted a way to cause your program to wait for a checkpoint 
    to be taken.  In my own test codes, I often call "pause()" for this reason. 
    Because BLCR's checkpoints are initiated using signals, the pause() will 
    return only after the checkpoint has been taken.
    
    Let us (checkpoint_at_lbl_dot_gov) know if you need any more assistance, but be 
    warned that our response is likely to be slow between now and the end of November.
    
    
    -Paul
    
    
    [email protected] wrote:
    > 
    > Please response to the previous mail. Till now i could not determine what
    > to do now. Please do reply me.I will be thankful to you.
    > 
    > Thanking you.
    > 
    >> Dear Paul,
    >>
    >> I have executed a simple program as per instruction of LAM/MPI
    >> documentation.Once I have run mpirun only in head node and next time for
    >> all the node, In both cases "lamcheckpoint" is successfull and generated
    >> the context file(i.e. context.mpirun.3270,context.3270-n0-3271 etc. ) for
    >> all the process. To this step i think evething is ok.
    >>
    >> Again, it is to inform you that after executed the program it will ask for
    >> an input. In this time i checkpointed the program and kill it.
    >>
    >> But problem is in restart. when i give the "lamrestart" command, the job
    >> is restart but the behaviour is not according to the program. It does not
    >> respond. In this situation the process can not be killed. the PID status
    >> is Dl+ for the job.
    >>
    >> Am i doing right ? Or my testing program has anything wrong. For your
    >> convenience i have attached my test program.
    >>
    >> Please advice me how i proceed.
    >>
    >>
    >> Thanking you.
    >>
    >> Dhruba
    >> IIT Guwahati, India
    >>
    >>
    >>
    >>> If you have not yet done so, please read the instructions for
    >>> "lamcheckpoint",
    >>> "lamrestart" and "checkpoint/restart of MPI jobs" - these are sections
    >>> 7.2,
    >>> 7.9 and 9.5 in the LAM/MPI User Guide:
    >>> http://www.lam-mpi.org/download/files/7.1.4-user.pdf
    >>>
    >>> If after following the instructions in the User Guide, you still have
    >>> questions, you should ask again with some information about how the
    >>> restart
    >>> fails.  For instance, if there are any error messages or syslog messages
    >>> from
    >>> the compute nodes that might explain the failure.
    >>>
    >>> -Paul
    >>>
    >>>
    >>> [email protected] wrote:
    >>>> Dear sir,
    >>>>
    >>>> I m working in a project named "Fault tolerance using checkpoint and
    >>>> recovery protcol using cluster based Distributed system" in Computer
    >>>> Science and Engineering Department, Indian Institute of
    >>>> Technology,Guwahati,India.
    >>>>
    >>>> Already i have setup a cluster using one head node and six client node
    >>>> using oscar 5.0. and install LAM-MPI beta version integrated with BLCR.
    >>>> Now i have got some problem in restarting the checkpointed process. Can
    >>>> you tell me proper procedure how to checkpoint a MPI program.
    >>>>
    >>>> Thanking you.
    >>>>
    >>>> Dhruba
    >>>> IIT Guwahati,India
    >>>>
    >>>>
    >>>
    >>> --
    >>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>> Future Technologies Group
    >>> HPC Research Department                   Tel: +1-510-495-2352
    >>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>
    > 
    > 
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: checkpointing (OpenMP) multithreaded applications within SGE"