drbj153_at_iitg.ernet.in
Date: Mon Nov 03 2008 - 20:27:43 PST
Many many thanks. For your kind information, now checkpointing is working. If i face further any problem ,then i will inform you. In spite of your busyness, you have replied me, thats enough. Thank you again. ---Dhruba > I am sorry about the slow reply. I am very busy right now. > > We don't have LAM/MPI installed anywhere for testing of our own. However, > I > have tried the following simple non-MPI program based on your code: > > #include <stdio.h> > int main(int argc, char **argv) > { > int i; > scanf("%d",&i); printf("1st read: %d\n", i); > scanf("%d",&i); printf("2nd read: %d\n", i); > return 0; > } > > If I checkpoint while the program is blocked at the first scanf(), and > then I > restart, I find that the application is not responding to input. However, > if > I hit ^Z and then type "fg" to the shell the application behaves normally: > > $ ./bin/cr_restart context.2307 > [ENTER] > [ENTER] > [^Z] > [1]+ Stopped ./bin/cr_restart context.2307 > $ fg > ./bin/cr_restart context.2307 > 1 > 1st read: 1 > 1 > 2nd read: 1 > > > So, there does appear to be something odd about how the read() has been > restarted. We don't normally deal much with applications with standard > input, > but this certainly seems like a BLCR bug. > > My recommendation is to avoid using I/O in this way. In general, reading > stdin in an MPI program is poorly defined anyway (for instance, does only > rank > 0 get the input, or is it cloned for all ranks?). > > I am guessing you wanted a way to cause your program to wait for a > checkpoint > to be taken. In my own test codes, I often call "pause()" for this > reason. > Because BLCR's checkpoints are initiated using signals, the pause() will > return only after the checkpoint has been taken. > > Let us (checkpoint_at_lbl_dot_gov) know if you need any more assistance, but be > warned that our response is likely to be slow between now and the end of > November. > > > -Paul > > > [email protected] wrote: >> >> Please response to the previous mail. Till now i could not determine >> what >> to do now. Please do reply me.I will be thankful to you. >> >> Thanking you. >> >>> Dear Paul, >>> >>> I have executed a simple program as per instruction of LAM/MPI >>> documentation.Once I have run mpirun only in head node and next time >>> for >>> all the node, In both cases "lamcheckpoint" is successfull and >>> generated >>> the context file(i.e. context.mpirun.3270,context.3270-n0-3271 etc. ) >>> for >>> all the process. To this step i think evething is ok. >>> >>> Again, it is to inform you that after executed the program it will ask >>> for >>> an input. In this time i checkpointed the program and kill it. >>> >>> But problem is in restart. when i give the "lamrestart" command, the >>> job >>> is restart but the behaviour is not according to the program. It does >>> not >>> respond. In this situation the process can not be killed. the PID >>> status >>> is Dl+ for the job. >>> >>> Am i doing right ? Or my testing program has anything wrong. For your >>> convenience i have attached my test program. >>> >>> Please advice me how i proceed. >>> >>> >>> Thanking you. >>> >>> Dhruba >>> IIT Guwahati, India >>> >>> >>> >>>> If you have not yet done so, please read the instructions for >>>> "lamcheckpoint", >>>> "lamrestart" and "checkpoint/restart of MPI jobs" - these are sections >>>> 7.2, >>>> 7.9 and 9.5 in the LAM/MPI User Guide: >>>> http://www.lam-mpi.org/download/files/7.1.4-user.pdf >>>> >>>> If after following the instructions in the User Guide, you still have >>>> questions, you should ask again with some information about how the >>>> restart >>>> fails. For instance, if there are any error messages or syslog >>>> messages >>>> from >>>> the compute nodes that might explain the failure. >>>> >>>> -Paul >>>> >>>> >>>> [email protected] wrote: >>>>> Dear sir, >>>>> >>>>> I m working in a project named "Fault tolerance using checkpoint and >>>>> recovery protcol using cluster based Distributed system" in Computer >>>>> Science and Engineering Department, Indian Institute of >>>>> Technology,Guwahati,India. >>>>> >>>>> Already i have setup a cluster using one head node and six client >>>>> node >>>>> using oscar 5.0. and install LAM-MPI beta version integrated with >>>>> BLCR. >>>>> Now i have got some problem in restarting the checkpointed process. >>>>> Can >>>>> you tell me proper procedure how to checkpoint a MPI program. >>>>> >>>>> Thanking you. >>>>> >>>>> Dhruba >>>>> IIT Guwahati,India >>>>> >>>>> >>>> >>>> -- >>>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>>> Future Technologies Group >>>> HPC Research Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >> >> >> >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >