From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Oct 31 2008 - 12:42:23 PST
I am sorry about the slow reply. I am very busy right now. We don't have LAM/MPI installed anywhere for testing of our own. However, I have tried the following simple non-MPI program based on your code: #include <stdio.h> int main(int argc, char **argv) { int i; scanf("%d",&i); printf("1st read: %d\n", i); scanf("%d",&i); printf("2nd read: %d\n", i); return 0; } If I checkpoint while the program is blocked at the first scanf(), and then I restart, I find that the application is not responding to input. However, if I hit ^Z and then type "fg" to the shell the application behaves normally: $ ./bin/cr_restart context.2307 [ENTER] [ENTER] [^Z] [1]+ Stopped ./bin/cr_restart context.2307 $ fg ./bin/cr_restart context.2307 1 1st read: 1 1 2nd read: 1 So, there does appear to be something odd about how the read() has been restarted. We don't normally deal much with applications with standard input, but this certainly seems like a BLCR bug. My recommendation is to avoid using I/O in this way. In general, reading stdin in an MPI program is poorly defined anyway (for instance, does only rank 0 get the input, or is it cloned for all ranks?). I am guessing you wanted a way to cause your program to wait for a checkpoint to be taken. In my own test codes, I often call "pause()" for this reason. Because BLCR's checkpoints are initiated using signals, the pause() will return only after the checkpoint has been taken. Let us (checkpoint_at_lbl_dot_gov) know if you need any more assistance, but be warned that our response is likely to be slow between now and the end of November. -Paul [email protected] wrote: > > Please response to the previous mail. Till now i could not determine what > to do now. Please do reply me.I will be thankful to you. > > Thanking you. > >> Dear Paul, >> >> I have executed a simple program as per instruction of LAM/MPI >> documentation.Once I have run mpirun only in head node and next time for >> all the node, In both cases "lamcheckpoint" is successfull and generated >> the context file(i.e. context.mpirun.3270,context.3270-n0-3271 etc. ) for >> all the process. To this step i think evething is ok. >> >> Again, it is to inform you that after executed the program it will ask for >> an input. In this time i checkpointed the program and kill it. >> >> But problem is in restart. when i give the "lamrestart" command, the job >> is restart but the behaviour is not according to the program. It does not >> respond. In this situation the process can not be killed. the PID status >> is Dl+ for the job. >> >> Am i doing right ? Or my testing program has anything wrong. For your >> convenience i have attached my test program. >> >> Please advice me how i proceed. >> >> >> Thanking you. >> >> Dhruba >> IIT Guwahati, India >> >> >> >>> If you have not yet done so, please read the instructions for >>> "lamcheckpoint", >>> "lamrestart" and "checkpoint/restart of MPI jobs" - these are sections >>> 7.2, >>> 7.9 and 9.5 in the LAM/MPI User Guide: >>> http://www.lam-mpi.org/download/files/7.1.4-user.pdf >>> >>> If after following the instructions in the User Guide, you still have >>> questions, you should ask again with some information about how the >>> restart >>> fails. For instance, if there are any error messages or syslog messages >>> from >>> the compute nodes that might explain the failure. >>> >>> -Paul >>> >>> >>> [email protected] wrote: >>>> Dear sir, >>>> >>>> I m working in a project named "Fault tolerance using checkpoint and >>>> recovery protcol using cluster based Distributed system" in Computer >>>> Science and Engineering Department, Indian Institute of >>>> Technology,Guwahati,India. >>>> >>>> Already i have setup a cluster using one head node and six client node >>>> using oscar 5.0. and install LAM-MPI beta version integrated with BLCR. >>>> Now i have got some problem in restarting the checkpointed process. Can >>>> you tell me proper procedure how to checkpoint a MPI program. >>>> >>>> Thanking you. >>>> >>>> Dhruba >>>> IIT Guwahati,India >>>> >>>> >>> >>> -- >>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>> Future Technologies Group >>> HPC Research Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> > > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900