From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 16 2005 - 08:45:18 PST
Deward -- I apologize for not replying earlier; I admit that I totally forget about this issue. I see that I did actually reply with *something* back in January, but you were not CC'ed on it. Sorry about that. :-\ I seem to recall running this and it *not* working (just like you said). I believe that the issue has something to do with the fact that the lamd's are not checkpointed -- and they contain the trace data. So if you restart processes, they won't have the associated with the trace files in the daemons, and I can see how things would go downhill from there. In the future, if you have LAM questions, you might want to mail the LAM user's mailing list directly (see http://www.lam-mpi.org/MailArchives/) -- not the BLCR guys. :-) On Jan 7, 2005, at 10:31 PM, Jeff Squyres wrote: > Oy, yes, this might be a problem. > > -ton tells the MPI processes to dump trace information down to the LAM > daemons. When the MPI processes restart, I can see how the trace > information would not be associated with them anymore. > > I'll check this out over the weekend and see if that works. I kinda > doubt it. > > On Jan 7, 2005, at 8:25 PM, jcduell_at_lbl_dot_gov wrote: > >> Paul: >> >> Do you know anything about the LAM mpirun '-ton' tracing flag? It >> sounds like >> jobs started with it won't restart correctly. >> >> -- >> Jason Duell Future Technologies Group >> <jcduell_at_lbl_dot_gov> Computational Research Division >> Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory >> >> >> ----- Forwarded message from [email protected] ----- >> >> From: [email protected] >> Subject: can blcr work well with the \'mpirun -ton .....\'? >> Date: Tue, 04 Jan 2005 14:53:08 +0800 (BEIST) >> To: JCDuell_at_lbl_dot_gov >> Cc: >> X-Mailer: SkyMiracle WorldPost 8.0.1 >> >> >> Dear Sir or Madam: >> >> I try to checkpoint and restart mpi programs with blcr in LAM >> environment ! >> >> I want to checkpoint some mpi programs which are launched with >> the >> '-ton' so that I can get the trace files that LAM has >> produced. After I restart the context file, the processes such >> as >> mpirun, cr_restart and mpi program, have been restarted, but they >> don't continue to run. when I checkpoint the mpi programs >> without the '-ton', everything is ok ! It is so weird ! >> can blcr work well with the "mpirun -ton ....." ? >> Thanks very much! >> >> the first commands are as followings:(with 'ton') >> mpirun C -ton ./ring >> cr_checkpoint pid of mpirun >> cr_restart context.XXXX (restart failed, the >> processed have been restarted but don't continue) >> >> the second comands are as following:(without '-ton') >> mpirun C ./ring >> cr_checkpoint pid of mpirun >> cr_restart context.XXXX (restart is ok) >> >> >> redhat 9 >> the version of blcr is 0.2.2.3b8 >> the lam version is 7.0.4 >> deward >> >> >> ----- End forwarded message ----- >> > > -- > {+} Jeff Squyres > {+} [email protected] > {+} http://www.lam-mpi.org/ > -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/