From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Fri Jan 07 2005 - 19:31:07 PST
Oy, yes, this might be a problem. -ton tells the MPI processes to dump trace information down to the LAM daemons. When the MPI processes restart, I can see how the trace information would not be associated with them anymore. I'll check this out over the weekend and see if that works. I kinda doubt it. I'll reply to the guy (and checkpoint@lbl) with what I find. Why do people keep sending LAM questions to you guys? Is there some web page that is not clear about who to send checkpoint vs. LAM questions? On Jan 7, 2005, at 8:25 PM, jcduell_at_lbl_dot_gov wrote: > Paul: > > Do you know anything about the LAM mpirun '-ton' tracing flag? It > sounds like > jobs started with it won't restart correctly. > > -- > Jason Duell Future Technologies Group > <jcduell_at_lbl_dot_gov> Computational Research Division > Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory > > > ----- Forwarded message from [email protected] ----- > > From: [email protected] > Subject: can blcr work well with the \'mpirun -ton .....\'? > Date: Tue, 04 Jan 2005 14:53:08 +0800 (BEIST) > To: JCDuell_at_lbl_dot_gov > Cc: > X-Mailer: SkyMiracle WorldPost 8.0.1 > > > Dear Sir or Madam: > > I try to checkpoint and restart mpi programs with blcr in LAM > environment ! > > I want to checkpoint some mpi programs which are launched with the > '-ton' so that I can get the trace files that LAM has > produced. After I restart the context file, the processes such as > mpirun, cr_restart and mpi program, have been restarted, but they > don't continue to run. when I checkpoint the mpi programs > without the '-ton', everything is ok ! It is so weird ! > can blcr work well with the "mpirun -ton ....." ? > Thanks very much! > > the first commands are as followings:(with 'ton') > mpirun C -ton ./ring > cr_checkpoint pid of mpirun > cr_restart context.XXXX (restart failed, the > processed have been restarted but don't continue) > > the second comands are as following:(without '-ton') > mpirun C ./ring > cr_checkpoint pid of mpirun > cr_restart context.XXXX (restart is ok) > > > redhat 9 > the version of blcr is 0.2.2.3b8 > the lam version is 7.0.4 > deward > > > ----- End forwarded message ----- > -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/