jcduell_at_lbl_dot_gov
Date: Fri Jan 07 2005 - 17:25:40 PST
Paul: Do you know anything about the LAM mpirun '-ton' tracing flag? It sounds like jobs started with it won't restart correctly. -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory ----- Forwarded message from [email protected] ----- From: [email protected] Subject: can blcr work well with the \'mpirun -ton .....\'? Date: Tue, 04 Jan 2005 14:53:08 +0800 (BEIST) To: JCDuell_at_lbl_dot_gov Cc: X-Mailer: SkyMiracle WorldPost 8.0.1 Dear Sir or Madam: I try to checkpoint and restart mpi programs with blcr in LAM environment ! I want to checkpoint some mpi programs which are launched with the '-ton' so that I can get the trace files that LAM has produced. After I restart the context file, the processes such as mpirun, cr_restart and mpi program, have been restarted, but they don't continue to run. when I checkpoint the mpi programs without the '-ton', everything is ok ! It is so weird ! can blcr work well with the "mpirun -ton ....." ? Thanks very much! the first commands are as followings:(with 'ton') mpirun C -ton ./ring cr_checkpoint pid of mpirun cr_restart context.XXXX (restart failed, the processed have been restarted but don't continue) the second comands are as following:(without '-ton') mpirun C ./ring cr_checkpoint pid of mpirun cr_restart context.XXXX (restart is ok) redhat 9 the version of blcr is 0.2.2.3b8 the lam version is 7.0.4 deward ----- End forwarded message -----