Re: [dehua999@sjtu.edu.cn: can blcr work well with the \'mpirun -ton .....\'?]

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 16 2005 - 08:45:18 PST

  • Next message: Paolo Victor: "Unresolved symbol errors when trying to install BLCR modules"
    Deward --
    
    I apologize for not replying earlier; I admit that I totally forget 
    about this issue.  I see that I did actually reply with *something* 
    back in January, but you were not CC'ed on it.  Sorry about that.  :-\
    
    I seem to recall running this and it *not* working (just like you 
    said).  I believe that the issue has something to do with the fact that 
    the lamd's are not checkpointed -- and they contain the trace data.  So 
    if you restart processes, they won't have the associated with the trace 
    files in the daemons, and I can see how things would go downhill from 
    there.
    
    In the future, if you have LAM questions, you might want to mail the 
    LAM user's mailing list directly (see 
    http://www.lam-mpi.org/MailArchives/) -- not the BLCR guys.  :-)
    
    
    On Jan 7, 2005, at 10:31 PM, Jeff Squyres wrote:
    
    > Oy, yes, this might be a problem.
    >
    > -ton tells the MPI processes to dump trace information down to the LAM 
    > daemons.  When the MPI processes restart, I can see how the trace 
    > information would not be associated with them anymore.
    >
    > I'll check this out over the weekend and see if that works.  I kinda 
    > doubt it.
    >
    > On Jan 7, 2005, at 8:25 PM, jcduell_at_lbl_dot_gov wrote:
    >
    >> Paul:
    >>
    >> Do you know anything about the LAM mpirun '-ton' tracing flag?  It 
    >> sounds like
    >> jobs started with it won't restart correctly.
    >>
    >> -- 
    >> Jason Duell             Future Technologies Group
    >> <jcduell_at_lbl_dot_gov>       Computational Research Division
    >> Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    >>
    >>
    >> ----- Forwarded message from dehua999@sjtu.edu.cn -----
    >>
    >> From: dehua999@sjtu.edu.cn
    >> Subject: can blcr work well with the \'mpirun -ton .....\'?
    >> Date: Tue, 04 Jan 2005 14:53:08 +0800 (BEIST)
    >> To: JCDuell_at_lbl_dot_gov
    >> Cc:
    >> X-Mailer: SkyMiracle WorldPost 8.0.1
    >>
    >>
    >> Dear Sir or Madam:
    >>
    >>      I try to checkpoint and restart mpi programs with blcr in LAM 
    >> environment !
    >>
    >>      I want to checkpoint some mpi programs which are launched with 
    >> the
    >>      '-ton'  so that I can get the trace files that LAM has
    >>      produced. After I restart the context file,  the processes such 
    >> as
    >>      mpirun, cr_restart and mpi program, have been restarted, but they
    >>      don't continue to run. when I checkpoint the mpi programs
    >>      without the '-ton', everything is ok !  It is so weird !
    >>     can blcr work well with the "mpirun -ton ....." ?
    >>     Thanks very much!
    >>
    >>    the first commands are as followings:(with 'ton')
    >>        mpirun C -ton  ./ring
    >>       cr_checkpoint   pid of mpirun
    >>       cr_restart  context.XXXX            (restart failed, the 
    >> processed have been restarted but don't continue)
    >>
    >>    the second comands are as following:(without  '-ton')
    >>         mpirun C ./ring
    >>        cr_checkpoint   pid of mpirun
    >>        cr_restart  context.XXXX              (restart is ok)
    >>
    >>
    >>    redhat 9
    >>    the version of blcr is 0.2.2.3b8
    >>    the lam version is 7.0.4
    >>                                                        deward
    >>
    >>
    >> ----- End forwarded message -----
    >>
    >
    > -- 
    > {+} Jeff Squyres
    > {+} jsquyres@lam-mpi.org
    > {+} http://www.lam-mpi.org/
    >
    
    -- 
    {+} Jeff Squyres
    {+} jsquyres@lam-mpi.org
    {+} http://www.lam-mpi.org/
    

  • Next message: Paolo Victor: "Unresolved symbol errors when trying to install BLCR modules"