Re: lam/mpi blcr problem

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 23 2005 - 11:16:11 PST

  • Next message: rmingming: "file appending related"
    Actually, this is quite likely (that LAM is not reconnecting stdin).
    
    On Mar 23, 2005, at 12:50 PM, Paul H. Hargrove wrote:
    
    > At restart time miprun should be reconnected to stdin just fine (in 
    > fact thre is presently no way to prevent it from doing so).  Similarly 
    > the stdin from the lamd to the app should have been restarted, but 
    > this is less certain than mpirun.
    >
    > It is also possible that LAM is not restoring the stdin *forwarding*. 
    > Jeff, can you determine from an examination of the LAM source whether 
    > the stdin forwarding is properly reestablished?  I know if I "lamexec 
    > n1 sleep 120" that the launched process gets /dev/null for stdin.  If 
    > the mechanism that runs cr_restart on the compute nodes is similar, it 
    > may also be getting /dev/null as *its* input which it then connects to 
    > the application.
    >
    > To check on what the app is connected to, try 'ls -l /proc/PID/fd/0' 
    > for the PID of mpirun and for each of the application processes.  If 
    > these turn out to be incorrect, then we'll still need to determine if 
    > BLCR or LAM is getting this wrong.
    >
    > -Paul
    >
    > Jeff Squyres wrote:
    >> It could well be that stdin is not being checkpointed.
    >> Paul?
    >> On Mar 23, 2005, at 12:18 PM, 任明明 wrote:
    >>>
    >>> I changed for another program which just does matrix multiplication,
    >>> this time checkpoint and restart of the MPI program worked very well.
    >>>
    >>>
    >>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾­Ìáµ½:
    >>>
    >>>> From: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>>> Reply-To: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>>> To: checkpoint_at_lbl_dot_gov
    >>>> Subject: Re: lam/mpi blcr problem
    >>>> Date:Thu, 24 Mar 2005 00:33:10 +0800
    >>>>
    >>>>
    >>>> it seems ok now, at least i can see the context files for each 
    >>>> process.
    >>>> but as to my cpi program(it needs input from the first process, and 
    >>>> i
    >>>> checkpointed it when it is waiting for the keyboard input),
    >>>> when use cr_restart, the program quits quickly.
    >>>> by the way, when use cr_checkpoint PID-of-mpirun(doesn't use 
    >>>> --term) to
    >>>> this cpi example program, it quits running. I don't know what's the 
    >>>> problem
    >>>> is, and wish i have expressed this problem clearly.:-)
    >>>>
    >>>> Thank you for your valuable information.
    >>>>
    >>>>
    >>>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾­Ìáµ½:
    >>>>
    >>>>> From: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>>>> Reply-To: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>>>> To: checkpoint_at_lbl_dot_gov
    >>>>> Subject: Re: lam/mpi blcr problem
    >>>>> Date:Wed, 23 Mar 2005 23:34:11 +0800
    >>>>>
    >>>>>
    >>>>> I will, I will use this version:
    >>>>>
    >>>>> http://www.lam-mpi.org/download/files/lam-7.1.2b18.tar.bz2
    >>>>>
    >>>>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾­Ìáµ½:
    >>>>>
    >>>>>> From: Jeff Squyres <jsquyres@lam-mpi.org>
    >>>>>> Reply-To:
    >>>>>> To: "$BG$L@L@(B" <0110018@mail.nankai.edu.cn>
    >>>>>> Subject: Re: lam/mpi blcr problem
    >>>>>> Date:Wed, 23 Mar 2005 10:17:38 -0500
    >>>>>>
    >>>>>> If you wouldn't mind, could you try the beta and ensure that it 
    >>>>>> works
    >>>>>> for you?
    >>>>>>
    >>>>>>
    >>>>>> On Mar 23, 2005, at 9:35 AM, ÈÎÃ÷Ã÷ wrote:
    >>>>>>
    >>>>>>>
    >>>>>>> Thank you very much! I will wait for the new version.
    >>>>>>> And Thank you all.
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    
    -- 
    {+} Jeff Squyres
    {+} jsquyres@lam-mpi.org
    {+} http://www.lam-mpi.org/
    

  • Next message: rmingming: "file appending related"