Re: lam/mpi blcr problem

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 23 2005 - 09:50:49 PST

  • Next message: Jeff Squyres: "Re: lam/mpi blcr problem"
    At restart time miprun should be reconnected to stdin just fine (in fact 
    thre is presently no way to prevent it from doing so).  Similarly the 
    stdin from the lamd to the app should have been restarted, but this is 
    less certain than mpirun.
    
    It is also possible that LAM is not restoring the stdin *forwarding*. 
    Jeff, can you determine from an examination of the LAM source whether 
    the stdin forwarding is properly reestablished?  I know if I "lamexec n1 
    sleep 120" that the launched process gets /dev/null for stdin.  If the 
    mechanism that runs cr_restart on the compute nodes is similar, it may 
    also be getting /dev/null as *its* input which it then connects to the 
    application.
    
    To check on what the app is connected to, try 'ls -l /proc/PID/fd/0' for 
    the PID of mpirun and for each of the application processes.  If these 
    turn out to be incorrect, then we'll still need to determine if BLCR or 
    LAM is getting this wrong.
    
    -Paul
    
    Jeff Squyres wrote:
    > It could well be that stdin is not being checkpointed.
    > 
    > Paul?
    > 
    > 
    > On Mar 23, 2005, at 12:18 PM, 任明明 wrote:
    > 
    >>
    >> I changed for another program which just does matrix multiplication,
    >> this time checkpoint and restart of the MPI program worked very well.
    >>
    >>
    >> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾­Ìáµ½:
    >>
    >>> From: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>> Reply-To: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>> To: checkpoint_at_lbl_dot_gov
    >>> Subject: Re: lam/mpi blcr problem
    >>> Date:Thu, 24 Mar 2005 00:33:10 +0800
    >>>
    >>>
    >>> it seems ok now, at least i can see the context files for each process.
    >>> but as to my cpi program(it needs input from the first process, and i
    >>> checkpointed it when it is waiting for the keyboard input),
    >>> when use cr_restart, the program quits quickly.
    >>> by the way, when use cr_checkpoint PID-of-mpirun(doesn't use --term) to
    >>> this cpi example program, it quits running. I don't know what's the 
    >>> problem
    >>> is, and wish i have expressed this problem clearly.:-)
    >>>
    >>> Thank you for your valuable information.
    >>>
    >>>
    >>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾­Ìáµ½:
    >>>
    >>>> From: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>>> Reply-To: "ÈÎÃ÷Ã÷" <0110018@mail.nankai.edu.cn>
    >>>> To: checkpoint_at_lbl_dot_gov
    >>>> Subject: Re: lam/mpi blcr problem
    >>>> Date:Wed, 23 Mar 2005 23:34:11 +0800
    >>>>
    >>>>
    >>>> I will, I will use this version:
    >>>>
    >>>> http://www.lam-mpi.org/download/files/lam-7.1.2b18.tar.bz2
    >>>>
    >>>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾­Ìáµ½:
    >>>>
    >>>>> From: Jeff Squyres <jsquyres@lam-mpi.org>
    >>>>> Reply-To:
    >>>>> To: "$BG$L@L@(B" <0110018@mail.nankai.edu.cn>
    >>>>> Subject: Re: lam/mpi blcr problem
    >>>>> Date:Wed, 23 Mar 2005 10:17:38 -0500
    >>>>>
    >>>>> If you wouldn't mind, could you try the beta and ensure that it works
    >>>>> for you?
    >>>>>
    >>>>>
    >>>>> On Mar 23, 2005, at 9:35 AM, ÈÎÃ÷Ã÷ wrote:
    >>>>>
    >>>>>>
    >>>>>> Thank you very much! I will wait for the new version.
    >>>>>> And Thank you all.
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Jeff Squyres: "Re: lam/mpi blcr problem"