From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 23 2005 - 11:16:11 PST
Actually, this is quite likely (that LAM is not reconnecting stdin). On Mar 23, 2005, at 12:50 PM, Paul H. Hargrove wrote: > At restart time miprun should be reconnected to stdin just fine (in > fact thre is presently no way to prevent it from doing so). Similarly > the stdin from the lamd to the app should have been restarted, but > this is less certain than mpirun. > > It is also possible that LAM is not restoring the stdin *forwarding*. > Jeff, can you determine from an examination of the LAM source whether > the stdin forwarding is properly reestablished? I know if I "lamexec > n1 sleep 120" that the launched process gets /dev/null for stdin. If > the mechanism that runs cr_restart on the compute nodes is similar, it > may also be getting /dev/null as *its* input which it then connects to > the application. > > To check on what the app is connected to, try 'ls -l /proc/PID/fd/0' > for the PID of mpirun and for each of the application processes. If > these turn out to be incorrect, then we'll still need to determine if > BLCR or LAM is getting this wrong. > > -Paul > > Jeff Squyres wrote: >> It could well be that stdin is not being checkpointed. >> Paul? >> On Mar 23, 2005, at 12:18 PM, 任明明 wrote: >>> >>> I changed for another program which just does matrix multiplication, >>> this time checkpoint and restart of the MPI program worked very well. >>> >>> >>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >>> >>>> From: "ÈÎÃ÷Ã÷" <[email protected]> >>>> Reply-To: "ÈÎÃ÷Ã÷" <[email protected]> >>>> To: checkpoint_at_lbl_dot_gov >>>> Subject: Re: lam/mpi blcr problem >>>> Date:Thu, 24 Mar 2005 00:33:10 +0800 >>>> >>>> >>>> it seems ok now, at least i can see the context files for each >>>> process. >>>> but as to my cpi program(it needs input from the first process, and >>>> i >>>> checkpointed it when it is waiting for the keyboard input), >>>> when use cr_restart, the program quits quickly. >>>> by the way, when use cr_checkpoint PID-of-mpirun(doesn't use >>>> --term) to >>>> this cpi example program, it quits running. I don't know what's the >>>> problem >>>> is, and wish i have expressed this problem clearly.:-) >>>> >>>> Thank you for your valuable information. >>>> >>>> >>>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >>>> >>>>> From: "ÈÎÃ÷Ã÷" <[email protected]> >>>>> Reply-To: "ÈÎÃ÷Ã÷" <[email protected]> >>>>> To: checkpoint_at_lbl_dot_gov >>>>> Subject: Re: lam/mpi blcr problem >>>>> Date:Wed, 23 Mar 2005 23:34:11 +0800 >>>>> >>>>> >>>>> I will, I will use this version: >>>>> >>>>> http://www.lam-mpi.org/download/files/lam-7.1.2b18.tar.bz2 >>>>> >>>>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >>>>> >>>>>> From: Jeff Squyres <[email protected]> >>>>>> Reply-To: >>>>>> To: "$BG$L@L@(B" <[email protected]> >>>>>> Subject: Re: lam/mpi blcr problem >>>>>> Date:Wed, 23 Mar 2005 10:17:38 -0500 >>>>>> >>>>>> If you wouldn't mind, could you try the beta and ensure that it >>>>>> works >>>>>> for you? >>>>>> >>>>>> >>>>>> On Mar 23, 2005, at 9:35 AM, ÈÎÃ÷Ã÷ wrote: >>>>>> >>>>>>> >>>>>>> Thank you very much! I will wait for the new version. >>>>>>> And Thank you all. > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/