From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 23 2005 - 09:50:49 PST
At restart time miprun should be reconnected to stdin just fine (in fact thre is presently no way to prevent it from doing so). Similarly the stdin from the lamd to the app should have been restarted, but this is less certain than mpirun. It is also possible that LAM is not restoring the stdin *forwarding*. Jeff, can you determine from an examination of the LAM source whether the stdin forwarding is properly reestablished? I know if I "lamexec n1 sleep 120" that the launched process gets /dev/null for stdin. If the mechanism that runs cr_restart on the compute nodes is similar, it may also be getting /dev/null as *its* input which it then connects to the application. To check on what the app is connected to, try 'ls -l /proc/PID/fd/0' for the PID of mpirun and for each of the application processes. If these turn out to be incorrect, then we'll still need to determine if BLCR or LAM is getting this wrong. -Paul Jeff Squyres wrote: > It could well be that stdin is not being checkpointed. > > Paul? > > > On Mar 23, 2005, at 12:18 PM, 任明明 wrote: > >> >> I changed for another program which just does matrix multiplication, >> this time checkpoint and restart of the MPI program worked very well. >> >> >> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >> >>> From: "ÈÎÃ÷Ã÷" <[email protected]> >>> Reply-To: "ÈÎÃ÷Ã÷" <[email protected]> >>> To: checkpoint_at_lbl_dot_gov >>> Subject: Re: lam/mpi blcr problem >>> Date:Thu, 24 Mar 2005 00:33:10 +0800 >>> >>> >>> it seems ok now, at least i can see the context files for each process. >>> but as to my cpi program(it needs input from the first process, and i >>> checkpointed it when it is waiting for the keyboard input), >>> when use cr_restart, the program quits quickly. >>> by the way, when use cr_checkpoint PID-of-mpirun(doesn't use --term) to >>> this cpi example program, it quits running. I don't know what's the >>> problem >>> is, and wish i have expressed this problem clearly.:-) >>> >>> Thank you for your valuable information. >>> >>> >>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >>> >>>> From: "ÈÎÃ÷Ã÷" <[email protected]> >>>> Reply-To: "ÈÎÃ÷Ã÷" <[email protected]> >>>> To: checkpoint_at_lbl_dot_gov >>>> Subject: Re: lam/mpi blcr problem >>>> Date:Wed, 23 Mar 2005 23:34:11 +0800 >>>> >>>> >>>> I will, I will use this version: >>>> >>>> http://www.lam-mpi.org/download/files/lam-7.1.2b18.tar.bz2 >>>> >>>> ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >>>> >>>>> From: Jeff Squyres <[email protected]> >>>>> Reply-To: >>>>> To: "$BG$L@L@(B" <[email protected]> >>>>> Subject: Re: lam/mpi blcr problem >>>>> Date:Wed, 23 Mar 2005 10:17:38 -0500 >>>>> >>>>> If you wouldn't mind, could you try the beta and ensure that it works >>>>> for you? >>>>> >>>>> >>>>> On Mar 23, 2005, at 9:35 AM, ÈÎÃ÷Ã÷ wrote: >>>>> >>>>>> >>>>>> Thank you very much! I will wait for the new version. >>>>>> And Thank you all. -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900