Re: Questions on BLCR..

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 11 2005 - 10:53:34 PST

  • Next message: jcduell_at_lbl_dot_gov: "Re: [dehua999@sjtu.edu.cn: can blcr work well with the \'mpirun -ton .....\'?]"
    Tarun,
       I am still working on Linux 2.6 and Opteron support.  I had hope to 
    be done w/ 2.6 by Jan 1, but am running behind.  At this point blcr 
    passes the single threaded tests on an Athlon running SuSE Linux 9.2 (a 
    2.6.8 kernel), but gets a kernel Oops on the multi-threaded tests.  I 
    believe that there is an uninitialized pointer or a similar problem in 
    the kernel module, which is proving difficult to track down.
    
       I am afraid I don't have a very accurate estimate on session or 
    process group support at this time.  I'd certainly like to see this 
    support done in time for an April release.
    
       I am also sorry to tell you that currently there is no way to 
    checkpoint a process tree with the current BLCR.  The problem is that at 
    restart time there is presently no "resource naming" that would allow 
    identification of the shared file descriptors (such as the common 
    connection to stdin and stdout, or the pipes between processes).
    
    -Paul
    
    
    Tarun Agarwal wrote:
    > Hi Paul,
    > 
    > I had met you at SC2004. As I had said I am working on integrating
    > checkpointing support using BLCR in a batch system here at UIUC. Saving
    > sessions seems critical to using BLCR for checkpointing. You had put that
    > in ongoing work at that time. I'd appreciate if you could tell me when can
    > this support be expected?. Alternatively is there some way of 
    > checkpointing a process subtree (say a shell script and its forks) in the 
    > current version?
    > 
    > Thanks
    > Tarun
    > 
    > On Wed, 3 Nov 2004, Paul H. Hargrove wrote:
    > 
    > 
    >>I am hoping to have the 2.6 port for ia32 done by Jan 1.  I expect that the
    >>Opteron-specifc support will be finished at about the same time, or soon after
    >>that.  The speed with which we can get Opteron support implemented will depend
    >>in part on availability of test platforms.
    >>
    >>-Paul
    >>
    >>Tarun Agarwal wrote:
    >>
    >>>Thanks for the quick response. Is there some time frame that you have in
    >>>mind for the 2.6 kernel compatible release of BLCR?
    >>>
    >>>Thanks
    >>>Tarun
    >>>
    >>>
    >>>
    >>>On Tue, 2 Nov 2004, Paul H. Hargrove wrote:
    >>>
    >>>
    >>>
    >>>>BLCR does not support the Opteron at all at this time.
    >>>>Support for Opteron will be for the 2.6 kernel only, and that work is
    >>>>still in
    >>>>progress.
    >>>>
    >>>>-Paul
    >>>>
    >>>>Tarun Agarwal wrote:
    >>>>
    >>>>
    >>>>>Hi
    >>>>>
    >>>>>I am trying to use BLCR on Linux 2.4 running on Opteron machine. Does
    >>>>>BLCR
    >>>>>work on the AMD Opteron architecture running 2.4 kernel? I got the
    >>>>>following error upon running make :
    >>>>>
    >>>>># make
    >>>>>make  all-recursive
    >>>>>make[1]: Entering directory `/home/kale/testmpi/tarun/blcr-0.2.3'
    >>>>>Making all in man
    >>>>>make[2]: Entering directory `/home/kale/testmpi/tarun/blcr-0.2.3/man'
    >>>>>make[2]: Nothing to be done for `all'.
    >>>>>make[2]: Leaving directory `/home/kale/testmpi/tarun/blcr-0.2.3/man'
    >>>>>Making all in include
    >>>>>make[2]: Entering directory
    >>>>>`/home/kale/testmpi/tarun/blcr-0.2.3/include'
    >>>>>make[2]: Nothing to be done for `all'.
    >>>>>make[2]: Leaving directory `/home/kale/testmpi/tarun/blcr-0.2.3/include'
    >>>>>Making all in cr_module
    >>>>>make[2]: Entering directory
    >>>>>`/home/kale/testmpi/tarun/blcr-0.2.3/cr_module'
    >>>>>if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I../include -I../include
    >>>>>-I../vmadump
    >>>>>-I/usr/src/linux-2.4/include -D__KERNEL__ -DMODULE   -Wall
    >>>>>-Wstrict-prototypes -O2 -fomit-frame-pointer  -g -O2 -MT cr_dump_self.o
    >>>>>-MD
    >>>>>-MP -MF ".deps/cr_dump_self.Tpo" \
    >>>>> -c -o cr_dump_self.o `test -f 'cr_dump_self.c' || echo
    >>>>>'./'`cr_dump_self.c; \
    >>>>>then mv -f ".deps/cr_dump_self.Tpo" ".deps/cr_dump_self.Po"; \
    >>>>>else rm -f ".deps/cr_dump_self.Tpo"; exit 1; \
    >>>>>fi
    >>>>>In file included from cr_dump_self.c:35:
    >>>>>../vmadump/vmadump.h:84:2: #error VMADUMP does not support this
    >>>>>architecture
    >>>>>cr_dump_self.c: In function `cr_do_coredump':
    >>>>>cr_dump_self.c:70: warning: implicit declaration of function
    >>>>>`get_pt_regs'
    >>>>>cr_dump_self.c:71: warning: passing arg 2 of pointer to function makes
    >>>>>pointer from integer without a cast
    >>>>>cr_dump_self.c: In function `cr_do_vmadump':
    >>>>>cr_dump_self.c:1103: warning: passing arg 2 of `vmadump_freeze_threads'
    >>>>>makes pointer from integer without a cast
    >>>>>make[2]: *** [cr_dump_self.o] Error 1
    >>>>>make[2]: Leaving directory
    >>>>>`/home/kale/testmpi/tarun/blcr-0.2.3/cr_module'
    >>>>>make[1]: *** [all-recursive] Error 1
    >>>>>make[1]: Leaving directory `/home/kale/testmpi/tarun/blcr-0.2.3'
    >>>>>make: *** [all] Error 2
    >>>>>#
    >>>>>
    >>>>>Thnaks
    >>>>>Tarun Agarwal
    >>>>>Graduate Student, CS, UIUC.
    >>>>
    >>>>
    >>>>-- 
    >>>>Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>>Future Technologies Group
    >>>>HPC Research Department                   Tel: +1-510-495-2352
    >>>>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>>
    >>>>
    >>
    >>
    >>-- 
    >>Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>Future Technologies Group
    >>HPC Research Department                   Tel: +1-510-495-2352
    >>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>
    >>
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: jcduell_at_lbl_dot_gov: "Re: [dehua999@sjtu.edu.cn: can blcr work well with the \'mpirun -ton .....\'?]"