Re: Problems with BLCR?

From: Jeff Squyres (jsquyres_at_open-mpi.org)
Date: Wed Jul 27 2005 - 12:32:48 PDT

  • Next message: Paul H. Hargrove: "BLCR 0.4.1 Beta4 now available"
    I didn't dig, but I'm guessing that it calls aio_init() (or whatever) 
    -- doesn't that spawn off another thread and/or setup things with 
    resources that could be non-checkpointable?
    
    
    On Jul 27, 2005, at 11:21 AM, Paul H. Hargrove wrote:
    
    > Jeff,
    >
    >  I am not sure this explains why a simple hello world program should 
    > fail to restart.  Even if romio runs some initialization code at 
    > MPI_Init time, I can see how any actual async I/O would be started.
    >
    > -Paul
    >
    > Jeff Squyres wrote:
    >
    >> On Jul 26, 2005, at 5:01 PM, Paul H. Hargrove wrote:
    >>
    >>>   There is no support in current BLCR versions for either POSIX or 
    >>> Linux-native async I/O support.  While this has nothing to do with 
    >>> whatever linker problems Jeff mentioned, it could be the cause of 
    >>> the problems you've been seeing.
    >>
    >>
    >> I'm inferring from Pradeep's mail that there was an RPM that was 
    >> removed, but has now been replaced (LAM won't use libaio unless it 
    >> finds it during configure -- so it must have been there at some point 
    >> and then was later removed).
    >>
    >>>   How/when is async I/O used in LAM?  Is there a simple way to 
    >>> disable it via ssi params?
    >>
    >>
    >> It's used in ROMIO.  There are currently no SSI params to remove its 
    >> use -- part of the problem is that the wrapper compilers add "-laio"  
    >>  So it's not just a run-time switch to change ROMIO's behavior, it's 
    >> a compile-time decision (ROMIO makes a bunch of decisions and sets 
    >> #define's based on whether AIO is present or not) for both LAM and 
    >> ROMIO.
    >>
    >> But this also explains why we rarely (never?) saw this problem in our 
    >> own testing -- the vast majority of our manual testing builds disable 
    >> ROMIO because it takes so long to compile.  Urgh.  This also explains 
    >> why my LAM build on Pradeep's system worked -- I configured and built 
    >> LAM after the libaio-devel RPM was removed, so my build did not add 
    >> -laio.
    >>
    >> The quick and easy solution is to disable ROMIO ("--without-romio").  
    >> Not really an optimal solution, but it'll work.
    >>
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 HPC Research Department      
    >              Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    
    -- 
    {+} Jeff Squyres
    {+} The Open MPI Project
    {+} http://www.open-mpi.org/
    

  • Next message: Paul H. Hargrove: "BLCR 0.4.1 Beta4 now available"