Re: dlopen() libcr.so, and problem with C++ compilers?

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Feb 13 2009 - 11:56:48 PST

Next message: Ted Cabeen: "Problems with --enable-restore-ids"

Previous message: Alan Woodland: "dlopen() libcr.so, and problem with C++ compilers?"
In reply to: Alan Woodland: "dlopen() libcr.so, and problem with C++ compilers?"

Alan,
  Thanks for your interest in BLCR.  Please see my answers below.
-Paul

Alan Woodland wrote:
> Hi,
>
> I've been working on using BLCR with my application, and I've
> encountered a few issues:
>
> Q1:
>
> I've been trying to integrate BLCR into one of my applications such
> that it will transparently work provided the machine the application
> is running on has the BLCR library and kernel modules available.
>
> I was under the impression from the documentation that a sensible way
> to make this work on both machines with and without BLCR was to use
> dlopen()/dlsym() at run time, but the problem is that
> cr_initialize_restart_args_t and cr_initialize_checkpoint_args_t are
> both macros, which means they're not symbols in libcr.so - my only
> options are to use the private interfaces they call (nasty) or link at
> compile time here.
>
> Any suggestions for a better work-around than using the private
> interfaces? It makes the software engineer in me die a little to do
> that!
>   

I am glad to here that you have that inner software engineer inside 
you.  The point of making internal interfaces internal is that we need 
to change them from time to time, and they DO change.  I know of at 
least one high-profile project that is stuck with an older version of 
BLCR because they are using some internal interfaces that changed.

The OpenMPI project is doing pretty much what you are (one build that 
works both w/ and w/o BLCR present) using dlopen().  The only difference 
is that you will need one level of indirection.  You are almost there 
when you say "or link at compile time". What you need is to build 
"myblcrsupport.o" or "myblcrsupprt.so" that does link to libcr.so at 
compile time, but the rest of your project does not.  In that 
object/library your calls to the initializer macros will be expanded.  
Then it is the "myblcrsupport.{o,so}" that you dlopen() instead of libcr 
(or in addition to it, depending how you want to deal with RTLD_NOW vs 
RTLD_LAZY).

> Q2:
>
> This one's quite minor - I should really just use a C compiler instead
> I guess...
>
> The macro  CR_RSTRT_RELOCATE_SIZE(CR_MAX_RSTRT_RELOC) has made life in
> C++ harder. It evaluates to:
>
> (sizeof(struct cr
> _rstrt_relocate) + (16) * sizeof(struct cr_rstrt_relocate_pair))
>
> But I think in C++ this is one of those subtle C/C++ differences. In
> C++ I think it needs to be
> sizeof(cr_rstrt_relocate::cr_rstrt_relocate_pair)?
> (sizeof(struct cr_rstrt_relocate) + (CR_MAX_RSTRT_RELOC *
> sizeof(cr_rstrt_relocate::cr_rstrt_relocate_pair))) works instead.
>
> If cr_rstrt_relocate_pair were defined and declared outside of
> cr_rstrt_relocate that would work for both C and C++ with the current
> macro? Or a #ifdef __CPLUSPLUS for two versions of that macro?
>
>   

I see the point here and since I almost never write C++ code myself 
(though I read it just fine), I missed this problem when writing this 
macro.  I think the preferred solution is your first: define the 
relocate_pair at file scope, rather than in a nested scope.  I will try 
to get that change in 0.8.1 (expected early March or when the 2.6.29 
kernel is released).

As a side note, you are lucky you didn't try a C++ compiler with the 
BLCR 0.7.x series.  Back then the "newpath" member in struct 
cr_rstrt_relocate_pair was named "new"!!

> Q3:
>
> Is it safe to write things into the file before the checkpoint itself?
> I want to write information about relocations that will be needed by
> my application into the same file. It seems to be working, but would
> it be better being written after? Could it ever seek to the beginning
> of a file explicitly during loading? Or will it always just start from
> where the file was when it was given it? Does it ignore extra bits at
> the end of a file? It seems to work fine with extra info at the
> beginning of the file provided I make the seek on the file handle to
> the appropriate point first. Is this 'as designed' and guaranteed to
> work with future versions?
>   

By design, BLCR will never seek to the beginning of the context file (no 
seeks at all, in fact).  This was decided both to allow exactly what you 
are doing now, as well as to allow a checkpoint to be sent through a 
non-seekable channel such as a pipe between processes or a socket 
between nodes.  So, it *is* guaranteed to continue working.   The one 
thing you might want to be aware of is that if there is an error while 
checkpointing (or restarting), there is no guarantee about how many 
bytes have been written (or read).  For instance, a failed checkpoint 
may have written a useless partial file, while a failed restart may have 
read only a portion of the file that was written at checkpoint time.  
So, if you plan to have some way to recover from such a failure, then 
you may need to take this in to account if you ever did place your own 
data after BLCR's data.

> Thanks,
> Alan
>   

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: Ted Cabeen: "Problems with --enable-restore-ids"

Previous message: Alan Woodland: "dlopen() libcr.so, and problem with C++ compilers?"
In reply to: Alan Woodland: "dlopen() libcr.so, and problem with C++ compilers?"

Date view	Thread view	Subject view	Author view	Attachment view