Re: using blcr on program with fork

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 10 2009 - 11:41:40 PDT

  • Next message: Andrea Autiero S143785: "Re: using blcr on program with fork"
    Andrea,
    
      You are correct that the "restarter" should not need to be linked to 
    any BLCR libraries if it uses system() to request the checkpoint ant the 
    restart.  If later you wanted to use the C equivalent of the 
    cr_checkpoint and cr_restart utilities (for instance to have more 
    control) you would need to link the "full" libcr.a.
    
      I cannot be certain what the problem is with your CR_ENOSUPPORT error, 
    but I do have a couple things you could try.
    
    1)  The warning about dlopen in statically linked applications is just a 
    warning, not an error, and BLCR should know what to do when dlopen() 
    fails.  However, I don't typically test on a system w/o shared libraries 
    and so if BLCR is getting this wrong that could be one possible reason 
    for your failure.  Looking quickly at the code, the following one-line 
    change might fix things, but I am not very confident about that:
    
    --- libcr/cr_libinit.c  14 Feb 2009 02:31:36 -0000      1.14.6.1
    +++ libcr/cr_libinit.c  10 Mar 2009 18:15:44 -0000
    @@ -143,7 +143,7 @@
         //
         if (CR_SIGNUM != __libc_current_sigrtmax()) {
            // Signal is already allocated.  Should we keep or replace?
    -       void *full_handler = NULL;
    +       void *full_handler = (void*)&cri_init; /* Cannot match */
            void *dlhandle = dlopen(NULL, RTLD_LAZY);
            if (dlhandle) {
                // Note that the preloaded one has been name-shifted
    
    2)  If the one-line patch above doesn't fix the problem, the I must ask 
    if you have been able to run the BLCR testsuite successfully on your 
    embedded platform?  You can find instructions for this in 
    config/cross_helper.c.  If you get failures running the testuite, we 
    should focus our attention there rather than on your specific application.
    
    -Paul
    
    
    Andrea Autiero S143785 wrote:
    > hello..that's me another time..
    > now i've the following problem
    >
    > andrea@chisone:~/Desktop/materiale_tesi> source
    > ../programmi_per_tesi/eldk/eldk_init 4xxARCH=ppc
    > CROSS_COMPILE=ppc_4xx-
    > DEPMOD=/home/andrea/Desktop/programmi_per_tesi/eldk/usr/bin/depmod.pl
    > PATH=/home/andrea/Desktop/programmi_per_tesi/eldk/usr/bin:/home/andrea/Desktop/programmi_per_tesi/eldk/bin:/home/andrea/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/opt/cross/bin:/usr/lib/jvm/jre/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/local/bin:/usr/local/bin
    > andrea@chisone:~/Desktop/materiale_tesi> ${CROSS_COMPILE}gcc -static -o
    > ppc_controller controller2.c -Wall
    > -L/ppc_blcr/builddir/ppc_blcr/builddir/lib/ -lcr_run -u cr_run_link_me -ldl
    > -lpthread
    > /ppc_blcr/builddir/ppc_blcr/builddir/lib//libcr_run.a(libcr_run_la-cr_run.o):
    > In function `cri_init':
    > /home/andrea/Desktop/blcr-0.7.3/builddir/libcr/../../libcr/cr_libinit.c:148:
    > warning: Using 'dlopen' in statically linked applications requires at
    > runtime the shared libraries from the glibc version used for linking
    >
    > what would be the matter?
    > I'm trying to create an application which will be checkpointed and
    > restarted from another application
    > (via system("cr_checkpoint pid")..)
    > i think that the "restarter" doesn't need the link with blcr..
    > the file to be checkpointed doesn't work and give me an error
    >
    > blcr: retry request on -CR_ENOSUPPORT
    > Checkpoint failed: support missing from application
    >
    > thanks for any suggestions..
    > Andrea Autiero
    >
    >
    > On Wed, 25 Feb 2009 12:22:39 -0800, "Paul H. Hargrove" <PHHargrove_at_lbl_dot_gov>
    > wrote:
    >   
    >> Andrea Autiero S143785 wrote:
    >>     
    >>> i'm using shared memory in my program
    >>> removing every line refering to them let blcr checkpoint my
    >>> applications..
    >>> could be this the problem?
    >>>   
    >>>       
    >> Yes, that is almost certainly the problem.  In the dmesg output you sent 
    >> I found
    >>     blcr: vfs_read returned -22
    >>     blcr: write returned -22 on copy-out of mmap()ed data
    >>     blcr: vfs_read returned -22
    >>     blcr: write returned -22 on copy-out of mmap()ed data
    >> which is consistent with use of SysV or POSIX shared memory.
    >>
    >> Unfortunately, BLCR does not yet have support for SvsY or POSIX shared 
    >> memory.  However, if you can change your program to instead use an 
    >> anonymous mmap() to obtain shared memory, that *is* supported by BLCR.
    >>
    >> Additionally, it is possible to construct a program with BLCR callbacks 
    >> that would disconnect from the shared memory when a checkpoint request 
    >> is received, allowing the checkpoint to be taken, and then reconnect 
    >> afterwards.  However, that opens up the messy issue of adding a 
    >> mechanism for preserving the shared memory values.
    >>
    >> -Paul
    >>
    >>
    >>     
    >>> On Mon, 23 Feb 2009 13:50:39 -0800, "Paul H. Hargrove"
    >>> <PHHargrove_at_lbl_dot_gov>
    >>> wrote:
    >>>   
    >>>       
    >>>> Andrea,
    >>>>
    >>>>   I cannot tell from the information you have provided what the problem
    >>>>         
    >
    >   
    >>>> might be.  If I construct a simple example program that behaves as you 
    >>>> describe, and I compile it as you describe, then I am able to
    >>>>         
    > checkpoint
    >   
    >>>> it and restart it just fine.
    >>>>   Could you please check the output of the "dmesg" command and/or your 
    >>>> system logs to see if there are any kernel messages that might help 
    >>>> explain the failure.
    >>>>
    >>>> -Paul
    >>>>
    >>>> Andrea Autiero S143785 wrote:
    >>>>     
    >>>>         
    >>>>> hi!
    >>>>> it's me another time..
    >>>>> after made statically linked file with blcr I've got another problem..
    >>>>> I'm trying to checkpoint a program after it forks twice
    >>>>> then from another shell (but in the future it will be done by the
    >>>>>       
    >>>>>           
    >>> program
    >>>   
    >>>       
    >>>>> itself)
    >>>>> i try to checkpoint it and the answer is:
    >>>>>  >ps -a
    >>>>>    PID TTY          TIME CMD
    >>>>>    5878 pts/0    00:00:00 controller
    >>>>>    5879 pts/0    00:00:02 controller
    >>>>>    5880 pts/0    00:00:02 controller
    >>>>>    5881 pts/1    00:00:00 ps
    >>>>>  >cr_checkpoint 5878
    >>>>> Checkpoint failed: Invalid argument
    >>>>>
    >>>>> 5878 is the father..
    >>>>> i've compiled it by 
    >>>>>     >gcc -o controller controller.c -L/usr/local/lib/ -lcr_run -u
    >>>>> cr_run_link_me -ldl -lpthread
    >>>>>     >nm controller | grep _link_me
    >>>>>          U cr_run_link_me
    >>>>>
    >>>>> (now is not statically linked because I'm trying on a pc and not on an
    >>>>> embedded system, but is in the last one that it must work)
    >>>>> why it do this?could you help me to make it works?
    >>>>> thanks..
    >>>>> have a good day
    >>>>> Andrea Autiero
    >>>>>
    >>>>>
    >>>>>           
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Andrea Autiero S143785: "Re: using blcr on program with fork"