Re: berkeley checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 02 2008 - 13:13:50 PST

  • Next message: Jerry Mersel: "Re: berkeley checkpointing"
      Let's try to deal with one problem at a time.  First I'd like to 
    address the "relocation error" and see if resolving it still leaves the 
    second error.
      The purpose of cr_run is to set LD_PRELOAD just as you have done 
    manually.  If you could, please tell me if the following two commands 
    (executed via SGE) each produce the same relocation error:
    ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m
    env matlab -nojvm -nodisplay 
    -nosplash < $H/test.m
    If you could, also send the output of "env 
    ldd /bin/cat" executed both from the command line and via SGE.
    Jerry Mersel wrote:
    > I manage to checkpoint matlab processes  from the command line.
    > But when I want to use SGE I get the error:
    > /lib64/ relocation error: /lib64/tls/ symbol 
    > errno, version GLIBC_PRIVATE not defined in file with link 
    > time reference
    > Restart failed: No such device or address
    > The relocation error I get on the start using cr_run.
    > The Restart failed I get when trying to restart.
    > I start matlab thus:
    > ${BLCR_HOME}/bin/cr_run env 
    > matlab -nojvm -nodisplay -nosplash < $H/test.m
    > and try to restart thus:
    > ${BLCR_HOME}/bin/cr_restart $ckptfile
    > my log file says this:
    > Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    > Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    > Jan  2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 
    > path='/dev/tty')
    > Jan  2 14:26:03 kam02 kernel: cr_restore_all_files [28703]:  Unable to 
    > restore fd 3 (type=6,err=-6)
    > Jan  2 14:26:03 kam02 kernel: cr_rstrt_child [28703]:  Unable to 
    > restore files!  (err=-6)
    > Perhaps something to do with the socket.
    > What do you think?
    >                                Regards,
    >                                   Jerry
    > P.S. I have prelinking turned off.
    > cat
    > Paul H. Hargrove wrote:
    >> Jerry Mersel wrote:
    >>> Hi:
    >>>  I am trying to migrate jobs on a grid after checkpointing.
    >>> Does the "prelinking" fix as mentioned in the faq must it be done
    >>> on the checkpointed node and the migrated to node?
    >>>                                     Regards,
    >>>                                        Jerry
    >> Yes, the prelinking of libraries should be disabled on both the 
    >> "checkpointed on" and "migrated to" nodes.
    >> I will clarify this in the next FAQ version.
    >> -Paul
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Jerry Mersel: "Re: berkeley checkpointing"