Re: problem migrating jobs

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon May 11 2009 - 08:49:12 PDT

  • Next message: Sergio Díaz: "Re: problem migrating jobs"
    Sergio,
    
       Since --save-private allowed your to migrate the job when not using SGE, 
    and disabling prelinking allowed success both with and without SGE I think we 
    can conclude that prelinking was the original cause of your problems.  So, I 
    recommend disabling prelinking as your best option.
    
       The fact that use of --save-private and/or --save-exe caused errors with 
    SGE is not something I would have guessed in advance.  However, I suspect that 
    it has something to to with resource limits (like the limit or ulimit shell 
    built-ins).  This might be something BLCR could work around in the future, but 
    I have no guess at the moment how.  If for some reason you do not wish to 
    disable prelinking, then there may be some resource limit setting in SGE that 
    could be changed to eliminate the "Failed to locate newborn mmap()ed space" 
    problem.  However, I am not an SGE expert and so don't know where you would 
    start looking.
    
       You asked how disabling of prelinking would affect your systems.  The 
    answer is that it will cost you a small amount of performance, mostly at 
    program startup.  It will not introduce errors in any programs.  prelinking is 
    also viewed by some as a security improvement.  You can read more about 
    prelinking at http://en.wikipedia.org/wiki/Prelinking
    
    -Paul
    
    Sergio Díaz wrote:
    > Hi Paul and Adolfo,
    > 
    > Adolfo, running the job without SGE, it doesn't work.
    > 
    > Paul, doing the checkpoint with "--save-private" it works fine only if I 
    > send the jobs without SGE. But If I send the job to SGE, it doesn't 
    > restart fine. Neither in the same host. I get the following error:
    > 
    >> - Failed to locate newborn mmap()ed space
    >> - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
    >> Restart failed: Invalid argument
    > 
    > I don't understand why it doesn't work because SGE shouldn't affect 
    > because the script that I use to the checkpoint is basically the same.
    > I also tried using the option --save-all but don't work. I got the same 
    > error. With the option --save-shared and --save-exe I got the 
    > segmentation fault.
    > 
    > 2nd attempt... disabling pre-linking and doing the cr_checkpoint with 
    > the --save-private. I got the same error. But doing the cr_checkpoint 
    > without --save-private, it works fine!! I did a successful migration and 
    > the job finished fine.
    > 
    > I have to research in which aspect could be affected the hosts if I 
    > disable the pre-linking. Any idea? Less performance? problems with my 
    > applications?
    > 
    > 
    > Thanks a lot,
    > Sergio
    > 
    > 
    > 
    > Paul H. Hargrove escribió:
    >> Sergio,
    >>
    >>   Your problem sounds like a problem of not having identical shared 
    >> libraries on host A and host B.  One possibility is that the two hosts 
    >> have different versions of libs installed, and a second possibility is 
    >> that they could have the same versions installed, but that 
    >> "prelinking" may be mapping them to different addresses on the two hosts.
    >>
    >>   If you think that the libraries installed on the two hosts are the 
    >> same, then try the instructions in our FAQ for disabling pre-linking: 
    >> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
    >>
    >>   If you know that the library versions are /not/ the same, or if 
    >> disabling pre-linking does not help, then you will need to add the 
    >> "--save-private" flag to the cr_checkpoint command in the SGE 
    >> migration script to request that BLCR include copies of the libraries 
    >> in the context file.
    >>
    >>   I hope one of the two suggestions above resolves your problem.  If 
    >> not, let use know and we'll see what else we can try.
    >>
    >> -Paul
    >>
    >> Sergio Díaz wrote:
    >>> Hi all,
    >>>
    >>> I am using BLCR + SGE to do checkpoint to my jobs. It's working fine 
    >>> and also I can migrate the job (doing qmod -s JOB_ID).
    >>> The problem is the next: If I have a job running in host A and I do a 
    >>> qmod -s JOB_ID (to migrate the job), SGE launch the migration script 
    >>> and do a checkpoint, kill the job and put the job in the queue. When 
    >>> a host is free, SGE runs the job in the host. If the job runs in the 
    >>> host A, it finishes fine but if  the job is runned in other host 
    >>> (host B for instance) the job fails.
    >>>
    >>> Doing a strace to the command cr_restart archivo_checkpoint I can see 
    >>> the following:
    >>>
    >>> If the job runs in the same host:
    >>>> .....
    >>>> close(5)                                = 0
    >>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
    >>>> __WCLONE|__WALL, NULL) = 27782
    >>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>> exit_group(0)                           = ?
    >>>> Process 27972 detached
    >>>
    >>> If the job runs in other host:
    >>>
    >>>> ....
    >>>> close(5)                                = 0
    >>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >>>> __WCLONE|__WALL, NULL) = 27782
    >>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >>>> tgkill(8889, 8889, SIGSEGV)             = 0
    >>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >>>> +++ killed by SIGSEGV +++
    >>>> Process 8889 detached
    >>>
    >>>
    >>> Any ideas??
    >>>
    >>> Regards,
    >>> Sergio
    >>>
    >>>
    >>>
    >>>
    >>
    >>
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Sergio Díaz: "Re: problem migrating jobs"