Re: problem migrating jobs

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon May 11 2009 - 08:33:48 PDT

  • Next message: Paul H. Hargrove: "Re: problem migrating jobs"
    Hi Paul and Adolfo,
    
    Adolfo, running the job without SGE, it doesn't work.
    
    Paul, doing the checkpoint with "--save-private" it works fine only if I 
    send the jobs without SGE. But If I send the job to SGE, it doesn't 
    restart fine. Neither in the same host. I get the following error:
    
    > - Failed to locate newborn mmap()ed space
    > - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
    > Restart failed: Invalid argument
    
    I don't understand why it doesn't work because SGE shouldn't affect 
    because the script that I use to the checkpoint is basically the same.
    I also tried using the option --save-all but don't work. I got the same 
    error. With the option --save-shared and --save-exe I got the 
    segmentation fault.
    
    2nd attempt... disabling pre-linking and doing the cr_checkpoint with 
    the --save-private. I got the same error. But doing the cr_checkpoint 
    without --save-private, it works fine!! I did a successful migration and 
    the job finished fine.
    
    I have to research in which aspect could be affected the hosts if I 
    disable the pre-linking. Any idea? Less performance? problems with my 
    applications?
    
    
    Thanks a lot,
    Sergio
    
    
    
    Paul H. Hargrove escribió:
    > Sergio,
    >
    >   Your problem sounds like a problem of not having identical shared 
    > libraries on host A and host B.  One possibility is that the two hosts 
    > have different versions of libs installed, and a second possibility is 
    > that they could have the same versions installed, but that 
    > "prelinking" may be mapping them to different addresses on the two hosts.
    >
    >   If you think that the libraries installed on the two hosts are the 
    > same, then try the instructions in our FAQ for disabling pre-linking: 
    > http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
    >
    >   If you know that the library versions are /not/ the same, or if 
    > disabling pre-linking does not help, then you will need to add the 
    > "--save-private" flag to the cr_checkpoint command in the SGE 
    > migration script to request that BLCR include copies of the libraries 
    > in the context file.
    >
    >   I hope one of the two suggestions above resolves your problem.  If 
    > not, let use know and we'll see what else we can try.
    >
    > -Paul
    >
    > Sergio Díaz wrote:
    >> Hi all,
    >>
    >> I am using BLCR + SGE to do checkpoint to my jobs. It's working fine 
    >> and also I can migrate the job (doing qmod -s JOB_ID).
    >> The problem is the next: If I have a job running in host A and I do a 
    >> qmod -s JOB_ID (to migrate the job), SGE launch the migration script 
    >> and do a checkpoint, kill the job and put the job in the queue. When 
    >> a host is free, SGE runs the job in the host. If the job runs in the 
    >> host A, it finishes fine but if  the job is runned in other host 
    >> (host B for instance) the job fails.
    >>
    >> Doing a strace to the command cr_restart archivo_checkpoint I can see 
    >> the following:
    >>
    >> If the job runs in the same host:
    >>> .....
    >>> close(5)                                = 0
    >>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
    >>> __WCLONE|__WALL, NULL) = 27782
    >>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>> exit_group(0)                           = ?
    >>> Process 27972 detached
    >>
    >> If the job runs in other host:
    >>
    >>> ....
    >>> close(5)                                = 0
    >>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >>> __WCLONE|__WALL, NULL) = 27782
    >>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >>> tgkill(8889, 8889, SIGSEGV)             = 0
    >>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >>> +++ killed by SIGSEGV +++
    >>> Process 8889 detached
    >>
    >>
    >> Any ideas??
    >>
    >> Regards,
    >> Sergio
    >>
    >>
    >>
    >>
    >
    >
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: [email protected] ; http://www.cesga.es/
    ------------------------------------------------ 
    

  • Next message: Paul H. Hargrove: "Re: problem migrating jobs"