problem migrating jobs

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Fri May 08 2009 - 02:53:58 PDT

  • Next message: Sergio Díaz: "Re: problem migrating jobs"
    Hi all,
    
    I am using BLCR + SGE to do checkpoint to my jobs. It's working fine and 
    also I can migrate the job (doing qmod -s JOB_ID).
    The problem is the next: If I have a job running in host A and I do a 
    qmod -s JOB_ID (to migrate the job), SGE launch the migration script and 
    do a checkpoint, kill the job and put the job in the queue. When a host 
    is free, SGE runs the job in the host. If the job runs in the host A, it 
    finishes fine but if  the job is runned in other host (host B for 
    instance) the job fails.
    
    Doing a strace to the command cr_restart archivo_checkpoint I can see 
    the following:
    
    If the job runs in the same host:
    > .....
    > close(5)                                = 0
    > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    > wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], __WCLONE|__WALL, 
    > NULL) = 27782
    > --- SIGCHLD (Child exited) @ 0 (0) ---
    > exit_group(0)                           = ?
    > Process 27972 detached
    
    If the job runs in other host:
    
    > ....
    > close(5)                                = 0
    > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    > wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    > __WCLONE|__WALL, NULL) = 27782
    > --- SIGCHLD (Child exited) @ 0 (0) ---
    > setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    > rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    > tgkill(8889, 8889, SIGSEGV)             = 0
    > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    > +++ killed by SIGSEGV +++
    > Process 8889 detached
    
    
    Any ideas??
    
    Regards,
    Sergio
    
    
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: [email protected] ; http://www.cesga.es/
    ------------------------------------------------ 
    

  • Next message: Sergio Díaz: "Re: problem migrating jobs"