From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Fri May 08 2009 - 02:53:58 PDT
Hi all, I am using BLCR + SGE to do checkpoint to my jobs. It's working fine and also I can migrate the job (doing qmod -s JOB_ID). The problem is the next: If I have a job running in host A and I do a qmod -s JOB_ID (to migrate the job), SGE launch the migration script and do a checkpoint, kill the job and put the job in the queue. When a host is free, SGE runs the job in the host. If the job runs in the host A, it finishes fine but if the job is runned in other host (host B for instance) the job fails. Doing a strace to the command cr_restart archivo_checkpoint I can see the following: If the job runs in the same host: > ..... > close(5) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], __WCLONE|__WALL, > NULL) = 27782 > --- SIGCHLD (Child exited) @ 0 (0) --- > exit_group(0) = ? > Process 27972 detached If the job runs in other host: > .... > close(5) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], > __WCLONE|__WALL, NULL) = 27782 > --- SIGCHLD (Child exited) @ 0 (0) --- > setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0 > rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 > tgkill(8889, 8889, SIGSEGV) = 0 > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > +++ killed by SIGSEGV +++ > Process 8889 detached Any ideas?? Regards, Sergio -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: [email protected] ; http://www.cesga.es/ ------------------------------------------------