From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Fri May 08 2009 - 04:25:47 PDT
I'm using Gaussian03 64bits for these tests. I have done some tests more. If the job is running in host A, doing the checkpoing and rebooting the host A, when the host is available again, the job can restart in the host A without problem. So, I guess that there are no problems with env variables or something allocated in memory.... regards, Sergio Sergio Díaz escribió: > Hi all, > > I am using BLCR + SGE to do checkpoint to my jobs. It's working fine > and also I can migrate the job (doing qmod -s JOB_ID). > The problem is the next: If I have a job running in host A and I do a > qmod -s JOB_ID (to migrate the job), SGE launch the migration script > and do a checkpoint, kill the job and put the job in the queue. When a > host is free, SGE runs the job in the host. If the job runs in the > host A, it finishes fine but if the job is runned in other host (host > B for instance) the job fails. > > Doing a strace to the command cr_restart archivo_checkpoint I can see > the following: > > If the job runs in the same host: >> ..... >> close(5) = 0 >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], >> __WCLONE|__WALL, NULL) = 27782 >> --- SIGCHLD (Child exited) @ 0 (0) --- >> exit_group(0) = ? >> Process 27972 detached > > If the job runs in other host: > >> .... >> close(5) = 0 >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], >> __WCLONE|__WALL, NULL) = 27782 >> --- SIGCHLD (Child exited) @ 0 (0) --- >> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0 >> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 >> tgkill(8889, 8889, SIGSEGV) = 0 >> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >> +++ killed by SIGSEGV +++ >> Process 8889 detached > > > Any ideas?? > > Regards, > Sergio > > > > -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: [email protected] ; http://www.cesga.es/ ------------------------------------------------