From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun May 10 2009 - 10:23:09 PDT
Sergio, Your problem sounds like a problem of not having identical shared libraries on host A and host B. One possibility is that the two hosts have different versions of libs installed, and a second possibility is that they could have the same versions installed, but that "prelinking" may be mapping them to different addresses on the two hosts. If you think that the libraries installed on the two hosts are the same, then try the instructions in our FAQ for disabling pre-linking: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink . If you know that the library versions are /not/ the same, or if disabling pre-linking does not help, then you will need to add the "--save-private" flag to the cr_checkpoint command in the SGE migration script to request that BLCR include copies of the libraries in the context file. I hope one of the two suggestions above resolves your problem. If not, let use know and we'll see what else we can try. -Paul Sergio D�az wrote: > Hi all, > > I am using BLCR + SGE to do checkpoint to my jobs. It's working fine and > also I can migrate the job (doing qmod -s JOB_ID). > The problem is the next: If I have a job running in host A and I do a > qmod -s JOB_ID (to migrate the job), SGE launch the migration script and > do a checkpoint, kill the job and put the job in the queue. When a host > is free, SGE runs the job in the host. If the job runs in the host A, it > finishes fine but if the job is runned in other host (host B for > instance) the job fails. > > Doing a strace to the command cr_restart archivo_checkpoint I can see > the following: > > If the job runs in the same host: >> ..... >> close(5) = 0 >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], __WCLONE|__WALL, >> NULL) = 27782 >> --- SIGCHLD (Child exited) @ 0 (0) --- >> exit_group(0) = ? >> Process 27972 detached > > If the job runs in other host: > >> .... >> close(5) = 0 >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], >> __WCLONE|__WALL, NULL) = 27782 >> --- SIGCHLD (Child exited) @ 0 (0) --- >> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0 >> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 >> tgkill(8889, 8889, SIGSEGV) = 0 >> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >> +++ killed by SIGSEGV +++ >> Process 8889 detached > > > Any ideas?? > > Regards, > Sergio > > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900