From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon May 11 2009 - 08:33:48 PDT
Hi Paul and Adolfo, Adolfo, running the job without SGE, it doesn't work. Paul, doing the checkpoint with "--save-private" it works fine only if I send the jobs without SGE. But If I send the job to SGE, it doesn't restart fine. Neither in the same host. I get the following error: > - Failed to locate newborn mmap()ed space > - cr_rstrt_child [20249]: Unable to load mmap()ed data! (err=-22) > Restart failed: Invalid argument I don't understand why it doesn't work because SGE shouldn't affect because the script that I use to the checkpoint is basically the same. I also tried using the option --save-all but don't work. I got the same error. With the option --save-shared and --save-exe I got the segmentation fault. 2nd attempt... disabling pre-linking and doing the cr_checkpoint with the --save-private. I got the same error. But doing the cr_checkpoint without --save-private, it works fine!! I did a successful migration and the job finished fine. I have to research in which aspect could be affected the hosts if I disable the pre-linking. Any idea? Less performance? problems with my applications? Thanks a lot, Sergio Paul H. Hargrove escribió: > Sergio, > > Your problem sounds like a problem of not having identical shared > libraries on host A and host B. One possibility is that the two hosts > have different versions of libs installed, and a second possibility is > that they could have the same versions installed, but that > "prelinking" may be mapping them to different addresses on the two hosts. > > If you think that the libraries installed on the two hosts are the > same, then try the instructions in our FAQ for disabling pre-linking: > http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink . > > If you know that the library versions are /not/ the same, or if > disabling pre-linking does not help, then you will need to add the > "--save-private" flag to the cr_checkpoint command in the SGE > migration script to request that BLCR include copies of the libraries > in the context file. > > I hope one of the two suggestions above resolves your problem. If > not, let use know and we'll see what else we can try. > > -Paul > > Sergio Díaz wrote: >> Hi all, >> >> I am using BLCR + SGE to do checkpoint to my jobs. It's working fine >> and also I can migrate the job (doing qmod -s JOB_ID). >> The problem is the next: If I have a job running in host A and I do a >> qmod -s JOB_ID (to migrate the job), SGE launch the migration script >> and do a checkpoint, kill the job and put the job in the queue. When >> a host is free, SGE runs the job in the host. If the job runs in the >> host A, it finishes fine but if the job is runned in other host >> (host B for instance) the job fails. >> >> Doing a strace to the command cr_restart archivo_checkpoint I can see >> the following: >> >> If the job runs in the same host: >>> ..... >>> close(5) = 0 >>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], >>> __WCLONE|__WALL, NULL) = 27782 >>> --- SIGCHLD (Child exited) @ 0 (0) --- >>> exit_group(0) = ? >>> Process 27972 detached >> >> If the job runs in other host: >> >>> .... >>> close(5) = 0 >>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], >>> __WCLONE|__WALL, NULL) = 27782 >>> --- SIGCHLD (Child exited) @ 0 (0) --- >>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0 >>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 >>> tgkill(8889, 8889, SIGSEGV) = 0 >>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>> +++ killed by SIGSEGV +++ >>> Process 8889 detached >> >> >> Any ideas?? >> >> Regards, >> Sergio >> >> >> >> > > -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: [email protected] ; http://www.cesga.es/ ------------------------------------------------