From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon May 11 2009 - 08:49:12 PDT
Sergio, Since --save-private allowed your to migrate the job when not using SGE, and disabling prelinking allowed success both with and without SGE I think we can conclude that prelinking was the original cause of your problems. So, I recommend disabling prelinking as your best option. The fact that use of --save-private and/or --save-exe caused errors with SGE is not something I would have guessed in advance. However, I suspect that it has something to to with resource limits (like the limit or ulimit shell built-ins). This might be something BLCR could work around in the future, but I have no guess at the moment how. If for some reason you do not wish to disable prelinking, then there may be some resource limit setting in SGE that could be changed to eliminate the "Failed to locate newborn mmap()ed space" problem. However, I am not an SGE expert and so don't know where you would start looking. You asked how disabling of prelinking would affect your systems. The answer is that it will cost you a small amount of performance, mostly at program startup. It will not introduce errors in any programs. prelinking is also viewed by some as a security improvement. You can read more about prelinking at http://en.wikipedia.org/wiki/Prelinking -Paul Sergio D�az wrote: > Hi Paul and Adolfo, > > Adolfo, running the job without SGE, it doesn't work. > > Paul, doing the checkpoint with "--save-private" it works fine only if I > send the jobs without SGE. But If I send the job to SGE, it doesn't > restart fine. Neither in the same host. I get the following error: > >> - Failed to locate newborn mmap()ed space >> - cr_rstrt_child [20249]: Unable to load mmap()ed data! (err=-22) >> Restart failed: Invalid argument > > I don't understand why it doesn't work because SGE shouldn't affect > because the script that I use to the checkpoint is basically the same. > I also tried using the option --save-all but don't work. I got the same > error. With the option --save-shared and --save-exe I got the > segmentation fault. > > 2nd attempt... disabling pre-linking and doing the cr_checkpoint with > the --save-private. I got the same error. But doing the cr_checkpoint > without --save-private, it works fine!! I did a successful migration and > the job finished fine. > > I have to research in which aspect could be affected the hosts if I > disable the pre-linking. Any idea? Less performance? problems with my > applications? > > > Thanks a lot, > Sergio > > > > Paul H. Hargrove escribi�: >> Sergio, >> >> Your problem sounds like a problem of not having identical shared >> libraries on host A and host B. One possibility is that the two hosts >> have different versions of libs installed, and a second possibility is >> that they could have the same versions installed, but that >> "prelinking" may be mapping them to different addresses on the two hosts. >> >> If you think that the libraries installed on the two hosts are the >> same, then try the instructions in our FAQ for disabling pre-linking: >> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink . >> >> If you know that the library versions are /not/ the same, or if >> disabling pre-linking does not help, then you will need to add the >> "--save-private" flag to the cr_checkpoint command in the SGE >> migration script to request that BLCR include copies of the libraries >> in the context file. >> >> I hope one of the two suggestions above resolves your problem. If >> not, let use know and we'll see what else we can try. >> >> -Paul >> >> Sergio D�az wrote: >>> Hi all, >>> >>> I am using BLCR + SGE to do checkpoint to my jobs. It's working fine >>> and also I can migrate the job (doing qmod -s JOB_ID). >>> The problem is the next: If I have a job running in host A and I do a >>> qmod -s JOB_ID (to migrate the job), SGE launch the migration script >>> and do a checkpoint, kill the job and put the job in the queue. When >>> a host is free, SGE runs the job in the host. If the job runs in the >>> host A, it finishes fine but if the job is runned in other host >>> (host B for instance) the job fails. >>> >>> Doing a strace to the command cr_restart archivo_checkpoint I can see >>> the following: >>> >>> If the job runs in the same host: >>>> ..... >>>> close(5) = 0 >>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], >>>> __WCLONE|__WALL, NULL) = 27782 >>>> --- SIGCHLD (Child exited) @ 0 (0) --- >>>> exit_group(0) = ? >>>> Process 27972 detached >>> >>> If the job runs in other host: >>> >>>> .... >>>> close(5) = 0 >>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], >>>> __WCLONE|__WALL, NULL) = 27782 >>>> --- SIGCHLD (Child exited) @ 0 (0) --- >>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0 >>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 >>>> tgkill(8889, 8889, SIGSEGV) = 0 >>>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>>> +++ killed by SIGSEGV +++ >>>> Process 8889 detached >>> >>> >>> Any ideas?? >>> >>> Regards, >>> Sergio >>> >>> >>> >>> >> >> > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900