From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Thu May 14 2009 - 05:19:50 PDT
Hi Paul, Respect to the limits. The different is that SGE set the two limits below copied to the value of vmem which you put in the qsub. I used vmem=1G and then SGE sets the limit to 1G. If you are working in the host without SGE, these limits are "unlimited". I tested the checkpoint without SGE and setting the limits to 1048576 and it worked fine. So, I guess, the limits are not relevant. About the environment (env), SGE sets some variables but I tested without SGE setting the same environment and the limits and it worked fine. I'll try to do more tests because I can't understand why doesn't work with SGE. Meanwhile, I think disabling prelink could be the best option to continue with my work. data seg size (kbytes, -d) 1048576 virtual memory (kbytes, -v) 1048576 Thanks!, Sergio Paul H. Hargrove escribió: > Sergio, > > Since --save-private allowed your to migrate the job when not using > SGE, and disabling prelinking allowed success both with and without > SGE I think we can conclude that prelinking was the original cause of > your problems. So, I recommend disabling prelinking as your best option. > > The fact that use of --save-private and/or --save-exe caused errors > with SGE is not something I would have guessed in advance. However, I > suspect that it has something to to with resource limits (like the > limit or ulimit shell built-ins). This might be something BLCR could > work around in the future, but I have no guess at the moment how. If > for some reason you do not wish to disable prelinking, then there may > be some resource limit setting in SGE that could be changed to > eliminate the "Failed to locate newborn mmap()ed space" problem. > However, I am not an SGE expert and so don't know where you would > start looking. > > You asked how disabling of prelinking would affect your systems. > The answer is that it will cost you a small amount of performance, > mostly at program startup. It will not introduce errors in any > programs. prelinking is also viewed by some as a security > improvement. You can read more about prelinking at > http://en.wikipedia.org/wiki/Prelinking > > -Paul > > Sergio Díaz wrote: >> Hi Paul and Adolfo, >> >> Adolfo, running the job without SGE, it doesn't work. >> >> Paul, doing the checkpoint with "--save-private" it works fine only >> if I send the jobs without SGE. But If I send the job to SGE, it >> doesn't restart fine. Neither in the same host. I get the following >> error: >> >>> - Failed to locate newborn mmap()ed space >>> - cr_rstrt_child [20249]: Unable to load mmap()ed data! (err=-22) >>> Restart failed: Invalid argument >> >> I don't understand why it doesn't work because SGE shouldn't affect >> because the script that I use to the checkpoint is basically the same. >> I also tried using the option --save-all but don't work. I got the >> same error. With the option --save-shared and --save-exe I got the >> segmentation fault. >> >> 2nd attempt... disabling pre-linking and doing the cr_checkpoint with >> the --save-private. I got the same error. But doing the cr_checkpoint >> without --save-private, it works fine!! I did a successful migration >> and the job finished fine. >> >> I have to research in which aspect could be affected the hosts if I >> disable the pre-linking. Any idea? Less performance? problems with my >> applications? >> >> >> Thanks a lot, >> Sergio >> >> >> >> Paul H. Hargrove escribió: >>> Sergio, >>> >>> Your problem sounds like a problem of not having identical shared >>> libraries on host A and host B. One possibility is that the two >>> hosts have different versions of libs installed, and a second >>> possibility is that they could have the same versions installed, but >>> that "prelinking" may be mapping them to different addresses on the >>> two hosts. >>> >>> If you think that the libraries installed on the two hosts are the >>> same, then try the instructions in our FAQ for disabling >>> pre-linking: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink . >>> >>> If you know that the library versions are /not/ the same, or if >>> disabling pre-linking does not help, then you will need to add the >>> "--save-private" flag to the cr_checkpoint command in the SGE >>> migration script to request that BLCR include copies of the >>> libraries in the context file. >>> >>> I hope one of the two suggestions above resolves your problem. If >>> not, let use know and we'll see what else we can try. >>> >>> -Paul >>> >>> Sergio Díaz wrote: >>>> Hi all, >>>> >>>> I am using BLCR + SGE to do checkpoint to my jobs. It's working >>>> fine and also I can migrate the job (doing qmod -s JOB_ID). >>>> The problem is the next: If I have a job running in host A and I do >>>> a qmod -s JOB_ID (to migrate the job), SGE launch the migration >>>> script and do a checkpoint, kill the job and put the job in the >>>> queue. When a host is free, SGE runs the job in the host. If the >>>> job runs in the host A, it finishes fine but if the job is runned >>>> in other host (host B for instance) the job fails. >>>> >>>> Doing a strace to the command cr_restart archivo_checkpoint I can >>>> see the following: >>>> >>>> If the job runs in the same host: >>>>> ..... >>>>> close(5) = 0 >>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >>>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], >>>>> __WCLONE|__WALL, NULL) = 27782 >>>>> --- SIGCHLD (Child exited) @ 0 (0) --- >>>>> exit_group(0) = ? >>>>> Process 27972 detached >>>> >>>> If the job runs in other host: >>>> >>>>> .... >>>>> close(5) = 0 >>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >>>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], >>>>> __WCLONE|__WALL, NULL) = 27782 >>>>> --- SIGCHLD (Child exited) @ 0 (0) --- >>>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0 >>>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 >>>>> tgkill(8889, 8889, SIGSEGV) = 0 >>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>>>> +++ killed by SIGSEGV +++ >>>>> Process 8889 detached >>>> >>>> >>>> Any ideas?? >>>> >>>> Regards, >>>> Sergio >>>> >>>> >>>> >>>> >>> >>> >> >> > > -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: [email protected] ; http://www.cesga.es/ ------------------------------------------------