Re: problem migrating jobs

Date view	Thread view	Subject view	Author view	Attachment view
From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Thu May 14 2009 - 05:19:50 PDT
Next message: Paul H. Hargrove: "Re: math.ct failure"
Previous message: Paul H. Hargrove: "Re: problem migrating jobs"
In reply to: Paul H. Hargrove: "Re: problem migrating jobs"
Next in thread: Sergio Díaz: "Re: problem migrating jobs"
Hi Paul,

Respect to the limits. The different is that SGE set the two limits 
below copied to the value of vmem which you put in the qsub. I used 
vmem=1G and then SGE sets the limit to 1G. If you are working in the 
host without SGE, these limits are "unlimited". I tested the checkpoint 
without SGE and setting the limits to 1048576 and it worked fine. So, I 
guess, the limits are not relevant.  About the environment (env), SGE 
sets some variables but I tested without SGE setting the same 
environment and the limits and it worked fine.

I'll try to do more tests because I can't understand why doesn't work 
with SGE. Meanwhile, I think disabling prelink could be the best option 
to continue with my work.


data seg size           (kbytes, -d) 1048576
virtual memory          (kbytes, -v) 1048576

Thanks!,
Sergio



Paul H. Hargrove escribió:
> Sergio,
>
>   Since --save-private allowed your to migrate the job when not using 
> SGE, and disabling prelinking allowed success both with and without 
> SGE I think we can conclude that prelinking was the original cause of 
> your problems.  So, I recommend disabling prelinking as your best option.
>
>   The fact that use of --save-private and/or --save-exe caused errors 
> with SGE is not something I would have guessed in advance.  However, I 
> suspect that it has something to to with resource limits (like the 
> limit or ulimit shell built-ins).  This might be something BLCR could 
> work around in the future, but I have no guess at the moment how.  If 
> for some reason you do not wish to disable prelinking, then there may 
> be some resource limit setting in SGE that could be changed to 
> eliminate the "Failed to locate newborn mmap()ed space" problem.  
> However, I am not an SGE expert and so don't know where you would 
> start looking.
>
>   You asked how disabling of prelinking would affect your systems.  
> The answer is that it will cost you a small amount of performance, 
> mostly at program startup.  It will not introduce errors in any 
> programs.  prelinking is also viewed by some as a security 
> improvement.  You can read more about prelinking at 
> http://en.wikipedia.org/wiki/Prelinking
>
> -Paul
>
> Sergio Díaz wrote:
>> Hi Paul and Adolfo,
>>
>> Adolfo, running the job without SGE, it doesn't work.
>>
>> Paul, doing the checkpoint with "--save-private" it works fine only 
>> if I send the jobs without SGE. But If I send the job to SGE, it 
>> doesn't restart fine. Neither in the same host. I get the following 
>> error:
>>
>>> - Failed to locate newborn mmap()ed space
>>> - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
>>> Restart failed: Invalid argument
>>
>> I don't understand why it doesn't work because SGE shouldn't affect 
>> because the script that I use to the checkpoint is basically the same.
>> I also tried using the option --save-all but don't work. I got the 
>> same error. With the option --save-shared and --save-exe I got the 
>> segmentation fault.
>>
>> 2nd attempt... disabling pre-linking and doing the cr_checkpoint with 
>> the --save-private. I got the same error. But doing the cr_checkpoint 
>> without --save-private, it works fine!! I did a successful migration 
>> and the job finished fine.
>>
>> I have to research in which aspect could be affected the hosts if I 
>> disable the pre-linking. Any idea? Less performance? problems with my 
>> applications?
>>
>>
>> Thanks a lot,
>> Sergio
>>
>>
>>
>> Paul H. Hargrove escribió:
>>> Sergio,
>>>
>>>   Your problem sounds like a problem of not having identical shared 
>>> libraries on host A and host B.  One possibility is that the two 
>>> hosts have different versions of libs installed, and a second 
>>> possibility is that they could have the same versions installed, but 
>>> that "prelinking" may be mapping them to different addresses on the 
>>> two hosts.
>>>
>>>   If you think that the libraries installed on the two hosts are the 
>>> same, then try the instructions in our FAQ for disabling 
>>> pre-linking: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
>>>
>>>   If you know that the library versions are /not/ the same, or if 
>>> disabling pre-linking does not help, then you will need to add the 
>>> "--save-private" flag to the cr_checkpoint command in the SGE 
>>> migration script to request that BLCR include copies of the 
>>> libraries in the context file.
>>>
>>>   I hope one of the two suggestions above resolves your problem.  If 
>>> not, let use know and we'll see what else we can try.
>>>
>>> -Paul
>>>
>>> Sergio Díaz wrote:
>>>> Hi all,
>>>>
>>>> I am using BLCR + SGE to do checkpoint to my jobs. It's working 
>>>> fine and also I can migrate the job (doing qmod -s JOB_ID).
>>>> The problem is the next: If I have a job running in host A and I do 
>>>> a qmod -s JOB_ID (to migrate the job), SGE launch the migration 
>>>> script and do a checkpoint, kill the job and put the job in the 
>>>> queue. When a host is free, SGE runs the job in the host. If the 
>>>> job runs in the host A, it finishes fine but if  the job is runned 
>>>> in other host (host B for instance) the job fails.
>>>>
>>>> Doing a strace to the command cr_restart archivo_checkpoint I can 
>>>> see the following:
>>>>
>>>> If the job runs in the same host:
>>>>> .....
>>>>> close(5)                                = 0
>>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
>>>>> __WCLONE|__WALL, NULL) = 27782
>>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
>>>>> exit_group(0)                           = ?
>>>>> Process 27972 detached
>>>>
>>>> If the job runs in other host:
>>>>
>>>>> ....
>>>>> close(5)                                = 0
>>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
>>>>> __WCLONE|__WALL, NULL) = 27782
>>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
>>>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
>>>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
>>>>> tgkill(8889, 8889, SIGSEGV)             = 0
>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>> +++ killed by SIGSEGV +++
>>>>> Process 8889 detached
>>>>
>>>>
>>>> Any ideas??
>>>>
>>>> Regards,
>>>> Sergio
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: [email protected] ; http://www.cesga.es/
------------------------------------------------
Next message: Paul H. Hargrove: "Re: math.ct failure"
Previous message: Paul H. Hargrove: "Re: problem migrating jobs"
In reply to: Paul H. Hargrove: "Re: problem migrating jobs"
Next in thread: Sergio Díaz: "Re: problem migrating jobs"
Date view	Thread view	Subject view	Author view	Attachment view