Re: problem migrating jobs

Date view	Thread view	Subject view	Author view	Attachment view

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon May 11 2009 - 08:33:48 PDT

Next message: Paul H. Hargrove: "Re: problem migrating jobs"

Previous message: Paul H. Hargrove: "Re: Question about BLCR syscall"
In reply to: Paul H. Hargrove: "Re: problem migrating jobs"
Next in thread: Paul H. Hargrove: "Re: problem migrating jobs"
Reply: Paul H. Hargrove: "Re: problem migrating jobs"

Hi Paul and Adolfo,

Adolfo, running the job without SGE, it doesn't work.

Paul, doing the checkpoint with "--save-private" it works fine only if I 
send the jobs without SGE. But If I send the job to SGE, it doesn't 
restart fine. Neither in the same host. I get the following error:

> - Failed to locate newborn mmap()ed space
> - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
> Restart failed: Invalid argument

I don't understand why it doesn't work because SGE shouldn't affect 
because the script that I use to the checkpoint is basically the same.
I also tried using the option --save-all but don't work. I got the same 
error. With the option --save-shared and --save-exe I got the 
segmentation fault.

2nd attempt... disabling pre-linking and doing the cr_checkpoint with 
the --save-private. I got the same error. But doing the cr_checkpoint 
without --save-private, it works fine!! I did a successful migration and 
the job finished fine.

I have to research in which aspect could be affected the hosts if I 
disable the pre-linking. Any idea? Less performance? problems with my 
applications?


Thanks a lot,
Sergio



Paul H. Hargrove escribió:
> Sergio,
>
>   Your problem sounds like a problem of not having identical shared 
> libraries on host A and host B.  One possibility is that the two hosts 
> have different versions of libs installed, and a second possibility is 
> that they could have the same versions installed, but that 
> "prelinking" may be mapping them to different addresses on the two hosts.
>
>   If you think that the libraries installed on the two hosts are the 
> same, then try the instructions in our FAQ for disabling pre-linking: 
> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
>
>   If you know that the library versions are /not/ the same, or if 
> disabling pre-linking does not help, then you will need to add the 
> "--save-private" flag to the cr_checkpoint command in the SGE 
> migration script to request that BLCR include copies of the libraries 
> in the context file.
>
>   I hope one of the two suggestions above resolves your problem.  If 
> not, let use know and we'll see what else we can try.
>
> -Paul
>
> Sergio Díaz wrote:
>> Hi all,
>>
>> I am using BLCR + SGE to do checkpoint to my jobs. It's working fine 
>> and also I can migrate the job (doing qmod -s JOB_ID).
>> The problem is the next: If I have a job running in host A and I do a 
>> qmod -s JOB_ID (to migrate the job), SGE launch the migration script 
>> and do a checkpoint, kill the job and put the job in the queue. When 
>> a host is free, SGE runs the job in the host. If the job runs in the 
>> host A, it finishes fine but if  the job is runned in other host 
>> (host B for instance) the job fails.
>>
>> Doing a strace to the command cr_restart archivo_checkpoint I can see 
>> the following:
>>
>> If the job runs in the same host:
>>> .....
>>> close(5)                                = 0
>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
>>> __WCLONE|__WALL, NULL) = 27782
>>> --- SIGCHLD (Child exited) @ 0 (0) ---
>>> exit_group(0)                           = ?
>>> Process 27972 detached
>>
>> If the job runs in other host:
>>
>>> ....
>>> close(5)                                = 0
>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
>>> __WCLONE|__WALL, NULL) = 27782
>>> --- SIGCHLD (Child exited) @ 0 (0) ---
>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
>>> tgkill(8889, 8889, SIGSEGV)             = 0
>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>> +++ killed by SIGSEGV +++
>>> Process 8889 detached
>>
>>
>> Any ideas??
>>
>> Regards,
>> Sergio
>>
>>
>>
>>
>
>


-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: [email protected] ; http://www.cesga.es/
------------------------------------------------

Next message: Paul H. Hargrove: "Re: problem migrating jobs"

Previous message: Paul H. Hargrove: "Re: Question about BLCR syscall"
In reply to: Paul H. Hargrove: "Re: problem migrating jobs"
Next in thread: Paul H. Hargrove: "Re: problem migrating jobs"
Reply: Paul H. Hargrove: "Re: problem migrating jobs"

Date view	Thread view	Subject view	Author view	Attachment view