Re: problem migrating jobs

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Thu May 14 2009 - 05:19:50 PDT

  • Next message: Paul H. Hargrove: "Re: math.ct failure"
    Hi Paul,
    
    Respect to the limits. The different is that SGE set the two limits 
    below copied to the value of vmem which you put in the qsub. I used 
    vmem=1G and then SGE sets the limit to 1G. If you are working in the 
    host without SGE, these limits are "unlimited". I tested the checkpoint 
    without SGE and setting the limits to 1048576 and it worked fine. So, I 
    guess, the limits are not relevant.  About the environment (env), SGE 
    sets some variables but I tested without SGE setting the same 
    environment and the limits and it worked fine.
    
    I'll try to do more tests because I can't understand why doesn't work 
    with SGE. Meanwhile, I think disabling prelink could be the best option 
    to continue with my work.
    
    
    data seg size           (kbytes, -d) 1048576
    virtual memory          (kbytes, -v) 1048576
    
    Thanks!,
    Sergio
    
    
    
    Paul H. Hargrove escribió:
    > Sergio,
    >
    >   Since --save-private allowed your to migrate the job when not using 
    > SGE, and disabling prelinking allowed success both with and without 
    > SGE I think we can conclude that prelinking was the original cause of 
    > your problems.  So, I recommend disabling prelinking as your best option.
    >
    >   The fact that use of --save-private and/or --save-exe caused errors 
    > with SGE is not something I would have guessed in advance.  However, I 
    > suspect that it has something to to with resource limits (like the 
    > limit or ulimit shell built-ins).  This might be something BLCR could 
    > work around in the future, but I have no guess at the moment how.  If 
    > for some reason you do not wish to disable prelinking, then there may 
    > be some resource limit setting in SGE that could be changed to 
    > eliminate the "Failed to locate newborn mmap()ed space" problem.  
    > However, I am not an SGE expert and so don't know where you would 
    > start looking.
    >
    >   You asked how disabling of prelinking would affect your systems.  
    > The answer is that it will cost you a small amount of performance, 
    > mostly at program startup.  It will not introduce errors in any 
    > programs.  prelinking is also viewed by some as a security 
    > improvement.  You can read more about prelinking at 
    > http://en.wikipedia.org/wiki/Prelinking
    >
    > -Paul
    >
    > Sergio Díaz wrote:
    >> Hi Paul and Adolfo,
    >>
    >> Adolfo, running the job without SGE, it doesn't work.
    >>
    >> Paul, doing the checkpoint with "--save-private" it works fine only 
    >> if I send the jobs without SGE. But If I send the job to SGE, it 
    >> doesn't restart fine. Neither in the same host. I get the following 
    >> error:
    >>
    >>> - Failed to locate newborn mmap()ed space
    >>> - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
    >>> Restart failed: Invalid argument
    >>
    >> I don't understand why it doesn't work because SGE shouldn't affect 
    >> because the script that I use to the checkpoint is basically the same.
    >> I also tried using the option --save-all but don't work. I got the 
    >> same error. With the option --save-shared and --save-exe I got the 
    >> segmentation fault.
    >>
    >> 2nd attempt... disabling pre-linking and doing the cr_checkpoint with 
    >> the --save-private. I got the same error. But doing the cr_checkpoint 
    >> without --save-private, it works fine!! I did a successful migration 
    >> and the job finished fine.
    >>
    >> I have to research in which aspect could be affected the hosts if I 
    >> disable the pre-linking. Any idea? Less performance? problems with my 
    >> applications?
    >>
    >>
    >> Thanks a lot,
    >> Sergio
    >>
    >>
    >>
    >> Paul H. Hargrove escribió:
    >>> Sergio,
    >>>
    >>>   Your problem sounds like a problem of not having identical shared 
    >>> libraries on host A and host B.  One possibility is that the two 
    >>> hosts have different versions of libs installed, and a second 
    >>> possibility is that they could have the same versions installed, but 
    >>> that "prelinking" may be mapping them to different addresses on the 
    >>> two hosts.
    >>>
    >>>   If you think that the libraries installed on the two hosts are the 
    >>> same, then try the instructions in our FAQ for disabling 
    >>> pre-linking: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
    >>>
    >>>   If you know that the library versions are /not/ the same, or if 
    >>> disabling pre-linking does not help, then you will need to add the 
    >>> "--save-private" flag to the cr_checkpoint command in the SGE 
    >>> migration script to request that BLCR include copies of the 
    >>> libraries in the context file.
    >>>
    >>>   I hope one of the two suggestions above resolves your problem.  If 
    >>> not, let use know and we'll see what else we can try.
    >>>
    >>> -Paul
    >>>
    >>> Sergio Díaz wrote:
    >>>> Hi all,
    >>>>
    >>>> I am using BLCR + SGE to do checkpoint to my jobs. It's working 
    >>>> fine and also I can migrate the job (doing qmod -s JOB_ID).
    >>>> The problem is the next: If I have a job running in host A and I do 
    >>>> a qmod -s JOB_ID (to migrate the job), SGE launch the migration 
    >>>> script and do a checkpoint, kill the job and put the job in the 
    >>>> queue. When a host is free, SGE runs the job in the host. If the 
    >>>> job runs in the host A, it finishes fine but if  the job is runned 
    >>>> in other host (host B for instance) the job fails.
    >>>>
    >>>> Doing a strace to the command cr_restart archivo_checkpoint I can 
    >>>> see the following:
    >>>>
    >>>> If the job runs in the same host:
    >>>>> .....
    >>>>> close(5)                                = 0
    >>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
    >>>>> __WCLONE|__WALL, NULL) = 27782
    >>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>>> exit_group(0)                           = ?
    >>>>> Process 27972 detached
    >>>>
    >>>> If the job runs in other host:
    >>>>
    >>>>> ....
    >>>>> close(5)                                = 0
    >>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >>>>> __WCLONE|__WALL, NULL) = 27782
    >>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >>>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >>>>> tgkill(8889, 8889, SIGSEGV)             = 0
    >>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >>>>> +++ killed by SIGSEGV +++
    >>>>> Process 8889 detached
    >>>>
    >>>>
    >>>> Any ideas??
    >>>>
    >>>> Regards,
    >>>> Sergio
    >>>>
    >>>>
    >>>>
    >>>>
    >>>
    >>>
    >>
    >>
    >
    >
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: sdiaz@cesga.es ; http://www.cesga.es/
    ------------------------------------------------ 
    

  • Next message: Paul H. Hargrove: "Re: math.ct failure"