Re: problem migrating jobs

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri May 15 2009 - 09:29:28 PDT

  • Next message: ÀîºêÁÁ: "Question about "fd" token"
    Sergio,
      If I am recalling correctly, Adolfo was seeing cr_restart fail with 
    SIGABRT, not SIGSEGV.  So, this might not be the same problem.
    
    Adolfo,
      Thanks for trying to help Sergio.
      BLCR should now (in 0.8.x) be automatically retrying the 
    pthread_create() call with a smaller stack if the first attempt fails.  
    If you have the time, I'd appreciate hearing if things work for you 
    after removing your manipulation of the stack limits.
    
    -Paul
    
    Sergio Díaz wrote:
    > Hi Adolfo,
    >
    > Thank for you collaboration. It is very interesting to test.
    > I have tested this and I think that my problem isn't about stack size 
    > because I have done some tests sending jobs with h_stack=8M and 
    > h_stack=16M and the jobs failed. It's true that restarting are two 
    > threads created but are also the same threads with the option 
    > --save-private and without it option. With this option failed and 
    > without this option don't.
    > Did you do something special to set the stack size to half of the 
    > available memory?
    >
    > Regards,
    > Sergio
    >
    >
    >
    >
    >
    > Adolfo J. Banchio escribió:
    >> Hi Sergio,
    >>
    >> A while ago, when upgrading to BLCR 0.7 (I think from
    >> there on it started using threads, blcr itself) a had
    >> some problems restarting within SGE, even disabling
    >> prelinking (in this sense it is different from your case,
    >> but it might help anyway).
    >> At that time I found that the problem was related with
    >> resources set by SGE. Namely, if Stack Size was set to unlimited (the 
    >> default value in the queue definition) SGE allocated a Stack
    >> Size equal to the whole memory for each thread of cr_restart, so the 
    >> second thread (from cr_restart itself, not related to the
    >> actual job, which could be just serial) never gets free
    >> memory and it crashed.
    >>
    >> If I remember well the only workaround I found was to set
    >> Stack Size to half of the available memory, so that after
    >> migration, there were enough space for both threads (since
    >> cr_restart seems to have two threads, and SGE set the stack
    >> for each from that limit). This was the solution at that time,
    >> and I still have those limits set.
    >>
    >> I am not sure that this still applies (in my case disabling
    >> prelinking did not help, so, the problem was probably not the
    >> same as yours), but I wanted to share this with you, just in case. 
    >> regards,
    >>
    >> adolfo
    >>
    >>
    >>
    >>
    >> On Thu, 2009-05-14 at 14:19 +0200, Sergio Díaz wrote:
    >>  
    >>> Hi Paul,
    >>>
    >>> Respect to the limits. The different is that SGE set the two limits 
    >>> below copied to the value of vmem which you put in the qsub. I used 
    >>> vmem=1G and then SGE sets the limit to 1G. If you are working in the 
    >>> host without SGE, these limits are "unlimited". I tested the 
    >>> checkpoint without SGE and setting the limits to 1048576 and it 
    >>> worked fine. So, I guess, the limits are not relevant.  About the 
    >>> environment (env), SGE sets some variables but I tested without SGE 
    >>> setting the same environment and the limits and it worked fine.
    >>>
    >>> I'll try to do more tests because I can't understand why doesn't 
    >>> work with SGE. Meanwhile, I think disabling prelink could be the 
    >>> best option to continue with my work.
    >>>
    >>>
    >>> data seg size           (kbytes, -d) 1048576
    >>> virtual memory          (kbytes, -v) 1048576
    >>>
    >>> Thanks!,
    >>> Sergio
    >>>
    >>>
    >>>
    >>> Paul H. Hargrove escribió:
    >>>    
    >>>> Sergio,
    >>>>
    >>>>   Since --save-private allowed your to migrate the job when not 
    >>>> using SGE, and disabling prelinking allowed success both with and 
    >>>> without SGE I think we can conclude that prelinking was the 
    >>>> original cause of your problems.  So, I recommend disabling 
    >>>> prelinking as your best option.
    >>>>
    >>>>   The fact that use of --save-private and/or --save-exe caused 
    >>>> errors with SGE is not something I would have guessed in advance.  
    >>>> However, I suspect that it has something to to with resource limits 
    >>>> (like the limit or ulimit shell built-ins).  This might be 
    >>>> something BLCR could work around in the future, but I have no guess 
    >>>> at the moment how.  If for some reason you do not wish to disable 
    >>>> prelinking, then there may be some resource limit setting in SGE 
    >>>> that could be changed to eliminate the "Failed to locate newborn 
    >>>> mmap()ed space" problem.  However, I am not an SGE expert and so 
    >>>> don't know where you would start looking.
    >>>>
    >>>>   You asked how disabling of prelinking would affect your systems.  
    >>>> The answer is that it will cost you a small amount of performance, 
    >>>> mostly at program startup.  It will not introduce errors in any 
    >>>> programs.  prelinking is also viewed by some as a security 
    >>>> improvement.  You can read more about prelinking at 
    >>>> http://en.wikipedia.org/wiki/Prelinking
    >>>>
    >>>> -Paul
    >>>>
    >>>> Sergio Díaz wrote:
    >>>>      
    >>>>> Hi Paul and Adolfo,
    >>>>>
    >>>>> Adolfo, running the job without SGE, it doesn't work.
    >>>>>
    >>>>> Paul, doing the checkpoint with "--save-private" it works fine 
    >>>>> only if I send the jobs without SGE. But If I send the job to SGE, 
    >>>>> it doesn't restart fine. Neither in the same host. I get the 
    >>>>> following error:
    >>>>>
    >>>>>        
    >>>>>> - Failed to locate newborn mmap()ed space
    >>>>>> - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
    >>>>>> Restart failed: Invalid argument
    >>>>>>           
    >>>>> I don't understand why it doesn't work because SGE shouldn't 
    >>>>> affect because the script that I use to the checkpoint is 
    >>>>> basically the same.
    >>>>> I also tried using the option --save-all but don't work. I got the 
    >>>>> same error. With the option --save-shared and --save-exe I got the 
    >>>>> segmentation fault.
    >>>>>
    >>>>> 2nd attempt... disabling pre-linking and doing the cr_checkpoint 
    >>>>> with the --save-private. I got the same error. But doing the 
    >>>>> cr_checkpoint without --save-private, it works fine!! I did a 
    >>>>> successful migration and the job finished fine.
    >>>>>
    >>>>> I have to research in which aspect could be affected the hosts if 
    >>>>> I disable the pre-linking. Any idea? Less performance? problems 
    >>>>> with my applications?
    >>>>>
    >>>>>
    >>>>> Thanks a lot,
    >>>>> Sergio
    >>>>>
    >>>>>
    >>>>>
    >>>>> Paul H. Hargrove escribió:
    >>>>>        
    >>>>>> Sergio,
    >>>>>>
    >>>>>>   Your problem sounds like a problem of not having identical 
    >>>>>> shared libraries on host A and host B.  One possibility is that 
    >>>>>> the two hosts have different versions of libs installed, and a 
    >>>>>> second possibility is that they could have the same versions 
    >>>>>> installed, but that "prelinking" may be mapping them to different 
    >>>>>> addresses on the two hosts.
    >>>>>>
    >>>>>>   If you think that the libraries installed on the two hosts are 
    >>>>>> the same, then try the instructions in our FAQ for disabling 
    >>>>>> pre-linking: 
    >>>>>> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
    >>>>>>
    >>>>>>   If you know that the library versions are /not/ the same, or if 
    >>>>>> disabling pre-linking does not help, then you will need to add 
    >>>>>> the "--save-private" flag to the cr_checkpoint command in the SGE 
    >>>>>> migration script to request that BLCR include copies of the 
    >>>>>> libraries in the context file.
    >>>>>>
    >>>>>>   I hope one of the two suggestions above resolves your problem.  
    >>>>>> If not, let use know and we'll see what else we can try.
    >>>>>>
    >>>>>> -Paul
    >>>>>>
    >>>>>> Sergio Díaz wrote:
    >>>>>>          
    >>>>>>> Hi all,
    >>>>>>>
    >>>>>>> I am using BLCR + SGE to do checkpoint to my jobs. It's working 
    >>>>>>> fine and also I can migrate the job (doing qmod -s JOB_ID).
    >>>>>>> The problem is the next: If I have a job running in host A and I 
    >>>>>>> do a qmod -s JOB_ID (to migrate the job), SGE launch the 
    >>>>>>> migration script and do a checkpoint, kill the job and put the 
    >>>>>>> job in the queue. When a host is free, SGE runs the job in the 
    >>>>>>> host. If the job runs in the host A, it finishes fine but if  
    >>>>>>> the job is runned in other host (host B for instance) the job 
    >>>>>>> fails.
    >>>>>>>
    >>>>>>> Doing a strace to the command cr_restart archivo_checkpoint I 
    >>>>>>> can see the following:
    >>>>>>>
    >>>>>>> If the job runs in the same host:
    >>>>>>>            
    >>>>>>>> .....
    >>>>>>>> close(5)                                = 0
    >>>>>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>>>>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
    >>>>>>>> __WCLONE|__WALL, NULL) = 27782
    >>>>>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>>>>>> exit_group(0)                           = ?
    >>>>>>>> Process 27972 detached
    >>>>>>>>               
    >>>>>>> If the job runs in other host:
    >>>>>>>
    >>>>>>>            
    >>>>>>>> ....
    >>>>>>>> close(5)                                = 0
    >>>>>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>>>>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >>>>>>>> __WCLONE|__WALL, NULL) = 27782
    >>>>>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>>>>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >>>>>>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >>>>>>>> tgkill(8889, 8889, SIGSEGV)             = 0
    >>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >>>>>>>> +++ killed by SIGSEGV +++
    >>>>>>>> Process 8889 detached
    >>>>>>>>               
    >>>>>>> Any ideas??
    >>>>>>>
    >>>>>>> Regards,
    >>>>>>> Sergio
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>>             
    >>>>>>           
    >>>>>         
    >>>>       
    >>>     
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: ÀîºêÁÁ: "Question about "fd" token"