Re: BLCR on IA64

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sat Nov 29 2008 - 21:03:36 PST

  • Next message: Jerry Mersel: "Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR"
    I am sorry to hear that you are not willing to share the code.  I am always
    saddened to find that there are users of Open Source Software who are willing
    to make extensive changes, but are not willing to share them.
    
    I have made the only suggestion I can without being able to see and use the
    code.  There are others in the BLCR community that would benefit from an IA64
    port, and some would probably be able to help if you made the source code
    available.  Since you are not willing to share the code, you will probably
    also find there is nobody willing or able to help you with your problem.
    
    -Paul
    
    
    强 马 wrote:
    > 
    > Thanks Paul.
    >  
    >  I'm sorry I can speak a few English. I thank your patience very much.
    >  I 'm sure the ar.bsp and ar.bspstore registers are stored and resumed 
    > exactly.Our CR system runs on IA64 well for a lot of programs but some 
    > of mvapich. These MPI are running on Infiniband cluster 
    > and checkpoint/restart successfully on X86&Infiniband cluster .Now 
    > I want to resolve CR on IA64&Infiniband cluster.
    >  
    > I'm sorry about sharing this code.because our opponent will benefit from 
    > this.
    > --- *08年11月27日,周四, Paul H. Hargrove /<PHHargrove_at_lbl_dot_gov>/* 写道:
    > 
    >     发件人: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    >     主题: Re: BLCR on IA64
    >     收件人: vera_wx_cn_at_yahoo_dot_com.cn
    >     抄送: checkpoint_at_lbl_dot_gov
    >     日期: 2008,1127,周四,6:52上午
    > 
    >     Thank you for your interest in BLCR and your kind words.
    > 
    >     We have not ported BLCR to IA64, and I am therefore very interested in
    >     what you have done. When you say "So I fixed BLCR because it couldn't
    >     work on IA64", I believe you mean that you have ported BLCR to the IA64
    >     architecture. There are many things that might have gone wrong in
    >     performing such a port, and some of them would result in a Segmentation
    >     Fault at restart and/or the all-zero registers you see with gdb.
    > 
    >     One possibility that comes to mind is the Register Stack Engine (RSE). I
    >     am not an expert on the IA64 architecture, but your words "Segmentation
    >     fault always happened after the processes restarted for a while", makes
    >     me think that perhaps the backing store for the RSE has not been
    >     saved/restored. That would show up after returning from one or more
    >     nested function calls, and would probably show up as zeroed registers.
    >     It is also possible that the ar.bsp and ar.bspstore registers that
    >     control use of the backing store might be incorrect, which would have
    >     the same result.
    > 
    >     Rather than testing with mvapich, I'd first suggest that you ensure that
    >     "make check" in the BLCR build directory PASSes all the tests (or
    >     might
    >     SKIP one or two). Have you done that yet?
    > 
    >     Also, I'd be very interested in including your IA64 port in the BLCR
    >     distribution, even if it is incomplete or imperfect. It is possible that
    >     other people could find and fix any bugs that may remain in the port. If
    >     you are prepared to share your IA64 port with the world, please see
    >     BLCR's README.devel for information about the Signed-off-by line that is
    >     needed to allow me to redistribute your contributions. Large patches or
    >     tar files should be sent to me directly at PHHargrove_at_lbl_dot_gov rather
    >     than by reply to the checkpoint_at_lbl_dot_gov list.
    > 
    >     I understand that English is not your first language. So, please let me
    >     know if anything I have said is not clear to you.
    > 
    >     -Paul
    > 
    >     强 马 wrote:
    >     > Hello
    >     > BLCR is wonderful!
    >     > We have developed a checkpoint/restart system for mvapich program
    >     > based on BLCR.
    >     > It's running on X86 cluster and being planted to IA64. So I fixed BLCR
    >     > because it couldn't work on IA64.
    >     > Now I have a trouble on IA64. Alougth my mvapich processes restared
    >     > from checkpoint files successfully, Segmentation fault always happened
    >     > after the processes restarted for a while. I check the core file by
    >     > gdb, all the registers are zero, so no any stack information can be
    >     > got. I guess it's memory fault.
    >     > If I don't cancel the program after the checkpoints are finished and
    >     > let it continue to run, it runs kindly until terminated normally.
    >     > Otherwise, I cancel the program when checkpoints are finished, then
    >     > restarted it from checkpoint files, I find the above segment fault.
    >     > How to resolve this problem? Can you help me, and give me any tips?
    >     > thanks you on advanced.
    >     >
    >     >
    >     > ------------------------------------------------------------------------
    >     > 好玩贺卡等你发,邮箱贺卡全新上线!
    >     >
    >     <http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/>
    > 
    > 
    > 
    >     -- 
    >     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >     Future Technologies Group                 Tel: +1-510-495-2352
    >     HPC Research Department                   Fax: +1-510-486-6900
    >     Lawrence Berkeley National Laboratory     
    > 
    > 
    > 
    > 好玩贺卡等你发,邮箱贺卡全新上线! 
    > <http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/> 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Jerry Mersel: "Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR"