Re: BLCR on IA64

From: ǿ (
Date: Wed Nov 26 2008 - 23:32:29 PST

  • Next message: Paul H. Hargrove: "Re: BLCR on IA64"
    Thanks Paul.
     I'm sorry I can speak a few English. I thank your patience very much.
     I 'm sure the ar.bsp and ar.bspstore registers are stored and resumed exactly.Our CR system runs on IA64 well for a lot of programs but some of mvapich. These MPI are running on Infiniband cluster and checkpoint/restart successfully on X86&Infiniband cluster .Now I want to resolve CR on IA64&Infiniband cluster.
    I'm sorry about sharing this code.because our opponent will benefit from this.
    --- 08年11月27日,周四, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> 写道:
    发件人: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    主题: Re: BLCR on IA64
    抄送: checkpoint_at_lbl_dot_gov
    日期: 2008,1127,周四,6:52上午
    Thank you for your interest in BLCR and your kind words.
    We have not ported BLCR to IA64, and I am therefore very interested in
    what you have done. When you say "So I fixed BLCR because it couldn't
    work on IA64", I believe you mean that you have ported BLCR to the IA64
    architecture. There are many things that might have gone wrong in
    performing such a port, and some of them would result in a Segmentation
    Fault at restart and/or the all-zero registers you see with gdb.
    One possibility that comes to mind is the Register Stack Engine (RSE). I
    am not an expert on the IA64 architecture, but your words "Segmentation
    fault always happened after the processes restarted for a while", makes
    me think that perhaps the backing store for the RSE has not been
    saved/restored. That would show up after returning from one or more
    nested function calls, and would probably show up as zeroed registers.
    It is also possible that the ar.bsp and ar.bspstore registers that
    control use of the backing store might be incorrect, which would have
    the same result.
    Rather than testing with mvapich, I'd first suggest that you ensure that
    "make check" in the BLCR build directory PASSes all the tests (or
    SKIP one or two). Have you done that yet?
    Also, I'd be very interested in including your IA64 port in the BLCR
    distribution, even if it is incomplete or imperfect. It is possible that
    other people could find and fix any bugs that may remain in the port. If
    you are prepared to share your IA64 port with the world, please see
    BLCR's README.devel for information about the Signed-off-by line that is
    needed to allow me to redistribute your contributions. Large patches or
    tar files should be sent to me directly at PHHargrove_at_lbl_dot_gov rather
    than by reply to the checkpoint_at_lbl_dot_gov list.
    I understand that English is not your first language. So, please let me
    know if anything I have said is not clear to you.
    强 马 wrote:
    > Hello
    > BLCR is wonderful!
    > We have developed a checkpoint/restart system for mvapich program
    > based on BLCR.
    > It's running on X86 cluster and being planted to IA64. So I fixed BLCR
    > because it couldn't work on IA64.
    > Now I have a trouble on IA64. Alougth my mvapich processes restared
    > from checkpoint files successfully, Segmentation fault always happened
    > after the processes restarted for a while. I check the core file by
    > gdb, all the registers are zero, so no any stack information can be
    > got. I guess it's memory fault.
    > If I don't cancel the program after the checkpoints are finished and
    > let it continue to run, it runs kindly until terminated normally.
    > Otherwise, I cancel the program when checkpoints are finished, then
    > restarted it from checkpoint files, I find the above segment fault.
    > How to resolve this problem? Can you help me, and give me any tips?
    > thanks you on advanced.
    > ------------------------------------------------------------------------
    > 好玩贺卡等你发,邮箱贺卡全新上线!
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Paul H. Hargrove: "Re: BLCR on IA64"