From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sat Nov 29 2008 - 21:03:36 PST
I am sorry to hear that you are not willing to share the code. I am always saddened to find that there are users of Open Source Software who are willing to make extensive changes, but are not willing to share them. I have made the only suggestion I can without being able to see and use the code. There are others in the BLCR community that would benefit from an IA64 port, and some would probably be able to help if you made the source code available. Since you are not willing to share the code, you will probably also find there is nobody willing or able to help you with your problem. -Paul ǿ �� wrote: > > Thanks Paul. > > I'm sorry I can speak a few English. I thank your patience very much. > I 'm sure the ar.bsp and ar.bspstore registers are stored and resumed > exactly.Our CR system runs on IA64 well for a lot of programs but some > of mvapich. These MPI are running on Infiniband cluster > and checkpoint/restart successfully on X86&Infiniband cluster .Now > I want to resolve CR on IA64&Infiniband cluster. > > I'm sorry about sharing this code.because our opponent will benefit from > this. > --- *08��11��27�գ�����, Paul H. Hargrove /<PHHargrove_at_lbl_dot_gov>/* д���� > > ������: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > ����: Re: BLCR on IA64 > �ռ���: vera_wx_cn_at_yahoo_dot_com.cn > ����: checkpoint_at_lbl_dot_gov > ����: 2008,1127,����,6:52���� > > Thank you for your interest in BLCR and your kind words. > > We have not ported BLCR to IA64, and I am therefore very interested in > what you have done. When you say "So I fixed BLCR because it couldn't > work on IA64", I believe you mean that you have ported BLCR to the IA64 > architecture. There are many things that might have gone wrong in > performing such a port, and some of them would result in a Segmentation > Fault at restart and/or the all-zero registers you see with gdb. > > One possibility that comes to mind is the Register Stack Engine (RSE). I > am not an expert on the IA64 architecture, but your words "Segmentation > fault always happened after the processes restarted for a while", makes > me think that perhaps the backing store for the RSE has not been > saved/restored. That would show up after returning from one or more > nested function calls, and would probably show up as zeroed registers. > It is also possible that the ar.bsp and ar.bspstore registers that > control use of the backing store might be incorrect, which would have > the same result. > > Rather than testing with mvapich, I'd first suggest that you ensure that > "make check" in the BLCR build directory PASSes all the tests (or > might > SKIP one or two). Have you done that yet? > > Also, I'd be very interested in including your IA64 port in the BLCR > distribution, even if it is incomplete or imperfect. It is possible that > other people could find and fix any bugs that may remain in the port. If > you are prepared to share your IA64 port with the world, please see > BLCR's README.devel for information about the Signed-off-by line that is > needed to allow me to redistribute your contributions. Large patches or > tar files should be sent to me directly at PHHargrove_at_lbl_dot_gov rather > than by reply to the checkpoint_at_lbl_dot_gov list. > > I understand that English is not your first language. So, please let me > know if anything I have said is not clear to you. > > -Paul > > ǿ �� wrote: > > Hello > > BLCR is wonderful! > > We have developed a checkpoint/restart system for mvapich program > > based on BLCR. > > It's running on X86 cluster and being planted to IA64. So I fixed BLCR > > because it couldn't work on IA64. > > Now I have a trouble on IA64. Alougth my mvapich processes restared > > from checkpoint files successfully, Segmentation fault always happened > > after the processes restarted for a while. I check the core file by > > gdb, all the registers are zero, so no any stack information can be > > got. I guess it's memory fault. > > If I don't cancel the program after the checkpoints are finished and > > let it continue to run, it runs kindly until terminated normally. > > Otherwise, I cancel the program when checkpoints are finished, then > > restarted it from checkpoint files, I find the above segment fault. > > How to resolve this problem? Can you help me, and give me any tips? > > thanks you on advanced. > > > > > > ------------------------------------------------------------------------ > > ����ؿ����㷢������ؿ�ȫ�����ߣ� > > > <http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/> > > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory > > > > ����ؿ����㷢������ؿ�ȫ�����ߣ� > <http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/> > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900