From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Nov 26 2008 - 14:52:30 PST
Thank you for your interest in BLCR and your kind words. We have not ported BLCR to IA64, and I am therefore very interested in what you have done. When you say "So I fixed BLCR because it couldn't work on IA64", I believe you mean that you have ported BLCR to the IA64 architecture. There are many things that might have gone wrong in performing such a port, and some of them would result in a Segmentation Fault at restart and/or the all-zero registers you see with gdb. One possibility that comes to mind is the Register Stack Engine (RSE). I am not an expert on the IA64 architecture, but your words "Segmentation fault always happened after the processes restarted for a while", makes me think that perhaps the backing store for the RSE has not been saved/restored. That would show up after returning from one or more nested function calls, and would probably show up as zeroed registers. It is also possible that the ar.bsp and ar.bspstore registers that control use of the backing store might be incorrect, which would have the same result. Rather than testing with mvapich, I'd first suggest that you ensure that "make check" in the BLCR build directory PASSes all the tests (or might SKIP one or two). Have you done that yet? Also, I'd be very interested in including your IA64 port in the BLCR distribution, even if it is incomplete or imperfect. It is possible that other people could find and fix any bugs that may remain in the port. If you are prepared to share your IA64 port with the world, please see BLCR's README.devel for information about the Signed-off-by line that is needed to allow me to redistribute your contributions. Large patches or tar files should be sent to me directly at PHHargrove_at_lbl_dot_gov rather than by reply to the checkpoint_at_lbl_dot_gov list. I understand that English is not your first language. So, please let me know if anything I have said is not clear to you. -Paul ǿ �� wrote: > Hello > BLCR is wonderful! > We have developed a checkpoint/restart system for mvapich program > based on BLCR. > It's running on X86 cluster and being planted to IA64. So I fixed BLCR > because it couldn't work on IA64. > Now I have a trouble on IA64. Alougth my mvapich processes restared > from checkpoint files successfully, Segmentation fault always happened > after the processes restarted for a while. I check the core file by > gdb, all the registers are zero, so no any stack information can be > got. I guess it's memory fault. > If I don't cancel the program after the checkpoints are finished and > let it continue to run, it runs kindly until terminated normally. > Otherwise, I cancel the program when checkpoints are finished, then > restarted it from checkpoint files, I find the above segment fault. > How to resolve this problem? Can you help me, and give me any tips? > thanks you on advanced. > > > ------------------------------------------------------------------------ > ����ؿ����㷢������ؿ�ȫ�����ߣ� > <http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory