From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Jul 06 2007 - 08:26:13 PDT
Mallikarjuna Shastry, I'd love to help you get BLCR running on your system, but you are going to need to provide some more information if I am to give useful answers. Please see my comments below. -Paul Mallikarjuna Shastry wrote: > dear sir/ madam > > this is mallikarjuna shastry, currently pursuing the > Ph.D in fault tolerance in distributed systems. > i am dealing with analysis of roll-back recovery > protocols such as checkpointing and message logging. > > i have installed blcr-0.5.5 on my linux 9 system. There is no version 9 of the Linux kernel, and no distribution known as simply "linux 9". Do you mean "Red Hat Linux 9", "SuSE Linux 9" or perhaps "SLES 9"? Knowing which distribution you are running may help narrow down the answer you your next question. Checking the contents of the file /etc/issue will probably tell you the name of your distribution. > each time i switch off my system i have to reinstal > blcr and run the commands like cr_run,cr_checkpoint > and cr_restart otherwise they do not work. > > what my be the problem with this? and how do i > overcome this? > plz advise me in this regard. I am unclear on a couple of things you are trying to say here. First, when you say "reinstall" what command or commands do you need to repeat each time the system is restarted? For instance, are you doing "make install", "make insmod" or something else? Second, "otherwise they do not work" doesn't tell me very much. What sort of error message or type of failure do you see if you try to use any of these three commands without first "reinstalling blcr"? With clarifications of "reinstall" and "do not work", I can probably identify your problem and provide a solution. > > i have the folowing queries. BLCR provides a single-node checkpointer from which to build the functionality you are asking for. I comment on some of your queries below. > 1. how do i take the periodic checkpoint on a process > ? If you want to checkpoint a process periodically, then you will need to write a program or script that invokes cr_checkpoint periodically. While there would be some value to doing this within BLCR itself, it has not been done. > 2. hod do i restart a process on another node after > taking checkpoint and terminate it(process-migration) First, take a look at this http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink FAQ entry for a warning about the most common problem seen when trying to migrate a process. Second, as with the periodic checkpointing question this is not something BLCR attempts to automate for you. When a process is checkpointed, a "context file" is created containing the information needed to recreate that process. If you run "cr_restart your_context_file" on another node then the process should resume running on the new node (assuming an "identical environment" such as same kernel, same shared libraries, same filesystem paths to files open at checkpoint time, etc.). Note that BLCR does nothing to deal with network connections. One can deal with them in you own code through the BLCR callback mechanism. You could take a look at LAM/MPI to see how they deal with communication. > 3. does blcr support co-ordinated blocking checkpoint > protocol or non-blocking co-ordinated checkpoint ? > 4. how do i implement the protocols like > a.unco-ordinated checkpoint ? > b.communication indeced checkpoint ? > c.message logging protocols such as pessimistic, > optimistic and causal protocols.? As a single-node checkpointer BLCR does not implement any sort of coordination, nor does it deal on its own with communication. Rather, BLCR is a building-block from which *you* can implement all of these things. You should be able to find more information on all these protocols in the CS literature. If you look at LAM/MPI, MVAPICH2 or a recent MPICH-V you can see how they have used the BLCR callback mechanism to deal with communication. > kindly send the details regarding this and advise me > how do i proceed further. > > regards > > m.shastry > > > mallikarjuna shastry > > > > ____________________________________________________________________________________ > Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. > http://mobile.yahoo.com/go?refer=1GNXIC -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900