Re: bugs in blcr

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Jul 06 2007 - 08:26:13 PDT

  • Next message: Jerry Mersel: "Re: berkeley checkpoint and matlab"
    Mallikarjuna Shastry,
      I'd love to help you get BLCR running on your system, but you are
    going to need to provide some more information if I am to give useful
    answers.  Please see my comments below.
    Mallikarjuna Shastry wrote:
    > dear sir/ madam
    > this is mallikarjuna shastry, currently pursuing the
    > Ph.D in fault tolerance in distributed systems.
    > i am dealing with analysis of roll-back recovery
    > protocols such as checkpointing and message logging.
    > i have installed blcr-0.5.5 on my linux 9 system.
    There is no version 9 of the Linux kernel, and no distribution known as
    simply "linux 9".  Do you mean "Red Hat Linux 9", "SuSE Linux 9" or
    perhaps "SLES 9"?  Knowing which distribution you are running may help
    narrow down the answer you your next question.  Checking the contents of
    the file /etc/issue will probably tell you the name of your distribution.
    > each time i switch off my system i have to reinstal
    > blcr and run the commands like cr_run,cr_checkpoint
    > and cr_restart otherwise they do not work.
    > what my be the problem with this? and how do i
    > overcome this?
    > plz advise me in this regard.
    I am unclear on a couple of things you are trying to say here.
    First, when you say "reinstall" what command or commands do you need to
    repeat each time the system is restarted?  For instance, are you doing
    "make install", "make insmod" or something else?
    Second, "otherwise they do not work" doesn't tell me very much.  What
    sort of error message or type of failure do you see if you try to use
    any of these three commands without first "reinstalling blcr"?
    With clarifications of "reinstall" and "do not work", I can probably
    identify your problem and provide a solution.
    >  i have the folowing queries.
    BLCR provides a single-node checkpointer from which to build the
    functionality you are asking for.  I comment on some of your queries below.
    > 1. how do i take the periodic checkpoint on a process
    > ?
    If you want to checkpoint a process periodically, then you will need to
    write a program or script that invokes cr_checkpoint periodically.
    While there would be some value to doing this within BLCR itself, it has
    not been done.
    > 2. hod do i restart a process on another node after
    > taking checkpoint and terminate it(process-migration)
    First, take a look at this FAQ entry for a
    warning about the most common problem seen when trying to migrate a process.
    Second, as with the periodic checkpointing question this is not
    something BLCR attempts to automate for you.  When a process is
    checkpointed, a "context file" is created containing the information
    needed to recreate that process.  If you run "cr_restart
    your_context_file" on another node then the process should resume
    running on the new node (assuming an "identical environment" such as
    same kernel, same shared libraries, same filesystem paths to files open
    at checkpoint time, etc.).
    Note that BLCR does nothing to deal with network connections.  One can
    deal with them in you own code through the BLCR callback mechanism.  You
    could take a look at LAM/MPI to see how they deal with communication.
    > 3. does blcr support co-ordinated blocking checkpoint
    > protocol or non-blocking co-ordinated checkpoint ?
    > 4. how do i implement the protocols like 
    > a.unco-ordinated checkpoint ? 
    > b.communication indeced checkpoint ?
    > c.message logging protocols such as pessimistic,
    > optimistic and causal protocols.?
    As a single-node checkpointer BLCR does not implement any sort of
    coordination, nor does it deal on its own with communication.  Rather,
    BLCR is a building-block from which *you* can implement all of these
    things.  You should be able to find more information on all these
    protocols in the CS literature.  If you look at LAM/MPI, MVAPICH2 or a
    recent MPICH-V you can see how they have used the BLCR callback
    mechanism to deal with communication.
    > kindly send the details regarding this and advise me
    > how do i proceed further.
    > regards
    > m.shastry
    > mallikarjuna shastry
    > ____________________________________________________________________________________
    > Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Jerry Mersel: "Re: berkeley checkpoint and matlab"