Re: checkpointing multiple processes []

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Nov 02 2006 - 21:52:07 PST

  • Next message: keep: "SilenceGet Balance RightLeave is"
    Anton,
    
      What you describe is not "multithreaded", )cone() or pthread_create()) 
    but instead is "multiprocess" (fork()), which you correctly note we 
    don't yet have support for.  I am afraid tht we are already quite late 
    in producing the process group support, which should have been out back 
    in about April.  I am also afraid that I don't have a good estimate of 
    when it will be available, though December would be my best (optimistic) 
    guess.  The snapshots available at http://mantis.lbl.gov/blcr-dist 
    include the latest I have in CVS, but process group support it not yet 
    available even as an unstable version.
      As for the situation you are encountering, I am not clear on why you 
    can't restart either process.  The parent process that calls waitpid() 
    is sure to fail to restart correctly, because its parent-child 
    relationship is not restored.  However, there is no reason I can see 
    (based on your description) why the "child" process running filecounting 
    shouldn't restart correctly.  Is there anything else unussual?
    
    -Paul
     
    
    Anton V. Uzunov wrote:
    
    >Hi, 
    >
    >I am currently testing BLCR in the hope of using it as our
    >checkpoint/restore library, and I have encountered a problem with
    >checkpointing multi-process applications. For example, BLCR has trouble
    >(or perhaps I am not doing something correct?) checkpointing a simple C
    >program which uses (a slightly modified version of) the "filecounting"
    >example provided with BLCR:
    >...
    >pid_t p = fork();
    >if (p == 0)
    >  execlp( "filecounting", ... );
    >waitpid( p, ... );
    >...
    >(The slight modification in "filecounting" consitst of making it multi-threaded
    >as per the other BLCR example, "pthread_counting"). 
    >In such a case two PIDs are created, one for the parent and child
    >processes respectively, and while both processes can be checkpointed
    >using cr_checkpoint PID,  neither of them can be restored via
    >cr_restart. Perhaps this has to do with BLCR not having implemented
    >checkpointing of process groups? If this is the case, do you know
    >(approximately) when this functionality will be implemented? Is there
    >perhaps a (newer, not entirely stable) CVS snapshot that has (some of)
    >this functionality? Or should I perhaps use the library hooks to
    >implement multi-process checkpointing myself, if this has not already
    >been implemented? I would appreciate any information on this. 
    >
    >Best regards, 
    >Anton V. Uzunov
    >
    >  
    >
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: keep: "SilenceGet Balance RightLeave is"