From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Nov 02 2006 - 21:52:07 PST
Anton, What you describe is not "multithreaded", )cone() or pthread_create()) but instead is "multiprocess" (fork()), which you correctly note we don't yet have support for. I am afraid tht we are already quite late in producing the process group support, which should have been out back in about April. I am also afraid that I don't have a good estimate of when it will be available, though December would be my best (optimistic) guess. The snapshots available at http://mantis.lbl.gov/blcr-dist include the latest I have in CVS, but process group support it not yet available even as an unstable version. As for the situation you are encountering, I am not clear on why you can't restart either process. The parent process that calls waitpid() is sure to fail to restart correctly, because its parent-child relationship is not restored. However, there is no reason I can see (based on your description) why the "child" process running filecounting shouldn't restart correctly. Is there anything else unussual? -Paul Anton V. Uzunov wrote: >Hi, > >I am currently testing BLCR in the hope of using it as our >checkpoint/restore library, and I have encountered a problem with >checkpointing multi-process applications. For example, BLCR has trouble >(or perhaps I am not doing something correct?) checkpointing a simple C >program which uses (a slightly modified version of) the "filecounting" >example provided with BLCR: >... >pid_t p = fork(); >if (p == 0) > execlp( "filecounting", ... ); >waitpid( p, ... ); >... >(The slight modification in "filecounting" consitst of making it multi-threaded >as per the other BLCR example, "pthread_counting"). >In such a case two PIDs are created, one for the parent and child >processes respectively, and while both processes can be checkpointed >using cr_checkpoint PID, neither of them can be restored via >cr_restart. Perhaps this has to do with BLCR not having implemented >checkpointing of process groups? If this is the case, do you know >(approximately) when this functionality will be implemented? Is there >perhaps a (newer, not entirely stable) CVS snapshot that has (some of) >this functionality? Or should I perhaps use the library hooks to >implement multi-process checkpointing myself, if this has not already >been implemented? I would appreciate any information on this. > >Best regards, >Anton V. Uzunov > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900