From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Apr 01 2008 - 12:07:43 PST
Yuan, Sorry for the delay in getting back to you. I had to ask a colleague to install R for me and then I left on travel about the time that was finished. I tried today with BLCR 0.6.5 and was able to checkpoint and restart the script you provided. I verified that /usr/lib64/gconv/gconv-modules.cache was mmapped (it was not when I had LANG=C in my environment, but changing it LANG=en_US.UTF-8 caused it to be mmapped). Since I cannot reproduce your problem, I am not sure what I can do at this point to help you. If you have any ideas about what makes your system different, please let me know. While not related to a "permission denied" error, it is worth nothing that your test script looks at wallclock time, which BLCR does not "virtualize". So if I restart more than 180 seconds after the original program began, then I get only a single ">" line as output. Not exactly a problem, but I was confused by it initially. -Paul Yuan Wan wrote: > > > Paul, > -------------------------------------------------------------------------------------- > > $ ls -l /usr/lib64/gconv/gconv-modules.cache > -rw-r--r-- 1 root root 21546 Oct 2 14:51 > /usr/lib64/gconv/gconv-modules.cache > $ tcsh -c 'cat /proc/$$/maps' | grep gconv > 2a9892f000-2a98935000 r--s 00000000 08:01 522135 > /usr/lib64/gconv/gconv-modules.cache > --------------------------------------------------------------------------------------- > > > I cannot see any difference on permission. > > Can you restart my test script from checkpoint on your machine? > > ------------------------------------------- > #!/bin/sh > > PATHTOR=/usr/bin > # Below, the phrase "EOF" marks the beginning and end of the HERE document. > $PATHTOR/R --no-save <<EOF > mod<-function (x, y) > { > x1 <- trunc(trunc(x/y) * y) > z <- trunc(x) - x1 > z > } > > z0 <- unclass(Sys.time()) > > repeat{ > > z1<-unclass(Sys.time()) > secs<-floor(z1-z0) > if (mod(secs, 10)==0) print(secs) > if ((secs)>180) break > > } > EOF > > ------------------------------------------- > > > > --Yuan > > > > On Fri, 14 Mar 2008, Paul H. Hargrove wrote: > >> Yuan, >> >> What do you get if you run the following two commands? >> $ ls -l /usr/lib64/gconv/gconv-modules.cache >> $ tcsh -c 'cat /proc/$$/maps' | grep gconv >> >> What I see is a world readable file and a shared read-only mmap in tcsh: >> $ ls -l /usr/lib64/gconv/gconv-modules.cache >> -rw-r--r-- 1 root root 21514 Jun 3 2005 >> /usr/lib64/gconv/gconv-modules.cache >> $ tcsh -c 'cat /proc/$$/maps' | grep gconv >> 2b8e36967000-2b8e3696d000 r--s 00000000 00:0f 9486631 >> /usr/lib64/gconv/gconv-modules.cache >> >> So, there shouldn't be a problem unless there is something different >> about your system. >> >> -Paul >> >> Paul H. Hargrove wrote: >>> Yuan, >>> >>> I've not seen that particular failure before, but some quick research >>> indicates that gconv-modules.cache is a part of glibc and I suspect that >>> it is getting mapped in much the same way as the NCSD file is. I will >>> continue to look into the problem to see what BLCR might be able to do >>> differently, >>> >>> -Paul >>> >>> Yuan Wan wrote: >>> >>>> Hi Paul, >>>> >>>> Thanks for replying. >>>> The error messege I got from /var/log/messeges is as the following: >>>> >>>> vmadump: mmap failed: /usr/lib64/gconv/gconv-modules.cache >>>> thaw_threads returned error, aborting. -13 >>>> >>>> The failure seems not caused by NSCD. What do you think? >>>> >>>> --Yuan >>>> >>>> >>>> On Mon, 10 Mar 2008, Paul H. Hargrove wrote: >>>> >>>> >>>>> Yuan, >>>>> >>>>> The most likely cause is that the restart failed to open one of the >>>>> files that was open() or mmap()ed at the time the checkpoint was >>>>> taken. >>>>> Based on the fact that you see this w/ a shell script, but not C code, >>>>> my best guess is that you are encountering a problem with the file >>>>> that >>>>> the Name Service Cache Daemon (NSCD) uses. Please see the >>>>> following FAQ >>>>> entry for more detail (including what to look for in the system logs) >>>>> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#nscd >>>>> The only known work-around is to remove NSCD from your system. >>>>> >>>>> -Paul >>>>> >>>>> Yuan Wan wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm trying to restart my shell script jobs (bash and R) with BLCR but >>>>>> failed with the following error: >>>>>> >>>>>> "Restart failed: Permission denied" >>>>>> >>>>>> I can checkpoint the job and get context file. The restart will be >>>>>> successful if executed by root but fail if run by normal users. The >>>>>> context file does belongs to me, so I'm wondering where the >>>>>> permission >>>>>> is required. I can also restart a C code as a regular user without >>>>>> problem. >>>>>> >>>>>> Anyone know the possible reason? Thanks >>>>>> >>>>>> --Yuan >>>>>> >>>>>> Yuan Wan >>>>>> >>>>> >>>>> >>> >>> >>> >> >> >> > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900