From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Thu Sep 25 2008 - 17:27:18 PDT
Dear Jin, I'm not sure exactly what's wrong, but based on your error, it sounds like the shell script wasn't started with cr_run. That's where you would get the 'Checkpoint failed: support missing from application'. The first thing I would try is writing a wrapper around cr_run that would write messages to the system log. Something like this: Change the MOM configuration from $checkpoint_run_exe /usr/local/bin/cr_run to $checkpoint_run_exe /usr/local/bin/cr_run.logging and write create a script /usr/local/bin/cr_run.logging #!/bin/bash logger $0: $@ /usr/local/bin/cr_run $@ exit $! You'll see messages on the system log every time a checkpointable job starts. If you don't see any messages, then for some reason, MOM isn't invoking cr_run. If you do see messages, it means that (for some reason) the checkpoint library isn't being preloaded into your application. This could happen for a few reasons -- a few obvious ones being that the checkpoint library wasn't found (unlikely), one of the processes is statically linked, or the LD_PRELOAD set by cr_run is being lost somewhere in the environment. If that's the case, try checkpointing something else. Eric On Tue, Sep 23, 2008 at 04:29:45PM +1000, Jin Zhang wrote: > Dear BLCR, > > I've got a problem by using BLCR. > I install BLCR in cluster, and tried to run with Torque for a serial job. > I've configured Torque with --enable-blcr, I've installed BLCR into kernel with insmod, and I've create the script that mom_priv need. > > However, when I run qhold, there was an error message as following: > > Sep 23 15:43:00 wayland003 pbs_mom: mach_checkpoint, checkpoint args: /usr/spool/PBS/mom_priv/blcr_checkpoint_script 28676 155.wayland.in.vpac.org wl /usr/spool/PBS/checkpoint ckpt.155.wayland.in.vpac.org.1222148580 15 > Sep 23 15:43:00 wayland003 checkpoint_script: Invoked: /usr/spool/PBS/mom_priv/blcr_checkpoint_script 28676 155.wayland.in.vpac.org wl /usr/spool/PBS/checkpoint ckpt.155.wayland.in.vpac.org.1222148580 15 > Sep 23 15:43:00 wayland003 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 28676 --file ckpt.155.wayland.in.vpac.org.1222148580) failed with rc=16777215: > > Then I check qstat -f 155, Job_state = R, it still running. > > When I ran: > cr_checkpoint --signal 15 --tree 28676 --file ckpt.155.wayland.in.vpac.org.1222148580, > there was another error: > Checkpoint failed: support missing from application > > Can you please tell me what's the problem > > Thanks > > -- > Jin Zhang > > Systems Administrator > Victorian Partnership for Advanced Computing > 110 Victoria St. Carlton South, VIC, 3053 AU > E: jin_at_vpac_dot_org P: +61 (03) 9925 4942 -- Eric Roman Department of Physics 510-642-7302 UC Berkeley