From: Guy Coates (gmpc_at_sanger.ac.uk)
Date: Wed Jun 17 2009 - 06:41:23 PDT
Hi all, Attached are a couple of simple scripts to allow checkpoint integration with the Platform LSF job scheduler. They have been tested with single CPU jobs. Installation ------------ Copy erestart.blcr and echkpnt.blcr into your LSF_SERVERDIR directory. The default is something like: /path/to/lsf/7.0/linux2.6-glibc2.3-x86_64/etc Set the permissions to match the other binaries in this directory. Use --- To submit a job that will be check-pointed every 24 hours, run: bsub -k "/path/to/your/checkpoint/directory method=blcr 1440" \ ...(other LSF options) ... \ cr_run /path/to/your/wrapper/script To restart a job from a checkpoint run: brestart /path/to/your/checkpoint/directory/jobid Job migration (bmig) and forced checkpointing (bchkpnt) should also work. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. #!/bin/bash # # LSF script to checkpoint a jobs using blcr # Copyright Genome Research Ltd 2009 # $Revision: 1.1 $ # $Author: gmpc $ $Date: 2009-06-17 12:08:53 $ # # Parse command line options # KILL="FALSE" while [ "$1" != "" ] ; do case $1 in -k ) KILL="TRUE" ;; esac shift done # # Find the PID of our jobs to checkpoint. restarted jobs need to be handled # differently from new jobs JOBLAUNCHER=`ps --pid $LSB_JOBRES_PID --no-heading | awk '{print $4}'` case $JOBLAUNCHER in sbatchd ) #We are a restarted job; walk the process tree 3 steps. ID1=`ps --no-heading --ppid $LSB_JOBRES_PID | awk '{print $1}' | sort -r -n | tail -1` ID2=`ps --no-heading --ppid $ID1 | awk '{print $1}' | sort -r -n | tail -1` ID=`ps --no-heading --ppid $ID2 | awk '{print $1}' | sort -r -n | tail -1` ;; res ) # We are a new job; walk the process tree 2 steps. ID1=`ps --no-heading --ppid $LSB_JOBRES_PID | awk '{print $1}' | sort -r -n | tail -1` ID=`ps --no-heading --ppid $ID1 | awk '{print $1}' | sort -r -n | tail -1` ;; * ) #We can't find the JOBID. exit 1; ;; esac if [ "$KILL" = "TRUE" ] ; then /usr/bin/cr_checkpoint --term -f $LSB_CHKPNT_DIR/jobstate.context $ID status=$? else /usr/bin/cr_checkpoint --run -f $LSB_CHKPNT_DIR/jobstate.context $ID status=$? fi exit $status #!/bin/sh # Restart an LSF job from a blcr checkpoint # Copyright Genome Research Ltd 2009 # $Revision: 1.2 $ # $Author: gmpc $ $Date: 2009-06-17 13:13:28 $ echo "LSB_RESTART_CMD=/usr/bin/cr_restart $LSB_CHKPNT_DIR/jobstate.context" > $LSB_CHKPNT_DIR/.restart_cmd exit $?