blcr LSF integration

From: Guy Coates (gmpc_at_sanger.ac.uk)
Date: Wed Jun 17 2009 - 06:41:23 PDT

  • Next message: Marcelo Veiga Neves: "Check out my photos on Facebook"
    Hi all,
    
    Attached are a couple of simple scripts to allow checkpoint integration with the
    Platform LSF job scheduler. They have been tested with single CPU jobs.
    
    
    Installation
    ------------
    
    Copy erestart.blcr and  echkpnt.blcr into your LSF_SERVERDIR directory. The
    default is something like:
    
    /path/to/lsf/7.0/linux2.6-glibc2.3-x86_64/etc
    
    Set the permissions to match the other binaries in this directory.
    
    
    Use
    ---
    
    To submit a job that will be check-pointed every 24 hours, run:
    
    bsub -k "/path/to/your/checkpoint/directory method=blcr 1440"  \
    ...(other LSF options) ... \
    cr_run /path/to/your/wrapper/script
    
    
    To restart a job from a checkpoint run:
    
    brestart /path/to/your/checkpoint/directory/jobid
    
    
    Job migration (bmig) and forced checkpointing (bchkpnt) should also work.
    
    Cheers,
    
    Guy
    
    -- 
    Dr. Guy Coates,  Informatics System Group
    The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
    Tel: +44 (0)1223 834244 x 6925
    Fax: +44 (0)1223 496802
    
    
    
    -- 
     The Wellcome Trust Sanger Institute is operated by Genome Research 
     Limited, a charity registered in England with number 1021457 and a 
     company registered in England with number 2742969, whose registered 
     office is 215 Euston Road, London, NW1 2BE. 
    
    #!/bin/bash 
    #
    # LSF script to checkpoint a jobs using blcr
    # Copyright Genome Research Ltd 2009
    # $Revision: 1.1 $
    # $Author: gmpc $  $Date: 2009-06-17 12:08:53 $
    
    
    #
    # Parse command line options
    #
    KILL="FALSE"
    
    while [ "$1" != "" ] ; do
    
    case $1 in 
        -k )
    	KILL="TRUE"
    	;;
    esac
    shift
    done
    
    
    #
    # Find the PID of our jobs to checkpoint. restarted jobs need to be handled
    # differently from new jobs
    
    JOBLAUNCHER=`ps  --pid $LSB_JOBRES_PID --no-heading  | awk '{print $4}'`
    
    case $JOBLAUNCHER in
        sbatchd )
        #We are a restarted job; walk the process tree 3 steps.
        ID1=`ps --no-heading --ppid $LSB_JOBRES_PID | awk '{print $1}' | sort -r -n | tail -1`
        ID2=`ps --no-heading --ppid $ID1 | awk '{print $1}' | sort -r -n | tail -1`
        ID=`ps --no-heading --ppid $ID2 | awk '{print $1}' | sort -r -n | tail -1`
        ;;
        res )
        # We are a new job; walk the process tree 2 steps.
        ID1=`ps --no-heading --ppid $LSB_JOBRES_PID | awk '{print $1}' | sort -r -n | tail -1`
        ID=`ps --no-heading --ppid $ID1 | awk '{print $1}' | sort -r -n | tail -1`
        ;;
        * )
        #We can't find the JOBID.
        exit 1;
        ;;
        esac
    
    
    
    if [ "$KILL" = "TRUE" ] ; then 
    /usr/bin/cr_checkpoint --term -f $LSB_CHKPNT_DIR/jobstate.context  $ID 
    status=$?
    else
    /usr/bin/cr_checkpoint --run -f $LSB_CHKPNT_DIR/jobstate.context  $ID 
    status=$?
    fi
    exit $status
    
    
    
    #!/bin/sh
    # Restart an LSF job from a blcr checkpoint
    # Copyright Genome Research Ltd 2009
    # $Revision: 1.2 $
    # $Author: gmpc $  $Date: 2009-06-17 13:13:28 $
    
    
    echo "LSB_RESTART_CMD=/usr/bin/cr_restart $LSB_CHKPNT_DIR/jobstate.context" > $LSB_CHKPNT_DIR/.restart_cmd
    exit $?
    

  • Next message: Marcelo Veiga Neves: "Check out my photos on Facebook"