Automatically self-resubmitting batch scriptsTo totally automate the processing of a very long job (or a large set of short jobs), each individual batch job has to be able to submit the next job to keep the chain of jobs going. Usually this involves a batch script that actually resubmits itself to the queuing system after doing as much work as it has time for. Alternatively you could have a sequence of scripts that each submit the next in the chain but this is a bit restrictive.Basics of the scriptThe exact details of how jobs are configured to resubmit and restart will depend somewhat on the application. The steps needed to get self-resubmission to work might be:
Steps 2 and 3 could be deleted by having your code always assume it is starting from a restart file even initially. You will then need a supplementary code which only sets up the initial state, checkpoints and then immediately exits.
- setup your code to checkpoint
- setup your code to be able to start from both an initial state and from a checkpoint file (the code should flag an incorrect checkpoint file format)
- develop a mechanism to tell the code which option to take - this could be the existence or not of a checkpoint file (can be checked from any language) or having the code edit its own input file
- write a script which conditionally submits iteslf to the batch queing system taking a counter as input and passing an incremented value to the next submission.
The basic tasks included in the script to do step 4 might look something like:1 if (whole job finished or job counter too large) exit 2 increment job counter 3 run executable limited by time 4 submit myself with new job counterHow each of these tasks is done will depend on the application, the scripting language and your tastes but we will refer to the example Bourne shell script below. Its probably a reasonable starting point for your own customization.
- Counters
- An integral part of this script is a job counter to limit the total number of jobs in the sequence -- without such a limiter you may inadvertently run thousands of jobs filling logfiles, NQS databases, filesystems and generally wreaking havoc!!
Always use a limited job counter. Our example uses two environment variables to be set before the job set starts:Alternatively the counter and the limiter could be two numbers in a file which gets read and updated with each job.
JOB_COUNT - used as a counter and initially set to 0 MAX_JOB_COUNT - reasonable upper limit on job count, say 10.
$MAX_JOB_COUNT + 1 jobs will run but the last will do nothing
- Application executable
- The script assumes most of the "restart" work and checkpointing will be handled within the executable (app.exe in this case); the executable is assumed to write out and read in enough information for the state of the run to be passed between batch jobs. This would include which checkpoint file to start from next, the time-step/iteration this checkpoint file represents etc. It is also assumed to create a file (called finished in the example) when the entire run is complete - the script simply has to check for its existence. The specification of a MAX_JOB_COUNT is a safety net to control jobs that fail, never finish and endlessly resubmit.
- Time limits
- The example script relies on the application being a Fortran executable and hence able to be cputime limited by the -Wl,-t runtime option. RASH sets the environment variable LIMIT_CPU to the job cpu limit in seconds and so can be uised to stop a minute or two early. If the application is based on C, some other mechanism (SIGXCPU signal handling for example) is required to shut application down early to allow time for resubmission.
- Resubmitting
- One complication on the VPP is that your batch job is unlikely to be running on a processor that can accept batch submissions - only PE0 (vpp00) knows about nqsub. Hence you must rsh to vpp00 to run nqsub from within your script. In our case JOB_COUNT and MAX_JOB_COUNT have to be added to the environment of the submitting shell on vpp00 as well.
- Getting it all started
- Assuming the script below is in a file called run_job, you must make this script executable using chmod +x run_job. To start a sequence of jobs (limited to a maximum of 9), the user enters the commands
% setenv JOB_COUNT 0 % setenv MAX_JOB_COUNT 9 % nqsub run_jobNote that these are the same commands given in the rsh line in the script except that we are giving values to MAX_JOB_COUNT and QSUB_REQNAME and an initial value for JOB_COUNT.Example shell script#!/bin/sh # # Self submitting NQS script for a sequentially ordered set of jobs. # ================================================================== # # --------- # Counters: # --------- # To stop an infinite of self-submitting jobs, a counter is used # to control the total number of jobs in the job set. This # requires two environment variables to be set before the job set # starts: # JOB_COUNT - used as a counter and initially set to 0 # MAX_JOB_COUNT - reasonable upper limit on job count, say 10 # # $MAX_JOB_COUNT + 1 jobs will run but the last will do nothing. # # ------ # Files:. # ------ # Set the following filenames - can be set relative to $QSUB_WORKDIR # ($QSUB_WORKDIR is set by NQS to the directory from which you submit # the inital job). # The application $EXEC is assumed to create a file $FINISHED when # the series of jobs has completed. # EXEC='/home/123/abc123/my_app/bin/app.exe' FINISHED='done_job' LOG_FILE='app_job.log' # # ------------ # Time limits: # ------------ # This script relies on $EXEC being a Fortran executable and hence # able to be cputime limited by the -Wl,-t runtime option. # # JOB_CPU_LIMIT = $LIMIT_CPU - 120 seconds # # If the code is C some other mechanism (SIGXCPU signal handling # for example) is required to shut $EXEC down early to allow time # for resubmission. # # ------------ # NQS options: # ------------ # Set the NQS job parameters - they will be fixed for all jobs # in the job set. The only nqsub parameter needed is the name of # this script and that is provided by NQS as $QSUB_REQNAME after # the initial submission. # # @$-q normal # Queue to submit to # @$-lT 4:00:00 # Timelimit on each individual NQS job # @$-lT 80MB # Memory limit # @$-x # Copy environment. Needed to export JOB_COUNT and # MAX_JOB_COUNT to the next job # # ------------------------------------------------ # The rest of this script should not need changing # ------------------------------------------------ cd $QSUB_WORKDIR if [ ! -f $FINISHED ] && [ $JOB_COUNT -lt $MAX_JOB_COUNT ] then JOB_COUNT=`expr $JOB_COUNT + 1` NEXT_JOB=`expr $JOB_COUNT + 1` JOB_CPU_LIMIT=`expr $LIMIT_CPU - 120` $EXEC -Wl,-t$JOB_CPU_LIMIT > $LOG_FILE.$JOB_COUNT echo Submitting job number $NEXT_JOB rsh vpp00 "cd $QSUB_WORKDIR; \ setenv JOB_COUNT $JOB_COUNT ; \ setenv MAX_JOB_COUNT $MAX_JOB_COUNT ; \ nqsub $QSUB_REQNAME" fi exitExtensionsNote that environment variables could be used instead of command line arguments and the method of indicating that the total job has finished could take various forms.
This script is very simple and assumes that the total job process is sequential and only one job executes at a time. Stopping concurrent jobs or running concurrent jobs when the total job is not sequential requires a little more sophistication. Contact ANUSF if you would like to know more.