ANU/APAC PBS Modifications

 

Modification identification

The code base is OpenPBS 2.3 (with patchs up to pl12 added by hand where applicable). Filenames given below are relative to $PBSSRCDIR/src.

All modifications are (should be?) inside #ifdef ANUPBS

Particular modifications are also #ifdef'd by:

      JOBFS     - code in Mom to support jobfs
      PSETS     - code in Mom to support processor sets
      POD       - code in server and MOM to improve server/mom connectivity
      EXECHOST  - code to support changing exec_host format
      RMS       - code to support RMS on the SC
      ANURASH   - code in server to output RASH accounting info
 

Job suspension/resumption

Also known as "job preemption" - now supported in PBSPro:

Scheduler:
  • support the concept of preemption and actually initiates the suspend and resume of jobs directly.
Server:
  • consistently assigns the JOB_SUBSTATE_SUSPEND state (substate of JOB_STATE_RUNNING) and save job state on transitions
    (req_signal.c, pbsd_init.c, req_holdjob.c, req_quejob.c, req_runjob.c)
  • frees and allocates cpus of nodes of suspended and resumed jobs
    (req_signal.c, node_manager.c)
MOM:
  • process tree operations
  • suspends walltime
 

CPU migration during suspension/resumption

Jobs are assigned cpus within a nodes. Migration allows a suspended job to be resumed on different cpu(s) to those on which it was previously running.
Scheduler:
  • Simple check for possible migration and passes new exec_host
Server:
  • frees and allocates the specified cpus of nodes of suspended and resumed jobs
MOM:
  • updates exec_host in start_exec.c:job_nodes()
  • process tree operations
  • suspends walltime
RMS: cannot migrate RMS resources so multi-cpu jobs cannot be migrated on the SC. (Possibly could with a indirect cpu numbering vector.)
 

Processor sets

A means of enforcing the ncpus resource request on SMP nodes. Relies on the underlying operating system supporting a processor set concept (true for Solaris, IRIX and Tru64, possibly true for AIX).

Modifications only in MOM:

Can be problematic getting PBS and the system to agree on the state of psets occasionally leading to jobs failing to start - needs more built-in checking and condition handling.

RMS: currently RMS processes cannot be put under psets since they are not children of MOM. Only an issue when jobs dont fill a node.

 

Specification of exec_host

Part 1:
Somewhat aesthetic change! More significant for use with larger SMP nodes but makes job location more readable.

The original PBS has an exec_host job attribute format of

     "node1/cpu1+node1/cpu2+...+noden/cpum-1+noden/cpum"
Changed to
     "node1/cpu1.cpu2..+...+noden/...cpum-1.cpum"
eg. sc17/0.1.2.3+sc18/1.3

exec_host is only parsed in a couple of places - node_manager.c and req_signal.c (for migrate on resume) in the server and job_nodes() in start_exec.c in the MOM.

Part 2:
Normally the pbs_runjob() request from scheduler to server specify the number of processors per node on particular nodes, eg. sc23:ppn=3+sc24:ppn=2

So that the scheduler can start multiple jobs per scheduling cycle and be sure that jobs end up where it decides, the full exec_host is specified in the final char *extend extension argument to pbs_runjob() and handled in req_runjob.c and node_manager.c in the server.

 

Jobfs

jobfs is a disk resource on every node, i.e. an allocated local scratch disk. Must be requested as part of a job request.
Scheduler:
  • Correct allocation like any other resource
Server:
  • Abstract resource - no change
MOM:
  • Form jobfs pathname and add PBS_JOBFS to environment
  • Create directory (and possibly apply quota to user on jobfs filesystem)
  • Add to list of resources that are monitored in mom_get_sample()
  • At job completion, traverse the jobfs directory deleting files then remove directory
RMS: The only node resource not available from RMS node_stats. Fairly static so could just be a table.
 

MOM-server connectivity robustness

A common reliability problem for OpenPBS is the server/scheduler hanging when a node (or MOM on a node) dies. In particular requests to MOM for job resource usage or node status can hang. With this mod, job resource usage reporting is initiated by the MOM, not the server. The server just keeps the latest usage info sent in - this may be way out of date if a node/MOM dies.
 

Scheduler

New scheduler from the gound up - a book in itself ...
 

MOM enhancements

process tree tracking and signals

resource utilization

shared memory segment memory usage tracking

integrated mom_mach.c (into mom_mach_common.c) for those platforms of interest (Linux, Solaris and Tru64) so that common functionality is supported.

limit the size of stdout and stderr files in local spool area.

setgid(PROJECT)

time varying sampling

send_sisters

at least 60 seconds grace between SIGXCPU and SIGKILL (needs to be user specifiable)

epilog "script" for appending resource usage info to stdout

 

RMS integration

The Resource Management System (RMS) on the SC provides a "context" for MPI jobs, i.e. access rights to network interfaces etc. RMS must allocate a resource (a list of cpus on nodes) to all MPI jobs.

Due to limits on how resource shapes can be specified, the current implementation involves an independent daemon, rmsresd, which is responsible for making resource requests to RMS and ensuring that the resource shape returned is the one requested by PBS. The scheduler decides on cpus and makes a request to rmsresd via a Unix domain socket. rmsresd makes sure unwanted cpus are alloacted to existing jobs or "fill" resources. It then makes a request to RMS for the required number of cpus and checks that the resource allocated matches the requested shape. rmsresd is also responsible for making the resource suspend and resume requests to RMS when a multi-cpu job is suspended and resumed.

MOMs role in RMS:

suspend/resume/kill resource
inquiring about RMS resusage prolog.c:run_pelog()
allocating RMS resources with V2.5 support

prun modification

 

Attributes and Resources

PBS is designed to be in extensible in the sense of adding attributes or resources to jobs, queues etc. Supports needs to be added to the scheduler and possibly MOM to support such additional attributes or resources.

stime (time):
start time of job (needed for expansion factor)

cwd (boolean):
qsub option -wd for starting job in submission directory (PBS_O_WORKDIR)

use_nodes (string):
queue attribute (set in qmgr) for restricting queue use of nodes. Form is either inclusive "node1+node2+..." or exclusive "!node1+node2+..."

jobfs (size):
job resource (see above)

ncpus (int):
already in OpenPBS - supported directly in scheduler

alt_id (int):
already in OpenPBS - job attribute to hold RMS resource id for multi-cpu jobs

nr_vmem (long)
Not actually an attribute - added to noderes struct in job.h. Support added in MOM.

nr_syst (TODO)
as for nr_vmem.

 

Commands

Source for cmd is cmds/cmd.c
qsub
Parses resource requests and checks against RASH limits

qstat
Minor to provide message when server is down

qorder
Standard qorder (really pbs_orderjob() support in server) only supports swapping two jobs in queue. Added argv[3] (supported by char *extend arg of pbs_orderjob()) to place job_argv[2] "BEFORE"/"AFTER"(argv[3]) job_argv[1].

pbsnodes
Shortened the output of -a invocation (possibly too much). Added -p (through char *extend arg of pbs_statnode()) to cause server to ping all nodes immediately (used when restarting a MOM - default is 600secs between pings). Support added in server/req_stat.c:req_stat_node().

nqstat
"qstat-like" command from RASH modified to request info from PBS.

pestat (NEW)
Provides status and resource summary from nodes. Can take a node name as an argument.

jobnodes (NEW)
Tabular presentation of jobids on cpus of nodes. Includes suspended jobs and indication of job's queue via "*".

jobs_on_node (NEW)
Summary of jobs and their resource usage on a given node. (Needs work on formatting.)

pbs_rusage (NEW)
Effectively produces the same job resource usage output as the MOM epilog script (used by some users who ignore the stdout file). Either uses the PBS_JOBID or argument for determine jobid.
 

tm API and support

Not used on the SC.

Important for integrating public domain MPI libraries like MPICH and LAM.

 

Miscellaneous hacks

Shifted the port numbers to 40001, 40002, etc on the SC because of apparent conflicts with 15001, etc (include/pbs_ifl.h).

Allows requests from any host with the same basename as the PBS server hostname (lib/Libsite/site_check_u.c). Should be fixed.