Modification identification
The code base is OpenPBS 2.3 (with patchs up to pl12 added by hand where applicable). Filenames given below are relative to $PBSSRCDIR/src.All modifications are (should be?) inside #ifdef ANUPBS
Particular modifications are also #ifdef'd by:
JOBFS - code in Mom to support jobfs PSETS - code in Mom to support processor sets POD - code in server and MOM to improve server/mom connectivity EXECHOST - code to support changing exec_host format RMS - code to support RMS on the SC ANURASH - code in server to output RASH accounting info
Job suspension/resumption
Also known as "job preemption" - now supported in PBSPro:
- Attempt to increase the utilization and allow easier scheduling of parallel and high priority jobs.
- Running job processes are sent SIGSTOP before preempting job is started. Jobs resumed by SIGCONT to processes.
- Based on the Cray code in original PBS which used Cray native job operations. Special signals are "suspend" and "resume" (users can use SIGSTOP and SIGCONT on their jobs but PBS still considers those jobs to be running).
- Elapsed time of job is suspended while job is suspended.
- Scheduler:
- support the concept of preemption and actually initiates the suspend and resume of jobs directly.
- Server:
- consistently assigns the JOB_SUBSTATE_SUSPEND state (substate of JOB_STATE_RUNNING) and save job state on transitions
(req_signal.c, pbsd_init.c, req_holdjob.c, req_quejob.c, req_runjob.c)- frees and allocates cpus of nodes of suspended and resumed jobs
(req_signal.c, node_manager.c)- MOM:
- process tree operations
- suspends walltime
CPU migration during suspension/resumption
Jobs are assigned cpus within a nodes. Migration allows a suspended job to be resumed on different cpu(s) to those on which it was previously running.RMS: cannot migrate RMS resources so multi-cpu jobs cannot be migrated on the SC. (Possibly could with a indirect cpu numbering vector.)
- Scheduler:
- Simple check for possible migration and passes new exec_host
- Server:
- frees and allocates the specified cpus of nodes of suspended and resumed jobs
- MOM:
- updates exec_host in start_exec.c:job_nodes()
- process tree operations
- suspends walltime
Processor sets
A means of enforcing the ncpus resource request on SMP nodes. Relies on the underlying operating system supporting a processor set concept (true for Solaris, IRIX and Tru64, possibly true for AIX).Modifications only in MOM:
Can be problematic getting PBS and the system to agree on the state of psets occasionally leading to jobs failing to start - needs more built-in checking and condition handling.
- Attaches initial proto-job (the forked MOM process before execing shell script) to either a single cpu or a processor set for a multi-cpu job.
- Unbinds processes of job being suspended from their pset/cpu
- Rebinds all processes of job (to a possibly different cpu/pset) on resumption.
- Responsible for tracking psets (current error in job_nodes() ?)
RMS: currently RMS processes cannot be put under psets since they are not children of MOM. Only an issue when jobs dont fill a node.
Specification of exec_host
Part 1:Somewhat aesthetic change! More significant for use with larger SMP nodes but makes job location more readable.Part 2:The original PBS has an exec_host job attribute format of
"node1/cpu1+node1/cpu2+...+noden/cpum-1+noden/cpum"Changed to"node1/cpu1.cpu2..+...+noden/...cpum-1.cpum"eg. sc17/0.1.2.3+sc18/1.3exec_host is only parsed in a couple of places - node_manager.c and req_signal.c (for migrate on resume) in the server and job_nodes() in start_exec.c in the MOM.
Normally the pbs_runjob() request from scheduler to server specify the number of processors per node on particular nodes, eg. sc23:ppn=3+sc24:ppn=2So that the scheduler can start multiple jobs per scheduling cycle and be sure that jobs end up where it decides, the full exec_host is specified in the final char *extend extension argument to pbs_runjob() and handled in req_runjob.c and node_manager.c in the server.
Jobfs
jobfs is a disk resource on every node, i.e. an allocated local scratch disk. Must be requested as part of a job request.RMS: The only node resource not available from RMS node_stats. Fairly static so could just be a table.
- Scheduler:
- Correct allocation like any other resource
- Server:
- Abstract resource - no change
- MOM:
- Form jobfs pathname and add PBS_JOBFS to environment
- Create directory (and possibly apply quota to user on jobfs filesystem)
- Add to list of resources that are monitored in mom_get_sample()
- At job completion, traverse the jobfs directory deleting files then remove directory
MOM-server connectivity robustness
A common reliability problem for OpenPBS is the server/scheduler hanging when a node (or MOM on a node) dies. In particular requests to MOM for job resource usage or node status can hang. With this mod, job resource usage reporting is initiated by the MOM, not the server. The server just keeps the latest usage info sent in - this may be way out of date if a node/MOM dies.
Scheduler
New scheduler from the gound up - a book in itself ...
MOM enhancements
process tree tracking and signalsresource utilization
shared memory segment memory usage tracking
integrated mom_mach.c (into mom_mach_common.c) for those platforms of interest (Linux, Solaris and Tru64) so that common functionality is supported.
limit the size of stdout and stderr files in local spool area.
setgid(PROJECT)
time varying sampling
send_sisters
at least 60 seconds grace between SIGXCPU and SIGKILL (needs to be user specifiable)
epilog "script" for appending resource usage info to stdout
RMS integration
The Resource Management System (RMS) on the SC provides a "context" for MPI jobs, i.e. access rights to network interfaces etc. RMS must allocate a resource (a list of cpus on nodes) to all MPI jobs.Due to limits on how resource shapes can be specified, the current implementation involves an independent daemon, rmsresd, which is responsible for making resource requests to RMS and ensuring that the resource shape returned is the one requested by PBS. The scheduler decides on cpus and makes a request to rmsresd via a Unix domain socket. rmsresd makes sure unwanted cpus are alloacted to existing jobs or "fill" resources. It then makes a request to RMS for the required number of cpus and checks that the resource allocated matches the requested shape. rmsresd is also responsible for making the resource suspend and resume requests to RMS when a multi-cpu job is suspended and resumed.
MOMs role in RMS:
suspend/resume/kill resource
inquiring about RMS resusage prolog.c:run_pelog()
allocating RMS resources with V2.5 supportprun modification
Attributes and Resources
PBS is designed to be in extensible in the sense of adding attributes or resources to jobs, queues etc. Supports needs to be added to the scheduler and possibly MOM to support such additional attributes or resources.
- stime (time):
- start time of job (needed for expansion factor)
- cwd (boolean):
- qsub option -wd for starting job in submission directory (PBS_O_WORKDIR)
- use_nodes (string):
- queue attribute (set in qmgr) for restricting queue use of nodes. Form is either inclusive "node1+node2+..." or exclusive "!node1+node2+..."
- jobfs (size):
- job resource (see above)
- ncpus (int):
- already in OpenPBS - supported directly in scheduler
- alt_id (int):
- already in OpenPBS - job attribute to hold RMS resource id for multi-cpu jobs
- nr_vmem (long)
- Not actually an attribute - added to noderes struct in job.h. Support added in MOM.
- nr_syst (TODO)
- as for nr_vmem.
Commands
Source for cmd is cmds/cmd.c
- qsub
- Parses resource requests and checks against RASH limits
- qstat
- Minor to provide message when server is down
- qorder
- Standard qorder (really pbs_orderjob() support in server) only supports swapping two jobs in queue. Added argv[3] (supported by char *extend arg of pbs_orderjob()) to place job_argv[2] "BEFORE"/"AFTER"(argv[3]) job_argv[1].
- pbsnodes
- Shortened the output of -a invocation (possibly too much). Added -p (through char *extend arg of pbs_statnode()) to cause server to ping all nodes immediately (used when restarting a MOM - default is 600secs between pings). Support added in server/req_stat.c:req_stat_node().
- nqstat
- "qstat-like" command from RASH modified to request info from PBS.
- pestat (NEW)
- Provides status and resource summary from nodes. Can take a node name as an argument.
- jobnodes (NEW)
- Tabular presentation of jobids on cpus of nodes. Includes suspended jobs and indication of job's queue via "*".
- jobs_on_node (NEW)
- Summary of jobs and their resource usage on a given node. (Needs work on formatting.)
- pbs_rusage (NEW)
- Effectively produces the same job resource usage output as the MOM epilog script (used by some users who ignore the stdout file). Either uses the PBS_JOBID or argument for determine jobid.
tm API and support
Not used on the SC.Important for integrating public domain MPI libraries like MPICH and LAM.
Miscellaneous hacks
Shifted the port numbers to 40001, 40002, etc on the SC because of apparent conflicts with 15001, etc (include/pbs_ifl.h).Allows requests from any host with the same basename as the PBS server hostname (lib/Libsite/site_check_u.c). Should be fixed.