$Id: sched-config.html,v 1.1 2000/03/20 23:11:26 bayucan Exp $

Origin2000 Scheduler Configuration Guide

INTRODUCTION

This document describes how to configure the NAS Origin2000 PBS job scheduler. A brief overview of the operation and algorithms used by the scheduler is presented in the file overview.html in this directory. Each of the 40 configuration options is listed below, cross-referenced to any related options.

There is also a complete index of all of the options at the end of this guide. The index is organized in both alphabetical order, and by functional groupings.

HIGH-LEVEL OVERVIEW OF SCHEDULING

A generic PBS scheduler periodically queries the PBS server for a list of jobs that are running and queued. From this list, the scheduler chooses the next job that should be run and, if the resources are available, instructs the server to start that job running. In making the decisions, it may query external sources of information and modify its algorithms based on the returned data.

The specifics of how the next job is chosen are mostly a policy issue. Your site must specify a policy, and configure the scheduler to implement that policy. The Origin scheduler is able to implement many common policies simply by being configured in various ways. However, for very complicated policies, it may be necessary to modify or extend the current algorithms.

For the Origin2000 systems running at NAS, the only strict requirement for a scheduler is that it must set the "Resource_List.ssinodes" attribute of a job and instruct the server to start its execution. The "ssinodes" attribute is used by the PBS resource monitor to allocate a portion of the execution host's resources to the job when it is started. "ssinodes" was added by NAS to allow a job to specify a number of computing elements on a single host( which PBS refers to as a "node").

CONFIGURATION FILE SYNTAX

The Origin scheduler gets most of its configuration and run-time options from the file specified at compile-time by CONFIGFILE (defined in the Makefile.in template). By default, the path of the configuration file is "$PBS_SERVER_HOME/sched_priv/config". This file is read by the scheduler when it starts up. Sending a SIGHUP to the scheduler will cause it to reconfigure itself from the current contents of the config file.

The configuration file is read and parsed line-by-line. Comments begin with a '#' character, and extend to the end of the line. They may be placed anywhere in the file, as may blank lines. Each line is parsed into a pair of white-space delimited words, an option and an argument.

The name of the option must be all-uppercase (for clarity). White-space is currently not allowed in the argument -- options that take lists of items or multiple values use commas to separate the individual items within the argument string. Syntax errors in the configuration file are caught by the parser, and the offending line number and option is noted in the scheduler logs. The scheduler will not start while there are syntax errors in its configuration files.

Note that many of the configuration options (i.e. AVOID_FRAGMENTATION) are left over from experimental algorithms that were tried at various stages of the lifetime of the Origin2000's. These "historical" options are marked in the descriptions below with the tag [deprecated]. They will probably be removed in future releases, and their use is discouraged.

DEBUGGING AND LOG INFORMATION

The scheduler puts a great deal of debugging information in its log files. These files can be found in /PBS/sched_logs/<YYYYMMDD>, and contain mostly human-readable output.

For lower-level debugging, the scheduler also dumps various debugging and state data onto its stdout stream. If the daemon is built with -DDEBUG, this output will be directed into the file /PBS/sched_priv/sched_out. The volume of output generated can be quite large for an active system, and it is intended for use only by those who have some familiarity with the code.

The sched_out file will grow without bound and eventually fill up the /PBS partition. It is recommended that it be pruned periodically to remove old information. Each iteration is tagged with a timestamp, both in ctime(3) format and as seconds since the Epoch. This should make it relatively easy to write a script to prune out the old debugging data.

ARGUMENT TYPES

The following basic types are allowed as arguments. The actual type of argument that is expected varies from option to option.

<boolean>
A boolean value. The strings "true", "yes", "on" and "1" are all "True", anything else evaluates to "False".
If a string in the format 'MM/DD/YYYY@HH:MM:SS' is supplied (i.e. "09/07/1999@05:01:00"), the option will be "False" until the given date and time, then will be "True" after that. If only the date is specified, then the time is assumed to be 00:00:00 on the morning of that day.
<integer>
An integer value (i.e. "64"). The range is determined by the option, but is typically a small positive number.
<pathname>
The path to some file in the local filesystem. If an absolute path is not specified, the path is relative to PBS' scheduler directory ($PBS_SERVER_HOME/sched_priv/). To avoid confusion, using absolute pathnames is recommended.
<queue>
The name of a PBS queue. May be either 'queue@hostname' or just 'queue'. If the hostname is not specified, it defaults to the hostname of the local machine.
<real>
A real-valued (floating point) number (for example, "0.80" or "-1.0").
<string>
An uninterpreted string that is passed to other programs. Note that, due to limitations in the parser, whitespace is not allowed in strings. Leading and trailing quotes are not removed by the parser.
<timespec>
A string of the form HH:MM:SS (i.e. 00:30:00 for thirty minutes, 4:00:00 for four hours). Note that the timespec is parsed from right-to-left, so "01:30" is 90 seconds, not an hour and 30 minutes.
<variance>
A negative and positive deviation from some base value. The syntax is '-mm%,+nn%' (i.e. '-10%,+15%' for minus 10 percent and plus 15% from some given value). Variances describe acceptable deviation from some ideal value.

Any previous value set for an option is over-ridden if the option is set again later in the file. The exception to this rule is the <queue> specifier, which appends the given queue(s) to the appropriate list.

INTERACTION WITH PBS SERVER AND MOM(S)

In order to determine what resources are available to be scheduled, the scheduler communicates with the PBS server and the PBS mom on each host that is under its control. The server supplies global configuration information, specifically a list of all the jobs and resource information for each of the batch queues. The mom on each host tells the scheduler how many nodes are available on that host, the load average, etc.

The PBS server should be configured with maximum resource limits for each queue. The scheduler uses these to determine whether a job "fits" in the batch queue. Each queue should have at least the following resources defined for it :

	resources_max.mem
	resources_max.ncpus
	resources_max.walltime

The resources_max.mem and resources_max.ncpus values should be even multiples of a single "ssinode" -- a physical computing node in the Origin2000 architecture. This means that the 'ncpus' and 'mem' maximum limits should be multiples of 2 and 490mb, respectively. The machines are scheduled as sets of atomic "nodes", each consisting of a pair of CPUs and a bank of memory. See the document O2K-config.html in this directory for more information about the machine's physical architecture and layout.

It's also a good idea to set the resources_default.* resources for the submit queue. At NAS, these are usually set to 5 minutes and 1 node:

	set queue submit resources_default.mem = 490mb
	set queue submit resources_default.ncpus = 2
	set queue submit resources_default.walltime = 00:05:00

The server itself should also be configured to have a set of resource limits. These limits should be the superset of all queue limits. In other words, if you have an 4-node queue with an 8-hour walltime limit, and a 64-node queue with a 1-hour walltime limit, you should set your server limits to 64-nodes and 8-hours. The scheduler will reject any jobs that do not fit into any single queue definition (i.e. a 4-hour, 64-node job):

	set server resources_max.mem = 31360mb
	set server resources_max.ncpus = 128
	set server resources_max.walltime = 08:00:00

The PBS mom must respond to requests for resource information from the scheduler. Usually this means that you must add a '$clienthost' entry for the scheduler's host in the mom's configuration file. Note that if the scheduler is running on the same host as the mom or server, you will probably need a '$clienthost' entry for localhost as well as the local machine's hostname.

BASIC SCHEDULER CONFIGURATION

The Origin2000 scheduler is designed to be useful with a minimal set of configuration options. A minimal, working configuration must include at least the following options:

	SUBMIT_QUEUE		<queue>
	BATCH_QUEUES		<queue>[,<queue> ...]

The minimal scheduler configuration should select jobs from the submit queue, then pack and run them on the batch queues. The jobs will be chosen for execution in a permuted order based on their size, queued time, and recent usage of the system by each user. The default behavior can be modified by including declarations for the options listed below. Some sample configuration files are included below.

A trivial case (one submit queue and one execution queue on the local host, default policy but do not attempt to increase fairness by sorting based on past-usage):

	SUBMIT_QUEUE		submit
	BATCH_QUEUES		execute
	SORT_BY_PAST_USAGE	False

More complicated configurations are possible. For instance, to schedule jobs across several hosts, with primetime from 8:00AM to 5:00PM limited to 3 hours, and special priority to the "special" queue:

	# Our site has several different sized hosts (named 32p-o2k,
	# 64p-o2k and 128p-o2k).  Schedule the queues 'small', 'medium'
	# and 'large' on them.  Note that a few CPUs are reserved for
	# system activity.
	SUBMIT_QUEUE		submit
	BATCH_QUEUES 		small@32p-o2k.bar.com # only 28p max
	BATCH_QUEUES		mediump@64p-o2k.bar.com	# only 60p max
	BATCH_QUEUES		large@128p-o2k.bar.com	# only 124p max
	#
	# "Special access" queue -- by request only, controlled by queue
	# ACL's in the server configuration (qmgr(1)).
	SPECIAL_QUEUE		special
	#
	# Our policy states that primetime is from 8:00 to 17:00 M-F
	# (except on holidays).  Jobs may run for up to 3 hours within
	# primetime.
	ENFORCE_PRIME_TIME	True
	PRIME_TIME_WALLT_LIMIT	03:00:00
	#
	# Attempt to foster fair use of the system by giving lower
	# priority to jobs whose owners have recently used many hours
	# of system time.
	SORT_BY_PAST_USAGE	True

Many more options are available for implementing various policies. See the comments in the sample configuration file and the descriptions of each option below.

IMPORTANT/REQUIRED FILES

The following files are important to the operation of the scheduler. Most of the pathnames are configurable at build-time or run-time.

$SERVER_HOME/sched_priv/config
Origin2000 Scheduler configuration file (required)
$SERVER_HOME/sched_priv/decay_usage
Per-user recent past-usage database file
$SERVER_HOME/sched_priv/sched_out
Unformatted low-level debugging output
$SERVER_HOME/sched_priv/sortedjobs
Dump of job list after sorting (in preferred order to run)
$SERVER_HOME/pbs_acctdir/allocations
Per-group node-hour allocations database file
$SERVER_HOME/pbs_acctdir/current
Per-user database of current (YTD) node-hour usage
/usr/lib/acct/holidays
List of holidays this year, with day numbers (counting from 1)

CONFIGURATION OPTIONS (GROUPED BY FUNCTION)

Here are the various configuration options, grouped by function. Unless noted otherwise, these options may appear anywhere in the configuration file.

Queue Definitions:

SUBMIT_QUEUE <queue> :
The submit queue is the default holding queue in which jobs are queued for "normal" scheduling. Jobs remain in the submit queue until they are chosen for execution. When a job is executed, it is moved to the proper batch queue, and the server is asked to execute the job on the host specified with the batch queue.

BATCH_QUEUES <queue>[,<queue>...] :
Batch queues specify the set of "execution" queues in which the scheduler should pack the jobs found in the submit queue. Jobs are tried against each of the batch queues in the sequence listed. The "size" and time limits for job execution in each queue is taken from the server's definitions of the queue "resources_max.*" attributes. Batch queues are kept in a list internally in the scheduler, and may be specified either in a comma separated list, or line-by-line.

Typically these queues are named either 'execute', or named after the execution host. 'execute' may be less confusing for a single-host installation, while using the hostname for the execution queue allows users to easily determine where their jobs are running (i.e. from the output of the 'qstat -r' command). For sites with more than a single execution host, we recommend using the hostname for queues.

It is possible for each execution host to have more than one queue, but for most installations this is unnecessary. A potential use of this feature is to dedicate a portion of a machine to a specific project or user with the queue 'user_acl' resource. The remaining resources can be placed in an un-ACL'd queue, and the two queues scheduled together. The ACL'd queue should be placed first in the list, to give it precedence. Then, the privileged user is guaranteed access to the resources in the ACL'd queue, but also has a fair shot at the remaining resources. Other mechanisms exist for handling the problem of giving a set of users an elevated priority on the hosts -- see the descriptions of SPECIAL_QUEUE and EXTERNAL_QUEUES below.

Because jobs are compared against queue limits in the order listed, it is a good idea to list the queues in ascending size/time order. This will cause the "smaller" queues to be filled first, leaving the larger queues less utilized and allowing large jobs to be run quickly. If there are no large jobs with a sufficiently high priority, several small jobs will be packed into a large queue in order to keep up utilization and throughput.

A typical BATCH_QUEUES configuration might look like this:

	# Execution queue on our 64-p box, which spans the whole machine:
	BATCH_QUEUES	execute@64p-o2k.bar.com
	
	# Execution queues on the 128-p.  Chess is a small queue dedicated
	#  tothe folks developing a new chess algorithm (controlled by an
	# ACL).  The remaining resources are covered by the execute queue.
	BATCH_QUEUES	chess@128p-o2k.bar.com,execute@128p-o2k.bar.com

Note also that there are two 'execute' queues -- the hostnames differ, allowing the scheduler to track them as independent objects.

The option name "BATCH_QUEUES" is a hold-over from an earlier, obsolete design that differentiated between "interactive" and "batch" queues.

SPECIAL_QUEUE <queue> :
The current scheduler implementation uses a modified-FIFO algorithm to prioritize the jobs waiting to be run. Basically, it sorts the list of jobs into the order they would be desired to be run, then permutes the list of jobs by recent-past usage of each user. It then attempts to run the jobs on the resulting list in the order they are listed.

As a brute-force means of allowing higher-priority access to the systems, a "special-access" queue can be specified. It is usually a clone of the submit queue, although its resource limits may be made smaller to reduce the size of jobs that can be submitted to it. The scheduler will check for jobs on the special queue, and place them at the head of the permuted list. This gives them the highest "priority" in the system.

Jobs that are enqueued on the special queue remain in the special queue until they are moved to an execution queue and started. The special queue is merely an container to differentiate the queues. The user ACL's provided by PBS should be used to limit access to the queue. Note that access to the special queue should be closely controlled. If it is used carelessly, the special jobs can starve all other jobs in the system (until they all become waiting, which leads to more problems).

We suggest a formal special-access request procedure, and a short (two weeks) time limit for the access. "Special" jobs are pre-empted only by jobs that have waited for too long and are marked "outstanding". Neither "outstanding" nor "special" jobs are subject to most policy-based restrictions (i.e. prime-time walltime limits, past-usage, etc).

In order to help the other users to understand why their jobs have been preempted, all running "special" jobs are marked in the job comment field with a message like:

"Started on Fri Aug 13 10:13:31 PST 1999 (special access job)"

See also:
INTERACTIVE_LONG_WAIT
MAX_QUEUED_TIME

DEDICATED_QUEUES <queue>[,<queue>...] :
For large installations, it is very likely that one or more users will want to run jobs on the whole system, without competing for resources with other users. The way that NAS deals with this problem is to instruct the users to submit their jobs to the dedicated queue (or queues, if the scheduler is managing multiple machines).

The administrator then requests a dedicated time for the user at the requested time. These jobs remain queued in the dedicated queue until the machine enters dedicated time, at which time they are run. There should be a separate dedicated queue for each machine. NAS names the dedicated queues based on the hostname, i.e. for hopper, the dedicated queue would be named 'hopper_ded'.

Jobs in the dedicated queue will be run in-place, in the order they were enqueued. They are packed, but not otherwise re-ordered. There is no reason to reprioritize them (dedicated time is inherently unfair), and most users want control over the execution sequence of their jobs (i.e. they submit jobs in the order they should run).

Again, ACL's should be used on the dedicated queues to prevent users from accidentally submitting jobs to them without scheduling a dedicated time in which they can be run.

See also:
ENFORCE_DEDICATED_TIME
DEDICATED_TIME_CACHE_SECS (required)
DEDICATED_TIME_COMMAND (required)
SYSTEM_NAME

EXTERNAL_QUEUES <queue>[,<queue>...] :
One or more "external" queues may also be specified. External queues provide a mechanism to allow users to "schedule" jobs themselves. Jobs are enqueued in the external queue, where they are scheduled in the order they were submitted. The term "external" comes from the idea of an externally-routed queue -- jobs just "appear" in the EXTERNAL_QUEUES, ready for the scheduler to run in-place.

This functionality was added to support a small 'compile' queue on an early set of machines, allowing the users to submit larger jobs to the global submit pool, or place their job in line in the 'compile' queue.

External queues are seldom used, but allow the same scheduler to simultaneously act as both a FIFO-packing and a priority-packing algorithm.

Primetime Configuration and Walltime Limits

PRIME_TIME_WALLT_LIMIT <timespec>
The site's scheduling policy often specifies two distinct policies, one for "non-primetime" (when long-running batch jobs should execute), and one for "primetime" (the time during the day when most people are at work). This allows one machine to service the incompatible requirements of maximizing throughput while allowing long-running jobs to run.

During the non-primetime hours (usually late at night when the users have mostly gone home), the scheduler attempts to maximize the utilization of the machine, while running the long jobs that cannot run during the day. For this reason, the longer, larger jobs are favored over shorter and smaller jobs.

In order to keep throughput up during the workday, most sites will choose to have a shorter walltime limit for jobs running in the day than during the night. Because of the shorter walltime and need to run more jobs, during primetime the sorting algorithms are adjusted to favor small, short jobs over longer, larger jobs.

A typical value for walltime limits during primetime is 2 hours:

	# Only run jobs up to 2 hours long during the day.
	PRIME_TIME_WALLT_LIMIT		02:00:00
	ENFORCE_PRIME_TIME		True

Primetime is only enforced during work-weekdays (i.e. Monday through Friday). Primetime is not observed on any holiday listed in the file /usr/acct/lib/holidays (be sure to update this file each year). The holidays file contains a list of holidays, one entry per line. Each entry follows the format of :

<day_of_year> <month> <day_of_month> <description>
Lines beginning with a '*' are considered comments. The scheduler only parses the <day_of_year> field, which should be a number from 1 to 365 according to the date of the holiday.

During non-primetime (or if primetime is not enforced), the walltime limit is taken from the 'resource_max.walltime' value returned by the PBS server. Note that the limit is on the amount of time the job would run during each prime-time period, not on the total time for the job. This means that no job may run for more than the primetime walltime limit during any given primetime period.

Remember that the <timespec>; is parsed from right-to-left, so the value "02:00" is two minutes, not two hours.

See also:
PRIME_TIME_START (required)
PRIME_TIME_END (required)
ENFORCE_PRIME_TIME

PRIME_TIME_START <timespec>
PRIME_TIME_END <timespec>
These options declare the starting and ending times of prime-time. Between these hours, only jobs with walltime requests that fit within the PRIME_TIME_WALLT_LIMIT will be considered runnable.

In addition, during primetime, jobs are sorted from smallest/shortest to largest/longest. This favors short, small jobs which should improve throughput and response time, allowing more iterations in the "submit, run, debug, resubmit" batch development cycle.

For instance, to declare primetime from 8AM to 5PM (local time), the following options should be set:

	# Prime-time is from 8:00AM to 5:00PM (local time).
	PRIME_TIME_START	08:00:00
	PRIME_TIME_END		17:00:00

Special and outstanding jobs are not subject to primetime walltime limits. Remember that the <timespec> is parsed from right-to-left, so "02:00" is 2 minutes past midnight, not 2:00 AM.

See also:
PRIME_TIME_WALLT_LIMIT(required)
ENFORCE_PRIME_TIME

ENFORCE_PRIME_TIME <boolean>
Setting this option enables the enforcement of separate polices for primetime and non-primetime. During primetime (usually configured to be the "normal" work-day), the scheduler chooses jobs in an order that should increase throughput. It also favors interactive jobs, and enforces a shorter walltime limit on any portion of a job that runs in primetime. During non-primetime, the walltime restrictions are lifted, and the job may run up to the maximum time specified on the execution queue. Batch jobs (non-interactive) are favored during non-primetime periods.

The argument to the ENFORCE_PRIME_TIME option may be either a boolean, or it may be a date/time specification. (See description of argument types above). Note that the date/time is the first time at which the scheduler should consider primetime enforced. Any jobs considered before this time will not be subject to primetime constraints, even if the job would run past the time that the option starts being enforced.

Also note that it is the portion of the job that will be running in primetime that is compared against the primetime walltime limit, not the total execution time requested. If the requested walltime is long enough to span more than one primetime period, it is allowed to run up to the primetime walltime limit for each period. This means that a job could be started that runs, for instance, 2 hours in primetime, 13 hours in non-primetime, and then another 2 hours in the next primetime period.

Primetime is only enforced on regular working days (i.e. Monday through Friday, excepting holidays). The list of which days are holidays is parsed from the holidays file, in /usr/acct/lib/holidays by default (this can be changed at compile time). The scheduler monitors the timestamp of this file, and will re-read it if it is modified.

An example of configuring the ENFORCE_PRIME_TIME option using the date/time format (assume 8-hour maximum walltime, and 2-hour primetime limits from 8:00AM to 5:00PM).

	# Enforce prime-time walltime limits starting on Nov. 1st.
	# Remember to start enforcing max_walltime - pt_limit (6) hours
	# before primetime starts, so there are no long jobs leaked into
	# primetime.
	ENFORCE_PRIME_TIME	11/01/1999@02:00:00
	PRIME_TIME_WALLT_LIMIT	02:00:00
See also:
PRIME_TIME_WALLT_LIMIT (required)
PRIME_TIME_START (required)
PRIME_TIME_END (required)

Job Size-Based Walltime Limits

SMALL_JOB_MAX <integer>
Some sites may want a more complex walltime policy, with a different walltime limit depending upon the size (in nodes) of the job. A simple mechanism has been implemented to support this sort of policy, based on a concept of "small" and "large" jobs. The value of the SMALL_JOB_MAX option determines the cut-off size (in nodes) for a job to be considered a "small job" (or removes the distinction if the value is zero).

If SMALL_JOB_MAX is defined, than jobs are subject to the required options WALLT_SMALL_LIMIT and WALLT_LARGE_LIMIT. In addition, small jobs can be given a separate primetime walltime limit by specifying the options PRIME_TIME_SMALL_NODE_LIMIT and PRIME_TIME_SMALL_WALLT_LIMIT. These options are described in more detail below.

Note that "special" jobs are not subject to the small/large walltime distinction.

See also:
WALLT_LIMIT_SMALL_JOB (required)
WALLT_LIMIT_LARGE_JOB (required)

WALLT_LIMIT_SMALL_JOB <timespec>
WALLT_LIMIT_LARGE_JOB <timespec>
These options specify the maxmimum walltime limits for small and large jobs (as defined by the SMALL_JOB_MAX option). This allows the site to set a policy with different allowable walltimes for jobs of different sizes. A common policy is to allow small jobs to run for a shorter period of time than larger ones. In theory, this will encourage users to submit larger jobs.

Both values must be defined if the SMALL_JOB_MAX option is declared with a non-zero argument. Remember that the <timespec> is parsed from right to left, so "02:00" is 2 minutes, not 2 hours. Jobs that request more than the limit specified for their size will be rejected by the scheduler, with a notice to that effect being delivered to the user.

For example, to limit jobs below 16 nodes to no more than four hours, while allowing other jobs to run for the full 8 hours, you may specify :

	# Our definition of a "small job" is 16 nodes or less (not all
	# that small, really).  Small jobs get only 4 hours to run, all
	# other jobs get 8 hours max.  Primetime is 2 hours for everyone.
	SMALL_JOB_MAX		16
	WALLT_LIMIT_SMALL_JOB	04:00:00
	WALLT_LIMIT_LARGE_JOB	08:00:00

See also:
SMALL_JOB_MAX (required)

PRIME_TIME_SMALL_NODE_LIMIT <integer>
Some sites may wish to institute a policy that specifies different walltime limits for small and large jobs during primetime. This could be used to discourage people from running very small jobs during primetime (if, for instance, a front-end is available for this use).

Primetime must be enforced (see the ENFORCE_PRIME_TIME option) in order for this option to have any effect. The PRIME_TIME_SMALL_WALLT_LIMIT must be defined along with this option.

This variable defines the maximum number of nodes that a job can request and still be considered a "small" job for the purposes of primetime. "Small jobs" (as defined by PRIME_TIME_SMALL_NODE_LIMIT) will be subject to the walltime limit declared by PRIME_TIME_SMALL_WALLT_LIMIT. Jobs larger than PRIME_TIME_SMALL_NODE_LIMIT nodes are subject to the normal primetime limits (given by PRIME_TIME_WALLT_LIMIT).

Note that the values of PRIME_TIME_SMALL_NODE_LIMIT and SMALL_JOB_MAX are independent of each other. To avoid confusion, it is recommened that they be set to the same value if they are used together. However, if the local policy dictates such, they may be given different values.

See also:
PRIME_TIME_SMALL_WALLT_LIMIT (required)
PRIME_TIME_WALLT_LIMIT (required)
ENFORCE_PRIME_TIME (required)

PRIME_TIME_SMALL_WALLT_LIMIT <timespec>
This option specifies the walltime limit for "small" jobs (as defined by PRIME_TIME_SMALL_NODE_LIMIT) during primetime. The walltime limit during primetime for a "normal" (i.e. bigger than the PRIME_TIME_SMALL_NODE_LIMIT) job will be constrained by the primetime limit PRIME_TIME_WALLT_LIMIT.

Remember that the <timespec> is parsed from right to left, so "02:00" is a two-minute limit, not a two-hour limit.

An example of this set of options is :

	# Start enforcing primetime on the 1st of November this year.
	ENFORCE_PRIME_TIME		11/01/1999@00:00:00
	# According to policy, jobs under 8 nodes only get one hour during
	# primetime, instead of the usual 2 hours.  This should encourage
	# users to scale their jobs.
	PRIME_TIME_WALLT_LIMIT		02:00:00
	PRIME_TIME_SMALL_NODE_LIMIT	8
	PRIME_TIME_SMALL_WALLT_LIMIT	01:00:00

See also:
PRIME_TIME_SMALL_NODE_LIMIT (required)
ENFORCE_PRIME_TIME (required)
PRIME_TIME_WALLT_LIMIT (required)

Dedicated Time Configuration and Support

ENFORCE_DEDICATED_TIME <boolean>
This option determines if dedicated times are being enforced. Dedicated times provide an easy mechanism for the administrator to grant access to an entire machine for a set of users. It may also be used to ensure that jobs are not allowed to run during a period in which the machine will be under maintenance or otherwise unavailable.

If dedicated time is being enforced, each job is compared against the list of upcoming dedicated times (see the description of DEDICATED_TIME_COMMAND for details) for each execution host (or for the "system" as a whole -- see the SYSTEM_NAME option). If the job would run over into a dedicated time on a machine, it will not be allowed to run. Thus, running jobs tend to drain off as the upcoming dedicated time approaches.

As soon as the system clock passes the start of the dedicated time for a machine, the scheduler switches to the dedicated queue defined for that host (if any) and runs any jobs found in that queue in FIFO order (with packing). If no jobs are enqueued on the proper dedicated time queue (or one is not defined), the host will be idled for the duration of the scheduled time.

If a current dedicated time is no longer needed, the scheduler may be returned to "normal" operation before the scheduled end of the period by temporarily disabling this option. It is recommended that the date/time form of boolean argument is used, as shown below. A common form of failure occurs when an operator forgets to re-enable dedicated times, and it is only discovered when the system fails to drain for the next scheduled dedicated time.

For example, the following is the correct way to temporarily disable the enforcement of dedicated time (to return to "normal" operation). Assume the current dedicated time was scheduled to end at 5:00PM on 11/10/1999.

	# Dedicated time completed early.  Return to normal operation
	# until immediately after dedicated time was scheduled to complete.
	ENFORCE_DEDICATED_TIME		11/10/1999@17:00:01

The compile-time option SORT_DEDTIME_JOBS will cause the jobs in the queue to be sorted from largest to smallest size. This option is disabled by default. We have found that most users that run during dedicated times wish to control the order of execution for their jobs (by submitting them in order). For brevity, dedicated times are called "Outages" in the scheduler source code.

Special and outstanding jobs will not be scheduled to run over into dedicated time.

See also:
DEDICATED_TIME_COMMAND (required)
DEDICATED_TIME_CACHE_SECS
MAX_DEDICATED_JOBS
SYSTEM_NAME

DEDICATED_TIME_COMMAND <pathname>
This option specifies the pathname to an executable that is run by the scheduler to determine when a given host (or the entire batch system) will be in dedicated time. The scheduler was designed to interface with the home-grown system downtime "schedule" service in use at the NAS facility.

The specified command is invoked by with the name of the batch system or host being queried as the only argument. Any output from the command is parsed, and an internal list of the upcoming dedicated times for each host is created in the scheduler. The results are cached for a short period of time to reduce the load on the server (see DEDICATED_TIME_CACHE_SECS).

This service is based around a simple client-server model that queries a database of scheduled outages for the various hosts and groups of hosts on-site at NAS. The scheduler's parser expects to receive data in the format returned by the NAS "schedule" service, but can be easily modified to accept other input.

A sample of the types of input expected from the DEDICATED_TIME_COMMAND is given below. The command should return the dedicated times for only the host specified as the argument, although the parser will cull out any entries that don't match the system requested.

    GRUMPY       07/07/1999 16:00-19:00 07/07/1999  Large job runs (berferd)
    HAPPY        07/09/1999 16:00-19:00 07/09/1999  Administrative period 
                                                    (preventive maintenance)
    SLEEPY       07/14/1999 16:00-19:00 07/14/1999  Install latest compiler.
    SLEEPY       07/28/1999 10:00-16:00 07/28/1999  Preventive maintenance
    SPORK        08/20/1999 09:00-09:00 08/21/1999  All day outage for power
                                                    supply and CPU board swap.

If there is no upcoming dedicated time for the specified host, the command should return the string :

"No scheduled downtime for the specified period."

The DEDICATED_TIME_COMMAND may be a binary executable or a shell script. An easy method for getting started with the scheduler is to write a simple shell script that just cat(1)'s the current "schedule" as a here-document. It is recommended that sites requiring more complicated mechanisms should consider writing their own interface to their system.

DEDICATED_TIME_COMMAND is queried for each host listed with BATCH_QUEUES, DEDICATED_QUEUE, or EXTERNAL_QUEUES. If SYSTEM_NAME is defined, it will also be asked for dedicated times for the specified "system". For more information, see the description below of the SYSTEM_NAME option.

See also:
ENFORCE_DEDICATED_TIME (required)
DEDICATED_TIME_CACHE_SECS
MAX_DEDICATED_JOBS
SYSTEM_NAME

DEDICATED_TIME_CACHE_SECS <integer>
In order to reduce the load on the "schedule" server (see the description of the DEDICATED_TIME_COMMAND option), the results of the last good request for a host are cached within the scheduler. This cache also allows the scheduler to keep running reasonably if the "schedule" server is unavailable (as happens sometime at NAS).

The DEDICATED_TIME_CACHE_SECS option specifies the amount of time (in seconds) that the host outage data should be kept cached. When the cache becomes stale, the scheduler will attempt to refresh it with the current data. Until the refresh is successful, the scheduler will continue to use the last-known upcoming dedicated times.

The cache can be disabled altogether by setting DEDICATED_TIME_CACHE_SECS to '0', and can be cleared by sending a SIGHUP to the scheduler.

See also:
ENFORCE_DEDICATED_TIME (required)

SYSTEM_NAME <string>
The SYSTEM_NAME feature allows an entire set of hosts to be simultaneously scheduled for dedicated time. This can be very useful when scheduling a large collection of hosts, each of which may have their own individual dedicated times along with a "system" dedicated time. It can also be used to key the other hosts to a required services machine, i.e. an NFS server or the PBS server.

If this option is set to a string or hostname, the scheduler will use the DEDICATED_TIME_COMMAND to request dedicated time information for the named "system". These dedicated times are then merged into the other dedicated times, resulting in a "system" dedicated time for each host being scheduled. The merge is performed so that an overlapping dedtime is resolved in the favor of the system.

For example:

	# Use 'schedule' to get dedicated times for the hosts in the cluster.
	ENFORCE_DEDICATED_TIME		True
	DEDICATED_TIME_COMMAND		/usr/local/bin/schedule
	# "cluster" covers all of the hosts.  Dedicated times for "cluster"
	# will cause all hosts in the cluster to be in dedicated time.
	SYSTEM_NAME			cluster

See also:
ENFORCE_DEDICATED_TIME (required)
DEDICATED_TIME_COMMAND

System Draining and Early Non-Primetime Startup

NONPRIME_DRAIN_SYS <boolean>
If you wish to drain the system just prior to starting non-primetime, set this option to true. Any normal job that would otherwise be runnable across the primetime/non-primetime transition will not be allowed to run.

Enabling this option will reduce utilization somewhat, as there will be a short period just before non-primetime in which no useful job can be run. However, draining the system periodically has been shown to be a good way to recover from the fragmentation that inevitably occurs when scheduling these machines node-by-node. It will also ensure that at least one very large job will be runnable at the beginning of non-primetime every night.

If your site regularly sees a long period of idle time on the machine before non-primetime starts, this can be addressed by specifying an early start for non-primetime. See the descriptions of the NP_DRAIN_BACKTIME and NP_DRAIN_IDLETIME options below for more information.

Special and outstanding jobs are not subject to this constraint. If one or more special or outstanding jobs is scheduled to cross the transition, then this restriction will be lifted. This is because the system cannot be drained in this case, so restricting other jobs is futile.

See also:
ENFORCE_PRIME_TIME (required)
NP_DRAIN_BACKTIME
NP_DRAIN_IDLETIME

NP_DRAIN_BACKTIME <timespec>
NP_DRAIN_IDLETIME <timespec>
In many cases, the systems will go idle for an hour or more each night if the NONPRIME_DRAIN_SYS feature is enabled. This idle time is spent waiting for some "last-minute" job to slip in and run within the few remaining minutes of primetime. While this does happen (some users find this idle time useful to submit very short large test jobs), it is fairly uncommon to see any jobs taking advantage of the gap. Non-primetime "early start" is an attempt to decrease the wasted cycles when jobs are not using primetime.

The NP_DRAIN_BACKTIME option specifies the maximum amount of time that non-primetime can be pulled back into primetime. NP_DRAIN_IDLETIME specifies the minimum idle time for a queue before early primetime will be applied. If the queue has been idle long enough, and it is now close enough to the start of non-primetime, primetime enforcement for this queue will be temporarily disabled.

An example may help explain this idea. Assume the following configuration options are specified:

	# Enforce primetime from 8:00AM through 6:00PM, draining the system
	# before starting non-primetime.
	ENFORCE_PRIME_TIME		True
	PRIME_TIME_START		08:00:00	# 8:00AM
	PRIME_TIME_END			18:00:00	# 6:00PM
	NONPRIME_DRAIN_SYS		True
	#
	# If it is within 1/2 hour of non-primetime, and a queue has been 
	# idle for more than 15 minutes, assume no last-minute jobs will
	# be submitted and start non-primetime early.
	NP_DRAIN_BACKTIME		00:30:00	# up to 1/2 hour early
	NP_DRAIN_IDLETIME		00:15:00	# must be idle 15 min

Once it is after 5:30PM, and an execution queue has been idle for more than 15 minutes, prime-time enforcement will no longer be enabled for that queue. This will allow a long-walltime job to start executing early, taking up the cycles that would probably have been wasted otherwise.

Both NP_DRAIN_BACKTIME and NP_DRAIN_IDLETIME must be specified in the configuration file. Sending a SIGHUP to or restarting the scheduler will reset the idle timers on each queue.

Note that, while this option will cause the primetime enforcement to be disabled, it does not cause the jobs to be re-sorted into non-primetime order. This may cause somewhat surprising results as the scheduler may not choose the largest, longest job to be run first. This is a known bug in the implementation, but in practice appears not to be a show-stopper.

See also:
NONPRIME_DRAIN_SYS (required)
ENFORCE_PRIME_TIME (required)

Host Resource Usage Limits:

[ Most of these options were implemented when the Origin2000 was still being scheduled as time-share instead of space-share resources. While they are no longer usable for their intended purpose, they may come in handy for some rare situations. ]

MAX_JOBS <integer>     [deprecated]
MIN_JOBS <integer>     [deprecated]
These options were used by a very early implementation of the scheduler. They may still be used to specify a minimum and/or maximum number of jobs that may be running simultaneously on any machine (as opposed to the queue-based "max_running" resource). Specifying a minimum job count is basically meaningless at this point, but the maximum might be useful in some rare instances.

One possible (but not very probable) use for this option would be to limit the job count due to some resource issue. An example configuration might specify:

	# Only allow two jobs to run on any host.  Our latest changes
	# to the OS have made it unstable with more than two running jobs.
	# This is only for the weekend until we have time to fix it and
	# reboot the systems.
	MAX_JOBS		2

If you wish to limit the number of jobs that can be run in a specific queue, you may do so by setting the "max_running" resource on the queue with qmgr(1). This gives much finer-grained control, and with one queue per host, can be used to control the maximum number of jobs running on a host.

Note that the scheduler will not oversubscribe the execution host, regardless of the values specified for minimum or maximum running job counts.

TARGET_LOAD_PCT <integer>[%]     [deprecated]
TARGET_LOAD_VARIANCE <variance>     [deprecated]
A very early implementation of the Origin scheduler attempted to schedule jobs based upon the load average of the machine. The scheduler attempted to maintain the load around the TARGET_LOAD_PCT, within the tolerances set by the +/- specifications of the TARGET_LOAD_VARIANCE. By default, the target load is 90%, with a -15%/+10% variance.

This set of options is more-or-less useless at this point (as the Origin is now scheduled node-by-node, rather than by load average). However, the scheduler will not schedule jobs on a host whose load average (as reported by the PBS mom) is higher than the sum of the target load and positive variance. This may prove useful to prevent new jobs being scheduled in the case of a runaway job or unauthorized use of the system:

	# Don't schedule hosts with a higher-than-expected load average.
	TARGET_LOAD_PCT		90%
	TARGET_LOAD_VARIANCE	-90%,+10%	# From 0-100% is okay.

Per-Group System Time Allocations Support

ENFORCE_ALLOCATION <boolean>
As most supercomputing sites have more users than cycles, it is usually necessary to allocate only some portion of the total available compute resources to each researcher or group. At NAS, each project is given a new UNIX gid at the beginning of the operational year. Each group is then provided with an allocation for the year, measured in node-hours.

The size of the allocation for any given group is usually determined by the expected computing needs of that group, balanced by the financial contribution towards the cost of the machine made by the group's funding source(s). Choosing allocations for a given set of groups is a difficult process, which is outside of the scope of this document.

As users run jobs on the system, the node-hours consumed by their jobs is charged against their allocations. When the allocation for their group reaches zero, they must request and be granted more time from their program manager before they will be allowed to run any more jobs.

If ENFORCE_ALLOCATION is enabled, the scheduler will enforce the allocations for groups. Each PBS job is tagged with the default UNIX group of the submitter. On each iteration, the scheduler looks up the current node-hour usage for that group. If the group has exceeded their allocated usage, the job will be rejected. A message is sent to the user informing them of the current and allocated usage for the group.

For more details in how the "current usage" of a group is maintained, see the description of the SCHED_ACCT_DIR option. The SCHED_ACCT_DIR option is required -- the scheduler uses files from this directory to enforce the allocations.

An example of configuring allocations support:

	# Starting with FY'99, our site will be charging for access to
	# our machines.  Need to start tracking allocations at that time.
	ENFORCE_ALLOCATION		11/01/1998
	SCHED_ACCT_DIR			/PBS/server_priv/pbs_acctdir

See also:
SCHED_ACCT_DIR (required)

SCHED_ACCT_DIR <pathname>
This option specifies the path to a directory containing the allocations accounting files. The given path must be a directory, and must contain at least the files "allocations" and "current". ENFORCE_ALLOCATION must also be specified with this option.

The "allocations" and "current" files are maintained by an accounting package called ACCT++, which was developed by NAS. The managers of each high-level project are respnsible for generating the lists of groups, users within those groups, their responsibilties, and the allocation granted to the group. ACCT++ distributes this list of allocations, and maintains the database of "current usage" for each user and group.

The scheduler watches the timestamp on both files, and re-reads the contents of a file whenever the timestamp is touched. Sending a SIGHUP to or restarting the scheduler will cause both files to be re-read as well.

The "allocations" file:
The "allocations" file is a flat text file containing a list of allocation records, one per line, formatted like this (the leading 'P= ' is required for historical reasons):

	P= <title/name of research> = <UNIX groupname> = <allocation>

For instance, the following record might be used for staff users, who are typically given an unlimited allocation for debugging, testing, etc :

	P= Systems Support Staff = staff = -1.0
An allocation for a set of researchers doing numerical analysis might be:

	P= Numerical Analysis Research Group = g10003 = 1500.0

In this case, the research group has been assigned the gid "g10003", and has been granted an allocation of 1500 node-hours for the operational year.

The following special values are defined for the node-hour field:
AllocationInterpretation of Special Value
-1.0 Unlimited allocation. Users in this group are not subject to any allocation.
0.0 NO allocation. Users in this group have been allocated no node-hours for this operational period. Any job submitted by this group will be immediately rejected.

At NAS, each project is assigned a new group ID at the beginning of the operational year. The base gid for the group is assigned to the number "##00" (where ## is the number of years since 1990). For instance, for 1999, the groups were numbered starting at 901, while the groups for the 2000 operational period were numbered from 10001. The groups are named in the /etc/group file as 'g<number>', so gid 10601 would be "g10601". Any gid less than the current year's base gid is given a '0.0' allocation by the accounting software, preventing them from running jobs.

A notable potential "gotcha" is the fact that the group allocations table is a fixed size (currently 1024 entries). This is a known issue and will be fixed in a future release.

The "current" file:
A simple database of the recent resource usage of a group is maintained in the file "$SCHED_ACCT_DIR/current" by the ACCT++ accounting system. This file consists of usage records for each user and group, one per line, in the following format:

<username> <groupname> <jobs_run> <nodehours_used>
There is a record for every username/groupname tuple that has executed a job on the system since the start of the operational period. For example, John Smith and Gilbert Held, both in group "g10003", have been running jobs.

Gilbert has run 1 job that used 32 nodes for almost an hour. John has been running several 8-node jobs that each run about 1-1/2 hours. The current file would contain something like the following:

	smithj g10003 15 203.587
	gilbert g10003 1 31.192

The records in the file are parsed, and the total node-hour usage for each listed group is computed. Any group that is not listed is assumed to have used no node-hours during this operational period.

Tracking allocations and resource usage:
Every time a job is created, changes state, or is moved to a different queue, an entry is logged in the PBS accounting logs. The accounting logs are located in the directory $PBS_SERVER_HOME/server_priv/accounting, one file for each day. These logs appear only on the server host, and allow an accounting system to track jobs and their resource usages.

Of particular interest is the "E" record, logged when a running job has exited. This record contains the user and group that submitted the job, a list of the resources that it requested, and a list of the resources the job actually used. From this, it is possible to determine the exact number of node hours used by the job (by multiplying the value recorded in "resources_used.walltime" by the "Resource_List.ssinodes"). The ACCT++ software uses this data to track the actual node-hour usage, number of jobs run, and other statistics for each group and user.

At NAS, the "current" file is not exactly current -- it is only updated periodically by the ACCT++ accounting software. Between updates, the scheduler fills in the gaps by adding a job's requested node-hours to the "last known" value from the accounting software. For groups that are very close to their allocation limit and often submit jobs that request more walltime than necessary, this can cause the scheduler to incorrectly reject a job due to an artificially high usage. At the next update, however, the scheduler's current usage will be updated with the actual node-hour usage, and the group may continue to submit jobs.

This behavior is rarely a serious problem, and could be considered to be a method of encouraging users to estimate their walltime requirements as closely as possible. Choosing a good walltime is a difficult problem for the user. A job that requests far more time than it needs may prevent another user's job from running, even though it later turns out that the other job could have run in the remaining time. Running longer jobs than necessary may lower the user's scheduling priority (see the description of the SORT_BY_PAST_USAGE option). However, specifying too short of a walltime will cause PBS to terminate the job before it completes.

See also:
ENFORCE_ALLOCATION (required)

Increasing Fairness In Resource Usage

SORT_BY_PAST_USAGE <boolean>
Any large site is bound to have at least one "power user" who tries to monopolize the resources of the machine. Usually this means that other users will not be able to get their work done in a timely manner. This situation tends to be sticky from an administrative point of view, since the users argue (rightly) that it's not their problem if the scheduler chooses their jobs more often than another user's.

There are a number of ways users abuse the system -- for instance, by submitting a large number of jobs of different sizes, thereby increasing their chances of a job fitting into a "hole" left by other jobs. Other techniques we have seen include the use of self-submitting scripts that submit a new copy of themselves just before terminating. The new job "just happens" to be a perfect fit for the hole left by the job that just completed.

The obvious solution to this problem is to track each user's recent usage of the machine, and lower their scheduling priority based upon their accumulated usage. The more node-hours used by a specific user, the less likely it is that another of their jobs will be run. This algorithm allows the scheduler to provide more fair access to the limited resources of the machine. It is similar to the priority-based process scheduler used in most UNIX implementations.

When the SORT_BY_PAST_USAGE option is enabled, the list of jobs being scheduled is permuted by a complex iterative process. This permutation walks through the list of jobs, looking for the job owned by the user with the least recent usage. It then shuffles that job to the top of the list and adds the resources requested to the owner's usage (the usage is tracked as if each job were run immediately). The algorithm then finds the new user with the least recent usage, and continues.

The result of this permutation is a new list of jobs re-ordered into a more fair schedule, with the jobs owned by the least-active users placed at the top of the list. These jobs are then packed in a FIFO order, with backfilling to improve utilization. This algorithm, while not perfect, does provide any given user a chance to use the system, and will favor those who have not recently used the system.

The statistics of recent usage are periodically written to the file $PBS_SERVER_HOME/sched_priv/decay_usage, and recovered when the scheduler is initialized. Every 24 hours, the recent-usage figures are "decayed" -- the current values are each multiplied by a decay factor, and re-written to disk. See the description of the options DECAY_FACTOR and OA_DECAY_FACTOR for more information.

	# Permute the list of jobs based on recent past usage of the machine
	# by each user.  The statistics for each user will be written hourly
	# to the file $PBS_SERVER_HOME/sched_priv/decay_usage.
	# Every 24 hours, the usage statistics are decayed by the factors
	# in DECAY_FACTOR and OA_DECAY_FACTOR.
	SORT_BY_PAST_USAGE		True

Note that the permuted list of jobs often appears to be more-or-less random to the casual observer. This makes user support somewhat more difficult, as it is often difficult to determine why a particular job was chosen over an equally eligible (maybe even "better") job. Usually, this is related to the relative recent usage of the users in question.

DECAY_FACTOR <real>
OA_DECAY_FACTOR <real>
These options control how quickly each user's recent usage of the machine is decayed away. Every 24 hours, each entry in the usage database is multiplied by the DECAY_FACTOR (or OA_DECAY_FACTOR if the user's group is over their allocation). This simple algorith implements an exponential decay, not unlike that of a UNIX load average (in that the usage can be instantaneous raised, but is reduced exponentially).

The scheduler maintains the usage statistics internally. Every hour, it writes the statistics to the file $PBS_SERVER_HOME/sched_priv/decay_usage. Consulting this file can be helpful when debugging the sorting algorithms.

Values specified for these options are floating point numbers, and should be under 1.0. The lower the number, the more quickly the past usage statistics will fall to zero. A possible exception might be to make the OA_DECAY_FACTOR 1.0 or larger -- this will cause users whose groups are over their allocation to be at a continuous disadvantage in the sorting process.

By default, the values of DECAY_FACTOR and OA_DECAY_FACTOR are 0.75 and 0.95, respectively. If you wish to be more or less lenient, you may specify different values in the configuration file:

	# Forget recent usage fairly quickly for users that are still under
	# their allocation, but don't be so generous with those who have
	# used more than their share.
	DECAY_FACTOR            0.5     # Cut usage in half each day.
	OA_DECAY_FACTOR         0.95    # Make them wait it out.

Note that the scheduler must be running to decay the recent usage database, or the usage for all users may be higher than expected. Since the scheduler makes decisions based upon relative usage values, not the absolute numbers, this should have little impact in operations.

See also:
SORT_BY_PAST_USAGE (required)

Long-Waiting ("outstanding") Job Support

MAX_QUEUED_TIME <timespec>
Due to the various policy and fairness algorithms (primetime, dedicated times, draining for non-primetime, "special" jobs, sorting by past usage, etc, etc), it is often the case that some jobs simply cannot be run for a long period of time. This is especially true when a large number of users are attempting to run jobs on a small machine. Unfortunately, the offered workload at most sites commonly exceeds the capacity of the available resources.

This job "starvation" can continue indefinitely if the scheduler does not actively correct it. The mechanism implemented in the Origin scheduler (and previously in the IBM SP2 scheduler) is to place a limit on the longest "reasonable" waiting time for a job. This walltime limit may be specified by the MAX_QUEUED_TIME option.

Any job that has been queued in the submit queue for more than this time will be given a very high priority in the scheduler. The scheduler will ignore most policy-based restrictions (i.e. primetime) for outstanding jobs, even giving them higher priority than "special" jobs. Large jobs that have waited too long may even cause the system to be drained in order to free up the resources necessary to run them.

Note that this does not mean that the job will run immediately after the MAX_QUEUED_TIME has elapsed, only that the scheduler will arrange for it to run as soon as possible after it becomes outstanding.

For example, in order to "boost" jobs that have been waiting in the submit queue for over 2 days (not uncommon at large sites), add the following to the configuration file:

	# Go out of our way to run any job that has waited in the queue
	# for more than two days.  The time period may need to be adjusted
	# depending upon changes in the offered load.
	MAX_QUEUED_TIME        48:00:00        # 48 hours maximum wait

Unfortunately, this "heroism" on the behalf of the waiting job can cause poor utilization of the machine, especially if draining is necessary. A more serious problem can arise if MAX_QUEUED_TIME is too short relative to the offered load and typical turnaround time. If waiting jobs become commonplace, the scheduler will begin to thrash attempting to make each one its highest priority. This is a strong indication that some part of your policy is at odds with the offered load.

Common causes of thrashing due to outstanding jobs are over-use of the special queue (causing "normal" jobs to be starved), and long dedicated times or outages (which prevent any jobs from running). The option may be temporarily disabled to prevent the scheduler from thrashing in these cases.

Additionally, early PBS servers allowed users to exploit this mechanism by placing a job on hold for hours or days. When the user later released the hold, the job instantly became the highest priority. PBS 2.x servers now reset the job's 'etime' attribute when the hold is released, so it appears to be newly enqueued. A patch has been released that addresses this problem, and should be easily portable to 1.1.x PBS servers.

See also:
INTERACTIVE_LONG_WAIT

INTERACTIVE_LONG_WAIT <timespec>
As noted in the description of MAX_QUEUED_TIME, it is common for jobs on busy systems to be enqueued for several hours, even days, before being executed. For batch jobs, this level of delay is tolerable. However, if the job is interactive, making the person sitting at the terminal wait this long is unreasonable. By specifying an INTERACTIVE_LONG_WAIT time, the administrator may attempt to mitigate this problem.

During primetime, if a job has been waiting for more than the sum of its requested walltime and INTERACTIVE_LONG_WAIT, the scheduler will mark it as "outstanding" (see MAX_QUEUED_TIME). This will make the scheduler arrange for the now-overdue interactive job to be started as soon as possible.

The maximum wait time was made dependent upon the requested walltime in order to favor users who are just trying to run a very quick test case. These jobs are typically short -- only a minute or so is usually required to see if a long-running batch version of the job will crash immediately. Our observations at NAS indicate that a time around 30 minutes to an hour works well for this option.

To enable this functionality, add the following to the configuration file:

	# Provide a priority boost for interactive jobs during primetime.
	# Interactive jobs waiting more than their requested walltime plus
	# a half hour will be made outstanding.
	INTERACTIVE_LONG_WAIT		00:30:00

Note that primetime must be active for this option to have any effect. Also note that an interactive job that is made outstanding just before the end of prime-time will be allowed to override the NON_PRIME_DRAIN_SYS option, and will run across the PT/NPT boundary.

See also:
MAX_QUEUED_TIME
ENFORCE_PRIME_TIME (required)

Sorted Job Dump Files

SORTED_JOB_DUMPFILE <pathname>
This option specifies the path to a file into which the list of queued jobs and other information is dumped. This file is re-written on each iteration of the scheduler, and contains the current date, policy in effect (i.e. primetime, etc), as well as the sorted list of jobs. The jobs are in the order which the scheduler would choose to run them (if the resources were available).

In addition to the jobid's, the dumpfile also lists the requested nodes and walltime for the job, the job's owner and group, and the amount of time the job has been queued and eligible. It also lists various flags from the scheduler's internal representation of the job.

	# Dump the sorted job list into a file in the sched_priv directory.
	SORTED_JOB_DUMPFILE		/PBS/sched_priv/sortedjobs

Note that this file will be created with owner/group root and readable only by the owner, if it does not exist. If the file exists, the permissions are not changed -- the contents are merely overwritten with each iteration. The administrator should decide if the sorted job file should be world-readable or not.

The possible flags, and their meanings are listed below:
FlagMeaning/Effect of Flag On Job
Int Job has PBS 'interactive' resource set to true.
High Job is queued in the SPECIAL_QUEUE, and has high priority.
Wait Job is marked as outstanding (MAX_QUEUED_TIME or INTERACTIVE_LONG_WAIT)
Ded Job is queued in one of the DEDICATED_QUEUES.
HPM Job has requested access to the HPM counters.
Run Job requests that it be run only on a specific host.

See also:
SPECIAL_QUEUE
DEDICATED_QUEUES
MANAGE_HPM_COUNTERS
MAX_QUEUED_TIME
INTERACTIVE_LONG_WAIT

Managing HPM/Perfex Performance Counters

MANAGE_HPM_COUNTERS <boolean>
One of the requirements for system analysts at many sites is that they perform statistical analysis of the performance of the machines. This is understandable -- large computers are very expensive, and their purchasers wish to know that they are getting the promised performance from their machines.

IRIX provides an interface to the on-chip event counters on the MIPS CPUs used in the Origin2000. These counters can set up to count, among other things, the number of clock cycles that have elapsed, and integer and floating point instructions that have been executed. From these numbers, a FLOPS and MIPS rating for the entire machine can be constructed. These can then be used to determine if the machine is operating as promised, and how efficiently it is being utilized as a resource. See the manpage for r10k_counters for details on the hardware performance counters.

While the global view is interesting and important, many users wish to use these counters to help optimize the performance of their applications. The operating system can also be told to monitor the execution of just the user application (ignoring global events). Applications like perfex(1) and SpeedShop (c.f. ssrun(1)) use these counters to make their performance evaulations.

Unfortunately, the counters cannot be operated in both "global" and "user" modes at the same time. In order to allow the global statistics to be collected unless a user wishes to use the counters in "user" mode, the scheduler can request that the PBS mom place the counters into user mode for this job, then return them to system-wide monitoring when the jobs that use them have completed.

The scheduler manages the HPM counters on the execution hosts by querying the 'hpm_ctl' resource on the PBS mom. In the PBS mom's configuration file, there should be an entry like this :

	hpm_ctl		!/usr/local/pbs/sbin/hpm_ctl %mode

The 'hpm_ctl' program should expect a single argument, and return one of the following responses:
%mode argumentEffect on HPM counters on execution host
query Prints either "user" or "global", depending upon state of counters. Outputs "???" if unable to determine the state.
user Attempts to set the counters to user mode. Outputs "OKAY" or "FAILED", depending upon success.
global Attempts to set the counters to global mode. Outputs "OKAY" or "FAILED", depending upon success.
revoke Attempts to revoke the counters by killing processes identified as holding a lock on the counters. This action is very expensive and potentially dangerous - use it carefully. Requires the 'ecfind' script supplied by SGI (which it uses to grovel in the kernel).

The hpm_ctl script used by NAS records the state of the counters before switching in and out of user mode. This allows system analysts to construct a continuous profile of the performance of the machine, with occasional blank spots where the counters were in "user" mode.

To enable the HPM support, set the MANAGE_HPM_COUNTERS option to True in the configuration file:

	# By default, the HPM counters run in global-monitoring mode.  In
	# order for user jobs to access them, they must be set to "user"
	# mode.  The scheduler manages the counters for jobs that request
	# the counters with the '-l hpm=1' job resource request.
	MANAGE_HPM_COUNTERS		True

Jobs that wish to use the counters (or the utilities that rely on them) must specify this as a requested resource. The job must be submitted with the '-l hpm=1' flag to qsub(1), or the qalter(1) command may be used to set the resource. Any job attempting to use the HPM counters without requesting them via '-l hpm=1' will fail or be terminated.

See also:
REVOKE_HPM_COUNTERS

REVOKE_HPM_COUNTERS <boolean>
An unfortunate implementation detail of the HPM counters on the Origin2000 is that they are "unowned" and cannot be revoked from user to global mode when a process has a reference to them. This becomes a problem when users forget to specify that they need access to the counters. In most cases, when their scripts run, the attempts to access the counters will fail because they are in use monitoring the global system performance.

However, sometimes the counters are in user mode (because another running job correctly specified the '-l hpm=1' attribute). The second job can then acquire a reference on the counters, and will happily run as if it had requested the counters itself.

The problem arises when the original well-behaved job terminates. As far as the scheduler can tell, the counters are now free to be returned to global mode. However, the request to return them to global monitoring will fail since the second job still maintains its reference. Although the scheduler will attempt to reclaim the counters on each iteration, it is possible for jobs to pass the illicit reference to the counters back and forth until there happen to be no running processes using the counters.

If the hpm_ctl script supports a "revoke" mode (see the description in MANAGE_HPM_COUNTERS above), the scheduler can use it to attempt to revoke the counters from the jobs that did not request it. Setting the option REVOKE_HPM_COUNTERS to "True" will enable this functionality.

Be aware that on some systems, using the ecfind(1) script to discover what PIDs have a reference on the counters can take a long period of time, so consider setting the scheduler's alarm to a larger value (i.e. 'pbs_sched -a 90 ...'). The ecfind(1) script may also crash the operating system. Use this option at your own risk.

See also:
MANAGE_HPM_COUNTERS (Required)

Restart Action After Server or Mom Crash

SCHED_RESTART_ACTION <string>
When PBS is terminated unexpectedly (i.e. by a system crash or errant SIGKILL), the jobs that were running at the time of the crash may be left in the execution queue, but in the "Queued" state. As these jobs are no longer in the submit queue, they will remain orphaned in the execution queues unless acted upon by the scheduler.

The SCHED_RESTART_ACTION defines how the scheduler should handle these jobs. The argument can take one of the following strings as arguments:
MethodEffect on queued jobs in execution queues
none Leave the jobs in the execution queues, where they will be ignored until manually moved back into a submit queue. This is the default disposition.
restart Assume that any jobs queued in the execution queue were running at the instant the system crashed. Re-run each job found queued, then recycle and start normal scheduling cycle. Restart is the most robust of the three actions.
resubmit Return each job to its original queue (named by the variable PBS_O_QUEUE), then start scheduling. The queued jobs will not maintain their original priority or ordering in the queues.

For example, the configuration file might include the following:

	# Restart any jobs that were running when the system crashes.
	# These may bomb out immediately (esp. if they were interactive)
	# so recycle and start a new scheduling run afterwards.
	SCHED_RESTART_ACTION		restart

Experimental Options

SMALL_QUEUED_TIME <timespec>     [experimental]
This option can be used to change the ordering of outstanding jobs to take into account the relative wait times between any two outstanding jobs.

It is not clear what effect, if any, this option will have on a normal workload.

See also:
MAX_QUEUED_TIME (required)
INTERACTIVE_LONG_WAIT

AVOID_FRAGMENTATION <boolean>     [experimental]
A natural side-effect of packing jobs into a queue is that the queue tends to become "fragmented". As jobs are packed into the queue, the available resources become smaller, decreasing the possibilities of a large job being able to run. Since the small jobs are more likely to be runnable, they tend to perpetuate the fragmentation problem. Without some mechanism for recovering from queue fragmentation, it is possible that larger jobs will be starved forever.

This experimental algorithm was implemented to allow the scheduler to recover from this queue fragementation. The algorithm computes the size of a "fragment" by dividing the total node resources in a queue by the maximum number of jobs. If the queue is empty, it will allow any job to be run (and recover later if necessary). If it discovers that the average size of a job in that queue is less than a fragment, it will refuse to allow the fragmentation to continue. The scheduler will only start a job that requests less than a fragment if it will run no longer than the time until the queue is expected to be emptied. Any job larger than a fragment will be allowed to run, as they will not contribute to the fragmentation problem.

On the whole, this algorithm only slightly improved turn-around and utilization, but was often confusing to the users and staff. It is left mostly for historical reference. Use of this feature is discouraged.

Options For Testing and Debugging

TEST_ONLY <boolean>
TEST_ONLY will cause the scheduler to just "go through the motions". To be specific, a normal scheduling cycle will be performed, but instead of performing any action which could change the state of the batch system, a message will be logged.

The TEST_ONLY option is very useful when making changes to the layout of a system, adding or removing queues or machines, or changing policies. It is also especially handy when adjusting the accounting machinery, since it will prevent the scheduler from deleting jobs in the case of mishaps with the allocations files.

To enable the test mode:

	# "Test-only" mode.  Log any actions that would affect the state
	# of the machine, instead of changing things.
	TEST_ONLY		True

See also:
FAKE_MACHINE_MULT

FAKE_MACHINE_MULT <integer>
Many sites have a small "test" platform they may use when testing new configurations, policy changes, etc, before deploying the changes on larger machines. The FAKE_MACHINE_MULT option may be useful in these cases for testing how the scheduler will react on a large machine. If FAKE_MACHINE_MULT is set to a non-zero value, all machine resources are multiplied by that value.

For example, the following configuration will allow an administrator to smoke-test a 256-processor scheduler configuration on an 8-processor test machine:

	# "test" submit queue -- don't test with the real "submit" queue
	SUBMIT_QUEUE		test
	# "fake" 256-processor batch queue
	BATCH_QUEUES		q256p

	# Run in testing-only mode, treating this 8-p machine as a 256.
	# All node counts, load average, etc, will be multiplied by 32.
	TEST_ONLY		True
	FAKE_MACHINE_MULT	32

Note that your test batch queues must be configured in PBS to have the correct "resource_max.*" limits for a 256-p machine.

See also:
TEST_ONLY (required)


ALPHABETICAL AND FUNCTIONAL INDEX

Configuration Options (By Function)
Configuration Options (Alphabetical)
Queue and Server Configuration
SUBMIT_QUEUE
BATCH_QUEUES
SPECIAL_QUEUE
DEDICATED_QUEUES
EXTERNAL_QUEUES

Primetime Job Walltime Configuration
PRIME_TIME_WALLT_LIMIT
PRIME_TIME_START
PRIME_TIME_END
ENFORCE_PRIME_TIME
SMALL_JOB_MAX
WALLT_LIMIT_SMALL_JOB
WALLT_LIMIT_LARGE_JOB
PRIME_TIME_SMALL_NODE_LIMIT
PRIME_TIME_SMALL_WALLT_LIMIT

Dedicated Time and System Draining Support
DEDICATED_TIME_COMMAND
DEDICATED_TIME_CACHE_SECS
SYSTEM_NAME
NONPRIME_DRAIN_SYS
NP_DRAIN_BACKTIME
NP_DRAIN_IDLETIME

Support For Per-Host Resource Usage Limits
MAX_JOBS
MIN_JOBS
TARGET_LOAD_PCT
TARGET_LOAD_VARIANCE

Support for Group Resource Allocation
ENFORCE_ALLOCATION
SCHED_ACCT_DIR
SORT_BY_PAST_USAGE
DECAY_FACTOR
OA_DECAY_FACTOR

Long-Waiting ("outstanding") Job Support
MAX_QUEUED_TIME
INTERACTIVE_LONG_WAIT

Sorted Job Dumpfile
SORTED_JOB_DUMPFILE

Managing HPM/Perfex Performance Counters
MANAGE_HPM_COUNTERS
REVOKE_HPM_COUNTERS

Restart Action After Server or Mom Crash
SCHED_RESTART_ACTION

Experimental Options
SMALL_QUEUED_TIME
AVOID_FRAGMENTATION

Options For Testing and Debugging
TEST_ONLY
FAKE_MACHINE_MULT

AVOID_FRAGMENTATION
BATCH_QUEUES
DECAY_FACTOR
DEDICATED_QUEUES
DEDICATED_TIME_CACHE_SECS
DEDICATED_TIME_COMMAND
ENFORCE_ALLOCATION
ENFORCE_PRIME_TIME
EXTERNAL_QUEUES
FAKE_MACHINE_MULT
INTERACTIVE_LONG_WAIT
MANAGE_HPM_COUNTERS
MAX_JOBS
MAX_QUEUED_TIME
MIN_JOBS
NONPRIME_DRAIN_SYS
NP_DRAIN_BACKTIME
NP_DRAIN_IDLETIME
OA_DECAY_FACTOR
PRIME_TIME_END
PRIME_TIME_SMALL_NODE_LIMIT
PRIME_TIME_SMALL_WALLT_LIMIT
PRIME_TIME_START
PRIME_TIME_WALLT_LIMIT
REVOKE_HPM_COUNTERS
SCHED_ACCT_DIR
SCHED_RESTART_ACTION
SMALL_JOB_MAX
SMALL_QUEUED_TIME
SORTED_JOB_DUMPFILE
SORT_BY_PAST_USAGE
SPECIAL_QUEUE
SUBMIT_QUEUE
SYSTEM_NAME
TARGET_LOAD_PCT
TARGET_LOAD_VARIANCE
TEST_ONLY
WALLT_LIMIT_LARGE_JOB
WALLT_LIMIT_SMALL_JOB

Each configuration option takes one argument. The argument may be one of the following types:
<boolean>
<integer>
<pathname>
<queue>
<real>
<string>
<timespec>
<variance>