overview.html
in this directory. Each of the 40 configuration options is listed below,
cross-referenced to any related options.
There is also a complete index of all of the options at the end of this guide. The index is organized in both alphabetical order, and by functional groupings.
The specifics of how the next job is chosen are mostly a policy issue. Your site must specify a policy, and configure the scheduler to implement that policy. The Origin scheduler is able to implement many common policies simply by being configured in various ways. However, for very complicated policies, it may be necessary to modify or extend the current algorithms.
For the Origin2000 systems running at NAS, the only strict requirement for
a scheduler is that it must set the "Resource_List.ssinodes"
attribute of a job and instruct the server to start its execution. The
"ssinodes" attribute is used by the PBS resource monitor to
allocate a portion of the execution host's resources to the job when it
is started. "ssinodes" was added by NAS to allow a job to
specify a number of computing elements on a single host( which PBS refers
to as a "node").
CONFIGFILE (defined in
the Makefile.in template). By default, the path of the
configuration file is "$PBS_SERVER_HOME/sched_priv/config".
This file is read by the scheduler when it starts up. Sending a
SIGHUP to the scheduler will cause it to reconfigure itself
from the current contents of the config file.
The configuration file is read and parsed line-by-line. Comments begin
with a '#' character, and extend to the end of the line. They
may be placed anywhere in the file, as may blank lines. Each line is parsed
into a pair of white-space delimited words, an option and an argument.
The name of the option must be all-uppercase (for clarity). White-space is currently not allowed in the argument -- options that take lists of items or multiple values use commas to separate the individual items within the argument string. Syntax errors in the configuration file are caught by the parser, and the offending line number and option is noted in the scheduler logs. The scheduler will not start while there are syntax errors in its configuration files.
Note that many of the configuration options (i.e. AVOID_FRAGMENTATION) are
left over from experimental algorithms that were tried at various stages
of the lifetime of the Origin2000's. These "historical" options are
marked in the descriptions below with the tag [deprecated]. They
will probably be removed in future releases, and their use is discouraged.
For lower-level debugging, the scheduler also dumps various debugging and
state data onto its stdout stream. If the daemon is built with
-DDEBUG, this output will be directed into the file
/PBS/sched_priv/sched_out. The volume of output generated can
be quite large for an active system, and it is intended for use only by those who have some familiarity with the code.
The sched_out file will grow without bound and eventually fill
up the /PBS partition. It is recommended that it be pruned
periodically to remove old information. Each iteration is tagged with a
timestamp, both in ctime(3) format and as seconds since the
Epoch. This should make it relatively easy to write a script to prune out
the old debugging data.
<boolean> |
A boolean value. The strings "true", "yes",
"on" and "1" are all "True", anything
else evaluates to "False". If a string in the format 'MM/DD/YYYY@HH:MM:SS' is supplied (i.e. "09/07/1999@05:01:00"), the option will be "False" until the given date and time, then will be "True" after that. If only the date is specified, then the time is assumed to be 00:00:00 on the morning of that day. |
<integer> |
An integer value (i.e. "64"). The range is determined by the option, but is typically a small positive number. |
<pathname> |
The path to some file in the local filesystem. If an absolute path is not specified, the path is relative to PBS' scheduler directory ($PBS_SERVER_HOME/sched_priv/). To avoid confusion, using absolute pathnames is recommended. |
<queue> |
The name of a PBS queue. May be either 'queue@hostname' or just 'queue'. If the hostname is not specified, it defaults to the hostname of the local machine. |
<real> |
A real-valued (floating point) number (for example, "0.80" or "-1.0"). |
<string> |
An uninterpreted string that is passed to other programs. Note that, due to limitations in the parser, whitespace is not allowed in strings. Leading and trailing quotes are not removed by the parser. |
<timespec> |
A string of the form HH:MM:SS (i.e. 00:30:00
for thirty minutes, 4:00:00 for four hours). Note that the
timespec is parsed from right-to-left, so "01:30" is 90 seconds,
not an hour and 30 minutes.
|
<variance> |
A negative and positive deviation from some base value. The syntax is
'-mm%,+nn%' (i.e. '-10%,+15%' for minus 10 percent
and plus 15% from some given value). Variances describe acceptable deviation
from some ideal value.
|
<queue> specifier, which appends the given queue(s) to
the appropriate list.
The PBS server should be configured with maximum resource limits for each queue. The scheduler uses these to determine whether a job "fits" in the batch queue. Each queue should have at least the following resources defined for it :
resources_max.mem resources_max.ncpus resources_max.walltime
The resources_max.mem and resources_max.ncpus
values should be even multiples of a single "ssinode" -- a
physical computing node in the Origin2000 architecture. This means that
the 'ncpus' and 'mem' maximum limits should be
multiples of 2 and 490mb, respectively. The
machines are scheduled as sets of atomic "nodes", each consisting of a
pair of CPUs and a bank of memory. See the document
O2K-config.html in this directory for more information
about the machine's physical architecture and layout.
It's also a good idea to set the resources_default.* resources
for the submit queue. At NAS, these are usually set to 5 minutes and 1 node:
set queue submit resources_default.mem = 490mb set queue submit resources_default.ncpus = 2 set queue submit resources_default.walltime = 00:05:00
The server itself should also be configured to have a set of resource limits. These limits should be the superset of all queue limits. In other words, if you have an 4-node queue with an 8-hour walltime limit, and a 64-node queue with a 1-hour walltime limit, you should set your server limits to 64-nodes and 8-hours. The scheduler will reject any jobs that do not fit into any single queue definition (i.e. a 4-hour, 64-node job):
set server resources_max.mem = 31360mb set server resources_max.ncpus = 128 set server resources_max.walltime = 08:00:00
The PBS mom must respond to requests for resource information from the
scheduler. Usually this means that you must add a '$clienthost'
entry for the scheduler's host in the mom's configuration file. Note that
if the scheduler is running on the same host as the mom or server, you will
probably need a '$clienthost' entry for localhost
as well as the local machine's hostname.
SUBMIT_QUEUE<queue>BATCH_QUEUES<queue>[,<queue> ...]
The minimal scheduler configuration should select jobs from the submit queue, then pack and run them on the batch queues. The jobs will be chosen for execution in a permuted order based on their size, queued time, and recent usage of the system by each user. The default behavior can be modified by including declarations for the options listed below. Some sample configuration files are included below.
A trivial case (one submit queue and one execution queue on the local host, default policy but do not attempt to increase fairness by sorting based on past-usage):
SUBMIT_QUEUEsubmitBATCH_QUEUESexecuteSORT_BY_PAST_USAGEFalse
More complicated configurations are possible. For instance, to schedule jobs across several hosts, with primetime from 8:00AM to 5:00PM limited to 3 hours, and special priority to the "special" queue:
# Our site has several different sized hosts (named 32p-o2k, # 64p-o2k and 128p-o2k). Schedule the queues 'small', 'medium' # and 'large' on them. Note that a few CPUs are reserved for # system activity.SUBMIT_QUEUEsubmitBATCH_QUEUESsmall@32p-o2k.bar.com # only 28p maxBATCH_QUEUESmediump@64p-o2k.bar.com # only 60p maxBATCH_QUEUESlarge@128p-o2k.bar.com # only 124p max # # "Special access" queue -- by request only, controlled by queue # ACL's in the server configuration (qmgr(1)).SPECIAL_QUEUEspecial # # Our policy states that primetime is from 8:00 to 17:00 M-F # (except on holidays). Jobs may run for up to 3 hours within # primetime.ENFORCE_PRIME_TIMETruePRIME_TIME_WALLT_LIMIT03:00:00 # # Attempt to foster fair use of the system by giving lower # priority to jobs whose owners have recently used many hours # of system time.SORT_BY_PAST_USAGETrue
Many more options are available for implementing various policies. See the comments in the sample configuration file and the descriptions of each option below.
$SERVER_HOME/sched_priv/config |
Origin2000 Scheduler configuration file (required) |
$SERVER_HOME/sched_priv/decay_usage |
Per-user recent past-usage database file |
$SERVER_HOME/sched_priv/sched_out |
Unformatted low-level debugging output |
$SERVER_HOME/sched_priv/sortedjobs |
Dump of job list after sorting (in preferred order to run) |
$SERVER_HOME/pbs_acctdir/allocations |
Per-group node-hour allocations database file |
$SERVER_HOME/pbs_acctdir/current |
Per-user database of current (YTD) node-hour usage |
/usr/lib/acct/holidays |
List of holidays this year, with day numbers (counting from 1) |
SUBMIT_QUEUE <queue> :BATCH_QUEUES <queue>[,<queue>...] :resources_max.*" attributes.
Batch queues are kept in a list internally in the scheduler, and may be
specified either in a comma separated list, or line-by-line.
Typically these queues are named either 'execute', or named after the execution host. 'execute' may be less confusing for a single-host installation, while using the hostname for the execution queue allows users to easily determine where their jobs are running (i.e. from the output of the 'qstat -r' command). For sites with more than a single execution host, we recommend using the hostname for queues.
It is possible for each execution host to have more than one queue, but
for most installations this is unnecessary. A potential use of this
feature is to dedicate a portion of a machine to a specific project or
user with the queue 'user_acl' resource. The remaining resources
can be placed in an un-ACL'd queue, and the two queues scheduled together.
The ACL'd queue should be placed first in the list, to give it precedence.
Then, the privileged user is guaranteed access to the resources in the
ACL'd queue, but also has a fair shot at the remaining resources. Other
mechanisms exist for handling the problem of giving a set of users an
elevated priority on the hosts -- see the descriptions of SPECIAL_QUEUE
and EXTERNAL_QUEUES below.
Because jobs are compared against queue limits in the order listed, it is a good idea to list the queues in ascending size/time order. This will cause the "smaller" queues to be filled first, leaving the larger queues less utilized and allowing large jobs to be run quickly. If there are no large jobs with a sufficiently high priority, several small jobs will be packed into a large queue in order to keep up utilization and throughput.
A typical BATCH_QUEUES configuration might look like this:
# Execution queue on our 64-p box, which spans the whole machine: BATCH_QUEUES execute@64p-o2k.bar.com # Execution queues on the 128-p. Chess is a small queue dedicated # tothe folks developing a new chess algorithm (controlled by an # ACL). The remaining resources are covered by the execute queue. BATCH_QUEUES chess@128p-o2k.bar.com,execute@128p-o2k.bar.com
Note also that there are two 'execute' queues -- the hostnames
differ, allowing the scheduler to track them as independent objects.
The option name "BATCH_QUEUES" is a hold-over from an earlier, obsolete
design that differentiated between "interactive" and "batch" queues.
SPECIAL_QUEUE <queue> :As a brute-force means of allowing higher-priority access to the systems, a "special-access" queue can be specified. It is usually a clone of the submit queue, although its resource limits may be made smaller to reduce the size of jobs that can be submitted to it. The scheduler will check for jobs on the special queue, and place them at the head of the permuted list. This gives them the highest "priority" in the system.
Jobs that are enqueued on the special queue remain in the special queue until they are moved to an execution queue and started. The special queue is merely an container to differentiate the queues. The user ACL's provided by PBS should be used to limit access to the queue. Note that access to the special queue should be closely controlled. If it is used carelessly, the special jobs can starve all other jobs in the system (until they all become waiting, which leads to more problems).
We suggest a formal special-access request procedure, and a short (two weeks) time limit for the access. "Special" jobs are pre-empted only by jobs that have waited for too long and are marked "outstanding". Neither "outstanding" nor "special" jobs are subject to most policy-based restrictions (i.e. prime-time walltime limits, past-usage, etc).
In order to help the other users to understand why their jobs have been preempted, all running "special" jobs are marked in the job comment field with a message like:
"Started on Fri Aug 13 10:13:31 PST 1999 (special access job)"
INTERACTIVE_LONG_WAITMAX_QUEUED_TIMEDEDICATED_QUEUES <queue>[,<queue>...] :
The administrator then requests a dedicated time for the user at the
requested time. These jobs remain queued in the dedicated queue until the
machine enters dedicated time, at which time they are run. There should
be a separate dedicated queue for each machine. NAS names the dedicated
queues based on the hostname, i.e. for hopper, the dedicated queue would
be named 'hopper_ded'.
Jobs in the dedicated queue will be run in-place, in the order they were enqueued. They are packed, but not otherwise re-ordered. There is no reason to reprioritize them (dedicated time is inherently unfair), and most users want control over the execution sequence of their jobs (i.e. they submit jobs in the order they should run).
Again, ACL's should be used on the dedicated queues to prevent users from accidentally submitting jobs to them without scheduling a dedicated time in which they can be run.
ENFORCE_DEDICATED_TIMEDEDICATED_TIME_CACHE_SECS (required)DEDICATED_TIME_COMMAND (required)SYSTEM_NAMEEXTERNAL_QUEUES <queue>[,<queue>...] :EXTERNAL_QUEUES,
ready for the scheduler to run in-place.
This functionality was added to support a small 'compile' queue
on an early set of machines, allowing the users to submit larger jobs to the
global submit pool, or place their job in line in the 'compile'
queue.
External queues are seldom used, but allow the same scheduler to simultaneously act as both a FIFO-packing and a priority-packing algorithm.
PRIME_TIME_WALLT_LIMIT <timespec>During the non-primetime hours (usually late at night when the users have mostly gone home), the scheduler attempts to maximize the utilization of the machine, while running the long jobs that cannot run during the day. For this reason, the longer, larger jobs are favored over shorter and smaller jobs.
In order to keep throughput up during the workday, most sites will choose to have a shorter walltime limit for jobs running in the day than during the night. Because of the shorter walltime and need to run more jobs, during primetime the sorting algorithms are adjusted to favor small, short jobs over longer, larger jobs.
A typical value for walltime limits during primetime is 2 hours:
# Only run jobs up to 2 hours long during the day.
PRIME_TIME_WALLT_LIMIT 02:00:00
ENFORCE_PRIME_TIME True
Primetime is only enforced during work-weekdays (i.e. Monday through Friday). Primetime is not observed on any holiday listed in the file /usr/acct/lib/holidays (be sure to update this file each year). The holidays file contains a list of holidays, one entry per line. Each entry follows the format of :
<day_of_year> <month> <day_of_month> <description>Lines beginning with a '*' are considered comments. The scheduler only parses the <day_of_year> field, which should be a number from 1 to 365 according to the date of the holiday.
During non-primetime (or if primetime is not enforced), the walltime limit
is taken from the 'resource_max.walltime' value returned by the
PBS server. Note that the limit is on the amount of time the job would run
during each prime-time period, not on the total time for the job. This
means that no job may run for more than the primetime walltime limit
during any given primetime period.
Remember that the <timespec> is parsed from right-to-left,
so the value "02:00" is two minutes, not two hours.
PRIME_TIME_START (required)PRIME_TIME_END (required)ENFORCE_PRIME_TIMEPRIME_TIME_START <timespec>PRIME_TIME_END <timespec>PRIME_TIME_WALLT_LIMIT will be considered runnable.
In addition, during primetime, jobs are sorted from smallest/shortest to largest/longest. This favors short, small jobs which should improve throughput and response time, allowing more iterations in the "submit, run, debug, resubmit" batch development cycle.
For instance, to declare primetime from 8AM to 5PM (local time), the following options should be set:
# Prime-time is from 8:00AM to 5:00PM (local time). PRIME_TIME_START 08:00:00 PRIME_TIME_END 17:00:00
Special and outstanding jobs are not subject to primetime walltime limits.
Remember that the <timespec> is parsed from right-to-left,
so "02:00" is 2 minutes past midnight, not 2:00 AM.
PRIME_TIME_WALLT_LIMIT(required)ENFORCE_PRIME_TIMEENFORCE_PRIME_TIME <boolean>
The argument to the ENFORCE_PRIME_TIME option may be either a boolean, or
it may be a date/time specification. (See description of argument types
above). Note that the date/time is the first time at which the scheduler
should consider primetime enforced. Any jobs considered before this time
will not be subject to primetime constraints, even if the job would run
past the time that the option starts being enforced.
Also note that it is the portion of the job that will be running in primetime that is compared against the primetime walltime limit, not the total execution time requested. If the requested walltime is long enough to span more than one primetime period, it is allowed to run up to the primetime walltime limit for each period. This means that a job could be started that runs, for instance, 2 hours in primetime, 13 hours in non-primetime, and then another 2 hours in the next primetime period.
Primetime is only enforced on regular working days (i.e. Monday through Friday, excepting holidays). The list of which days are holidays is parsed from the holidays file, in /usr/acct/lib/holidays by default (this can be changed at compile time). The scheduler monitors the timestamp of this file, and will re-read it if it is modified.
An example of configuring the ENFORCE_PRIME_TIME option using the
date/time format (assume 8-hour maximum walltime, and 2-hour primetime
limits from 8:00AM to 5:00PM).
# Enforce prime-time walltime limits starting on Nov. 1st.
# Remember to start enforcing max_walltime - pt_limit (6) hours
# before primetime starts, so there are no long jobs leaked into
# primetime.
ENFORCE_PRIME_TIME 11/01/1999@02:00:00
PRIME_TIME_WALLT_LIMIT 02:00:00
PRIME_TIME_WALLT_LIMIT (required)PRIME_TIME_START (required)PRIME_TIME_END (required)SMALL_JOB_MAX <integer>SMALL_JOB_MAX
option determines the cut-off size (in nodes) for a job to be considered a
"small job" (or removes the distinction if the value is zero).
If SMALL_JOB_MAX is defined, than jobs are subject to the required options
WALLT_SMALL_LIMIT and WALLT_LARGE_LIMIT. In addition, small jobs can be
given a separate primetime walltime limit by specifying the options
PRIME_TIME_SMALL_NODE_LIMIT and PRIME_TIME_SMALL_WALLT_LIMIT. These
options are described in more detail below.
Note that "special" jobs are not subject to the small/large walltime distinction.
WALLT_LIMIT_SMALL_JOB (required)WALLT_LIMIT_LARGE_JOB (required)WALLT_LIMIT_SMALL_JOB <timespec>WALLT_LIMIT_LARGE_JOB <timespec>SMALL_JOB_MAX option). This allows the site to
set a policy with different allowable walltimes for jobs of different
sizes. A common policy is to allow small jobs to run for a shorter period
of time than larger ones. In theory, this will encourage users to submit
larger jobs.
Both values must be defined if the SMALL_JOB_MAX option is declared with a
non-zero argument. Remember that the <timespec> is parsed
from right to left, so "02:00" is 2 minutes, not 2 hours. Jobs
that request more than the limit specified for their size will be rejected
by the scheduler, with a notice to that effect being delivered to the user.
For example, to limit jobs below 16 nodes to no more than four hours, while allowing other jobs to run for the full 8 hours, you may specify :
# Our definition of a "small job" is 16 nodes or less (not all
# that small, really). Small jobs get only 4 hours to run, all
# other jobs get 8 hours max. Primetime is 2 hours for everyone.
SMALL_JOB_MAX 16
WALLT_LIMIT_SMALL_JOB 04:00:00
WALLT_LIMIT_LARGE_JOB 08:00:00
SMALL_JOB_MAX (required)PRIME_TIME_SMALL_NODE_LIMIT <integer>
Primetime must be enforced (see the ENFORCE_PRIME_TIME option) in order
for this option to have any effect. The PRIME_TIME_SMALL_WALLT_LIMIT must
be defined along with this option.
This variable defines the maximum number of nodes that a job can request
and still be considered a "small" job for the purposes of primetime.
"Small jobs" (as defined by PRIME_TIME_SMALL_NODE_LIMIT) will be subject
to the walltime limit declared by PRIME_TIME_SMALL_WALLT_LIMIT. Jobs
larger than PRIME_TIME_SMALL_NODE_LIMIT nodes are subject to the normal
primetime limits (given by PRIME_TIME_WALLT_LIMIT).
Note that the values of PRIME_TIME_SMALL_NODE_LIMIT and SMALL_JOB_MAX are
independent of each other. To avoid confusion, it is recommened that they
be set to the same value if they are used together. However, if the local
policy dictates such, they may be given different values.
PRIME_TIME_SMALL_WALLT_LIMIT (required)PRIME_TIME_WALLT_LIMIT (required)ENFORCE_PRIME_TIME (required)PRIME_TIME_SMALL_WALLT_LIMIT <timespec>PRIME_TIME_SMALL_NODE_LIMIT) during primetime. The walltime limit during
primetime for a "normal" (i.e. bigger than the
PRIME_TIME_SMALL_NODE_LIMIT) job will be constrained by the primetime
limit PRIME_TIME_WALLT_LIMIT.
Remember that the <timespec> is parsed from right to left,
so "02:00" is a two-minute limit, not a two-hour limit.
An example of this set of options is :
# Start enforcing primetime on the 1st of November this year.ENFORCE_PRIME_TIME11/01/1999@00:00:00 # According to policy, jobs under 8 nodes only get one hour during # primetime, instead of the usual 2 hours. This should encourage # users to scale their jobs.PRIME_TIME_WALLT_LIMIT02:00:00PRIME_TIME_SMALL_NODE_LIMIT8 PRIME_TIME_SMALL_WALLT_LIMIT 01:00:00
PRIME_TIME_SMALL_NODE_LIMIT (required)
ENFORCE_PRIME_TIME (required)
PRIME_TIME_WALLT_LIMIT (required)
ENFORCE_DEDICATED_TIME <boolean>
If dedicated time is being enforced, each job is compared against the list
of upcoming dedicated times (see the description of DEDICATED_TIME_COMMAND
for details) for each execution host (or for the "system" as a whole --
see the SYSTEM_NAME option). If the job would run over into a dedicated
time on a machine, it will not be allowed to run. Thus, running jobs tend
to drain off as the upcoming dedicated time approaches.
As soon as the system clock passes the start of the dedicated time for a machine, the scheduler switches to the dedicated queue defined for that host (if any) and runs any jobs found in that queue in FIFO order (with packing). If no jobs are enqueued on the proper dedicated time queue (or one is not defined), the host will be idled for the duration of the scheduled time.
If a current dedicated time is no longer needed, the scheduler may be returned to "normal" operation before the scheduled end of the period by temporarily disabling this option. It is recommended that the date/time form of boolean argument is used, as shown below. A common form of failure occurs when an operator forgets to re-enable dedicated times, and it is only discovered when the system fails to drain for the next scheduled dedicated time.
For example, the following is the correct way to temporarily disable the enforcement of dedicated time (to return to "normal" operation). Assume the current dedicated time was scheduled to end at 5:00PM on 11/10/1999.
# Dedicated time completed early. Return to normal operation # until immediately after dedicated time was scheduled to complete. ENFORCE_DEDICATED_TIME 11/10/1999@17:00:01
The compile-time option SORT_DEDTIME_JOBS will cause the jobs in
the queue to be sorted from largest to smallest size. This option is disabled
by default. We have found that most users that run during dedicated times
wish to control the order of execution for their jobs (by submitting them
in order). For brevity, dedicated times are called "Outages"
in the scheduler source code.
Special and outstanding jobs will not be scheduled to run over into dedicated time.
DEDICATED_TIME_COMMAND (required)DEDICATED_TIME_CACHE_SECSMAX_DEDICATED_JOBSSYSTEM_NAMEDEDICATED_TIME_COMMAND <pathname>
The specified command is invoked by with the name of the batch system or
host being queried as the only argument. Any output from the command is
parsed, and an internal list of the upcoming dedicated times for each host
is created in the scheduler. The results are cached for a short period of
time to reduce the load on the server (see DEDICATED_TIME_CACHE_SECS).
This service is based around a simple client-server model that queries a database of scheduled outages for the various hosts and groups of hosts on-site at NAS. The scheduler's parser expects to receive data in the format returned by the NAS "schedule" service, but can be easily modified to accept other input.
A sample of the types of input expected from the DEDICATED_TIME_COMMAND is
given below. The command should return the dedicated times for only the
host specified as the argument, although the parser will cull out any entries
that don't match the system requested.
GRUMPY 07/07/1999 16:00-19:00 07/07/1999 Large job runs (berferd)
HAPPY 07/09/1999 16:00-19:00 07/09/1999 Administrative period
(preventive maintenance)
SLEEPY 07/14/1999 16:00-19:00 07/14/1999 Install latest compiler.
SLEEPY 07/28/1999 10:00-16:00 07/28/1999 Preventive maintenance
SPORK 08/20/1999 09:00-09:00 08/21/1999 All day outage for power
supply and CPU board swap.
If there is no upcoming dedicated time for the specified host, the command should return the string :
"No scheduled downtime for the specified period."
The DEDICATED_TIME_COMMAND may be a binary executable or a shell script.
An easy method for getting started with the scheduler is to write a simple
shell script that just cat(1)'s the current "schedule" as a
here-document.
It is recommended that sites requiring more complicated mechanisms should
consider writing their own interface to their system.
DEDICATED_TIME_COMMAND is queried for each host listed with BATCH_QUEUES,
DEDICATED_QUEUE, or EXTERNAL_QUEUES. If SYSTEM_NAME is defined, it will
also be asked for dedicated times for the specified "system". For more
information, see the description below of the SYSTEM_NAME option.
ENFORCE_DEDICATED_TIME (required)DEDICATED_TIME_CACHE_SECSMAX_DEDICATED_JOBSSYSTEM_NAMEDEDICATED_TIME_CACHE_SECS <integer>schedule" server (see the
description of the DEDICATED_TIME_COMMAND option), the results of the last
good request for a host are cached within the scheduler. This cache also
allows the scheduler to keep running reasonably if the "schedule"
server is unavailable (as happens sometime at NAS).
The DEDICATED_TIME_CACHE_SECS option specifies the amount of time (in
seconds) that the host outage data should be kept cached. When the cache
becomes stale, the scheduler will attempt to refresh it with the current
data. Until the refresh is successful, the scheduler will continue to use
the last-known upcoming dedicated times.
The cache can be disabled altogether by setting DEDICATED_TIME_CACHE_SECS to
'0', and can be cleared by sending a SIGHUP to
the scheduler.
ENFORCE_DEDICATED_TIME (required)SYSTEM_NAME <string>SYSTEM_NAME feature allows an entire set of hosts to be simultaneously
scheduled for dedicated time. This can be very useful when scheduling a
large collection of hosts, each of which may have their own individual
dedicated times along with a "system" dedicated time. It can also be used
to key the other hosts to a required services machine, i.e. an NFS server
or the PBS server.
If this option is set to a string or hostname, the scheduler will use the
DEDICATED_TIME_COMMAND to request dedicated time information for the named
"system". These dedicated times are then merged into the other dedicated
times, resulting in a "system" dedicated time for each host being
scheduled. The merge is performed so that an overlapping dedtime is
resolved in the favor of the system.
For example:
# Use 'schedule' to get dedicated times for the hosts in the cluster.ENFORCE_DEDICATED_TIMETrueDEDICATED_TIME_COMMAND/usr/local/bin/schedule # "cluster" covers all of the hosts. Dedicated times for "cluster" # will cause all hosts in the cluster to be in dedicated time. SYSTEM_NAME cluster
ENFORCE_DEDICATED_TIME (required)
DEDICATED_TIME_COMMAND
NONPRIME_DRAIN_SYS <boolean>Enabling this option will reduce utilization somewhat, as there will be a short period just before non-primetime in which no useful job can be run. However, draining the system periodically has been shown to be a good way to recover from the fragmentation that inevitably occurs when scheduling these machines node-by-node. It will also ensure that at least one very large job will be runnable at the beginning of non-primetime every night.
If your site regularly sees a long period of idle time on the machine
before non-primetime starts, this can be addressed by specifying an early
start for non-primetime. See the descriptions of the NP_DRAIN_BACKTIME
and NP_DRAIN_IDLETIME options below for more information.
Special and outstanding jobs are not subject to this constraint. If one or more special or outstanding jobs is scheduled to cross the transition, then this restriction will be lifted. This is because the system cannot be drained in this case, so restricting other jobs is futile.
ENFORCE_PRIME_TIME (required)NP_DRAIN_BACKTIMENP_DRAIN_IDLETIMENP_DRAIN_BACKTIME <timespec>NP_DRAIN_IDLETIME <timespec>NONPRIME_DRAIN_SYS feature is enabled. This idle time is spent
waiting for some "last-minute" job to slip in and run within the few
remaining minutes of primetime. While this does happen (some users find
this idle time useful to submit very short large test jobs), it is fairly
uncommon to see any jobs taking advantage of the gap. Non-primetime
"early start" is an attempt to decrease the wasted cycles when jobs are
not using primetime.
The NP_DRAIN_BACKTIME option specifies the maximum amount of time that
non-primetime can be pulled back into primetime. NP_DRAIN_IDLETIME
specifies the minimum idle time for a queue before early primetime will be
applied. If the queue has been idle long enough, and it is now close
enough to the start of non-primetime, primetime enforcement for this queue
will be temporarily disabled.
An example may help explain this idea. Assume the following configuration options are specified:
# Enforce primetime from 8:00AM through 6:00PM, draining the system # before starting non-primetime.ENFORCE_PRIME_TIMETruePRIME_TIME_START08:00:00 # 8:00AMPRIME_TIME_END18:00:00 # 6:00PMNONPRIME_DRAIN_SYSTrue # # If it is within 1/2 hour of non-primetime, and a queue has been # idle for more than 15 minutes, assume no last-minute jobs will # be submitted and start non-primetime early. NP_DRAIN_BACKTIME 00:30:00 # up to 1/2 hour early NP_DRAIN_IDLETIME 00:15:00 # must be idle 15 min
Once it is after 5:30PM, and an execution queue has been idle for more than 15 minutes, prime-time enforcement will no longer be enabled for that queue. This will allow a long-walltime job to start executing early, taking up the cycles that would probably have been wasted otherwise.
Both NP_DRAIN_BACKTIME and NP_DRAIN_IDLETIME must be specified in the
configuration file. Sending a SIGHUP to or restarting the
scheduler will reset the idle timers on each queue.
Note that, while this option will cause the primetime enforcement to be disabled, it does not cause the jobs to be re-sorted into non-primetime order. This may cause somewhat surprising results as the scheduler may not choose the largest, longest job to be run first. This is a known bug in the implementation, but in practice appears not to be a show-stopper.
NONPRIME_DRAIN_SYS (required)ENFORCE_PRIME_TIME (required)
MAX_JOBS <integer> [deprecated]MIN_JOBS <integer> [deprecated]max_running" resource). Specifying a minimum job
count is basically meaningless at this point, but the maximum might be useful
in some rare instances.
One possible (but not very probable) use for this option would be to limit the job count due to some resource issue. An example configuration might specify:
# Only allow two jobs to run on any host. Our latest changes # to the OS have made it unstable with more than two running jobs. # This is only for the weekend until we have time to fix it and # reboot the systems. MAX_JOBS 2
If you wish to limit the number of jobs that can be run in a specific queue, you may do so by setting the "max_running" resource on the queue with qmgr(1). This gives much finer-grained control, and with one queue per host, can be used to control the maximum number of jobs running on a host.
Note that the scheduler will not oversubscribe the execution host, regardless of the values specified for minimum or maximum running job counts.
TARGET_LOAD_PCT <integer>[%] [deprecated]TARGET_LOAD_VARIANCE <variance> [deprecated]TARGET_LOAD_PCT, within the tolerances set
by the +/- specifications of the TARGET_LOAD_VARIANCE. By default, the
target load is 90%, with a -15%/+10% variance.
This set of options is more-or-less useless at this point (as the Origin is now scheduled node-by-node, rather than by load average). However, the scheduler will not schedule jobs on a host whose load average (as reported by the PBS mom) is higher than the sum of the target load and positive variance. This may prove useful to prevent new jobs being scheduled in the case of a runaway job or unauthorized use of the system:
# Don't schedule hosts with a higher-than-expected load average. TARGET_LOAD_PCT 90% TARGET_LOAD_VARIANCE -90%,+10% # From 0-100% is okay.
ENFORCE_ALLOCATION <boolean>The size of the allocation for any given group is usually determined by the expected computing needs of that group, balanced by the financial contribution towards the cost of the machine made by the group's funding source(s). Choosing allocations for a given set of groups is a difficult process, which is outside of the scope of this document.
As users run jobs on the system, the node-hours consumed by their jobs is charged against their allocations. When the allocation for their group reaches zero, they must request and be granted more time from their program manager before they will be allowed to run any more jobs.
If ENFORCE_ALLOCATION is enabled, the scheduler will enforce the
allocations for groups. Each PBS job is tagged with the default UNIX
group of the submitter. On each iteration, the scheduler looks up the
current node-hour usage for that group. If the group has exceeded their
allocated usage, the job will be rejected. A message is sent to the user
informing them of the current and allocated usage for the group.
For more details in how the "current usage" of a group is maintained, see
the description of the SCHED_ACCT_DIR option. The SCHED_ACCT_DIR option
is required -- the scheduler uses files from this directory to enforce the
allocations.
An example of configuring allocations support:
# Starting with FY'99, our site will be charging for access to
# our machines. Need to start tracking allocations at that time.
ENFORCE_ALLOCATION 11/01/1998
SCHED_ACCT_DIR /PBS/server_priv/pbs_acctdir
SCHED_ACCT_DIR (required)SCHED_ACCT_DIR <pathname>ENFORCE_ALLOCATION must also be specified with this option.
The "allocations" and "current" files are maintained by an accounting package called ACCT++, which was developed by NAS. The managers of each high-level project are respnsible for generating the lists of groups, users within those groups, their responsibilties, and the allocation granted to the group. ACCT++ distributes this list of allocations, and maintains the database of "current usage" for each user and group.
The scheduler watches the timestamp on both files, and re-reads the
contents of a file whenever the timestamp is touched. Sending a
SIGHUP to or restarting the scheduler will cause both files to
be re-read as well.
The "allocations" file:
The "allocations" file is a flat text file containing a list
of allocation records, one per line, formatted like this (the leading
'P= ' is required for historical reasons):
P= <title/name of research> = <UNIX groupname> = <allocation>
For instance, the following record might be used for staff users, who are typically given an unlimited allocation for debugging, testing, etc :
P= Systems Support Staff = staff = -1.0An allocation for a set of researchers doing numerical analysis might be:
P= Numerical Analysis Research Group = g10003 = 1500.0
In this case, the research group has been assigned the gid "g10003", and has been granted an allocation of 1500 node-hours for the operational year.
The following special values are defined for the node-hour field:
| Allocation | Interpretation of Special Value |
|---|---|
-1.0 |
Unlimited allocation. Users in this group are not subject to any allocation. |
0.0 |
NO allocation. Users in this group have been allocated no node-hours for this operational period. Any job submitted by this group will be immediately rejected. |
A notable potential "gotcha" is the fact that the group allocations table is a fixed size (currently 1024 entries). This is a known issue and will be fixed in a future release.
The "current" file:
A simple database of the recent resource usage of a group is maintained in
the file "$SCHED_ACCT_DIR/current" by the ACCT++
accounting system. This file consists of usage records for each user and
group, one per line, in the following format:
<username> <groupname> <jobs_run> <nodehours_used>There is a record for every username/groupname tuple that has executed a job on the system since the start of the operational period. For example, John Smith and Gilbert Held, both in group "g10003", have been running jobs.
Gilbert has run 1 job that used 32 nodes for almost an hour. John has been running several 8-node jobs that each run about 1-1/2 hours. The current file would contain something like the following:
smithj g10003 15 203.587 gilbert g10003 1 31.192
The records in the file are parsed, and the total node-hour usage for each listed group is computed. Any group that is not listed is assumed to have used no node-hours during this operational period.
Tracking allocations and resource usage:
Every time a job is created, changes state, or is moved to a different
queue, an entry is logged in the PBS accounting logs. The accounting logs
are located in the directory
$PBS_SERVER_HOME/server_priv/accounting, one file for each day.
These logs appear only on the server host, and allow an accounting system
to track jobs and their resource usages.
Of particular interest is the "E" record, logged when a running
job has exited. This record contains the user and group that submitted the
job, a list of the resources that it requested, and a list of the resources
the job actually used. From this, it is possible to determine the exact
number of node hours used by the job (by multiplying the value recorded in
"resources_used.walltime" by the
"Resource_List.ssinodes"). The ACCT++
software uses this data to track the actual node-hour usage, number of
jobs run, and other statistics for each group and user.
At NAS, the "current" file is not exactly current -- it is only updated periodically by the ACCT++ accounting software. Between updates, the scheduler fills in the gaps by adding a job's requested node-hours to the "last known" value from the accounting software. For groups that are very close to their allocation limit and often submit jobs that request more walltime than necessary, this can cause the scheduler to incorrectly reject a job due to an artificially high usage. At the next update, however, the scheduler's current usage will be updated with the actual node-hour usage, and the group may continue to submit jobs.
This behavior is rarely a serious problem, and could be considered to be a
method of encouraging users to estimate their walltime requirements as
closely as possible. Choosing a good walltime is a difficult problem for
the user. A job that requests far more time than it needs may prevent
another user's job from running, even though it later turns out that the
other job could have run in the remaining time. Running longer jobs than
necessary may lower the user's scheduling priority (see the description of
the SORT_BY_PAST_USAGE option). However, specifying too short of a
walltime will cause PBS to terminate the job before it completes.
ENFORCE_ALLOCATION (required)
SORT_BY_PAST_USAGE <boolean>There are a number of ways users abuse the system -- for instance, by submitting a large number of jobs of different sizes, thereby increasing their chances of a job fitting into a "hole" left by other jobs. Other techniques we have seen include the use of self-submitting scripts that submit a new copy of themselves just before terminating. The new job "just happens" to be a perfect fit for the hole left by the job that just completed.
The obvious solution to this problem is to track each user's recent usage of the machine, and lower their scheduling priority based upon their accumulated usage. The more node-hours used by a specific user, the less likely it is that another of their jobs will be run. This algorithm allows the scheduler to provide more fair access to the limited resources of the machine. It is similar to the priority-based process scheduler used in most UNIX implementations.
When the SORT_BY_PAST_USAGE option is enabled, the list of jobs being
scheduled is permuted by a complex iterative process. This permutation
walks through the list of jobs, looking for the job owned by the user with
the least recent usage. It then shuffles that job to the top of the list
and adds the resources requested to the owner's usage (the usage is
tracked as if each job were run immediately). The algorithm then finds
the new user with the least recent usage, and continues.
The result of this permutation is a new list of jobs re-ordered into a more fair schedule, with the jobs owned by the least-active users placed at the top of the list. These jobs are then packed in a FIFO order, with backfilling to improve utilization. This algorithm, while not perfect, does provide any given user a chance to use the system, and will favor those who have not recently used the system.
The statistics of recent usage are periodically written to the file
$PBS_SERVER_HOME/sched_priv/decay_usage, and recovered when
the scheduler is initialized. Every 24 hours, the recent-usage figures
are "decayed" -- the current values are each multiplied by a decay
factor, and re-written to disk. See the description of the options
DECAY_FACTOR and OA_DECAY_FACTOR for more information.
# Permute the list of jobs based on recent past usage of the machine # by each user. The statistics for each user will be written hourly # to the file $PBS_SERVER_HOME/sched_priv/decay_usage. # Every 24 hours, the usage statistics are decayed by the factors # inDECAY_FACTORandOA_DECAY_FACTOR. SORT_BY_PAST_USAGE True
Note that the permuted list of jobs often appears to be more-or-less random to the casual observer. This makes user support somewhat more difficult, as it is often difficult to determine why a particular job was chosen over an equally eligible (maybe even "better") job. Usually, this is related to the relative recent usage of the users in question.
DECAY_FACTOR <real>OA_DECAY_FACTOR <real>DECAY_FACTOR (or OA_DECAY_FACTOR if the user's group is
over their allocation). This simple algorith implements an exponential
decay, not unlike that of a UNIX load average (in that the usage can be
instantaneous raised, but is reduced exponentially).
The scheduler maintains the usage statistics internally. Every hour, it writes the statistics to the file $PBS_SERVER_HOME/sched_priv/decay_usage. Consulting this file can be helpful when debugging the sorting algorithms.
Values specified for these options are floating point numbers, and should
be under 1.0. The lower the number, the more quickly the past
usage statistics will fall to zero. A possible exception might be to make the
OA_DECAY_FACTOR 1.0 or larger -- this will cause users whose
groups are over their allocation to be at a continuous disadvantage in the
sorting process.
By default, the values of DECAY_FACTOR and OA_DECAY_FACTOR are
0.75 and 0.95, respectively. If you wish to be
more or less lenient, you may specify different values in the configuration
file:
# Forget recent usage fairly quickly for users that are still under # their allocation, but don't be so generous with those who have # used more than their share. DECAY_FACTOR 0.5 # Cut usage in half each day. OA_DECAY_FACTOR 0.95 # Make them wait it out.
Note that the scheduler must be running to decay the recent usage database, or the usage for all users may be higher than expected. Since the scheduler makes decisions based upon relative usage values, not the absolute numbers, this should have little impact in operations.
SORT_BY_PAST_USAGE (required)
MAX_QUEUED_TIME <timespec>
This job "starvation" can continue indefinitely if the scheduler does not
actively correct it. The mechanism implemented in the Origin scheduler
(and previously in the IBM SP2 scheduler) is to place a limit on the
longest "reasonable" waiting time for a job. This walltime limit may be
specified by the MAX_QUEUED_TIME option.
Any job that has been queued in the submit queue for more than this time will be given a very high priority in the scheduler. The scheduler will ignore most policy-based restrictions (i.e. primetime) for outstanding jobs, even giving them higher priority than "special" jobs. Large jobs that have waited too long may even cause the system to be drained in order to free up the resources necessary to run them.
Note that this does not mean that the job will run immediately after the
MAX_QUEUED_TIME has elapsed, only that the scheduler will arrange for it
to run as soon as possible after it becomes outstanding.
For example, in order to "boost" jobs that have been waiting in the submit queue for over 2 days (not uncommon at large sites), add the following to the configuration file:
# Go out of our way to run any job that has waited in the queue # for more than two days. The time period may need to be adjusted # depending upon changes in the offered load. MAX_QUEUED_TIME 48:00:00 # 48 hours maximum wait
Unfortunately, this "heroism" on the behalf of the waiting job can cause
poor utilization of the machine, especially if draining is necessary. A
more serious problem can arise if MAX_QUEUED_TIME is too short relative to
the offered load and typical turnaround time. If waiting jobs become
commonplace, the scheduler will begin to thrash attempting to make each
one its highest priority. This is a strong indication that some part of
your policy is at odds with the offered load.
Common causes of thrashing due to outstanding jobs are over-use of the special queue (causing "normal" jobs to be starved), and long dedicated times or outages (which prevent any jobs from running). The option may be temporarily disabled to prevent the scheduler from thrashing in these cases.
Additionally, early PBS servers allowed users to exploit this mechanism by placing a job on hold for hours or days. When the user later released the hold, the job instantly became the highest priority. PBS 2.x servers now reset the job's 'etime' attribute when the hold is released, so it appears to be newly enqueued. A patch has been released that addresses this problem, and should be easily portable to 1.1.x PBS servers.
INTERACTIVE_LONG_WAITINTERACTIVE_LONG_WAIT <timespec>MAX_QUEUED_TIME, it is common for jobs on
busy systems to be enqueued for several hours, even days, before being
executed. For batch jobs, this level of delay is tolerable. However, if
the job is interactive, making the person sitting at the terminal wait
this long is unreasonable. By specifying an INTERACTIVE_LONG_WAIT time,
the administrator may attempt to mitigate this problem.
During primetime, if a job has been waiting for more than the sum of its
requested walltime and INTERACTIVE_LONG_WAIT, the scheduler will mark it
as "outstanding" (see MAX_QUEUED_TIME). This will make the scheduler
arrange for the now-overdue interactive job to be started as soon as
possible.
The maximum wait time was made dependent upon the requested walltime in order to favor users who are just trying to run a very quick test case. These jobs are typically short -- only a minute or so is usually required to see if a long-running batch version of the job will crash immediately. Our observations at NAS indicate that a time around 30 minutes to an hour works well for this option.
To enable this functionality, add the following to the configuration file:
# Provide a priority boost for interactive jobs during primetime. # Interactive jobs waiting more than their requested walltime plus # a half hour will be made outstanding. INTERACTIVE_LONG_WAIT 00:30:00
Note that primetime must be active for this option to have any effect.
Also note that an interactive job that is made outstanding just before the
end of prime-time will be allowed to override the NON_PRIME_DRAIN_SYS
option, and will run across the PT/NPT boundary.
MAX_QUEUED_TIMEENFORCE_PRIME_TIME (required)
SORTED_JOB_DUMPFILE <pathname>In addition to the jobid's, the dumpfile also lists the requested nodes and walltime for the job, the job's owner and group, and the amount of time the job has been queued and eligible. It also lists various flags from the scheduler's internal representation of the job.
# Dump the sorted job list into a file in the sched_priv directory. SORTED_JOB_DUMPFILE /PBS/sched_priv/sortedjobs
Note that this file will be created with owner/group root and readable only by the owner, if it does not exist. If the file exists, the permissions are not changed -- the contents are merely overwritten with each iteration. The administrator should decide if the sorted job file should be world-readable or not.
The possible flags, and their meanings are listed below:
| Flag | Meaning/Effect of Flag On Job |
|---|---|
Int |
Job has PBS 'interactive' resource set to true. |
High |
Job is queued in the SPECIAL_QUEUE, and has high priority. |
Wait |
Job is marked as outstanding (MAX_QUEUED_TIME or INTERACTIVE_LONG_WAIT) |
Ded |
Job is queued in one of the DEDICATED_QUEUES. |
HPM |
Job has requested access to the HPM counters. |
Run |
Job requests that it be run only on a specific host. |
SPECIAL_QUEUEDEDICATED_QUEUESMANAGE_HPM_COUNTERSMAX_QUEUED_TIMEINTERACTIVE_LONG_WAIT
MANAGE_HPM_COUNTERS <boolean>
IRIX provides an interface to the on-chip event counters on the MIPS CPUs
used in the Origin2000. These counters can set up to count, among other
things, the number of clock cycles that have elapsed, and integer and
floating point instructions that have been executed. From these numbers,
a FLOPS and MIPS rating for the entire machine can be constructed. These
can then be used to determine if the machine is operating as promised, and
how efficiently it is being utilized as a resource. See the manpage for
r10k_counters for details on the hardware performance counters.
While the global view is interesting and important, many users wish to use
these counters to help optimize the performance of their applications. The
operating system can also be told to monitor the execution of just the
user application (ignoring global events). Applications like
perfex(1) and SpeedShop (c.f. ssrun(1)) use these
counters to make their performance evaulations.
Unfortunately, the counters cannot be operated in both "global"
and "user" modes at the same time. In order to allow the
global statistics to be collected unless a user wishes to use the counters
in "user" mode, the
scheduler can request that the PBS mom place the counters into user mode
for this job, then return them to system-wide monitoring when the jobs
that use them have completed.
The scheduler manages the HPM counters on the execution hosts by querying
the 'hpm_ctl' resource on the PBS mom. In the PBS mom's
configuration file, there should be an entry like this :
hpm_ctl !/usr/local/pbs/sbin/hpm_ctl %mode
The 'hpm_ctl' program should expect a single argument, and
return one of the following responses:
| %mode argument | Effect on HPM counters on execution host |
|---|---|
query |
Prints either "user" or "global", depending
upon state of counters. Outputs "???" if unable to determine
the state.
|
user |
Attempts to set the counters to user mode. Outputs "OKAY"
or "FAILED", depending upon success.
|
global |
Attempts to set the counters to global mode. Outputs "OKAY"
or "FAILED", depending upon success.
|
revoke |
Attempts to revoke the counters by killing processes identified as
holding a lock on the counters. This action is very expensive and potentially
dangerous - use it carefully. Requires the 'ecfind' script
supplied by SGI (which it uses to grovel in the kernel).
|
The hpm_ctl script used by NAS records the state of the
counters before switching in and out of user mode. This allows system
analysts to construct a continuous profile of the performance of the
machine, with occasional blank spots where the counters were in
"user" mode.
To enable the HPM support, set the MANAGE_HPM_COUNTERS option to True in
the configuration file:
# By default, the HPM counters run in global-monitoring mode. In # order for user jobs to access them, they must be set to "user" # mode. The scheduler manages the counters for jobs that request # the counters with the '-l hpm=1' job resource request. MANAGE_HPM_COUNTERS True
Jobs that wish to use the counters (or the utilities that rely on them)
must specify this as a requested resource. The job must be submitted with
the '-l hpm=1' flag to qsub(1), or the
qalter(1) command may be used to set the resource. Any job
attempting to use the HPM counters without requesting them via
'-l hpm=1' will fail or be terminated.
REVOKE_HPM_COUNTERSREVOKE_HPM_COUNTERS <boolean>
However, sometimes the counters are in user mode (because another running
job correctly specified the '-l hpm=1' attribute). The second
job can then acquire a reference on the counters, and will happily run as
if it had requested the counters itself.
The problem arises when the original well-behaved job terminates. As far as the scheduler can tell, the counters are now free to be returned to global mode. However, the request to return them to global monitoring will fail since the second job still maintains its reference. Although the scheduler will attempt to reclaim the counters on each iteration, it is possible for jobs to pass the illicit reference to the counters back and forth until there happen to be no running processes using the counters.
If the hpm_ctl script supports a "revoke" mode
(see the description in MANAGE_HPM_COUNTERS above), the scheduler can use
it to attempt to revoke the counters from the jobs that did not request it.
Setting the option REVOKE_HPM_COUNTERS to "True" will enable
this functionality.
Be aware that on some systems, using the ecfind(1) script to
discover what PIDs have a reference on the counters can take a long period
of time, so consider setting the scheduler's alarm to a larger value (i.e.
'pbs_sched -a 90 ...'). The ecfind(1) script may
also crash the operating system. Use this option at your own risk.
MANAGE_HPM_COUNTERS (Required)SCHED_RESTART_ACTION <string>SIGKILL), the jobs that were running at the time of the crash
may be left in the execution queue, but in the "Queued" state.
As these jobs are no longer in the submit queue, they will remain orphaned in
the execution queues unless acted upon by the scheduler.
The SCHED_RESTART_ACTION defines how the scheduler should handle these jobs.
The argument can take one of the following strings as arguments:
| Method | Effect on queued jobs in execution queues |
|---|---|
none |
Leave the jobs in the execution queues, where they will be ignored until manually moved back into a submit queue. This is the default disposition. |
restart |
Assume that any jobs queued in the execution queue were running at the instant the system crashed. Re-run each job found queued, then recycle and start normal scheduling cycle. Restart is the most robust of the three actions. |
resubmit |
Return each job to its original queue (named by the variable
PBS_O_QUEUE), then start scheduling. The queued jobs will not
maintain their original priority or ordering in the queues.
|
# Restart any jobs that were running when the system crashes. # These may bomb out immediately (esp. if they were interactive) # so recycle and start a new scheduling run afterwards. SCHED_RESTART_ACTION restart
SMALL_QUEUED_TIME <timespec> [experimental]It is not clear what effect, if any, this option will have on a normal workload.
MAX_QUEUED_TIME (required)INTERACTIVE_LONG_WAITAVOID_FRAGMENTATION <boolean> [experimental]This experimental algorithm was implemented to allow the scheduler to recover from this queue fragementation. The algorithm computes the size of a "fragment" by dividing the total node resources in a queue by the maximum number of jobs. If the queue is empty, it will allow any job to be run (and recover later if necessary). If it discovers that the average size of a job in that queue is less than a fragment, it will refuse to allow the fragmentation to continue. The scheduler will only start a job that requests less than a fragment if it will run no longer than the time until the queue is expected to be emptied. Any job larger than a fragment will be allowed to run, as they will not contribute to the fragmentation problem.
On the whole, this algorithm only slightly improved turn-around and utilization, but was often confusing to the users and staff. It is left mostly for historical reference. Use of this feature is discouraged.
TEST_ONLY <boolean>TEST_ONLY will cause the scheduler to just "go through the motions". To
be specific, a normal scheduling cycle will be performed, but instead of
performing any action which could change the state of the batch system, a
message will be logged.
The TEST_ONLY option is very useful when making changes to the layout of a
system, adding or removing queues or machines, or changing policies. It
is also especially handy when adjusting the accounting machinery, since it
will prevent the scheduler from deleting jobs in the case of mishaps with
the allocations files.
To enable the test mode:
# "Test-only" mode. Log any actions that would affect the state # of the machine, instead of changing things. TEST_ONLY True
FAKE_MACHINE_MULTFAKE_MACHINE_MULT <integer>FAKE_MACHINE_MULT option may be useful in these
cases for testing how the scheduler will react on a large machine. If
FAKE_MACHINE_MULT is set to a non-zero value, all machine resources are
multiplied by that value.
For example, the following configuration will allow an administrator to smoke-test a 256-processor scheduler configuration on an 8-processor test machine:
# "test" submit queue -- don't test with the real "submit" queueSUBMIT_QUEUEtest # "fake" 256-processor batch queueBATCH_QUEUESq256p # Run in testing-only mode, treating this 8-p machine as a 256. # All node counts, load average, etc, will be multiplied by 32.TEST_ONLYTrue FAKE_MACHINE_MULT 32
Note that your test batch queues must be configured in PBS to have the
correct "resource_max.*" limits for a 256-p machine.
TEST_ONLY (required)