The downside of this approach is that the resources in the machine (memory, cpu/thread scheduling, I/O bandwidth, etc) are allocated and scheduled much as if the machine was a single-user workstation. While this allows interactive and event-based computing (i.e. database and WWW servers, general-purpose compute servers, etc) to make good use of the resources, considerably more control is necessary to support a batch job execution environment. Over the past two years, the NAS Parallel Systems group has been developing software and making modifications to the operating system to support the unique requirements of a large batch system on a single-system image NUMA machine.
The basic requirements for which these changes were made are:
The changes and policies made by the Parallel Systems groups fall into five basic categories :
Each module can also be interconnected by external cabling to several other modules, allowing module-by-module expansion of the machine to suit the budget and computing needs of the organization. SGI ships machines ranging from 8- to 256-processors in size, with a 512-processor machine in development (joint project with NAS). The topology of the connecting network becomes increasingly complicated as the number of processors or nodes grows, but is basically a modified hyper-cube architecture.
The Origin2000 belongs to a class of machines designated CC/NUMA, for "cache-coherent non-uniform memory architecture". This means that the time to access any given memory space of the machine is non-uniform (memory accesses between nodes are routed across the interconnecting network), and that the hardware presents the illusion of a large, contiguous physical memory space. A cache-line ownership and invalidation protocol performed by the hardware guarantee that every processor on the machine is given a coherent view of the memory space on every node.
For very large (i.e. 256-processor and larger) machines, the cache coherency protocol enters a "coarse" operating mode. This is necessary due to limitations in the hardware that restrict the number of distinct cache-lines. When in coarse mode, cache invalidations affect entire modules instead of just the node on which the memory lies.
I/O and network peripherals are attached to specific nodes in the internal network. Accesses to the I/O device are handled by the node to which it is connected (i.e. interrupts and DMA are serviced by the processors and memory on that node). For off-node requests, the request is forwarded to the connected node, and data is transferred back to the requesting node. For high-interrupt or high-bandwidth devices (i.e. HIPPI or Fibre-Channel interfaces), the memory and processing load on the attached node can be significant.
Additionally, CPU 0 is considered "special" by much of the kernel. The kernel runs mostly on CPU 0. Most of the memory used by the kernel is allocated at boot-time from the memory on node 0 (which is attached to cpus 0 and 1). A small amount of memory (~8MB) is also allocated on each node to support the kernel heap.
The architecture of the machine suggests that a portion of the nodes be dedicated to "system" usage (i.e. running the kernel, login sessions, various daemons, etc), and the remainder of the machine be reserved for exclusive use by batch jobs. The primary goal of splitting the machine into the "system" and "compute" resources is to reduce the amount of interference to running jobs caused by system overhead. This overhead is composed mainly of kernel overhead, device interrupts and exception threads, and system daemons and user processes.
For small machines (i.e. 128-processors or less), a "system" partition of 4 cpus (2 nodes) is probably sufficient. For the 256 and larger, a full 8-cpu partition (4 nodes) is reasonable -- the "coarse-mode" cache coherency mode will produce a noticable performance boost for using a full rack (two modules). The resulting layout (for example, a 64-processor, 32-node machine) can be visualized as such:
node 0 7 15 23 31
[ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ]
\_/ \_________________________________________________________/
| |
"system" nodes "compute" nodes
In order to prevent device interrupts from interfering with the execution of user processes, it is recommended that all active devices be physically connected to one of the "system" nodes. This will prevent a CPU that is executing a user process from being periodically interrupted to service an I/O device.
IRIX maintains an affinity between a process and its first-touch memory. For a well-behaved set of processes, this tends to maintain memory to process locality, and jobs tend to remain clustered around their initial execution node. However, once a process overruns the memory local to its node, IRIX attempts to honor the allocation from another memory "close" to the requestor. This causes the locality to break down, and (depending on how much of the remote node's memory is allocated) can cause another process to be forced to go off-node for memory. This situation can quickly go from bad to worse on a heavily subscribed machine, causing the memory for any set of processes to become scattered around the machine.
A kernel interface was provided in IRIX 6.2 (and carried forward into the 6.5 branch) to allow a process to specify the nodes from which it would "prefer" to allocate memory. The kernel was supposed to attempt to allocate memory from this set of nodes (the "nodemask") first, which would cause the processes in the job to remain clustered around those nodes. Unfortunately, this mechanism was basically a "hint" that the kernel was not required to honor.
Even more unfortunately, it was discovered about two years after NAS began using it that the nodemask interface did not, in fact, work. The loop indices in the "next nearest node" lookup routine were reversed, causing the kernel to choose nodes essentially at random. This was corrected, but the improvement in job performance was minimal. The nodemask interface was soon abandoned in favor of the use of cpusets.
A cpuset is a collection of CPUs to which a process (and all of its children) can be bound. The cpuset can be made exclusive, so that only processes that are attached to it may access a CPU within it. Exclusive cpusets also constrain attached processes to run only on the CPUs specified in the cpuset. For instance, an exclusive cpuset could be created that contains CPUs 4, 5, 6, and 7. These CPUs would be accessible only to processes that were attached to the cpuset, and those processes could only run on one of those four cpus. If 8 processes were required to run in this cpuset, at most 4 of those eight would be runnable at any time (as only 4 cpus are available to execute threads at any given instant).
Cpusets do an excellent job of restricting the CPU usage of a set of processes. However, in stock IRIX, there are no restrictions on the memory usage of the processes in a cpuset. As memory is not a pre-emptable resource like CPUs are, it is critical that it not be over-subscribed or allowed to be allocated to "unauthorized" processes. One of the major modifications made to IRIX by NAS was to provide memory controls on cpusets.
When the cpuset is created (with the appropriate flags set), the kernel computes an "effective nodemask" for that cpuset. This nodemask contains a '1' bit for every node corresponding to a CPU in the cpuset. When a process associated with the cpuset attempts to allocate a page, the kernel first attempts to get a page from the local node's memory. If this fails, the effective nodemask is consulted for the next-nearest node, and the kernel attempts to allocate the page from that memory. This process continues for each node in the effective nodemask, until all of the nodes have been exhausted. The choice of "next-nearest node" follows a much-improved algorithm over the stock IRIX memory allocator.
At this point, there is no un-allocated memory on any of the nodes in the
effective nodemask. As a last resort, the kernel then walks through the
set of nodes in the effective nodemask, looking for any named memory
pages. These pages are local backing store for objects like shared
libraries, mmap(2)'d disk files, etc. Any pages whose contents can be
reconstructed on demand are flushed, and placed back on the free list.
Once the reclamation process (affectionately dubbed "the squeeze")
completes, if any pages were discarded, the allocation request is granted.
If, after the squeeze, there are no available free pages on the nodes,
then a SIGKILL is enqueued for every process in the session group. This
signal is tagged with a special flag that indicates that it was generated
due to a resource limit being exceeded. The kernel then returns the
thread to the run queue. Upon transitioning from the kernel to user mode,
the SIGKILL is delivered, and all members of the session are killed. With
this simple but elegant mechanism in place, it is possible to proactively
control the resource usage of processes directly in the kernel.
Note that no attempt is made to control virtual-memory. Having a large
address space that is sparsely committed with physical pages makes the
solution of many CFD-style applications much more efficient. A PBS job
may request a virual memory limit (i.e. 'qsub -l vmem=1gb') but users
rarely do so, and the value of it is questionable on these machines.
In order to prevent random user processes from wandering out into the
"compute nodes" and allocating memory from them, they should be restricted
to the "system" nodes. To accomplish this, NAS modified the IRIX init(1)
binary to create a cpuset at boot-time and attach itself (and all
children) to it. This cpuset, named "system", should cover the nodes
reserved for system use. It is configured by adding a line like the
following to /etc/inittab and rebooting :
stup:c:off:2
This will cause the "system" cpuset to be created covering the first 2
nodes. For large machines, the number should be '4'. init(1) will attach
itself to this cpuset, and all children of init(1) will inherit this
affinity. Since all processes on the system are descendants of init(1),
this will cause all processes to be constrained to the memory and cpus
located in the nodes reserved for system use, leaving the rest of the
system empty and idle. Incidentally, "stup" is short for "startup".
Note that CPU 0 cannot be included in a cpuset. Setting up the "system" cpuset also prevents user processes from running on CPU 0 where they might increase the time required for the kernel to acquire resources.
Before the introduction of memory control provided by the NAS cpuset modifications, there was no way to guarantee that a minimum amount of memory would be available on any given node. When using cpusets, however, only a very small amount of memory is taken up by kernel and OS overhead. Nearly the entire physical memory on each node is available for allocation by user processes. A conservative estimate (used by PBS) of the free memory is 490MB/node -- in practice, the actual free memory typically runs from 495MB to 500MB/node.
Because of the physical interdependency between the cpus and memory on a
node, it is somewhat fallacious to schedule them as orthogonal resources.
The PBS scheduler will adjust the 'ncpus' and 'mem' requested by jobs to
the appropriate values for the smallest number of nodes that meets or
exceeds both requested resource values. The PBS job scheduler views the
machines as sets of schedulable nodes, scheduling jobs onto the
non-reserved nodes based on their requested resources.
When the scheduler decides to have PBS run a job, it sets the job's
"Resource_List.ssinodes" resource to the number of nodes required to run
this job. This value is passed to the PBS mom on the execution host when
the job is started. The PBS mom then chooses a set of nodes that
satisfies the jobs resources request, assigns them to the job, and
executes the job script.
Note that the scheduler no longer calculates, assigns, stores or modifies
the nodemask of the jobs. Although the "nodemask" job attribute is still
provided, it exists mainly for debugging and backwards-compatability.
The PBS mom, after assigning the nodes for the job, updates the job
attribute "Resources_used.nodemask". This string, a long hexadecimal
number corresponding to the job's effective nodemask, is recorded in the
accounting and server logs, making it easier to determine the placement of
the job on the machine.
Resource_List.ssinodes". The
value of this resource is set by the scheduler to the number of nodes
(each node having 2 cpus and 490MB) required by the job. The mom chooses
a set of unallocated nodes to be assigned to the job, and creates an
exclusive cpuset named with the first 8 characters of the jobid (for
instance, "123.pigl"). The job shell is then attached to the cpuset, and
the script is started executing.
The newly-created cpuset grants the job access to the memory of each node that was assigned, as well as both CPUs on that node. The job can make use of these resources any way the submitter wishes, as no other process can use the assigned cpus or memory. However, the job is also constrained to these resources -- it can only allocate the memory on those nodes, and it can only run processes on the CPUs that have been allocated to it.
After the job completes its execution (or is terminated for some reason), the cpuset is destroyed, the nodes returned to the free pool, and the system is ready to run one or more new jobs. The main difficulty with cpusets is that they cannot be destroyed (thus allowing their resources to be granted to another job) until no processes remain attached to them. In most cases, all processes have terminated when the job exits, and the cleanup proceeds smoothly. However, it is possible for processes to be stuck and unkillable (typically while writing very large coredumps over NFS), causing PBS to be temporarily unable to destroy the cpuset.
When PBS determines that a cpuset is "stuck" (i.e. the job has terminated but the cpuset cannot be reclaimed), it places the job on a list of "stuck cpusets" and logs the event. The resources are in limbo at this point -- they are not officially in use, but they cannot be returned to service until the operating system releases them. The mom periodically walks down this list of stuck cpusets, and attempts to reclaim each set. If the cpuset can now be destroyed, the nodes that were held by that cpuset are returned to the free node pool, and are available for immediate use by another job.
In addition, any extra threads will not be scheduled in another jobs' cpuset, even if the CPUs are idle. This is a significant change from the work-stealing and load-balancing algorithms usually employed by MP machines that do mostly interactive and server work. It was decided early on that it was best to reduce inter-job interference by preventing jobs from stealing CPU cycles from one another.
The memory limits are somewhat less forgiving. An attempt to allocate
more memory than was assigned to the job will cause the processes in the
session to be terminated with a SIGKILL. The signal is enqueued by the
kernel, and a special flag is set that allows the parent to differentiate
between a "normal" SIGKILL (i.e. 'kill -9 -<pgroupid>') and one delivered
on behalf of the memory subsystem.
PBS checks for this flag when it reaps processes from terminated jobs, and tags ill-behaved jobs with the following line on stderr (a message is also logged in the mom logs):
>> Job 170107.fermi.nas.nasa.gov exceeded resource allocation -- killed
Unfortunately, PBS becomes aware of the problem only after the job has been terminated and memory is being reclaimed by the system. There is no way to accurately compute the amount of memory that was allocated immediately before the fatal allocation request. Because of the use of cpusets and their ability to prevent unauthorized use of memory, the job must have allocated at least as much as requested. The exact figure over that amount is almost impossible to determine.
Memory usage for a job is computed by walking through all of the processes
in the job, requesting a mapping of their address space, and summing up
all the private memory. In addition, a "share" of all shared memory
segments is charged against each process, as their is no single "owner" of
a shared segment. For instance, if an 8MB shared memory segment is owned
by four processes, each will be charged for 2MB of the allocated memory. A
few segments required by the rld(1) runtime linker are also ignored, and the
memory used by pre-loaded system shared libraries (see the
Miscellaneous section below) is not charged.
After walking through the address space of each process, totalling up its share of allocated pages, the grand total is computed and reported as the job's memory usage. For a set of processes whose memory spaces is no longer changing (i.e. the dataset has been loaded and the job is now in a computational phase), the results should be accurate to within a page. However, since it is impossible to do this tallying accurately when the address space or process membership of the job is changing, the results are precise but inaccurate unless the job is fully compute-bound.
The interfaces used to get the memory maps and other process information were not designed to be efficient. For larger systems, the system calls can block for long periods of time, often requiring several minutes to complete. There is no way to interrupt the call, and no way to tell that the next invocation will block indefinitely. The result of this was that the mom would "lock up" for long periods of time until this call completed and normal execution continued. The only reasonable solution was to make the memory usage collection an asynchronous task.
At startup, the PBS mom now starts a "collector" thread which runs independently of the main PBS mom. Shared memory is used to pass data between the main PBS mom (which responds to queries, executes jobs, etc) and its sub-threads. This allows the sub-threads to run completely asynchronously from the main mom, preventing the mom from blocking.
In fact, the collector can be started and stopped without affecting
running jobs (except that there memory usage will not be reported). The
collector thread periodically wakes up (or is signaled by the main mom)
and generates a list of processes owned by the various PBS jobs. It then
runs through each of their address space maps, calculating the virtual and
physical memory usage for each process. Finally, it sums up the grand
totals and hands those back to the main mom process, which updates the
"Resources_used" job attributes.
If querying the memory maps blocks for a long period of time, the only
impact this has on PBS is that the memory usage values are not updated.
The introduction of this functionality completely removed the mom stalls
that plagued the large machines when they were first installed. Memory
reporting may be temporarily disabled by sending a SIGQUIT to the collector
thread, or by permanently disabling enforcement of all memory classes in
the file /PBS/mom_priv/config :
# Disable memory enforcement (and hence collection of usage statistics) enforce !mem,!vmem,!procvmem
To re-enable the reporting, remove the 'enforce' line (if necessary) and
send a SIGHUP to the main PBS mom (the correct pid is listed in the file
/PBS/mom_priv/mom.lock). Note that sending a SIGKILL to the collector
will result in mom restarting it -- you must use SIGQUIT to indicate that
the thread should not be restarted.
Although the "system" cpuset will constrain the processes spawned by login
sessions, attempting to schedule the additional processes on the limited
resources can cause a performance impact on the machine. This load will
affect the throughput of all processes on the system, because the thread
scheduler will have to consider the extra threads even if there aren't
available resources in the "system" cpuset.
Users tend to want to log in to the batch machines to retrieve or edit the
results of previous runs, to compile and test their programs, run quick
test runs, etc. Most of this activity can be performed on the front-ends,
and job output data is available over NFS. In addition, interactive
logins on the batch machines can be started through PBS (with 'qsub -I
...'), so the policy has been retained even though the actual impact is
minimal.
Originally, the hammer was implemented as a shell-script that queried the running jobs from the PBS server, and looked for running processes that didn't correlate to a running job. This was cumbersome and error-prone, and it relied on the server keeping the list of running jobs. However, the PBS mom knows exactly what jobs are supposed to be running (as well as the session id's and ASHes for those jobs) at any instant. The PBS mom is the ideal place for this functionality, so the hammer was re-implemented as a thread of the mom.
The PBS hammer sweeps the process table about once every 15 seconds,
examining every listed process against the following criteria. If the
process does not meet one of the exemptions below, a SIGKILL is sent to
it. This is a rather Machiavellian approach, but for the sake of
minimizing interference to the running jobs, it is imperative that the
processes be removed immediately. In addition, the "no nonsense" aspect
quickly discourages users from attempting to bend the rules.
Any process that matches any of the following criteria is exempt from being hammered:
The hammer may be permanently disabled, or enabled in either log-only or kill mode, by adding one of the following options to the mom configuration file (/PBS/mom_priv/config) :
# Turn off hammer functionality enforce !hammer # Turn on hammer, but just log offending processes enforce hammer,nokill # Turn on hammer, and send SIGKILL to any offending processes enforce hammer,!nokill
Note that running the hammer with 'nokill' enabled can cause the mom's
logs to grow quite quickly (if many hammer-able processes exist on the
system). It is recommended that some utility watch the log files for
offending processes and take some action to remove them.
The hammer runs asynchronously from the rest of the PBS mom. It may be
killed (send a SIGKILL or SIGQUIT to it) to temporarily disable it.
Sending a SIGHUP to the main PBS mom will restart the hammer (unless it is
disabled).
libldr):
Imagine that PBS runs an MPI job, for instance, in a small cpuset on a
host. When the job script invokes 'mpirun ...', the shared library
'libmpi.so' will be loaded into the memory on the nodes assigned to that
job. For most libraries, this is a small amount of memory, but some of
the libraries are quite large (several megabytes). They may also cause
more memory to be allocated when they are first loaded to handle their
initialization. In any case, some memory from the job's nodes is now
dedicated to holding this shared object.
PBS then starts another MPI job in a different cpuset. This time, the
'mpirun ...' command does not need to load the shared library. It merely
makes a reference to the copy of libmpi.so that the first job loaded (into
the memory assigned to the first job) and the new job begins execution.
After a few minutes, the first job encounters some error, and terminates.
The kernel frees all the pages that were allocated in the memory on the
first job's nodes, except for the pages that were used by the shared
library. Since it is still marked as in-use, that memory cannot be freed.
Now the nodes, although technically "idle", do not have as free memory as
expected, which may cause the next job that runs on them to fail.
Page migration could be employed to address this problem. When the kernel noticed that it was going off-node to access the shared-object, it could force the pages to be loaded into the process' local memory and free the remote pages. However, in a batch environment, this would lead to the library being shuffled around the machine as jobs started and terminated, impacting performance and wasting cycles and memory bandwidth.
The obvious alternative, implemented as the 'libldr' program, is to load
all of the commonly- and not-so-commonly-used libraries into the memory on
the "system" nodes at boot time. The three main advantages of doing this
are that 1) the libraries do not need to be loaded by the job, improving
startup-time, 2) the memory consumed by the libraries is confined to
the "system" nodes, and 3) PBS jobs are not charged for the memory used by
the shared objects.
The library loader is shipped with the nas_eoe patches for IRIX 6.5.4. Sources are in the farmers' CVS repository. libldr has a separate binary for
the n32- and 64-bit ABI's. It is started at boot time, early in the
startup sequence. After reading configuration information from the file
/etc/config/libldr.config, the libldr then walks through the system libraries,
loading them into memory. It also resolves any dependencies listed in the
libraries, loading the dependent libraries as well. Finally, it creates
an index of the libraries that were loaded in the file /lib{32,64}/index.
The library loader then goes to sleep forever, leaving the pre-loaded
libraries in place on the machine. On a typical system, the memory used
to load all of the libraries is approximately 191MB for the 64-bit
libraries, and about 183MB for the n32 libraries.
Because the cross-referencing and indexing of the libraries can take a
very long time (especially the NFS-mounted /opt directory), the libldr
writes out a cache file containing all of the cross-dependency information
for the libraries. This allows it to skip the time-consuming indexing at
boot, and just load the libraries into memory. The cache is considered
stale if any of the library directories have been updated more recently
than the cache, implying that a library has been added, removed, or
updated.
The PBS mom uses the /lib32/index and /lib64/index files to correlate process shared-object segments with the pre-loaded system libraries. The PBS jobs are not charged for the memory used by the system libraries. However, if the job uses its own shared libraries, these are charged against the job (and loaded into its allocated memory).
mmscd) to display the memory usage
on each node in the system.
The bargraph shows the total memory installed on each node in yellow, scaled so that the largest memory size fills the entire bargraph. The memory used by the kernel is displayed in blue (usually only visible on node 0, where the kernel loads its text and stack). Red bars indicate memory that has been allocated by user processes on the system. The display is, by default, updated every 1/3 second.
By looking at the display on a lightly-loaded batch system, you can immediately see that the shared libraries are loaded on nodes 0 and 1 (in fact, the taller bar on node 0 corresponds to the 64-bit libraries, which are slightly larger). Memory consumed by batch jobs will be placed in clusters, depending upon the nodes allocated by PBS.
It is instructive to submit a job that uses up all of its available memory
and causes the cpuset memory limits to be reached (i.e. a loop that calls
malloc(3) continuously). It will begin allocating memory on the local
node, overflowing into the next-nearest nodes. On each of these nodes, it
will allocate all but the last 16MB -- this is left as "breathing room"
for the kernel to quickly service I/O requests and other activities that
may require a memory allocation. Eventually the reserved 16MB will be
allocated, leaving no unallocated memory on the node. The application now
pauses while the node memories are "squeezed", and more free memory should
appear. Once that memory is completely allocated, the kernel will
terminate the job. You should see the memory being de-allocated in
roughly the reverse order as it was allocated. The "fresh-squeezed" nodes
should be almost completely free once the job terminates.
The modified MMSC daemon is installed in /usr/etc/mmscd.nas on the Parallel Origin2000's. The changes to the stock mmscd are expected to be incorporated into IRIX 6.5.7.