$Id: O2K-config.html,v 1.1 2000/03/20 23:11:24 bayucan Exp $

NAS Origin2000 Configuration For Batch Job Execution

INTRODUCTION

The multithreaded design of the 6.5 family of IRIX coupled with the scalable architecture of the Origin2000 allows the operating system to provide a single system image view of the machine. The benefits of this approach include ease of administration (the entire machine "feels" like a workstation), easier implementation of shared memory and IPC between cooperating processes in highly parallel jobs, and access to network and I/O devices from any part of the computer. In fact, much of IRIX uses the same algorithms and strategies on the largest supercomputers as on a desktop single-user workstation.

The downside of this approach is that the resources in the machine (memory, cpu/thread scheduling, I/O bandwidth, etc) are allocated and scheduled much as if the machine was a single-user workstation. While this allows interactive and event-based computing (i.e. database and WWW servers, general-purpose compute servers, etc) to make good use of the resources, considerably more control is necessary to support a batch job execution environment. Over the past two years, the NAS Parallel Systems group has been developing software and making modifications to the operating system to support the unique requirements of a large batch system on a single-system image NUMA machine.

The basic requirements for which these changes were made are:

The changes and policies made by the Parallel Systems groups fall into five basic categories :

HARDWARE CONFIGURATION AND LAYOUT:

The Origin2000 architecture is made up of a set of "nodes", each one of which consists of two R10000 or R12000 processors and a bank of memory (512MB per node for most of the NAS machines). These nodes are arranged in sets of four in rack-mount "modules" and are interconnected by a set of routers to a very fast internal network. There are some considerations for performance when physically configuring the machines.

Each module can also be interconnected by external cabling to several other modules, allowing module-by-module expansion of the machine to suit the budget and computing needs of the organization. SGI ships machines ranging from 8- to 256-processors in size, with a 512-processor machine in development (joint project with NAS). The topology of the connecting network becomes increasingly complicated as the number of processors or nodes grows, but is basically a modified hyper-cube architecture.

The Origin2000 belongs to a class of machines designated CC/NUMA, for "cache-coherent non-uniform memory architecture". This means that the time to access any given memory space of the machine is non-uniform (memory accesses between nodes are routed across the interconnecting network), and that the hardware presents the illusion of a large, contiguous physical memory space. A cache-line ownership and invalidation protocol performed by the hardware guarantee that every processor on the machine is given a coherent view of the memory space on every node.

For very large (i.e. 256-processor and larger) machines, the cache coherency protocol enters a "coarse" operating mode. This is necessary due to limitations in the hardware that restrict the number of distinct cache-lines. When in coarse mode, cache invalidations affect entire modules instead of just the node on which the memory lies.

I/O and network peripherals are attached to specific nodes in the internal network. Accesses to the I/O device are handled by the node to which it is connected (i.e. interrupts and DMA are serviced by the processors and memory on that node). For off-node requests, the request is forwarded to the connected node, and data is transferred back to the requesting node. For high-interrupt or high-bandwidth devices (i.e. HIPPI or Fibre-Channel interfaces), the memory and processing load on the attached node can be significant.

Additionally, CPU 0 is considered "special" by much of the kernel. The kernel runs mostly on CPU 0. Most of the memory used by the kernel is allocated at boot-time from the memory on node 0 (which is attached to cpus 0 and 1). A small amount of memory (~8MB) is also allocated on each node to support the kernel heap.

The architecture of the machine suggests that a portion of the nodes be dedicated to "system" usage (i.e. running the kernel, login sessions, various daemons, etc), and the remainder of the machine be reserved for exclusive use by batch jobs. The primary goal of splitting the machine into the "system" and "compute" resources is to reduce the amount of interference to running jobs caused by system overhead. This overhead is composed mainly of kernel overhead, device interrupts and exception threads, and system daemons and user processes.

For small machines (i.e. 128-processors or less), a "system" partition of 4 cpus (2 nodes) is probably sufficient. For the 256 and larger, a full 8-cpu partition (4 nodes) is reasonable -- the "coarse-mode" cache coherency mode will produce a noticable performance boost for using a full rack (two modules). The resulting layout (for example, a 64-processor, 32-node machine) can be visualized as such:

node   0             7              15               23              31
      [ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ]
       \_/ \_________________________________________________________/
        |                                  |
    "system" nodes                 "compute" nodes

In order to prevent device interrupts from interfering with the execution of user processes, it is recommended that all active devices be physically connected to one of the "system" nodes. This will prevent a CPU that is executing a user process from being periodically interrupted to service an I/O device.

OS SUPPORT FOR RESOURCE MANAGEMENT:

Resource management and enforcement is a serious issue on these systems. The biggest problem is in restricting the allocation of memory. Since the Origin2000 is a NUMA machine, the relative locality of memory and the processes accessing them is a big factor in the performance of those processes. Access to memory across the machine may be two to four times slower than access to memory local to the requesting CPU.

IRIX maintains an affinity between a process and its first-touch memory. For a well-behaved set of processes, this tends to maintain memory to process locality, and jobs tend to remain clustered around their initial execution node. However, once a process overruns the memory local to its node, IRIX attempts to honor the allocation from another memory "close" to the requestor. This causes the locality to break down, and (depending on how much of the remote node's memory is allocated) can cause another process to be forced to go off-node for memory. This situation can quickly go from bad to worse on a heavily subscribed machine, causing the memory for any set of processes to become scattered around the machine.

A kernel interface was provided in IRIX 6.2 (and carried forward into the 6.5 branch) to allow a process to specify the nodes from which it would "prefer" to allocate memory. The kernel was supposed to attempt to allocate memory from this set of nodes (the "nodemask") first, which would cause the processes in the job to remain clustered around those nodes. Unfortunately, this mechanism was basically a "hint" that the kernel was not required to honor.

Even more unfortunately, it was discovered about two years after NAS began using it that the nodemask interface did not, in fact, work. The loop indices in the "next nearest node" lookup routine were reversed, causing the kernel to choose nodes essentially at random. This was corrected, but the improvement in job performance was minimal. The nodemask interface was soon abandoned in favor of the use of cpusets.

A cpuset is a collection of CPUs to which a process (and all of its children) can be bound. The cpuset can be made exclusive, so that only processes that are attached to it may access a CPU within it. Exclusive cpusets also constrain attached processes to run only on the CPUs specified in the cpuset. For instance, an exclusive cpuset could be created that contains CPUs 4, 5, 6, and 7. These CPUs would be accessible only to processes that were attached to the cpuset, and those processes could only run on one of those four cpus. If 8 processes were required to run in this cpuset, at most 4 of those eight would be runnable at any time (as only 4 cpus are available to execute threads at any given instant).

Cpusets do an excellent job of restricting the CPU usage of a set of processes. However, in stock IRIX, there are no restrictions on the memory usage of the processes in a cpuset. As memory is not a pre-emptable resource like CPUs are, it is critical that it not be over-subscribed or allowed to be allocated to "unauthorized" processes. One of the major modifications made to IRIX by NAS was to provide memory controls on cpusets.

When the cpuset is created (with the appropriate flags set), the kernel computes an "effective nodemask" for that cpuset. This nodemask contains a '1' bit for every node corresponding to a CPU in the cpuset. When a process associated with the cpuset attempts to allocate a page, the kernel first attempts to get a page from the local node's memory. If this fails, the effective nodemask is consulted for the next-nearest node, and the kernel attempts to allocate the page from that memory. This process continues for each node in the effective nodemask, until all of the nodes have been exhausted. The choice of "next-nearest node" follows a much-improved algorithm over the stock IRIX memory allocator.

At this point, there is no un-allocated memory on any of the nodes in the effective nodemask. As a last resort, the kernel then walks through the set of nodes in the effective nodemask, looking for any named memory pages. These pages are local backing store for objects like shared libraries, mmap(2)'d disk files, etc. Any pages whose contents can be reconstructed on demand are flushed, and placed back on the free list. Once the reclamation process (affectionately dubbed "the squeeze") completes, if any pages were discarded, the allocation request is granted.

If, after the squeeze, there are no available free pages on the nodes, then a SIGKILL is enqueued for every process in the session group. This signal is tagged with a special flag that indicates that it was generated due to a resource limit being exceeded. The kernel then returns the thread to the run queue. Upon transitioning from the kernel to user mode, the SIGKILL is delivered, and all members of the session are killed. With this simple but elegant mechanism in place, it is possible to proactively control the resource usage of processes directly in the kernel.

Note that no attempt is made to control virtual-memory. Having a large address space that is sparsely committed with physical pages makes the solution of many CFD-style applications much more efficient. A PBS job may request a virual memory limit (i.e. 'qsub -l vmem=1gb') but users rarely do so, and the value of it is questionable on these machines.

In order to prevent random user processes from wandering out into the "compute nodes" and allocating memory from them, they should be restricted to the "system" nodes. To accomplish this, NAS modified the IRIX init(1) binary to create a cpuset at boot-time and attach itself (and all children) to it. This cpuset, named "system", should cover the nodes reserved for system use. It is configured by adding a line like the following to /etc/inittab and rebooting :

stup:c:off:2

This will cause the "system" cpuset to be created covering the first 2 nodes. For large machines, the number should be '4'. init(1) will attach itself to this cpuset, and all children of init(1) will inherit this affinity. Since all processes on the system are descendants of init(1), this will cause all processes to be constrained to the memory and cpus located in the nodes reserved for system use, leaving the rest of the system empty and idle. Incidentally, "stup" is short for "startup".

Note that CPU 0 cannot be included in a cpuset. Setting up the "system" cpuset also prevents user processes from running on CPU 0 where they might increase the time required for the kernel to acquire resources.

JOB SCHEDULING AND RESOURCE ALLOCATION POLICY:

After booting with the "system" cpuset correctly configured, a large portion of the machine is idle. Any process that does not specifically create a new cpuset and attach itself to it will remain constrained to the "system nodes". PBS schedules the un-reserved, idle nodes on the system, leaving the "system" nodes to service miscellaneous daemons, OS requests, etc.

Before the introduction of memory control provided by the NAS cpuset modifications, there was no way to guarantee that a minimum amount of memory would be available on any given node. When using cpusets, however, only a very small amount of memory is taken up by kernel and OS overhead. Nearly the entire physical memory on each node is available for allocation by user processes. A conservative estimate (used by PBS) of the free memory is 490MB/node -- in practice, the actual free memory typically runs from 495MB to 500MB/node.

Because of the physical interdependency between the cpus and memory on a node, it is somewhat fallacious to schedule them as orthogonal resources. The PBS scheduler will adjust the 'ncpus' and 'mem' requested by jobs to the appropriate values for the smallest number of nodes that meets or exceeds both requested resource values. The PBS job scheduler views the machines as sets of schedulable nodes, scheduling jobs onto the non-reserved nodes based on their requested resources.

When the scheduler decides to have PBS run a job, it sets the job's "Resource_List.ssinodes" resource to the number of nodes required to run this job. This value is passed to the PBS mom on the execution host when the job is started. The PBS mom then chooses a set of nodes that satisfies the jobs resources request, assigns them to the job, and executes the job script.

Note that the scheduler no longer calculates, assigns, stores or modifies the nodemask of the jobs. Although the "nodemask" job attribute is still provided, it exists mainly for debugging and backwards-compatability. The PBS mom, after assigning the nodes for the job, updates the job attribute "Resources_used.nodemask". This string, a long hexadecimal number corresponding to the job's effective nodemask, is recorded in the accounting and server logs, making it easier to determine the placement of the job on the machine.

PBS CONFIGURATION, ENHANCEMENTS AND FEATURES:

The version of PBS that is running on the Origin2000's has been heavily modified to support the various OS features that allow better control over the running jobs. The changes are concentrated mainly in the PBS mom, as it is most directly in control of the job's running processes and environment. The most notable enhancements to the PBS mom are :

Support for dynamic creation of a cpuset for each job:

When the PBS server requests that the PBS mom start the execution of a job, the mom looks for the job attribute "Resource_List.ssinodes". The value of this resource is set by the scheduler to the number of nodes (each node having 2 cpus and 490MB) required by the job. The mom chooses a set of unallocated nodes to be assigned to the job, and creates an exclusive cpuset named with the first 8 characters of the jobid (for instance, "123.pigl"). The job shell is then attached to the cpuset, and the script is started executing.

The newly-created cpuset grants the job access to the memory of each node that was assigned, as well as both CPUs on that node. The job can make use of these resources any way the submitter wishes, as no other process can use the assigned cpus or memory. However, the job is also constrained to these resources -- it can only allocate the memory on those nodes, and it can only run processes on the CPUs that have been allocated to it.

After the job completes its execution (or is terminated for some reason), the cpuset is destroyed, the nodes returned to the free pool, and the system is ready to run one or more new jobs. The main difficulty with cpusets is that they cannot be destroyed (thus allowing their resources to be granted to another job) until no processes remain attached to them. In most cases, all processes have terminated when the job exits, and the cleanup proceeds smoothly. However, it is possible for processes to be stuck and unkillable (typically while writing very large coredumps over NFS), causing PBS to be temporarily unable to destroy the cpuset.

When PBS determines that a cpuset is "stuck" (i.e. the job has terminated but the cpuset cannot be reclaimed), it places the job on a list of "stuck cpusets" and logs the event. The resources are in limbo at this point -- they are not officially in use, but they cannot be returned to service until the operating system releases them. The mom periodically walks down this list of stuck cpusets, and attempts to reclaim each set. If the cpuset can now be destroyed, the nodes that were held by that cpuset are returned to the free node pool, and are available for immediate use by another job.

Reporting of exceeded resource allocation:

By assigning a cpuset to each job, CPU and memory restrictions can be enforced by the kernel. To be exact, a job can start as many threads as it likes, but they will only be runnable on the assigned CPUs. Any "extra" threads will sleep until one of the assigned CPUs becomes available. In general, it is recommended that jobs request a minimum of 1 CPU per execution thread (i.e. 16 or 17 CPUs for a 16-way MPI job).

In addition, any extra threads will not be scheduled in another jobs' cpuset, even if the CPUs are idle. This is a significant change from the work-stealing and load-balancing algorithms usually employed by MP machines that do mostly interactive and server work. It was decided early on that it was best to reduce inter-job interference by preventing jobs from stealing CPU cycles from one another.

The memory limits are somewhat less forgiving. An attempt to allocate more memory than was assigned to the job will cause the processes in the session to be terminated with a SIGKILL. The signal is enqueued by the kernel, and a special flag is set that allows the parent to differentiate between a "normal" SIGKILL (i.e. 'kill -9 -<pgroupid>') and one delivered on behalf of the memory subsystem.

PBS checks for this flag when it reaps processes from terminated jobs, and tags ill-behaved jobs with the following line on stderr (a message is also logged in the mom logs):

>> Job 170107.fermi.nas.nasa.gov exceeded resource allocation -- killed

Unfortunately, PBS becomes aware of the problem only after the job has been terminated and memory is being reclaimed by the system. There is no way to accurately compute the amount of memory that was allocated immediately before the fatal allocation request. Because of the use of cpusets and their ability to prevent unauthorized use of memory, the job must have allocated at least as much as requested. The exact figure over that amount is almost impossible to determine.

Multi-threading of mom server and more precise memory usage reporting:

Because of the strict enforcement of memory limits, users want to know exactly how much memory their jobs are using. On a shared- and virtual- memory machine such as the Origin2000, answering the question of "how much memory is this job using?" is very difficult.

Memory usage for a job is computed by walking through all of the processes in the job, requesting a mapping of their address space, and summing up all the private memory. In addition, a "share" of all shared memory segments is charged against each process, as their is no single "owner" of a shared segment. For instance, if an 8MB shared memory segment is owned by four processes, each will be charged for 2MB of the allocated memory. A few segments required by the rld(1) runtime linker are also ignored, and the memory used by pre-loaded system shared libraries (see the Miscellaneous section below) is not charged.

After walking through the address space of each process, totalling up its share of allocated pages, the grand total is computed and reported as the job's memory usage. For a set of processes whose memory spaces is no longer changing (i.e. the dataset has been loaded and the job is now in a computational phase), the results should be accurate to within a page. However, since it is impossible to do this tallying accurately when the address space or process membership of the job is changing, the results are precise but inaccurate unless the job is fully compute-bound.

The interfaces used to get the memory maps and other process information were not designed to be efficient. For larger systems, the system calls can block for long periods of time, often requiring several minutes to complete. There is no way to interrupt the call, and no way to tell that the next invocation will block indefinitely. The result of this was that the mom would "lock up" for long periods of time until this call completed and normal execution continued. The only reasonable solution was to make the memory usage collection an asynchronous task.

At startup, the PBS mom now starts a "collector" thread which runs independently of the main PBS mom. Shared memory is used to pass data between the main PBS mom (which responds to queries, executes jobs, etc) and its sub-threads. This allows the sub-threads to run completely asynchronously from the main mom, preventing the mom from blocking.

In fact, the collector can be started and stopped without affecting running jobs (except that there memory usage will not be reported). The collector thread periodically wakes up (or is signaled by the main mom) and generates a list of processes owned by the various PBS jobs. It then runs through each of their address space maps, calculating the virtual and physical memory usage for each process. Finally, it sums up the grand totals and hands those back to the main mom process, which updates the "Resources_used" job attributes.

If querying the memory maps blocks for a long period of time, the only impact this has on PBS is that the memory usage values are not updated. The introduction of this functionality completely removed the mom stalls that plagued the large machines when they were first installed. Memory reporting may be temporarily disabled by sending a SIGQUIT to the collector thread, or by permanently disabling enforcement of all memory classes in the file /PBS/mom_priv/config :

	# Disable memory enforcement (and hence collection of usage statistics)
	enforce !mem,!vmem,!procvmem

To re-enable the reporting, remove the 'enforce' line (if necessary) and send a SIGHUP to the main PBS mom (the correct pid is listed in the file /PBS/mom_priv/mom.lock). Note that sending a SIGKILL to the collector will result in mom restarting it -- you must use SIGQUIT to indicate that the thread should not be restarted.

Hammer functionality as a thread of PBS mom:

Access to the compute engines is allowed only through PBS. By stated policy, users are not allowed to log in directly to the compute machines (a 24-processor Origin2000 front-end host is part of the cluster). Since PBS is unable to track login sessions (and the resources they consume on the machines), preventing them allows better control over the usage of the machine.

Although the "system" cpuset will constrain the processes spawned by login sessions, attempting to schedule the additional processes on the limited resources can cause a performance impact on the machine. This load will affect the throughput of all processes on the system, because the thread scheduler will have to consider the extra threads even if there aren't available resources in the "system" cpuset.

Users tend to want to log in to the batch machines to retrieve or edit the results of previous runs, to compile and test their programs, run quick test runs, etc. Most of this activity can be performed on the front-ends, and job output data is available over NFS. In addition, interactive logins on the batch machines can be started through PBS (with 'qsub -I ...'), so the policy has been retained even though the actual impact is minimal.

Originally, the hammer was implemented as a shell-script that queried the running jobs from the PBS server, and looked for running processes that didn't correlate to a running job. This was cumbersome and error-prone, and it relied on the server keeping the list of running jobs. However, the PBS mom knows exactly what jobs are supposed to be running (as well as the session id's and ASHes for those jobs) at any instant. The PBS mom is the ideal place for this functionality, so the hammer was re-implemented as a thread of the mom.

The PBS hammer sweeps the process table about once every 15 seconds, examining every listed process against the following criteria. If the process does not meet one of the exemptions below, a SIGKILL is sent to it. This is a rather Machiavellian approach, but for the sake of minimizing interference to the running jobs, it is imperative that the processes be removed immediately. In addition, the "no nonsense" aspect quickly discourages users from attempting to bend the rules.

Any process that matches any of the following criteria is exempt from being hammered:

The hammer may be permanently disabled, or enabled in either log-only or kill mode, by adding one of the following options to the mom configuration file (/PBS/mom_priv/config) :

	# Turn off hammer functionality
	enforce !hammer
	# Turn on hammer, but just log offending processes
	enforce hammer,nokill
	# Turn on hammer, and send SIGKILL to any offending processes
	enforce hammer,!nokill

Note that running the hammer with 'nokill' enabled can cause the mom's logs to grow quite quickly (if many hammer-able processes exist on the system). It is recommended that some utility watch the log files for offending processes and take some action to remove them.

The hammer runs asynchronously from the rest of the PBS mom. It may be killed (send a SIGKILL or SIGQUIT to it) to temporarily disable it. Sending a SIGHUP to the main PBS mom will restart the hammer (unless it is disabled).

MISCELLANEOUS CHANGES:

Some operational changes and features have been added to help make the machines more usable and controllable. The most important of these are listed below:

Boot-time system library loader (libldr):

IRIX uses dynamic shared objects for virtually every system library. These shared objects are loaded into memory on demand, and once loaded, are referenced by other programs. The advantages of this mechanism are well-known -- smaller executables, faster startup, more efficient use of memory, less disk traffic, etc, etc. However, there is a problem with using shared objects on a system that uses cpusets to partition memory.

Imagine that PBS runs an MPI job, for instance, in a small cpuset on a host. When the job script invokes 'mpirun ...', the shared library 'libmpi.so' will be loaded into the memory on the nodes assigned to that job. For most libraries, this is a small amount of memory, but some of the libraries are quite large (several megabytes). They may also cause more memory to be allocated when they are first loaded to handle their initialization. In any case, some memory from the job's nodes is now dedicated to holding this shared object.

PBS then starts another MPI job in a different cpuset. This time, the 'mpirun ...' command does not need to load the shared library. It merely makes a reference to the copy of libmpi.so that the first job loaded (into the memory assigned to the first job) and the new job begins execution. After a few minutes, the first job encounters some error, and terminates. The kernel frees all the pages that were allocated in the memory on the first job's nodes, except for the pages that were used by the shared library. Since it is still marked as in-use, that memory cannot be freed. Now the nodes, although technically "idle", do not have as free memory as expected, which may cause the next job that runs on them to fail.

Page migration could be employed to address this problem. When the kernel noticed that it was going off-node to access the shared-object, it could force the pages to be loaded into the process' local memory and free the remote pages. However, in a batch environment, this would lead to the library being shuffled around the machine as jobs started and terminated, impacting performance and wasting cycles and memory bandwidth.

The obvious alternative, implemented as the 'libldr' program, is to load all of the commonly- and not-so-commonly-used libraries into the memory on the "system" nodes at boot time. The three main advantages of doing this are that 1) the libraries do not need to be loaded by the job, improving startup-time, 2) the memory consumed by the libraries is confined to the "system" nodes, and 3) PBS jobs are not charged for the memory used by the shared objects.

The library loader is shipped with the nas_eoe patches for IRIX 6.5.4. Sources are in the farmers' CVS repository. libldr has a separate binary for the n32- and 64-bit ABI's. It is started at boot time, early in the startup sequence. After reading configuration information from the file /etc/config/libldr.config, the libldr then walks through the system libraries, loading them into memory. It also resolves any dependencies listed in the libraries, loading the dependent libraries as well. Finally, it creates an index of the libraries that were loaded in the file /lib{32,64}/index. The library loader then goes to sleep forever, leaving the pre-loaded libraries in place on the machine. On a typical system, the memory used to load all of the libraries is approximately 191MB for the 64-bit libraries, and about 183MB for the n32 libraries.

Because the cross-referencing and indexing of the libraries can take a very long time (especially the NFS-mounted /opt directory), the libldr writes out a cache file containing all of the cross-dependency information for the libraries. This allows it to skip the time-consuming indexing at boot, and just load the libraries into memory. The cache is considered stale if any of the library directories have been updated more recently than the cache, implying that a library has been added, removed, or updated.

The PBS mom uses the /lib32/index and /lib64/index files to correlate process shared-object segments with the pre-loaded system libraries. The PBS jobs are not charged for the memory used by the system libraries. However, if the job uses its own shared libraries, these are charged against the job (and loaded into its allocated memory).

Memory utilization display for MMSC:

The Origin2000 provides a display panel that is controlled by a daemon running on the machine. Stock IRIX uses this display panel to show the CPU utilization of each processor on the machine. However, experience has shown that the CPU load is less useful than memory usage for judging the health and utilization of the machine. For this reason, NAS added an option to the MMSC controller daemon (mmscd) to display the memory usage on each node in the system.

The bargraph shows the total memory installed on each node in yellow, scaled so that the largest memory size fills the entire bargraph. The memory used by the kernel is displayed in blue (usually only visible on node 0, where the kernel loads its text and stack). Red bars indicate memory that has been allocated by user processes on the system. The display is, by default, updated every 1/3 second.

By looking at the display on a lightly-loaded batch system, you can immediately see that the shared libraries are loaded on nodes 0 and 1 (in fact, the taller bar on node 0 corresponds to the 64-bit libraries, which are slightly larger). Memory consumed by batch jobs will be placed in clusters, depending upon the nodes allocated by PBS.

It is instructive to submit a job that uses up all of its available memory and causes the cpuset memory limits to be reached (i.e. a loop that calls malloc(3) continuously). It will begin allocating memory on the local node, overflowing into the next-nearest nodes. On each of these nodes, it will allocate all but the last 16MB -- this is left as "breathing room" for the kernel to quickly service I/O requests and other activities that may require a memory allocation. Eventually the reserved 16MB will be allocated, leaving no unallocated memory on the node. The application now pauses while the node memories are "squeezed", and more free memory should appear. Once that memory is completely allocated, the kernel will terminate the job. You should see the memory being de-allocated in roughly the reverse order as it was allocated. The "fresh-squeezed" nodes should be almost completely free once the job terminates.

The modified MMSC daemon is installed in /usr/etc/mmscd.nas on the Parallel Origin2000's. The changes to the stock mmscd are expected to be incorporated into IRIX 6.5.7.