LAM 6.0 Release Notes

Version 6.0 is a major overhaul of version 5.2. Our general objectives for LAM 6.0 were:

Observability

LAM already had powerful abilities to examine the execution state of a process and its message queue. Version 6.0 also reveals the group membership of MPI communicators, the type map of derived datatypes and the contents of messages, formatted according to the datatype. See mpitask(1) and mpimsg(1) for the new -d, -c and -m options.

Execution Tracing

LAM 6.0 has a three stage trace collection system to minimize intrusiveness. Traces are initially stored into a buffer within the process that produced them. When the buffer fills or the process is signalled, this internal buffer is flushed to the local LAM daemon. A new command, lamtrace(1), collects trace data from remote nodes and stores them in a file for subsequent visualization (by XMPI 2.0).

This trace collection system is unaware of the format of traces. Only the instrumented MPI library and the visualization tools agree on formats.

A new option of mpirun(1) (-t) enables an application to generate traces for all communication. No recompiling is done. MPIL_Trace_on(2) and MPIL_Trace_off(2) are LAM/MPI extensions that toggle trace generation on and off at runtime and allow an application to avoid voluminous trace files and focus on interesting phases of the computation.

Improved Documentation

"MPI Primer / Developing with LAM" is a new 84 page document with a 26 page pull-out chapter, "MPI Programming Primer", that is purely MPI and can be used with any implementation. The primer focuses on the more commonly used MPI routines. Of course, the main body of LAM documentation is still the extensive set of manual pages. Begin your tour of the manual pages with lam(7).

Robust MPI Resource Management

Applications may fail, legitimately, on some implementations but not others due to an escape hatch in the MPI Standard called "resource limitations". Most resources are managed locally and it is easy for an implementation to provide a helpful error code and/or message when a resource is exhausted. Buffer space for message envelopes is often a remote resource which is difficult to manage. An overflow may not be reported to the process that caused the overflow. Moreover, interpretation of the MPI guarantee on message progress may confuse the debugging of an application that actually died on envelope overflow.

LAM 6.0 has a property called "Guaranteed Envelope Resources" (GER) which serves two purposes. It is a promise from the implementation to the application that a minimum amount of envelope buffering will be available to each process pair. Secondly, it ensures that the producer of messages that overflows this resource will be throttled or cited with an error as necessary.

Dynamic LAM Nodes and Fault Tolerance

The set of nodes that constitute a multicomputer is no longer static after start-up with lamboot(1). Two new commands, lamgrow(1) and lamshrink(1), add and delete nodes at any time. A resource manager and job scheduler could use these commands to respond to changing processor availability. lamshrink(1) has the option to notify processes on the doomed processor with a new signal (LAM_SIGFUSE) and give them a grace period to terminate.

All LAM nodes can watch each other for possible failure. When a failure is detected, all remaining nodes adjust their network information to the lesser system size and notify all application processes with a new signal (LAM_SIGSHRINK). Further attempts by any process to use the dead node are reported as errors.

Dynamic MPI Processes

In anticipation of MPI-2, LAM 6.0 allows MPI applications to create new processes and establish communicators with them. All of LAM's monitoring and debugging commands and tools can cope with the presence of multiple MPI world communicators.

MPIL_Spawn(2)
Like mpirun(1), a single executable program or an application schema file is used to specify and assign resources to a group of processes under a new world communicator. The routine is collective over the group of parent processes and an inter-communicator is returned for communication with the children.
MPIL_Comm_parent(2)
The children call this routine either to get an inter-communicator connecting the parent processes or to learn that they were created by mpirun(1).
MPIL_Universe_size(2)
This routine returns the current number of nodes in the LAM session. Call it once to determine how many processes to spawn. Call it periodically to learn if more nodes are available for further spawns.

Improved Performance

A completely rebuilt client-daemon interface makes every part of LAM / MPI faster. The biggest performance gains in message-passing will come from bypassing the daemon altogether, which MPI processes can now do by simply turning on an option (-c2c) in mpirun(1). The trade-off is reduced visibility for debugging, though it is presumed that default daemon-based communication will be used for developing the application, and direct communication will be used for production runs.

The direct communication module in LAM is driven by only six function calls. The only implementation of this clear, concise module in LAM 6.0 is based on sockets and TCP/IP for portability. It is straight-forward to replace this module with another six-call implementation that targets a specific architecture. Future releases will include a combined TCP/IP, shared memory implementation to optimally handle clustered SMP architectures. Ohio Supercomputer Center is interested in collaborating with other groups on this project.


Ohio Supercomputer Center, lam@tbag.osc.edu, http://www.osc.edu/lam.html