LAM 6.0 Release Notes
Version 6.0 is a major overhaul of version 5.2. Our general
objectives for LAM 6.0 were:
- observability of more MPI objects and status information
- execution tracing for performance visualization and advanced debugging
- improved documentation emphasizing MPI features
- very robust MPI request handling and resource management
- dynamic LAM nodes and fault tolerance
- dynamic MPI processes
- improved performance all around, particularly the ability to bypass
the LAM daemon for MPI communication
Observability
LAM already had powerful abilities to examine the execution state
of a process and its message queue. Version 6.0 also reveals
the group membership of MPI communicators, the type map of
derived datatypes and the contents of messages, formatted according
to the datatype. See mpitask(1) and mpimsg(1) for the new
-d, -c and -m options.
Execution Tracing
LAM 6.0 has a three stage trace collection system to minimize
intrusiveness. Traces are initially stored into a buffer within
the process that produced them. When the buffer fills or the process is
signalled, this internal buffer is flushed to the local LAM
daemon. A new command, lamtrace(1), collects trace data from remote
nodes and stores them in a file for subsequent visualization
(by XMPI 2.0).
This trace collection system is unaware of the format of traces.
Only the instrumented MPI library and the visualization tools agree
on formats.
A new option of mpirun(1) (-t) enables an application to generate traces
for all communication. No recompiling is done. MPIL_Trace_on(2) and
MPIL_Trace_off(2) are LAM/MPI extensions that toggle trace generation
on and off at runtime and allow an application to avoid voluminous
trace files and focus on interesting phases of the computation.
Improved Documentation
"MPI Primer / Developing with LAM" is a new 84 page document with
a 26 page pull-out chapter, "MPI Programming Primer", that is purely
MPI and can be used with any implementation. The primer focuses
on the more commonly used MPI routines. Of course, the main body
of LAM documentation is still the extensive set of manual pages.
Begin your tour of the manual pages with lam(7).
Robust MPI Resource Management
Applications may fail, legitimately, on some implementations but not
others due to an escape hatch in the MPI Standard called "resource
limitations". Most resources are managed locally and it is easy for
an implementation to provide a helpful error code and/or message when
a resource is exhausted. Buffer space for message envelopes is often
a remote resource which is difficult to manage. An overflow may not
be reported to the process that caused the overflow. Moreover,
interpretation of the MPI guarantee on message progress may confuse
the debugging of an application that actually died on envelope overflow.
LAM 6.0 has a property called "Guaranteed Envelope Resources" (GER)
which serves two purposes. It is a promise from the implementation to
the application that a minimum amount of envelope buffering will be
available to each process pair. Secondly, it ensures that the producer
of messages that overflows this resource will be throttled or cited with
an error as necessary.
Dynamic LAM Nodes and Fault Tolerance
The set of nodes that constitute a multicomputer is no longer static after
start-up with lamboot(1). Two new commands, lamgrow(1) and lamshrink(1),
add and delete nodes at any time. A resource manager and job scheduler
could use these commands to respond to changing processor availability.
lamshrink(1) has the option to notify processes on the doomed processor
with a new signal (LAM_SIGFUSE) and give them a grace period to terminate.
All LAM nodes can watch each other for possible failure. When a failure
is detected, all remaining nodes adjust their network information to the
lesser system size and notify all application processes with a new signal
(LAM_SIGSHRINK). Further attempts by any process to use the dead node
are reported as errors.
Dynamic MPI Processes
In anticipation of MPI-2, LAM 6.0 allows MPI applications to create
new processes and establish communicators with them. All of LAM's
monitoring and debugging commands and tools can cope with the
presence of multiple MPI world communicators.
- MPIL_Spawn(2)
-
Like mpirun(1), a single executable program or an application schema
file is used to specify and assign resources to a group of processes
under a new world communicator. The routine is collective over the
group of parent processes and an inter-communicator is returned for
communication with the children.
- MPIL_Comm_parent(2)
-
The children call this routine either to get an inter-communicator connecting
the parent processes or to learn that they were created by mpirun(1).
- MPIL_Universe_size(2)
-
This routine returns the current number of nodes in the LAM session.
Call it once to determine how many processes to spawn. Call it periodically
to learn if more nodes are available for further spawns.
Improved Performance
A completely rebuilt client-daemon interface makes every part of
LAM / MPI faster. The biggest performance gains in message-passing
will come from bypassing the daemon altogether, which MPI processes
can now do by simply turning on an option (-c2c) in mpirun(1).
The trade-off is reduced visibility for debugging, though it is
presumed that default daemon-based communication will be used for
developing the application, and direct communication will be used
for production runs.
The direct communication module in LAM is driven by only six function
calls. The only implementation of this clear, concise module in
LAM 6.0 is based on sockets and TCP/IP for portability. It is
straight-forward to replace this module with another six-call
implementation that targets a specific architecture. Future releases
will include a combined TCP/IP, shared memory implementation to optimally
handle clustered SMP architectures. Ohio Supercomputer Center is
interested in collaborating with other groups on this project.
Ohio Supercomputer Center, lam@tbag.osc.edu, http://www.osc.edu/lam.html