OpenMPI for Dummies
OpenMPI has a huge documentation base, but I thought a simple and summarized document may be helpful. To read more, visit:
OpenMPI abstraction layer architecture
OpenMPI has three abstraction layer.
- Open, Portable Access Layer (OPAL): OPAL is the bottom layer of Open MPI’s abstractions. Its abstractions are focused on individual processes (versus parallel jobs). It provides utility and glue code such as generic linked lists, string manipulation, debugging controls, and other mundane—yet necessary—functionality. OPAL also provides Open MPI’s core portability between different operating systems, such as discovering IP interfaces, sharing memory between processes on the same server, processor and memory affinity, high-precision timers, etc.
- Open MPI Run-Time Environment (ORTE) (pronounced “or-tay”): An MPI implementation must provide not only the required message passing API, but also an accompanying run-time system to launch, monitor, and kill parallel jobs. In Open MPI’s case, a parallel job is comprised of one or more processes that may span multiple operating system instances, and are bound together to act as a single, cohesive unit. It will be replaced by PRRTE in OpenMPI 5.0.
- Open MPI (OMPI): The MPI layer is the highest abstraction layer and is the only one exposed to applications. The MPI API is implemented in this layer, as are all the message passing semantics defined by the MPI standard.
OpenMPI Plugin architecture
The component concept is utilized throughout all three layers of Open MPI, and in each layer, there are many different types of components. Each type of component is enclosed in a framework. A component belongs to exactly one framework, and a framework supports exactly one kind of component.
Open MPI’s set of layers, frameworks, and components is referred to as the Modular Component Architecture (MCA).
MPI layer frameworks
Here is a list of all the component frameworks in the MPI layer of Open MPI:
bml
: BTL management layercoll
: MPI collective algorithmsfbtl
: file byte transfer layer: abstraction for individual blocking and non-blocking read and write operationsfcoll
: Collective read and write operations for MPI I/O.fs
: File system functions for MPI I/O.hook
: Make calls at various points of MPI process life-cycle.io
: MPI I/Omtl
: Matching transport layer, used for MPI point-to-point messages on some types of networksop
: Back end computations for intrinsic MPI_Op operatorsosc
: MPI one-sided communicationspml
: MPI point-to-point management layerpart
: MPI Partitioned communication.sharedfp
: shared file pointer operations for MPI I/Otopo
: MPI topology routinesvprotocol
: Protocols for the “v” PML
Miscellaneous frameworks
allocator
: Memory allocatorbacktrace
: Debugging call stack backtrace supportbtl
: Point-to-point Byte Transfer Layerdl
: Dynamic loading library interfacehwloc
: Hardware locality (hwloc) versioning supportif
: OS IP interface supportinstalldirs
: Installation directory relocation servicesmemchecker
: Run-time memory checkingmemcpy
: Memory copy supportmemory
: Memory management hooksmpool
: Memory poolingpatcher
: Symbol patcher hookspmix
: Process management interface (exascale)rcache
: Memory registration cachereachable
: Reachability matrix between endpoints of a given pair of hostsshmem
: Shared memory support (NOT related to OpenSHMEM)smsc
: Shared Memory Single Copythreads
: Thread management and support.timer
: High-resolution timers
Framework notes
Each framework typically has one or more components that are used at run-time. For example, the btl
framework is used by the MPI layer to send bytes across different types underlying networks. The tcp
btl
, for example, sends messages across TCP-based networks; the ucx
pml
sends messages across InfiniBand-based networks.
Each component typically has some tunable parameters that can be changed at run-time. Use the ompi_info
command to check a component to see what its tunable parameters are. For example:
shell$ ompi_info --param btl tcp
shows some of the parameters (and default values) for the tcp
btl
component (use --level
to show all the parameters; see below).
Note that ompi_info
only shows a small number a component’s MCA parameters by default. Each MCA parameter has a “level” value from 1 to 9, corresponding to the MPI-3 MPI_T tool interface levels. In Open MPI, we have interpreted these nine levels as three groups of three:
- End user / basic
- End user / detailed
- End user / all
- Application tuner / basic
- Application tuner / detailed
- Application tuner / all
- MPI/OpenSHMEM developer / basic
- MPI/OpenSHMEM developer / detailed
- MPI/OpenSHMEM developer / all
Here’s how the three sub-groups are defined:
- End user: Generally, these are parameters that are required for correctness, meaning that someone may need to set these just to get their MPI/OpenSHMEM application to run correctly.
- Application tuner: Generally, these are parameters that can be used to tweak MPI application performance.
- MPI/OpenSHMEM developer: Parameters that either don’t fit in the other two, or are specifically intended for debugging/development of Open MPI itself.
Each sub-group is broken down into three classifications:
- Basic: For parameters that everyone in this category will want to see.
- Detailed: Parameters that are useful, but you probably won’t need to change them often.
- All: All other parameters – probably including some fairly esoteric parameters.
To see all available parameters for a given component, specify that ompi_info should use level 9:
shell$ ompi_info --param btl tcp --level 9
These values can be overridden at run-time in several ways. At run-time, the following locations are examined (in order) for new values of parameters:
PREFIX/etc/openmpi-mca-params.conf
: This file is intended to set any system-wide default MCA parameter values – it will apply, by default, to all users who use this Open MPI installation. The default file that is installed contains many comments explaining its format.$HOME/.openmpi/mca-params.conf
: If this file exists, it should be in the same format asPREFIX/etc/openmpi-mca-params.conf
. It is intended to provide per-user default parameter values.environment variables of the form
OMPI_MCA_<name>
set equal to aVALUE
: Where<name>
is the name of the parameter. For example, set the variable namedOMPI_MCA_btl_tcp_frag_size
to the value 65536 (Bourne-style shells):OMPI_MCA_btl_tcp_frag_size=65536 export OMPI_MCA_btl_tcp_frag_size
the
mpirun
/oshrun
command line:--mca NAME VALUE
Where
<name>
is the name of the parameter. For example:mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi
These locations are checked in order. For example, a parameter value passed on the mpirun
command line will override an environment variable; an environment variable will override the system-wide defaults.
Each component typically activates itself when relevant. For example, the usNIC component will detect that usNIC devices are present and will automatically be used for MPI communications. The SLURM component will automatically detect when running inside a SLURM job and activate itself. And so on. Components can be manually activated or deactivated if necessary, of course. The most common components that are manually activated, deactivated, or tuned are the btl
components – components that are used for MPI point-to-point communications on many types common networks.
For example, to only activate the tcp
and self
(process loopback) components are used for MPI communications, specify them in a comma-delimited list to the btl
MCA parameter:
mpirun --mca btl tcp,self hello_world_mpi
To add shared memory support, add sm
into the command-delimited list (list order does not matter):
mpirun --mca btl tcp,sm,self hello_world_mpi
(there used to be a vader
BTL for shared memory support; it was renamed to sm
in Open MPI v5.0.0, but the alias vader
still works as well)
To specifically deactivate a specific component, the comma-delimited list can be prepended with a ^
to negate it:
mpirun --mca btl ^tcp hello_mpi_world
The above command will use any other btl
component other than the tcp
component.