MVAPICH2 1.4 User Guide

by 서진우 · Published 2009년 9월 11일 · Updated 2014년 12월 24일

MVAPICH Team
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University
http://mvapich.cse.ohio-state.edu
Copyright �2003-2009
Network-Based Computing Laboratory,
headed by Dr. D. K. Panda.
All rights reserved.

Last revised: August 31, 2009

1 Overview of the Open-Source MVAPICH Project

InfiniBand and 10GbE/iWARP are emerging as high-performance interconnects delivering low latency and high bandwidth. They are also getting widespread acceptance due to their open standards.

MVAPICH (pronounced as “em-vah-pich”) is an open-source MPI software to exploit the novel features and mechanisms of InfiniBand, iWARP and other RDMA-enabled interconnects and deliver best performance and scalability to MPI applications. This software is developed in the Network-Based Computing Laboratory (NBCL), headed by Prof. Dhabaleswar K. (DK) Panda.

Currently, there are two versions of this MPI: MVAPICH with MPI-1 semantics and MVAPICH2 with MPI-2 semantics. This open-source MPI software project started in 2001 and a first high-performance implementation was demonstrated at Supercomputing ’02 conference. After that, this software has been steadily gaining acceptance in the HPC, InfiniBand and 10GigE/iWARP community. As of August 31, 2009, more than 975 organizations (National Labs, Universities and Industry) world-wide have downloaded this software from OSU’s web site directly. In addition, many InfiniBand and 10GigE/iWARP vendors, server vendors, and systems integrators have been incorporating MVAPICH/MVAPICH2 into their software stacks and distributing it. Several InfiniBand systems using MVAPICH/MVAPICH2 have obtained positions in the TOP 500 ranking. MVAPICH and MVAPICH2 are also available with the Open Fabrics Enterprise Distribution (OFED) stack. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.

More details on MVAPICH/MVAPICH2 software, users list, mailing lists, sample performance numbers on a wide range of platforms and interconnect, a set of OSU benchmarks, related publications, and other InfiniBand- and iWARP-related projects (parallel file systems, storage, data centers) can be obtained from the following URL:

http://mvapich.cse.ohio-state.edu

This document contains necessary information for MVAPICH2 users to download, install, test, use, tune and troubleshoot MVAPICH2 1.4. As we get feedbacks from users and take care of bug-fixes, we introduce new tarballs and also continuously update this document. Thus, we strongly request you to refer to our web page for updates.

2 How to use this User Guide?

This guide is designed to take the user through all the steps involved in configuring, installing, running and tuning MPI applications over InfiniBand using MVAPICH2 1.4.

In Section 3 we describe all the features in MVAPICH2 1.4. As you read through this section, please note our new features (highlighted as NEW) in the 1.4 series. Some of these features are designed in order to optimize specific type of MPI applications and achieve greater scalability. Section 4 describes in detail the configuration and installation steps. This section enables the user to identify specific compilation flags which can be used to turn some of the features on of off. Basic usage of MVAPICH2 is explained in Section 5. Section 6 provides instructions for running MVAPICH2 with some of the advanced features. Section 8 describes the usage of the OSU Benchmarks. If you have any problems using MVAPICH2, please check Section 9 where we list some of the common problems people face. In Section 10 we suggest some tuning techniques for multi-thousand node clusters using some of our new features. Finally in Section 11 we list all important run-time parameters, their default values and a small description of what that parameter stands for.

3 MVAPICH2 1.4 Features

MVAPICH2 (MPI-2 over InfiniBand) is an MPI-2 implementation based on MPICH2 ADI3 layer. It also supports all MPI-1 functionalities. MVAPICH2 1.4 is available as a single integrated package (with MPICH2 1.0.8p1).

The current release supports the following five underlying transport interfaces:

OpenFabrics Gen2-IB: This interface supports all InfiniBand compliant devices based on the OpenFabrics Gen2 layer. This interface has the most features and is most widely used. For example, this interface can be used over all Mellanox InfiniBand adapters, IBM eHCA adapters and Qlogic adapters.
OpenFabrics Gen2-iWARP: This interface supports all iWARP compliant devices supported by OpenFabrics. For example, this layer supports Chelsio T3 adapters with the native iWARP mode.
(NEW) QLogic InfiniPath: This interface provides native support for InfiniPath adapters from QLogic over PSM interface. It provides high-performance point-to-point communication for both one-sided and two-sided operations.
uDAPL: This interface supports all network-adapters and software stacks which implement the portable DAPL interface from the DAT Collaborative. For example, this interface can be used over all Mellanox adapters, Chelsio adapters and NetEffect adapters. It can also be used with Solaris uDAPL-IBTL implementation over InfiniBand adapters.
TCP/IP: The standard TCP/IP interface (provided by MPICH2) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/scalability as compared to the other interfaces.

Please note that the support for VAPI interface has been deprecated since MVAPICH2 1.2 because OpenFabrics interface is getting more popular. MVAPICH2 users still using VAPI interface are strongly requested to migrate to the OpenFabrics-IB interface.

MVAPICH2-1.4 delivers better performance (especially with one-copy intra-node communication support with LiMIC2) compared to MVAPICH 1.1, the latest release package of MVAPICH supporting MPI-1 standard. In addition, MVAPICH2 1.4 provides support and optimizations for other MPI-2 features, multi-threading and fault-tolerance (Checkpoint-restart). A complete set of features of MVAPICH2 1.4 are:

Design for scaling to multi-thousand nodes with highest performance and reduced memory usage
- (NEW) Support for MPI-2 Dynamic Process Management on InfiniBand clusters
- (NEW) eXtended Reliable Connection (XRC) support
- (NEW) Multiple CQ-based design for Chelsio 10GigE/iWARP
- Scalable and robust daemon-less job startup
  - Enhanced and robust mpirun_rsh framework (non-MPD-based) to provide scalable job launching on multi-thousand core clusters
  - (NEW) Hierarchical ssh to nodes to speedup job start-up
  - Available for OpenFabrics (IB and iWARP) and uDAPL interfaces (including Solaris)
- On-demand Connection Management: This feature enables InfiniBand connections to be setup dynamically, enhancing the scalability of MVAPICH2 on clusters of thousands of nodes
  - Native InfiniBand Unreliable Datagram (UD) based asynchronous connection management for OpenFabrics Gen2-IB interface
  - RDMA CM based on-demand connection management for OpenFabrics Gen2-iWARP and OpenFabrics Gen2-IB interfaces
  - uDAPL on-demand connection management based on standard uDAPL interface
- Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for OpenFabrics Gen2-IB interface
- Hot-Spot Avoidance Mechanism (HSAM) for alleviating network congestion in large scale clusters. Available for OpenFabrics Gen2-IB interface
- RDMA Read utilized for increased overlap of computation and communication for OpenFabrics device. Available for OpenFabrics Gen2-IB and iWARP interfaces
- Shared Receive Queue (SRQ) with flow control. This design uses significantly less memory for MPI library. Available for OpenFabrics Gen2-IB interface.
- Adaptive RDMA Fast Path with Polling Set for low-latency messaging. Available for OpenFabrics Gen2-IB and iWARP interfaces.
- Enhanced scalability for RDMA-based direct one-sided communication with less communication resource. Available for OpenFabrics (IB and iWARP) interfaces.
(NEW) Dynamic Process Management (DPM) support with mpirun_rsh framework. Available for OpenFabrics IB interface.
Fault tolerance support
- Checkpoint-restart support for application transparent systems-level fault tolerance. BLCR-based support using OpenFabrics Gen2-IB interface.
  - (NEW) Scalable Checkpoint-restart with mpirun_rsh framework
  - (NEW) Scalable Checkpoint-restart with Fault Tolerance Backplane (FTB) framework
  - Checkpoint-restart with intra-node shared memory (user-level) support
  - (NEW) Checkpoint-restart with intra-node shared memory (kernel-level with LiMIC2) support
  - (NEW) Checkpoint-restart with Fault-Tolerant Backplane (FTB-CR) support
  - Allows best performance and scalability with fault-tolerance support
- Application-initiated system-level checkpointing is also supported. User application can request a whole program checkpoint synchronously by calling special MVAPICH2 functions.
  - Flexible interface to work with different files systems. Tested with ext3 (local disk), NFS and PVFS2
- Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand
Enhancement to software installation
- Full autoconf-based configuration
- Automatically detects system architecture and adapter types and optimizes MVAPICH2 for any particular installation
- An application (mpiname) for querying the MVAPICH2 library version and configuration information
Optimized intra-node communication support by taking advantage of shared-memory communication. Available for all interfaces.
- (NEW) Kernel-level single-copy intra-node communication solution based on LiMIC2
  - LiMIC2 is designed and developed by System Software Laboratory at Konkuk University, Korea
- Efficient Buffer Organization for Memory Scalability of Intra-node Communication
- Multi-core optimized.
- Optimized for Bus-based SMP and NUMA-Based SMP systems
- Efficent support for diskless clusters
- Enhanced processor affinity using PLPA for multi-core architectures
  - Allows user-defined flexible processor affinity
Shared memory optimizations for collective communication operations. Available for all interfaces.
- (NEW) K-nomial tree-based solution together with shared memory-based broadcast for scalable MPI_Bcast operations
- Optimized and tuned MPI_Alltoall
- Efficient algorithms and optimizations for barrier, reduce and all-reduce operations
Integrated multi-rail communication support. Available for OpenFabrics Gen2-IB and iWARP interfaces.
- Multiple queue pairs per port
- Multiple ports per adapter
- Multiple adapters
- Support for both one-sided and point-to-point operations
- Support for OpenFabrics Gen2-iWARP interface and RDMA CM (for Gen2-IB).
Multi-threading support. Available for all interfaces, including TCP/IP.
High-performance optimized and scalable support for one-sided communication: Put, Get and Accumulate. Supported synchronization calls: Fence, Active Target, Passive (lock and unlock). Available for all interfaces.
- Direct RDMA based One-sided communication support for OpenFabrics Gen2-iWARP and RDMA CM (with Gen2-IB)
- Enhanced scalability for RDMA-based direct one-sided communication with less communication resource
Two modes of communication progress
- Polling
- Blocking (enables running multiple MPI processes/processor). Available for Open Fabrics Gen2-IB interface.
Scalable job startup schemes
- Enhanced and robust mpirun_rsh framework
- (NEW) Hierarchical ssh-based schemes to nodes
- Using in-band IB communication with MPD
- Support for SLURM
Advanced AVL tree-based Resource-aware registration cache
Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations.
High Performance and Portable Support for multiple networks and operating systems through uDAPL interface.
- InfiniBand (tested with)
  - uDAPL over OpenFabrics Gen2-IB on Linux
  - uDAPL over IBTL on Solaris
This uDAPL support is generic and can work with other networks that provide uDAPL interface. Please note that the stability and performance of MVAPICH2 with uDAPL depends on the stability and performance of the uDAPL library used. Starting from version 1.2, MVAPICH2 supports both uDAPL v2 and v1.2 on Linux.
Support for TotalView debugger with mpirun_rsh framework.
Shared Library Support for existing binary MPI application programs to run
ROMIO Support for MPI-IO
- Optimized, high-performance ADIO driver for Lustre
Single code base for the following platforms (Architecture, OS, Compilers, Devices and InfiniBand adapters)
- Architecture: (tested with) EM64T, Opteron and IA-32; IBM PPC and Mac G5
- Operating Systems: (tested with) Linux and Solaris; and Mac OSX
- Compilers: (tested with) gcc, intel, pathscale, pgi and sun studio
- Devices: (tested with) OpenFabrics Gen2-IB, OpenFabrics Gen2-iWARP, and uDAPL; and TCP/IP
- InfiniBand adapters (tested with):
  - Mellanox adapters with PCI-X and PCI-Express (SDR and DDR with mem-full and mem-free cards)
  - Mellanox ConnectX (DDR)
  - Mellanox ConnectX (QDR) with PCI-Express Gen2
  - (NEW) QLogic adapter (SDR)
  - (NEW) QLogic adapter (DDR) with PCI-Express Gen2
- 10GigE adapters:
  - (tested with) Chelsio T3 adapter with iWARP support

The MVAPICH2 1.4 package and the project also includes the following provisions:

Public SVN access of the codebase
A set of micro-benchmarks (including multi-threading latency test) for carrying out MPI-level performance evaluation after the installation
Public mvapich-discuss mailing list for mvapich users to
- Ask for help and support from each other and get prompt response
- Enable users and developers to contribute patches and enhancements

4 Installation Instructions

The MVAPICH2 installation process is designed to enable the most widely utilized features on the target build OS by default. Supported operating systems include Linux and Solaris. The default interface is OpenFabrics IB/iWARP on Linux and uDAPL on Solaris. uDAPL, QLogic InfiniPath and TCP/IP devices can also be explicitly selected on Linux. The installation section provides generic instructions for building from a Tarball or our latest sources. Please see the subsection for the device you are targeting for specific configuration instructions.

4.1 Building from a Tarball

The MVAPICH2 1.4 source code package includes MPICH2 1.0.8p1. All the required files are present as a single tarball. Download the most recent distribution tarbal from:

http://mvapich.cse.ohio-state.edu/download/mvapich2

Unpack the tarball and use the standard GNU procedure to compile:

$ tar xzf mvapich2-1.4.tar.gz
$ cd mvapich2-1.4
$ ./configure
$ make
$ make install

4.2 Obtaining and Building the Source from Anonymous SVN

These instructions assume you have already installed subversion.

The MVAPICH2 SVN repository is available at:

https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2

Please keep in mind the following guidelines before deciding which version to check out:

“branches/1.4” is a stable version with bug fixes. New features are not added to this branch.
- To obtain the source code from branches/1.4:
  $ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/branches/1.4
  mvapich2
“trunk” will contain the latest source code as we enhance and imprive MVAPICH2. It may contain newer features and bug fixes, but is lightly tested.
- To obtain the source code from trunk:
  $ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/trunk mvapich2
“tags/1.4” is the exact version released with no updates for bug fixes or new features.
- To obtain the source code from tags/1.4:
  $ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/tags/1.4
  mvapich2

The mvapich2 directory under your present working directory contains a working copy of the MVAPICH2 source code. Now that you have obtained a copy of the source code, you need to update the files in the source tree:

$ cd mvapich2
$ maint/updatefiles

This script will generate all of the source and configuration files you need to build MVAPICH2. If the command ”autoconf” on your machine does not run autoconf 2.59 or later, but you do have a new enough autoconf available, then you can specify the correct one with the AUTOCONF environment variable (the AUTOHEADER environment variable is similar). Once you’ve prepared the working copy by running maint/updatefiles, just follow the usual configuration and build procedure:

$ ./configure
$ make
$ make install

4.3 Selecting a Process Manager

With this release of MVAPICH2, the mpirun_rsh/mpispawn framework from the MVAPICH distribution is now provided as an alternative to to mpd/mpiexec. By default both process managers are installed.

The mpirun_rsh/mpispawn framework launches jobs on demand in a manner more scalable than mpd/mpiexec. Using mpirun_rsh also alleviates the need to start daemons in advance on nodes used for MPI jobs.

4.3.1 Using SLURM

There is now a configuration option that can be used to allow mpicc and the other MPI compiler commands to automatically link MPI programs to the SLURM’s PMI library.

–with-slurm=<path to slurm installation>

4.4 Configuring a build for OpenFabrics IB/iWARP

OpenFabrics IB/iWARP is the default interface on Linux. It can be explicitly selected by configuring with:

$ ./configure –with-rdma=gen2

Configuration Options for OpenFabrics IB/iWARP

Berkeley Lab Checkpoint/Restart Support
- Default: disabled
- Enable: –enable-blcr
  –with-blcr-libpath=path –with-blcr-include=path
Berkeley Lab Checkpoint/Restart Support with FTB
- Default: disabled
- Enable: –enable-blcr –enable-ftb
  –with-blcr-libpath=path –with-blcr-include=path
  –with-ftb-libpath=path –with-ftb-include=path
Header Caching
- Default: enabled
- Disable: –disable-header-caching
Path to OpenFabrics Header Files
- Default: Your PATH
- Specify: –with-ib-include=path
Path to OpenFabrics Libraries
- Default: The systems search path for libraries.
- Specify: –with-ib-libpath=path
Support for RDMA CM
- Default: enabled, except when BLCR support is enabled
- Disable: –disable-rdma-cm
Registration Cache
- Default: enabled
- Disable: –disable-registration-cache
ADIO driver for Lustre: When compiled with this support, MVAPICH2 will use the optimzed driver for Lustre. In order to enable this feature, the flag
–enable-romio –with-file-system=lustre
should be passed to configure (–enable-romio is optional as it is enabled by default). You can add support for more file systems using
–enable-romio –with-file-system=lustre+nfs+pvfs2
LiMIC2 Support
- Default: disabled
- Enable:
  –with-limic2[=<path to LiMIC2 installation>]
  –with-limic2-include=<path to LiMIC2 headers>
  –with-limic2-libpath=<path to LiMIC2 library>
eXtended Reliable Connection
- Default: disabled
- Disable: –enable-xrc

4.5 Configuring a build for uDAPL

The uDAPL interface is the default on Solaris. It can be explicitly selected on both Solaris and Linux by configuring with:

$ ./configure –with-rdma=udapl

Configuration options for uDAPL

Cluster Size
- Default: small
- Specify: –with-cluster-size=level
  - Where level is one of:
    - small: < 128 processor cores
    - medium: 128 – 1024 cores
    - large: > 1024 cores
Path to the DAPL Header Files
- Default: Your PATH
- Specify: –with-dapl-include=path
Path to the DAPL Library
- Default: The systems search path for libraries.
- Specify: –with-dapl-libpath=path
Default DAPL Provider
- Default: OpenIB-cma on Linux; ibd0 on Solaris
- Specify: –with-dapl-provider=type
  - Where type can be found in:
    - /etc/dat.conf on Linux
    - /etc/dat/dat.conf on Solaris
DAPL Version
- Default: 1.2
- Specify: –with-dapl-version=version
Header Caching
- Default: enabled
- Disable: –disable-header-caching
Path to OpenFabrics Header Files
- Default: Your PATH
- Specify: –with-ib-include=path
Path to OpenFabrics Libraries
- Default: The systems search path for libraries.
- Specify: –with-ib-libpath=path
I/O Bus
- Default: PCI Express
- Specify: –with-io-bus=type
  - Where type is one of:
    - PCI_EX for PCI Express
    - PCI_X for PCI-X
Link Speed
- Default: SDR
- Specify: –with-link=type
  - Where type is one of:
    - DDR
    - SDR
Registration Cache
- Default: enabled on Linux; enabled and not configurable on Solaris
- Disable (Linux only): –disable-registration-cache

4.6 Configuring a build for QLogic InfiniPath

The QLogic PSM interface needs to be built to use MVAPICH2 on InfiniPath adapters. It can built with:

$ ./configure –with-device=ch3:psm

Configuration options for QLogic PSM channel

Path to QLogic PSM header files
- Default: The systems search path for header files
- Specify: –with-psm-include=path
Path to QLogic PSM library
- Default: The systems search path for libraries
- Specify: –with-psm=path

To build and install the library we will need to run:

$ make

$ make install

4.7 Configuring a build for TCP/IP

The use of TCP/IP requires the explicit selection of a TCP/IP enabled channel. The recommended channel is ch3:sock and it can be selected by configuring with:

./configure –with-device=ch3:sock

Additional instructions for configuring with TCP/IP can be found in the MPICH2 documentation available at:

http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs

5 Basic Usage Instructions

5.1 Compile MPI Applications

MVAPICH2 provides a variety of MPI compilers to support applications written in different programming languages. Please use mpicc, mpif77, mpiCC, or mpif90 to compile applications. The correct compiler should be selected depending upon the programming language of your MPI application.

These compilers are available in the MVAPICH2_HOME/bin directory. MVAPICH2 installation directory can also be specified by modifying $PREFIX, then all the above compilers will also be present in the $PREFIX/bin directory.

5.2 Run MPI Applications

5.2.1 Run MPI Applications using mpirun_rsh (for OpenFabrics IB/iWARP, QLogic InfiniPath and uDAPL Devices)

Prerequisites:

Either ssh or rsh should be enabled between the front nodes and the computing nodes. In addition to this setup, you should be able to login to the remote nodes without any password prompts.
All hostnames should resolve to the same IP address on all machines. For instance, if a machine’s hostnames resolves to 127.0.0.1 due to the default /etc/hosts on some linux distributions it leads to incorrect behavior of the library.

Examples of running programs using mpirun_rsh:

$ mpirun_rsh -np 4 n0 n1 n2 n3 ./cpi

This command launches cpi on nodes n0, n1, n2 and n3, one process per node. By default ssh is used.

$ mpirun_rsh -rsh -np 4 n0 n1 n2 n3 ./cpi

This command launches cpi on nodes n0, n1, n2 and n3, one process per each node using rsh instead of ssh.

$ mpirun_rsh -np 4 -hostfile hosts ./cpi

A list of target nodes must be provided in the file hosts one per line. MPI ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun_rsh. ie. if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as “cyclic”. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as “block”.

Many parameters of the MPI library can be configured at run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example:

$ mpirun_rsh -np 4 -hostfile hosts ENV1=value ENV2=value ./cpi

Note that the environmental variables should be put immediately before the executable.

Alternatively, you may also place environmental variables in your shell environment (e.g. .bashrc). These will be automatically picked up when the application starts executing.

Note that there are many different parameters which could be used to improve the performance of applications depending upon their requirements from the MPI library. For a discussion on how to identify such parameters, see Section 10.

Other options of mpirun_rsh can be obtained using

$ mpirun_rsh –help

Note that mpirun_rsh is sensitive to the ordering of the command-line options.

5.2.2 Run MPI Applications using SLURM

SLURM is an open-source resource manager designed by Lawrence Livermore National Laboratory. SLURM software package and its related documents can be downloaded from:
http://www.llnl.gov/linux/slurm/

Once SLURM is installed and the daemons are started, applications compiled with MVAPICH2 can be launched by SLURM, e.g.

$ srun -n2 –mpi=none ./a.out

The use of SLURM enables many good features such as explicit CPU and memory binding. For example, if you have two processes and want to bind the first process to CPU 0 and Memory 0, and the second process to CPU 4 and Memory 1, then it can be achieved by:

$ srun –cpu_bind=v,map_cpu:0,4 –mem_bind=v,map_mem:0,1 -n2 –mpi=none ./a.out

For more information about SLURM and its features please visit SLURM website.

5.2.3 Setting MPD Environment for Running Applications with mpiexec

Prerequisites: ssh should be enabled between the front nodes and the computing nodes.

Please follow these steps to setup MPD:

Please ensure that you have .mpd.conf and .mpdpasswd in your home directory. They are typically a single line file containing the following:
secretword=56rtG9.
The content of .mpd.conf and .mpdpasswd should be exactly the same. (Ofcourse, the secretword should be different.)
Please include MPD path into your path
$ export MPD_BIN=$MVAPICH2_HOME/bin
$ export PATH=$MVAPICH2_HOME/bin:$PATH

$MVAPICH2_HOME is the installation path of your MVAPICH2, as specified by $PREFIX when you configure MVAPICH2.
Prepare hostfile
Specify the hostnames of the compute nodes in a file. If you have a hostfile like:

n0
n1
n2
n3

then one process per node will be started on each one of these compute nodes.
Start MPD Daemons on the compute nodes
$ mpdboot -n 4 -f hostfile

Note: The command, mpdboot, also takes a default hostfile name mpd.hosts. If you have created the hostfile as mpd.hosts, you can omit the option “-f hostfile”.
Check status of MPD Daemons on the compute nodes (optional)
$ mpdtrace

This should list all the nodes specified in the hostfile, not necessarily in the order specified in the hostfile.

Up to now, we have specified setting up the environment, which is independent of the underlying device supported by MVAPICH2. In the next sections, we present details specific to different devices.

5.2.4 Run MPI Applications using mpiexec with OpenFabrics IB Device or QLogic InfiniPath Device

To start multiple processes, mpiexec can be used in the following fashion:

$ mpiexec -n 4 ./cpi

Four processes will be started on the compute nodes n0, n1, n2 and n3. mpiexec can also be run with several options. “$ mpiexec –help” lists all the possible options. A useful option is to specify a machinefile which holds the process mapping to machines. It can also be used to specify the number of processes to be run on each host. The machinefile option can be used with mpiexec as follows:

$ mpiexec -machinefile mf -n 4 ./cpi

where the machine file ”mf” contains the process to machine mapping. For example, if you want to run all the 4 processes on n0, then ”mf” contains the following lines:

$ cat mf
n0
n0
n0
n0

Environmental variables can be set with mpiexec as follows:

$ mpiexec -n 4 -env ENV1 value1 -env ENV2 value2 ./cpi

Note that the environmental variables should be put immediately before the executable file. The mpiexec command also propagates exported variables in its runtime environment to all processes by default. Exporting a variable before running mpiexec has the same effect as explicitly passing its value with the -env command line option. The command above could be done in the following manner when using a Bourne shell derivative:

$ export ENV1=value1
$ export ENV2=value2
$ mpiexec -n 4 ./cpi

5.2.5 Run MPI-2 Application with Dynamic Process Management support

MVAPICH2 provides MPI-2 dynamic process management. This feature allows MPI applications to spawn new MPI processes according to MPI-2 semantics. The following commands provide an example of how to run your application.

To run your application using mpirun_rsh
$ mpirun_rsh -np 2 -hostfile hosts MV2_SUPPORT_DPM=1 ./spawn1
Note: It is necessary to provide the hostfile when running dynamic process management applications using mpirun_rsh.
To run your application using mpiexec
$ mpiexec -n 2 -env MV2_SUPPORT_DPM 1 ./spawn1

Please refer to Section 11.57 for information about the MV2_SUPPORT_DPM environment variable.

5.2.6 Run MPI Application with mpiexec using OpenFabrics iWARP Device

In MVAPICH2, Gen2-iWARP support is enabled with the use of the run time environment variable ‘‘MV2_USE_IWARP_MODE’’.

In addition to this flag, all the systems to be used need the following one time setup for enabling RDMA CM usage.

Setup the RDMA CM device: RDMA CM device needs to be setup, configured with an IP address and connected to the network.
Setup the Local Address File: Create the file (/etc/mv2.conf) with the local IP address to be used by RDMA CM. (Multiple IP addresses can be listed (one per line) for multirail configurations).
$ echo 10.1.1.1 >> /etc/mv2.conf

Programs can be executed as follows:

$ mpiexec -n 4 -env MV2_USE_IWARP_MODE 1 -env ENV1 value1 prog

The iWARP device also provides totalview debugging and shared library support. Please refer to section 5.2.10 and 5.2.11 for shared library and totalview support, respectively.

5.2.7 Run MPI Application using mpiexec with uDAPL Device

MVAPICH2 can be configured with the uDAPL device, as described in the Section 4.5 . To compile MPI applications, please refer to the Section 5.1. In order to run MPI applications with uDAPL support, please specify the environmental variable MV2_DAPL_PROVIDER. As an example,

$ mpiexec -n 4 -env MV2_DAPL_PROVIDER OpenIB-cma ./cpi

or:

$ export MV2_DAPL_PROVIDER=OpenIB-cma

$ mpiexec -n 4 ./cpi

Please check the /etc/dat.conf file on Linux or /etc/dat/dat.conf on Solaris to find all the available uDAPL service providers. The default value for the uDAPL provider will be chosen, if no environment variable is provided at runtime. If you are using OpenFabrics software stack on Linux, the default DAPL provider is OpenIB-cma for DAPL-1.2, and ofa-v2-ib0 for DAPL-2.0. If you are using Solaris, the default DAPL provider is ibd0.

The uDAPL device also provides totalview debugging and shared library support. Please refer to section 5.2.10 and 5.2.11 for shared library and totalview support, respectively.

5.2.8 Run MPI Application using mpiexec with TCP/IP

If you would like to run an MPI job using IPoIB but your IB card is not the default interface for ip traffic you have two options. For both of the options , assume that you have a cluster setup as the following:

#hostname	Eth Addr	IPoIB Addr

compute1	192.168.0.1	192.168.1.1
compute2	192.168.0.2	192.168.1.1
compute3	192.168.0.3	192.168.1.1
compute4	192.168.0.4	192.168.1.1

The MPI Job Uses IPoIB: In this scenario, you will start up mpd like normal. However, you will need to create a machine file for mpiexec that tells mpiexec to use a particular interface. Example:
$ cat – > $(MPD_HOSTFILE) compute1
compute2
compute3
compute4

$ mpdboot -n 4 -f $(MPD_HOSTFILE)
compute1 ifhn=192.168.1.1
compute2 ifhn=192.168.1.2
compute3 ifhn=192.168.1.3
compute4 ifhn=192.168.1.4

The ifhn portion tells mpiexec to use the interface associated with that ip address for each machine. You can now run your MPI application using IPoIB similar to the following.
$ mpiexec -n $(NUM_PROCESS) -f $(MACHINE_FILE) $(MPI_APPLICATION)

Both MPD And the MPI Job Use IPoIB: In this scenario you will start up mpd in a modified fashion. However, you will not need to create a machine file for mpiexec. Your hostsfile for mpdboot must contain the ip addresses, or hostnames mapped to these addresses, of each machine’s IPoIB interface. The only exception is that you do not list the ip address or hostname of the local machine. This will be specified on the command line of the mpdboot command using the –ifhn option. Example:
$ cat – > $(MPD_HOSTFILE)
192.168.1.2
192.168.1.3
192.168.1.4

$ mpdboot -n 4 -f $HOSTSFILE –ifhn=192.168.1.1

The –ifhn option tells mpdboot to use the interface corresponding to that ip address to create the mpd ring and run MPI jobs. You can now run your MPI application using IPoIB similar to the following.
$ mpiexec -n $(NUM_PROCES) $(MPI_APPLICATION)

Note: For both options, you can replace the IPoIB addresses with aliases

5.2.9 Run MPI applications using ADIO driver for Lustre

MVAPICH2 contains optimized Lustre ADIO support for the OpenFabrics/Gen2 device. The Lustre directory should be mounted on all nodes on which MVAPICH2 processes will be running. Compile MVAPICH2 with ADIO support for Lustre as described in Section 4. If your Lustre mount is /mnt/datafs on nodes n0 and n1, on node n0, you can compile and run your program as follows:

$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 <path to perf>/perf -fname /mnt/datafs/testfile

If you have enabled support for multiple file systems, append the prefix ”lustre:” to the name of the file. For example:

$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 ./perf -fname lustre:/mnt/datafs/testfile

5.2.10 Run MPI Applications using Shared Library Support

MVAPICH2 provides shared library support. This feature allows you to build your application on top of MPI shared library. If you choose this option, you still will be able to compile applications with static libraries. But as default, when you have shared library support enabled your applications will be built on top of shared libraries automatically. the following commands provide some examples of how to build and run your application with shared library support.

To compile your application with shared library support. Run the following command:
$ mpicc -o cpi cpi.c
To execute an application compiled with shared library support, you need to specify the path to the shared library by putting LD_LIBRARY_PATH=path-to-shared-libraries in the command line. For example:
$ mpiexec -np 2 -env LD_LIBRARY_PATH $MVAPICH2_BUILD/lib/shared ./cpi
or
$ mpirun_rsh -np 2 n0 n1 LD_LIBRARY_PATH=/path/to/shared/lib ./cpi .
To disable MVAPICH2 shared library support even if you have installed MVAPICH2, run the following command:
$ mpicc -noshlib -o cpi cpi.c

5.2.11 Run MPI Application using TotalView Debugger Support

MVAPICH2 provides TotalView support. The following commands provide an example of how to build and run your application with TotalView support. Note: running TotalView requires correct setup in your environment, if you encounter any problem with your setup, please check with your system administrator for help.

Define ssh as a TVDSVRLAUNCHCMD variable in your default shell. For example, with bashrc, you can do:
$ echo “export TVDSVRLAUNCHCMD=ssh” >> $HOME/.bashrc
Configure MVAPICH2 with the configure options –enable-g=dbg –enable-sharedlibs=kind –enable-debuginfo in addition to the default options and then build MVAPICH2.
Compile your program with a flag -g:
$ mpicc -g -o prog prog.c
Define the correct path to TotalView as the TOTALVIEW variable. For example, example, for mpirun_rsh, under the bash shell:
$ export TOTALVIEW=<path_to_TotalView>
or for mpiexec, under bash shell:
$ export MPIEXEC_TOTALVIEW=<path_to_TotalView>
Run your program:
$ mpirun_rsh -tv -np 2 n0 n1 LD_LIBRARY_PATH=$MVAPICH2_BUILD/lib/shared:
$MVAPICH2_BUILD/lib prog
or
$ mpiexec -tv -np 2 -env LD_LIBRARY_PATH $MVAPICH2_BUILD/lib/shared:
$MVAPICH2_BUILD/lib prog
Troubleshooting:
- X authentication errors: check if you have enabled X Forwarding
  $ cat “ForwardX11 yes” >> $HOME/.ssh/config
- ssh authentication error: ssh to the computer node with its long form hostname, for example, ssh i0.domain.osu.edu

6 Advanced Usage Instructions

In this section, we present the usage instructions for advanced features provided by MVAPICH2.

6.1 Run MPI applications on Multi-Rail Configurations (for OpenFabrics IB/iWARP Devices)

MVAPICH2 has integrated multi-rail support. Run-time variables are used to specify the control parameters of the multi-rail support; number of adapters with MV2_NUM_HCAS (section 11.30), number of ports per adapter with MV2_NUM_PORTS (section 11.31), and number of queue pairs per port with MV2_NUM_QP_PER_PORT (section 11.32). Those variables are default to 1 if you do not specify them.

Large messages are striped across all HCA’s. The threshold for striping = (MV2_VBUF_TOTAL_SIZE × MV2_NUM_PORTS × MV2_NUM_QP_PER_PORT × MV2_NUM_HCAS).

MVAPICH2 also gives the flexibility to balance short message traffic over multiple HCAs in a multi-rail configuration. The run-time variable MV2_SM_SCHEDULING can be used to choose between the various load balancing options available. It can be set to USE_FIRST (Default) or ROUND_ROBIN. In the USE_FIRST scheme, the HCA in slot 0 is always used to transmit the short messages. If ROUND_ROBIN is chosen, messages are sent accross all HCAs alternately.

Following is an example to run multi-rail support with two adapters, using one port per adapter and one queue pair per port:

$ mpirun_rsh -np 2 n0 n1 MV2_NUM_HCAS=2 MV2_NUM_PORTS=1 MV2_NUM_QP_PER_PORT=1 prog
or
$ mpiexec -n 2 -env MV2_NUM_HCAS 2 -env MV2_NUM_PORTS 1 -env MV2_NUM_QP_PER_PORT 1 prog

Note that you don’t need to specify MV2_NUM_PORTS and MV2_NUM_QP_PER_PORT since they default to 1, so you can type:

$ mpirun_rsh -np 2 n0 n1 MV2_NUM_HCAS=2 prog
or
$ mpirun_rsh -np 2 n0 n1 MV2_NUM_HCAS=2 MV2_SM_SCHEDULING=ROUND_ROBIN prog
or
$ mpiexec -n 2 -env MV2_NUM_HCAS 2 prog

6.2 Run MPI application with Customized Optimizations (for OpenFabrics IB/iWARP Devices)

In MVAPICH2-1.4, run-time variables are used to switch various optimization schemes on and off. Following is a list of optimizations schemes and the control environmental variables, for a full list please refer to the section 11:

Adaptive RDMA fast path: using RDMA write to enhance performance for short messages. Default: on; to disable:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_RDMA_FAST_PATH=0 prog
or
$ mpiexec -n 2 -env MV2_USE_RDMA_FAST_PATH 0 prog
Shared-receive Queue: This feature is available only with Gen2-IB devices. This is targeted for using Shared Receive Queue (SRQ). Default: on; to disable:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_SRQ=0 prog
or
$ mpiexec -n 2 -env MV2_USE_SRQ 0 prog
Optimizations for one sided communication: One sided operations can be directly built on RDMA operations. Currently this scheme will be disabled if on-demand connection management is used. Default: on; to disable:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_RDMA_ONE_SIDED=0 prog
or
$ mpiexec -n 2 -env MV2_USE_RDMA_ONE_SIDED 0 prog
Lazy memory unregistration: user-level registration cache. Default: on; to disable:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_LAZY_MEM_UNREGISTER=0 prog
or
$ mpiexec -n 2 -env MV2_USE_LAZY_MEM_UNREGISTER 0 prog

6.3 Run MPI application with Checkpoint/Restart Support (for OpenFabrics IB Device)

MVAPICH2 provides system-level checkpoint/restart functionality for the OpenFabrics Gen2-IB interface with the option of using BLCR in standalone mode or using BLCR in conjunction with FTB support. FTB enables faults to be handled in a co-ordinated and holistic manner in the entire system, providing for an infrastructure which can be used by different software systems to exchange fault-related information.

Three methods are provided to invoke checkpointing: Manual, Automated and Application Initiated Synchronous Checkpointing. In order to utilize the checkpoint/restart functionality there are few steps that need to be followed.

Download and install the BLCR (Berkeley Lab’s Checkpoint/Restart) package. The packages can be downloaded from this webpage.
Make sure the BLCR packages are installed on every node and the LD_LIBRARY_PATH must contain the path to the shared library of BLCR, usually $BLCR_HOME/lib.
MVAPICH2 needs to be compiled with checkpoint/restart support, see section 4.4.
BLCR’s kernel modules must be loaded on all the compute nodes.
Make sure the PATH contains the path to the executables of BLCR, usually $BLCR_HOME/bin.

Users are strongly encouraged to read the Administrators guide of BLCR, and test the BLCR on the target platform, before using the checkpointing feature of MVAPICH2.

If using the FTB frame-work for checkpoint/restart, in addition to the above following needs to be done.

Download and install the FTB (Fault Tolerance Backplane) package. The packages can be downloaded from here.
Make sure the FTB packages are installed on every node and the LD_LIBRARY_PATH must contain the path to the shared library of FTB, usually $FTB_HOME/lib.
MVAPICH2 needs to be compiled with checkpoint/restart as well as FTB support, see section 4.4.
Start FTB Database server ($FTB_HOME/sbin/ftb_database_server) on one of the nodes, this node will act as server node for all the FTB agents.
Start FTB agents ($FTB_HOME/sbin/ftb_agent) on all the compute nodes.

Now, your system is set up to use the Checkpoint/Restart features of MVAPICH2. Several parameters are provided by MVAPICH2 for flexiblity in configuration and using the Checkpoint / Restart features. If mpiexec is used as the job startup mechanism, these paramaters need to be set in the users’ environment through the BASH shell’s export command, or the equivalent command for other shells. If mpirun_rsh is used as the job startup mechanism, these paramaters need to be passed to mpirun_rsh through the command line.

MV2_CKPT_FILE: This parameter specifies the path and the base filename for checkpoint files of MPI processes. Please note that File System performance is critical to the performance of checkpointing. This parameter controls which file system will be used to store the checkpoint files. For example, if your PVFS2 is mounted at
/mnt/pvfs2, using MV2_CKPT_FILE=/mnt/pvfs2/ckptfile will let the checkpoint files being stored in pvfs2 file system. See Section 11.1 for details.
MV2_CKPT_INTERVAL: This parameter can be used to enable automatic checkpointing. See Section 11.2 for details.
MV2_CKPT_MAX_SAVE_CKPTS: This parameter is used to limit the number of checkpoints saved on file system. See Section 11.3 for details.
MV2_CKPT_NO_SYNC: This parameter is used to control whether the program forces the checkpoint files being synced to disk or not before it continues execution. See Section 11.6 for details.
MV2_CKPT_MPD_BASE_PORT: Not applicable to mpirun_rsh. See Section 11.4 for details.
MV2_CKPT_MPIEXEC_PORT: Not applicable to mpirun_rsh. See Section 11.4 for details.

In order to provide maximum flexibility to end users who wish to use the checkpoint/restart features of MVAPICH2, we’ve provided three different methods which can be used to take the checkpoints during the execution of the MPI application. These methods are described as follows:

Manual Checkpointing: In this mode, the user simply launches an MPI application and chooses when to checkpoint the application. This mode can be primarily used for experimentation during deployment stages. In order to use this mode, the MPI application is launched normally using mpiexec or mpirun_rsh. When the user decides to take a checkpoint, the users can issue a BLCR command called ‘‘cr_checkpoint’’ with the process id (PID) of the mpiexec or mpirun_rsh process. In order to simplify the process, the script mv2_checkpoint can be used. This script is available in the same directory as mpiexec and mpirun_rsh.
Automated Checkpointing: In this mode, the user can launch the MPI application normally using mpiexec or mpirun_rsh. However, instead of manually issuing checkpoints as described in the above bullet, a parameter (MV2_CKPT_INTERVAL) can be set to automatically take checkpoints and user-defined intervals. Please refer to Section 11.2 for a complete usage description of this variable. This mode can be used to take checkpoints of a long running application, for example every 1 hour, 2 hours etc. based on user’s choice.
Application Initiated Synchronous Checkpointing: In this mode, the MPI application which is running can itself request for a checkpoint. Application can request a whole program checkpoint synchronously by calling MVAPICH2_Sync_Checkpoint. Note that it is a collective operation, and this function must be called from all processes to take the checkpoint. This mode is expected to be used by applications that can be modified and have well defined synchronization points. These points can be effectively used to take checkpoints. An example of how this mode can be activated is given below.

#include “mpi.h”
#include <unistd.h>
#include <stdio.h>

int main(int argc,char *argv[])
{
MPI_Init(&argc,&argv);
printf(“Computation\n”);
sleep(5);
MPI_Barrier(MPI_COMM_WORLD);
MVAPICH2_Sync_Checkpoint();
MPI_Barrier(MPI_COMM_WORLD);
printf(“Computation\n”);
sleep(5);
MPI_Finalize();
return 0;
}

To restart a job from a checkpoint, users need to issue another command of BLCR, ‘‘cr_restart’’ with the checkpoint file name of the MPI job console as the parameter, usually context.<pid>. The checkpoint file name of the MPI job console can be specified when issuing the checkpoint, see the ‘‘cr_checkpoint –help’’ for more information. Please note that the names of checkpoint files of the MPI processes will be assigned according to the environment variable MV2_CKPT_FILE, ($MV2_CKPT_FILE.<number of checkpoint>.<process rank>).

Please refer to the Section 9.6 for troubleshooting with Checkpoint/Restart.

6.4 Run MPI application with RDMA CM support (for OpenFabrics IB/iWARP Devices)

In MVAPICH2, for using RDMA CM the runtime variable MV2_USE_RDMA_CM needs to be used as described in 11.

In addition to these flags, all the systems to be used need the following one time setup for enabling RDMA CM usage.

Setup the RDMA CM device: RDMA CM device needs to be setup, configured with an IP address and connected to the network.
Setup the Local Address File: Create the file (/etc/mv2.conf) with the local IP address to be used by RDMA CM. (Multiple IP addresses can be listed (one per line) for multirail configurations).
$ echo 10.1.1.1 >> /etc/mv2.conf

Programs can be executed as follows:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_RDMA_CM=1 prog
or
$ mpiexec -n 2 -env MV2_USE_RDMA_CM 1 prog

6.5 Run MPI application with Shared Memory Collectives

In MVAPICH2, support for shared memory based collectives has been enabled for MPI applications running over OpenFabrics Gen2-IB, Gen2-iWARP and uDAPL stack. Currently, this support is available for the following collective operations:

MPI_Allreduce
MPI_Reduce
MPI_Barrier
MPI_Bcast

Optionally, these feature can be turned off at runtime by using the following parameters:

MV2_USE_SHMEM_COLL (section 11.74)
MV2_USE_SHMEM_ALLREDUCE (section 11.71)
MV2_USE_SHMEM_REDUCE (section 11.75)
MV2_USE_SHMEM_BARRIER (section 11.72)
MV2_USE_SHMEM_BCAST (section 11.73)

Please refer to Section 11 for further details.

6.6 Run MPI Application with Hot-Spot and Congestion Avoidance (for OpenFabrics IB Device)

MVAPICH2 supports hot-spot and congestion avoidance using InfiniBand multi-pathing mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.

To enable this functionality, a run-time variable, MV2_USE_HSAM (Section 11.62) can be enabled, as shown in the following example:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_HSAM=1 ./cpi
or
$ mpiexec -n 2 -env MV2_USE_HSAM 1 ./cpi

This functionality automatically defines the number of paths for hot-spot avoidance. Alternatively, the maximum number of paths to be used between a pair of processes can be defined by using a run-time variable MV2_NUM_QP_PER_PORT (Section 11.32).

We expect this functionality to show benefits in the presence of at least partially non-overlapping paths in the network. OpenSM, the subnet manager distributed with OpenFabrics supports LMC mechanism, which can be used to create multiple paths:

$ opensm -l4

will start the subnet manager with LMC value to four, creating sixteen paths between every pair of nodes.

6.7 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics IB Device)

MVAPICH2 supports network fault recovery by using InfiniBand Automatic Path Migration mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.

To enable this functionality, a run-time variable, MV2_USE_APM (section 11.58) can be enabled, as shown in the following example:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_APM=1 ./cpi
or
$ mpiexec -n 2 -env MV2_USE_APM 1 ./cpi

MVAPICH2 also supports testing Automatic Path Migration in the subnet in the absence of network faults. This can be controlled by using a run-time variable MV2_USE_APM_TEST (section 11.59). This should be combined with MV2_USE_APM as follows:
$ mpirun_rsh -np 2 n0 n1 MV2_USE_APM=1 MV2_USE_APM_TEST=1 ./cpi
or
$ mpiexec -n 2 -env MV2_USE_APM 1 -env MV2_USE_APM_TEST 1 ./cpi

6.8 Run MPI Application with User Defined CPU (Core) Mapping

MVAPICH2 supports user defined CPU mapping through Portable Linux Processor Affinity (PLPA) library (http://www.open-mpi.org/projects/plpa/). The feature is especially useful on multi-core systems, where performance may be different if processes are mapped to different cores. The mapping can be specified by setting the environment variable MV2_CPU_MAPPING.

For example, if you want to run 4 processes per node and utilize cores 0, 1, 4, 5 on each node, you can specify:

$ mpirun_rsh -np 64 -hostfile hosts MV2_CPU_MAPPING=0:1:4:5 ./a.out

$ mpiexec -n 64 -env MV2_CPU_MAPPING 0:1:4:5 ./a.out

In this way, process 0 on each node will be mapped to core 0, process 1 will be mapped to core 1, process 2 will be mapped to core 4, and process 3 will be mapped to core 5. For each process, the mapping is separated by a single “:”.

PLPA supports more flexible notations when specifying core mapping. More details can be found at:

http://www.open-mpi.org/community/lists/plpa-users/2007/04/0035.php

6.9 Run MPI Application with LiMIC2

MVAPICH2 supports LiMIC2 for intra-node communication for medium and large messages to get higher performance. It is disabled by default because it depends on the LiMIC2 package to be previously installed. As a convenience we have distributed the lastest LiMIC2 package (as of this release) with our sources.

To install this package, please take the following steps.

Navigate to the LiMIC2 to source
$ cd limic2-0.5.2
Configure and build the source
limic2-0.5.2$ ./configure –enable-module –sysconfdir=/etc && make
Install
limic2-0.5.2$ sudo make install

Before using LiMIC2 you’ll need to load the kernel module. If you followed the instructions above you can do this using the following command (LSB init script).

$ /etc/init.d/limic start

Please note that supplying ‘–sysconfdir=/etc’ in the configure line above told the package to install the init script and an udev rule in the standard location for system packages. This is optional but recommended.

Now you can use LiMIC2 with MVAPICH2 by simply supplying the ‘–with-limic2’ option when configuring MVAPICH2. You can run your applications as normal and LiMIC2 will be used for medium and large intra-node messages. To disable it at run time, use the env variable:

$ mpirun_rsh -np 64 -hostfile hosts MV2_SMP_USE_LIMIC2=0 ./a.out

7 Obtaining MVAPICH2 Library Version Information

The mpiname application is provided with MVAPICH2 to assist with determining the MPI library version and related information. The usage of mpiname is as follows:

Usage: [OPTION]…

Print MPI library information. With no OPTION, the output is the same as -v.

-a print all information

-c print compilers

-d print device

-h display this help and exit

-n print the MPI name

-o print configuration options

-r print release date

-v print library version

8 Using OSU Benchmarks

If you have arrived at this point, you have successfully installed MVAPICH2. Congratulations!! In the mvapich2-1.4/osu_benchmarks directory, we provide these basic performance tests:

One-way latency test (osu_latency.c)
One-way bandwidth test (osu_bw.c)
Bi-directional bandwidth (osu_bibw.c)
Multiple Bandwidth / Message Rate test (osu_mbw_mr.c)
One-sided put latency (osu_put_latency.c)
One-sided put bandwidth (osu_put_bw.c)
One-sided put bi-directional bandwidth (osu_put_bibw.c)
One-sided get latency (osu_get_latency.c)
One-sided get bandwidth (osu_get_bw.c)
One-sided accumulate latency (osu_acc_latency.c)
One-way multi-threaded latency test (osu_latency_mt.c) – Multi-threading support must be compiled in to run this test.
Broadcast test (osu_bcast.c)
Alltoall test (osu_alltoall.c)

The benchmarks are also periodically updated. The latest copy of the benchmarks can be downloaded from http://mvapich.cse.ohio-state.edu/benchmarks/. Sample performance numbers for these benchmarks on representative platforms with InfiniBand and iWARP adapters are also included on our projects’ web page. You are welcome to compare your performance numbers with our numbers. If you see any big discrepancy, please let us know by sending an email to mvapich-discuss@cse.ohio-state.edu.

9 FAQ and Troubleshooting with MVAPICH2

Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact us by sending an email to mvapich-discuss@cse.ohio-state.edu.

MVAPICH2 can be used over five underlying transport interfaces, namely OpenFabrics (Gen2), OpenFabrics (Gen2-iWARP), QLogic InfiniPath, uDAPL and TCP/IP. Based on the underlying library being utilized, the troubleshooting steps may be different. However, some of the troubleshooting hints are common for all underlying libraries. Thus, in this section, we have divided the troubleshooting tips into four sections: General troubleshooting and Troubleshooting over any one of the five transport interfaces.

9.1 General Questions and Troubleshooting

9.1.1 Invalid Communicators Error

This is a problem which typically occurs due to the presence of multiple installations of MVAPICH2 on the same set of nodes. The problem is due to the presence of mpi.h other than the one, which is used for executing the program. This problem can be resolved by making sure that the mpi.h from other installation is not included.

9.1.2 Are fork() and system() supported?

fork() and system() is supported for the OpenFabrics device as long as the kernel is being used is Linux 2.6.16 or newer. Additionally, the version of OFED used should be 1.2 or higher. The environment variable IBV_FORK_SAFE=1 must also be set to enable fork support.

9.1.3 Cannot Build with the PathScale Compiler

There is a known bug with the PathScale compiler (before version 2.5) when building MVAPICH2. This problem will be solved in the next major release of the PathScale compiler. To work around this bug, use the the “-LNO:simd=0” C compiler option. This can be set in the build script similarly to:

export CC=‘‘pathcc -LNO:simd=0’’

Please note the use of double quotes. If you are building shared libraries and are using the PathScale compiler (version below 2.5), then you should add “-g” to your CFLAGS, in order to get around a compiler bug.

9.1.4 MPI+OpenMP shows bad performance

MVAPICH2 uses CPU affinity to have better performance for single-threaded programs. For multi-threaded programs, e.g. MPI+OpenMP, it may schedule all the threads of a process to run on the same CPU. CPU affinity should be disabled in this case to solve the problem, i.e. set -env MV2_ENABLE_AFFINITY 0.

9.1.5 Error message “No such file or directory” when using Lustre file system

If you are using ADIO support for Lustre, please make sure that:
– Lustre is setup correctly, and that you are able to create, read to and write from files in the Lustre mounted directory.
– The Lustre directory is mounted on all nodes on which MVAPICH2 processes with ADIO support for Lustre are running.
– The path to the file is correctly specified.
– The permissions for the file or directory are correctly specified.

9.1.6 My program segfaults with:
File locking failed in ADIOI_Set_lock?

If you are using ADIO support for Lustre, the recent Lustre releases require an additional mount option to have correct file locks.
So please include the following option with your lustre mount command: ”-o localflock”.
For example:
$ mount -o localflock -t lustre xxxx@o2ib:/datafs /mnt/datafs

9.1.7 Running MPI programs built with gfortran

MPI programs built with gfortran might not appear to run correctly due to the default output buffering used by gfortran. If it seems there is an issue with program output, the GFORTRAN_UNBUFFERED_ALL variable can be set to “y” and exported into the environment before using the mpiexec command to launch the program, as done in the bash shell example below:

$ export GFORTRAN_UNBUFFERED_ALL=y

Or, if using mpirun_rsh, export the environment variable as in the example:

$ mpirun_rsh -np 2 n1 n2 GFORTRAN_UNBUFFERED_ALL=y ./a.out

9.1.8 Does MVAPICH2 work across AMD and Intel systems?

Yes, as long as you compile MVAPICH2 and your programs on one of the systems, either AMD or Intel, and run the same binary across the systems. MVAPICH2 has platform specific parameters for performance optimizations and it may not work if you compile MVAPICH2 and your programs on different systems and try to run the binaries together.

9.2 Failure with Job Launchers

9.2.1 Cannot find mpd.conf

If you get this error, please set your .mpd.conf and .mpdpasswd files.

9.2.2 The MPD mpiexec fails with “no msg recvd from mpd when expecting ack of request.”

This failure may be an indication that there is a problem with your cluster configuration. If you are confident in the correctness of your cluster configuration, then you can tune the timeout with MV2_MPD_RECVTIMEOUT_MULTIPLIER.

9.2.3 /usr/bin/env: mpispawn: No such file or directory

If mpirun_rsh fails with this error message, it was unable to locate a necessary utility. This can be fixed by ensuring that all MVAPICH2 executables are in the PATH on all nodes.

If PATHs cannot be setup as mentioned, then invoke mpirun_rsh with a path prefix. For example:

/path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc

../../path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc

9.2.4 Totalview complains that “The MPI library contains no suitable type definition for struct MPIR_PROCDESC”

Ensure that the MVAPICH2 job launcher mpirun_rsh is compiled with debug symbols. Details are available in Section 5.2.11.

9.3 With Gen2 Interface

9.3.1 Cannot Open HCA

The above error reports that the InfiniBand Adapter is not ready for communication. Make sure that the drivers are up. This can be done by executing the following command which gives the path at which drivers are setup.

% locate libibverbs

9.3.2 Checking state of IB Link

In order to check the status of the IB link, one of the following commands can be used:
% ibstatus
or
% ibv_devinfo.

9.3.3 Undefined reference to ibv_get_device_list

Add -DGEN2_OLD_DEVICE_LIST_VERB macro to CFLAGS and rebuild MVAPICH2-gen2. If this happens, this means that your Gen2 installation is old and needs to be updated.

9.3.4 Creation of CQ or QP failure

A possible reason could be inability to pin the memory required. Make sure the following steps are taken.

In /etc/security/limits.conf add the following

* soft memlock phys_mem_in_KB
After this, add the following to /etc/init.d/sshd

ulimit -l phys_mem_in_KB
Restart sshd

With some distros, we’ve found that adding the ulimit -l line to the sshd init script is no longer necessary. For instance, the following steps work for our rhel5 systems.

Add the following lines to /etc/security/limits.conf

* soft memlock unlimited
* hard memlock unlimited
Restart sshd

9.3.5 Hang with the HSAM Functionality

HSAM functionality uses multi-pathing mechanism with LMC functionality. However, some versions of OpenFabrics Drivers (including OpenFabrics Enterprise Distribution (OFED) 1.1) and using the Up*/Down* routing engine do not configure the routes correctly using the LMC mechanism. We strongly suggest to upgrade to OFED 1.2, which supports Up*/Down* routing engine and LMC mechanism correctly.

9.3.6 Failure with Automatic Path Migration

MVAPICH2 provides network fault tolerance with Automatic Path Migration (APM). However, APM is supported only with OFED 1.2 onwards. With OFED 1.1 and prior versions of OpenFabrics drivers, APM functionality is not completely supported. Please refer to Section 11.58 and section 11.59

9.3.7 Error opening file

If you configure MVAPICH2 with RDMA_CM and see this error, you need to verify if you have setup up the local IP address to be used by RDMA_CM in the file /etc/mv2.conf. Further, you need to make sure that this file has the appropriate file read permissions. Please follow Section 6.4 for more details on this.

9.3.8 RDMA CM Address error

If you get this error, please verify that the IP address specified /etc/mv2.conf is correctly specified with the IP address of the device you plan to use RDMA_CM with.

9.3.9 RDMA CM Route error

If see this error, you need to check whether the specified network is working or not.

9.4 With Gen2-iWARP Interface

9.4.1 Error opening file

9.4.2 RDMA CM Address error

If you get this error, please verify that the IP address specified /etc/mv2.conf is correctly specified with the IP address of the device you plan to use RDMA_CM with.

9.4.3 RDMA CM Route error

If see this error, you need to check whether the specified network is working or not.

9.4.4 No Fortran interface on the MacOS platform

To enable Fortran support, you would need to install the IBM compiler located at (there is a 60-day free trial version) available from IBM.

Once you unpack the tarball, you can customize and use make.mvapich2.vapi to compile and install the package or manually configure, compile and install the package.

9.5 With uDAPL Interface

9.5.1 Cannot Open IA

If you configure MVAPICH2 with uDAPL and see this error, you need to check whether you have specified the correct uDAPL service provider (Section 5.2.7). If you have specified the uDAPL provider but still see this error, you need to check whether the specified network is working or not. If you are using OpenFabrics software stack on Linux, the default DAPL provider is OpenIB-cma for DAPL-1.2, and ofa-v2-ib0 for DAPL-2.0. If you are using Solaris, the default DAPL provider is ibd0.

9.5.2 DAT Insufficient Resource

If you configure MVAPICH2 with uDAPL and see this error, you need to reduce the value of the environmental variables RDMA_DEFAULT_MAX_SEND_WQE and/or RDMA_DEFAULT_MAX_RECV_WQE depending on the underlying network.

9.5.3 Cannot Find libdat.so

If you get the error: “error while loading shared libraries, libdat.so”, The location of the dat shared library is incorrect. You need to find the correct path of libdat.so and export LD_LIBRARY_PATH to this correct location. For example:

$ mpirun_rsh -np 2 n1 n2 LD_LIBRARY_PATH=/path/to/libdat.so ./a.out

$ export LD_LIBRARY_PATH=/path/to/libdat.so:$LD_LIBRARY_PATH

$ mpiexec -n 2 ./a.out

9.5.4 Cannot Find mpd.conf

If you get this error, please set your .mpd.conf and .mpdpasswd files as mentioned in Section 5.2.4.

9.5.5 uDAPL over IB Does Not Scale Beyond 256 Nodes with rdma_cm Provider

We recommend that uDAPL IB consumers needing large scale-out use socket cm provider (libdaplscm.so) in leiu of rdma_cm (libdaplcma.so). iWARP users can remain using uDAPL rdma_cm provider. For detailed discussion of this issue please refer to:

http://lists.openfabrics.org/pipermail/general/2008-June/051814.html

9.6 Checkpoint/Restart

Please make sure the following things for a successful restart:

The MPD must be started on all the compute nodes and the console node before a restart.
The BLCR modules must be loaded on all the compute nodes and the console node before a restart
The checkpoint file of MPI job console must be accessible from the console node.
The corresponding checkpoint files of the MPI processes must be accessible from the compute nodes using the same path as when checkpoint was taken.

The following things can cause a restart to fail:

The job which was checkpointed is not terminated or the some processes in that job are not cleaned properly. Usually they will be cleaned automatically, otherwise, since the pid can’t be used by BLCR’s to restart, restart will fail.
The processes in the job have opened temporary files and these temporary files are removed or not accessible from the nodes where the processes are restarted on.

FAQ regarding Berkeley Lab Checkpoint/Restart (BLCR) can be found at
http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html And the userguide for BLCR can be found at
http://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html

If you encounter any problem with the Checkpoint/Restart support, please feel free to contact us as mvapich-discuss@cse.ohio-state.edu.

10 Scalable features for Large Scale Clusters and Performance Tuning

MVAPICH2 provides many different parameters for tuning performance for a wide variety of platforms and applications. These parameters can be either compile time parameters or runtime parameters. Please refer to Section 10 for a complete description of all these parameters. In this section we classify these parameters depending on what you are tuning for and provide guidelines on how to use them.

10.1 Job Launch Tuning

Starting with version 1.2, MVAPICH2 has a new, scalable job launcher – mpirun_rsh which uses a tree based mechanism to spawn processes. The degree of this tree is determined dynamically to keep the depth low. For large clusters, it might be beneficial to further flatten the tree by specifying a higher degree. The degree can be overridden with the environment variable MV2_MT_DEGREE (see 11.28).

When running on large number of nodes, MVAPICH2 can use a faster, hierarchical launching mechanism. This mechanism can also be enabled manually by using the environment variable MV2_FASTSSH_THRESHOLD (see 11.17).

10.2 Basic QP Resource Tuning

The following parameters affect memory requirements for each QP.

MV2_DEFAULT_MAX_SEND_WQE
MV2_DEFAULT_MAX_RECV_WQE
MV2_MAX_INLINE_SIZE

MV2_DEFAULT_MAX_SEND_WQE and MV2_DEFAULT_MAX_RECV_WQE control the maximum number of WQEs per QP and MV2_MAX_INLINE_SIZE controls the maximum inline size. Reducing the values of these two parameters leads to less memory consumption. They are especially important for large scale clusters with a large amount of connections and multiple rails.

These two parameters are run-time adjustable. Please refer to Sections 11.12 and 11.25 for details.

10.3 RDMA Based Point-to-Point Tuning

The following parameters are important in tuning the memory requirements for adaptive rdma fast path feature.

MV2_VBUF_TOTAL_SIZE
MV2_NUM_RDMA_BUFFER
MV2_RDMA_VBUF_POOL_SIZE

MV2_RDMA_VBUF_POOL_SIZE is a fixed number of pool of vbufs. These vbufs can be shared among all different connections depending on the communication needs of each connection.

On the other hand, the product of MV2_VBUF_TOTAL_SIZE and MV2_NUM_RDMA_BUFFER generally is a measure of the amount of memory registered for eager message passing. These buffers are not shared across connections.

In MVAPICH2, MV2_VBUF_TOTAL_SIZE is adjustable by environmental variables. Please refer to Section 11.80 for details.

10.4 Shared Receive Queue (SRQ) Tuning

The main environmental parameters controlling the behavior of the Shared Receive Queue design are:

MV2_SRQ_SIZE (11.55)
MV2_SRQ_LIMIT (11.54)

MV2_SRQ_SIZE is the maximum size of the Shared Receive Queue. You may increase this to value 1000 if the application requires very large number of processors (4K and beyond).
MV2_SRQ_LIMIT defines the low watermark for the flow control handler. This can be reduced if your aim is to reduce the number of interrupts.

10.5 eXtended Reliable Connection (XRC)

MVAPICH2 now supports the eXtended Reliable Connection (XRC) transport available in recent Mellanox HCAs. This transport helps reduce the number of QPs needed on multi-core systems. Set MV2_USE_XRC 11.77 to use XRC with MVAPICH2.

10.6 Shared Memory Tuning

MVAPICH2 uses shared memory communication channel to achieve high-performance message passing among processes that are on the same physical node. The two main parameters which are used for tuning shared memory performance for small messages are SMPI_LENGTH_QUEUE ( 11.82) and SMP_EAGER_SIZE ( 11.81). The two main parameters which are used for tuning shared memory performance for large messages are SMP_SEND_BUF_SIZE( 11.84) and SMP_NUM_SEND_BUFFER ( 11.83).

SMPI_LENGTH_QUEUE is the size of the shared memory buffer which is used to store outstanding small and control messages. SMP_EAGER_SIZE defines the switch point from Eager protocol to Rendezvous protocol.

Messages larger than SMP_EAGER_SIZE are packetized and sent out in a pipelined manner.
SMP_SEND_BUF_SIZE is the packet size, i.e. the send buffer size. SMP_NUM_SEND_BUFFER is the number of send buffers.

10.7 On-demand Connection Management Tuning

MVAPICH2 uses on-demand connection management to reduce the memory usage of MPI library. There are 4 parameters to tune connection manager: MV2_ON_DEMAND_THRESHOLD ( 11.34), MV2_CM_RECV_BUFFERS ( 11.7), MV2_CM_TIMEOUT ( 11.9), and MV2_CM_SPIN_COUNT ( 11.8). The first one applies to Gen2-IB, Gen2-iWARP and uDAPL devices and the other three only apply to Gen2 device.

MV2_ON_DEMAND_THRESHOLD defines threshold for enabling on-demand connection management scheme. When the size of the job is larger than the threshold value, on-demand connection management will be used.

MV2_CM_RECV_BUFFERS defines the number of buffers used by connection manager to establish new connections. These buffers are quite small and are shared for all connections, so this value may be increased to 8192 for large clusters to avoid reties in case of packet drops.

MV2_CM_TIMEOUT is the timeout value associated with connection management messages via UD channel. Decreasing this value may lead to faster retries but at the cost of generating duplicate messages.

MV2_CM_SPIN_COUNT is the number of the connection manager polls for new control messages from UD channel for each interrupt. This may be increased to reduce the interrupt overhead when many incoming control messages from UD channel at the same time.

10.8 Scalable Collectives Tuning

MVAPICH2 uses shared memory to get the best performance for many collective operations: MPI_Allreduce, MPI_Reduce, MPI_Barrier, and MPI_Bcast.

The important parameters for tuning these collectives are as follows. For MPI_Allreduce, the optimized shared memory algorithm is used until the MV2_SHMEM_ALLREDUCE_MSG( 11.45).

Similarly for MPI_Reduce the corresponding threshold is MV2_SHMEM_REDUCE_MSG( 11.51) and for MPI_BCAST the threshold can be set using MV2_SHMEM_BCAST_MSG( 11.47). The default value for the SHMEM_BCAST_LEADERS parameter is set to 4K for this release.

The current version of MVAPICH2 also supports a 2-level point-to-point based Knomial algorothm for MPI_BCAST. It is currently active for all messages of size less than MPIR_BCAST_SHORT_MSG. Users can set the MV2_KNOMIAL_2LEVEL_BCAST_THRESHOLD( 11.24) parameter to select the lower threshold for using the knomial-based algorithm. With this setting, the normal binomial algorithm will be used for message sizes smaller than the chosen value and the new knomial algorithm will be used for larger messages until the MPIR_BCAST_SHORT_MSG size is reached. The users can also choose the inter-node and intra-node k-degree of the knomial bcast algorithm by using the parameters MV2_KNOMIAL_INTER_NODE_FACTOR ( 11.23) and MV2_KNOMIAL_INTRA_NODE_FACTOR ( 11.22). These values are currently set to 4.

11 MVAPICH2 Parameters

11.1 MV2_CKPT_FILE

Class: Run Time
Default: /tmp/ckpt
Applicable interface(s): Gen2

This parameter specifies the path and the base filename for checkpoint files of MPI processes. The checkpoint files will be named as $MV2_CKPT_FILE.<number of checkpoint>.<process rank>, for example, /tmp/ckpt.1.0 is the checkpoint file for process 0’s first checkpoint. To checkpoint on network-based file systems, user just need to specify the path to it, such as /mnt/pvfs2/my_ckpt_file.

11.2 MV2_CKPT_INTERVAL

Class: Run Time
Default: 0
Unit: minutes
Applicable interface(s): Gen2

This parameter can be used to enable automatic checkpointing. To let MPI job console automatically take checkpoints, this value needs to be set to the desired checkpointing interval. A zero will disable automatic checkpointing. Using automatic checkpointing, the checkpoint file for the MPI job console will be named as $MV2_CKPT_FILE.<number of checkpoint>.auto. Users need to use this file for restart.

11.3 MV2_CKPT_MAX_SAVE_CKPTS

Class: Run Time
Default: 0
Applicable interface(s): Gen2

This parameter is used to limit the number of checkpoints saved on file system to save the file system space. When set to a positive value N, only the last N checkpoints will be saved.

11.4 MV2_CKPT_MPD_BASE_PORT

Class: Run Time
Default: 24678
Applicable interface(s): Gen2

This parameter specifies the ports of socket connections to pass checkpointing control messages between MPD manager and MPI process. Users need to have a set of unused ports starting with $MV2_CKPT_MPD_BASE_PORT on the compute nodes. The used port will be the
$MV2_CKPT_MPD_BASE_PORT + <process rank> for each MPI processes.

11.5 MV2_CKPT_MPIEXEC_PORT

Class: Run Time
Default: 14678
Applicable interface(s): Gen2

This parameter specifies the port of the socket connection for passing checkpointing control messages on MPI job console node. Users need to have an unused port to be set to
$MV2_CKPT_MPIEXEC_PORT on the console node.

11.6 MV2_CKPT_NO_SYNC

Class: Run Time
Applicable interface(s): Gen2

When this parameter is set to any value, the checkpoints will not be required to sync to disk. It can reduce the checkpointing delay in many cases. But if users are using local file system, or any parallel file system with local cache, to store the checkpoints, it is recommended not to set this parameter because otherwise the checkpoint files will be cached in local memory and will likely be lost upon failure.

11.7 MV2_CM_RECV_BUFFERS

Class: Run Time
Default: 1024
Applicable interface(s): Gen2

This defines the number of buffers used by connection manager to establish new connections. These buffers are quite small and are shared for all connections, so this value may be increased to 8192 for large clusters to avoid reties in case of packet drops.

11.8 MV2_CM_SPIN_COUNT

Class: Run Time
Default: 5000
Applicable interface(s): Gen2

This is the number of the connection manager polls for new control messages from UD channel for each interrupt. This may be increased to reduce the interrupt overhead when many incoming control messages from UD channel at the same time.

11.9 MV2_CM_TIMEOUT

Class: Run Time
Default: 500
Unit: milliseconds
Applicable interface(s): Gen2

This is the timeout value associated with connection management messages via UD channel. Decreasing this value may lead to faster retries but at the cost of generating duplicate messages.

11.10 MV2_CPU_MAPPING

Class: Run Time
Default: Local rank based mapping
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL (Linux)

This allows users to specify process to CPU (core) mapping. The detailed usage of this parameter is described in Section 6.8. This parameter will not take effect if MV2_ENABLE_AFFINITY is set to 0. MV2_CPU_MAPPING is currently not supported on Solaris.

11.11 MV2_DAPL_PROVIDER

Class: Run time
Default: ofa-v2-ib0 (Linux DAPL v2.0), OpenIB-cma (Linux DAPL v1.2), ibd0 (Solaris)
Applicable interface(s): uDAPL

This is to specify the underlying uDAPL library that the user would like to use if MVAPICH2 is built with uDAPL.

11.12 MV2_DEFAULT_MAX_SEND_WQE

Class: Run time
Default: 64
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This specifies the maximum number of send WQEs on each QP. Please note that for Gen2 and Gen2-iWARP, the default value of this parameter will be 16 if the number of processes is larger than 256 for better memory scalability.

11.13 MV2_DEFAULT_MAX_RECV_WQE

Class: Run time
Default: 128
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This specifies the maximum number of receive WQEs on each QP (maximum number of receives that can be posted on a single QP).

11.14 MV2_DEFAULT_MTU

Class: Run time
Default: Gen2: IBV_MTU_1024 for IB SDR cards and IBV_MTU_2048 for IB DDR and QDR cards. uDAPL: Network dependent.
Applicable interface(s): Gen2, uDAPL

The internal MTU size. For Gen2, this parameter should be a string instead of an integer. Valid values are: IBV_MTU_256, IBV_MTU_512, IBV_MTU_1024, IBV_MTU_2048, IBV_MTU_4096.

11.15 MV2_DEFAULT_PKEY

Class: Run Time
Applicable device(s): Gen2

Select the partition to be used for the job.

11.16 MV2_ENABLE_AFFINITY

Class: Run time
Default: 1
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL (Linux)

Enable CPU affinity by setting MV2_ENABLE_AFFINITY to 1 or disable it by setting
MV2_ENABLE_AFFINITY to 0. MV2_ENABLE_AFFINITY is currently not supported on Solaris.

11.17 MV2_FASTSSH_THRESHOLD

Class: Run time
Default: 200
Applicable device(s): All

Number of nodes beyond which to use hierarchical ssh during startup. This parameter is only relevant for mpirun_rsh based startup.

11.18 MV2_GET_FALLBACK_THRESHOLD

Class: Run time
This threshold value needs to be set in bytes.
This option is effective if we define ONE_SIDED flag.
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the threshold beyond which the MPI_Get implementation is based on direct one sided RDMA operations.

11.19 MV2_IBA_EAGER_THRESHOLD

Class: Run time
Default: Architecture dependent (12KB for IA-32)
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This specifies the switch point between eager and rendezvous protocol in MVAPICH2. For better performance, the value of MV2_IBA_EAGER_THRESHOLD should be set the same as MV2_VBUF_TOTAL_SIZE.

11.20 MV2_IBA_HCA

Class: Run time
Default: Unset
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This specifies the HCA to be used for performing network operations.

11.21 MV2_INITIAL_PREPOST_DEPTH

Class: Run time
Default: 10
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the initial number of pre-posted receive buffers for each connection. If communication happen for a particular connection, the number of buffers will be increased to
RDMA_PREPOST_DEPTH.

11.22 MV2_KNOMIAL_INTRA_NODE_FACTOR

Class: Run time
Default: 4
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the degree of the knomial operation during the intra-node knomial broadcast phase.

11.23 MV2_KNOMIAL_INTER_NODE_FACTOR

Class: Run time
Default: 4
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the degree of the knomial operation during the inter-node knomial broadcast phase.

11.24 MV2_KNOMIAL_2LEVEL_BCAST_THRESHOLD

Class: Run time
Default: 0
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the minimum message size in bytes for the knomial bcast algorithm to be used during a call to MPI_Bcast. As of now, knomial bcast is being called for all messages less than MPIR_BCAST_SHORT_MSG.

11.25 MV2_MAX_INLINE_SIZE

Class: Run time
Default: Network card dependent (128 for most networks including InfiniBand)
Applicable interface(s): Gen2, Gen2-iWARP

This defines the maximum inline size for data transfer. Please note that the default value of this parameter will be 0 when the number of processes is larger than 256 to improve memory usage scalability.

11.26 MV2_MPD_RECVTIMEOUT_MULTIPLIER

Class: Run time
Default: 0.05

The multiplier to be added to the MPD mpiexec timeout for each process in a job.

11.27 MV2_MPIRUN_TIMEOUT

Class: Run time
Default: Dynamic – based on number of nodes

The number of seconds after which mpirun_rsh aborts job launch. Note that unlike most other parameters described in this section, this is an environment variable that has to be set in the runtime environment (for e.g. through export in the bash shell).

11.28 MV2_MT_DEGREE

Class: Run time
Default: Dynamic – based on number of nodes

The degree of the hierarchical tree used by mpirun_rsh. By default mpirun_rsh uses a value that tries to keep the depth of the tree to 4. Note that unlike most other parameters described in this section, this is an environment variable that has to be set in the runtime environment (for e.g. through export in the bash shell).

11.29 MV2_NDREG_ENTRIES

Class: Run time
Default: 1000
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the total number of buffers that can be stored in the registration cache. It has no effect if MV2_USE_LAZY_MEM_UNREGISTER is not set. A larger value will lead to less frequent lazy de-registration.

11.30 MV2_NUM_HCAS

Class: Run time
Default: 1
Applicable interface(s): Gen2, Gen2-iWARP

This parameter indicates number of InfiniBand adapters to be used for communication on an end node.

11.31 MV2_NUM_PORTS

Class: Run time
Default: 1
Applicable interface(s): Gen2, Gen2-iWARP

This parameter indicates number of ports per InfiniBand adapter to be used for communication per adapter on an end node.

11.32 MV2_NUM_QP_PER_PORT

Class: Run time
Default: 1
Applicable interface(s): Gen2, Gen2-iWARP

This parameter indicates number of queue pairs per port to be used for communication on an end node. This is useful in the presence of multiple send/recv engines available per port for data transfer.

11.33 MV2_NUM_RDMA_BUFFER

Class: Run time
Default: Architecture dependent (32 for EM64T)
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

The number of RDMA buffers used for the RDMA fast path. This fast path is used to reduce latency and overhead of small data and control messages. This value will be ineffective if MV2_USE_RDMA_FAST_PATH is not set.

11.34 MV2_ON_DEMAND_THRESHOLD

Class: Run Time
Default: 64 (Gen2, uDAPL), 16 (Gen2-iWARP)
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines threshold for enabling on-demand connection management scheme. When the size of the job is larger than the threshold value, on-demand connection management will be used.

11.35 MV2_PREPOST_DEPTH

Class: Run time
Default: 64
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the number of buffers pre-posted for each connection to handle send/receive operations.

11.36 MV2_PSM_DEBUG

Class: Run time (Debug)
Default: 0
Applicable interface: PSM

This parameter enables the dumping of run-time debug counters from the MVAPICH2-PSM progress engine. Counters are dumped every PSM_DUMP_FREQUENCY seconds.

11.37 MV2_PSM_DUMP_FREQUENCY

Class: Run time (Debug)
Default: 10 seconds
Applicable interface: PSM

This parameters sets the frequency for dumping MVAPICH2-PSM debug counters. Value takes effect only in PSM_DEBUG is enabled.

11.38 MV2_PUT_FALLBACK_THRESHOLD

Class: Run time
This threshold value needs to be set in bytes.
This option is effective if we define ONE_SIDED flag.
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This defines the threshold beyond which the MPI_Put implementation is based on direct one sided RDMA operations.

11.39 MV2_RDMA_CM_ARP_TIMEOUT

Class: Run Time
Default: 2000 ms
Applicable interface(s): Gen2, Gen2-iWARP

This parameter specifies the arp timeout to be used by RDMA CM module.

11.40 MV2_RDMA_CM_MAX_PORT

Class: Run Time
Default: Unset
Applicable interface(s): Gen2, Gen2-iWARP

This parameter specifies the upper limit of the port range to be used by the RDMA CM module when choosing the port on which it listens for connections.

11.41 MV2_RDMA_CM_MIN_PORT

Class: Run Time
Default: Unset
Applicable interface(s): Gen2, Gen2-iWARP

This parameter specifies the lower limit of the port range to be used by the RDMA CM module when choosing the port on which it listens for connections.

11.42 MV2_RNDV_PROTOCOL

Class: Run time
Default: RPUT
Applicable interface(s): Gen2, Gen2-iWARP

The value of this variable can be set to choose different Rendezvous protocols. RPUT (default RDMA-Write) RGET (RDMA Read based), R3 (send/recv based).

11.43 MV2_R3_THRESHOLD

Class: Run time
Default: 4096
Applicable interface(s): Gen2, Gen2-iWARP

The value of this variable controls what message sizes go over the R3 rendezvous protocol. Messages above this message size use MV2_RNDV_PROTOCOL.

11.44 MV2_R3_NOCACHE_THRESHOLD

Class: Run time
Default: 32768
Applicable interface(s): Gen2, Gen2-iWARP

The value of this variable controls what message sizes go over the R3 rendezvous protocol when the registration cache is disabled (MV2_USE_LAZY_MEM_UNREGISTER=0). Messages above this message size use MV2_RNDV_PROTOCOL.

11.45 MV2_SHMEM_ALLREDUCE_MSG

Class: Run Time
Default: 1 ≪ 15
Applicable interface(s): Gen2, Gen2-iWARP

The shmem allreduce is used for messages less than this threshold.

11.46 MV2_SHMEM_BCAST_LEADERS

Class: Run time
Default: 4096

The number of leader processes that will take part in the shmem broadcast operation. Must be greater than the number of nodes in the job.

11.47 MV2_SHMEM_BCAST_MSG

Class: Run Time
Default: 1 ≪ 20
Applicable interface(s): Gen2, Gen2-iWARP

The shmem bcast is used for messages less than this threshold.

11.48 MV2_SHMEM_COLL_MAX_MSG_SIZE

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP

This parameter can be used to select the max buffer size of message for shared memory collectives.

11.49 MV2_SHMEM_COLL_NUM_COMM

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP

This parameter can be used to select the number of communicators using shared memory collectives.

11.50 MV2_SHMEM_DIR

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL
Default: /dev/shm for Linux and /tmp for Solaris

This parameter can be used to specify the path to the shared memory files for intra-node communication.

11.51 MV2_SHMEM_REDUCE_MSG

Class: Run Time
Default: 1 ≪ 10
Applicable interface(s): Gen2, Gen2-iWARP

The shmem reduce is used for messages less than this threshold.

11.52 MV2_SM_SCHEDULING

Class: Run Time
Default: USE_FIRST (Options: ROUND_ROBIN)
Applicable interface(s): Gen2, Gen2-iWARP

11.53 MV2_SMP_USE_LIMIC2

Class: Run Time
Default: On if configured with –with-limic2
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This parameter enables/disables LiMIC2 at run time. It does not take effect if MVAPICH2 is not configured with –with-limic2.

11.54 MV2_SRQ_LIMIT

Class: Run Time
Default: 30
Applicable interface(s): Gen2, Gen2-iWARP

This is the low watermark limit for the Shared Receive Queue. If the number of available work entries on the SRQ drops below this limit, the flow control will be activated.

11.55 MV2_SRQ_SIZE

Class: Run Time
Default: 512
Applicable interface(s): Gen2, Gen2-iWARP

This is the maximum number of work requests allowed on the Shared Receive Queue.

11.56 MV2_STRIPING_THRESHOLD

Class: Run Time
Default: 8192
Applicable interface(s): Gen2, Gen2-iWARP

This parameter specifies the message size above which we begin the stripe the message across multiple rails (if present).

11.57 MV2_SUPPORT_DPM

Class: Run time
Default: 0 (disabled)
Applicable interface: Gen2

This option enables the dynamic process management interface and on-demand connection management.

11.58 MV2_USE_APM

Class: Run Time
Applicable interface(s): Gen2

This parameter is used for recovery from network faults using Automatic Path Migration. This functionality is beneficial in the presence of multiple paths in the network, which can be enabled by using lmc mechanism.

11.59 MV2_USE_APM_TEST

Class: Run Time
Applicable interface(s): Gen2

This parameter is used for testing the Automatic Path Migration functionality. It periodically moves the alternate path as the primary path of communication and re-loads another alternate path.

11.60 MV2_USE_BLOCKING

Class: Run time
Default: 0
Applicable interface(s): Gen2

Setting this parameter enables mvapich2 to use blocking mode progress. MPI applications do not take up any CPU when they are waiting for incoming messages.

11.61 MV2_USE_COALESCE

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP

Setting this parameter enables message coalescing to increase small message throughput

11.62 MV2_USE_HSAM

Class: Run Time
Applicable interface(s): Gen2

This parameter is used for utilizing hot-spot avoidance with InfiniBand clusters. To leverage this functionality, the subnet should be configured with lmc greater than zero. Please refer to section 6.6 for detailed information.

11.63 MV2_USE_IWARP_MODE

Class: Run Time
Default: unset
Applicable interface(s): Gen2, Gen2-iWARP

This parameter enables the library to run in iWARP mode. The library has to be built using the flag -DRDMA_CM for using this feature.

11.64 MV2_USE_KNOMIAL_2LEVEL_BCAST

Class: Run time
Default: 1
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL (Linux)

Enable Knomial Broadcast by setting MV2_USE_KNOMIAL_2LEVEL_BCAST to 1 or disable it by setting
MV2_USE_KNOMIAL_2LEVEL_BCAST to 0. The other knomial related variables are :

MV2_KNOMIAL_INTRA_NODE_FACTOR
MV2_KNOMIAL_INTER_NODE_FACTOR
MV2_KNOMIAL_2LEVEL_BCAST_THRESHOLD

11.65 MV2_USE_LAZY_MEM_UNREGISTER

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

Setting this parameter enables mvapich2 to use memory registration cache.

11.66 MV2_USE_RDMA_CM

Class: Run Time
Default: Network Dependant (set for Gen2-iWARP and unset for Gen2)
Applicable interface(s): Gen2, Gen2-iWARP

This parameter enables the use of RDMA CM for establishing the connections. The library has to be built using the flag -DRDMA_CM for using this feature.

11.67 MV2_USE_RDMA_FAST_PATH

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

Setting this parameter enables mvapich2 to use adaptive rdma fast path features for Gen2 interface and static rdma fast path features for uDAPL interface.

11.68 MV2_USE_RDMA_ONE_SIDED

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

Setting this parameter allows mvapich2 to use optimized one sided implementation based RDMA operations.

11.69 MV2_USE_RING_STARTUP

Class: Run time
Default: set
Applicable interface(s): Gen2

Setting this parameter enables mvapich2 to use ring based startup.

11.70 MV2_USE_SHARED_MEM

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

Use shared memory for intra-node communication.

11.71 MV2_USE_SHMEM_ALLREDUCE

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL, VAPI

This parameter can be used to turn off shared memory based MPI_Allreduce for Gen2 over IBA by setting this to 0.

11.72 MV2_USE_SHMEM_BARRIER

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL, VAPI

This parameter can be used to turn off shared memory based MPI_Barrier for Gen2 over IBA by setting this to 0.

11.73 MV2_USE_SHMEM_BCAST

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This parameter can be used to turn off shared memory based MPI_Bcast for Gen2 over IBA by setting this to 0.

11.74 MV2_USE_SHMEM_COLL

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

Use shared memory for collective communication. Set this to 0 for disabling shared memory collectives.

11.75 MV2_USE_SHMEM_REDUCE

Class: Run Time
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL, VAPI

This parameter can be used to turn off shared memory based MPI_Reduce for Gen2 over IBA by setting this to 0.

11.76 MV2_USE_SRQ

Class: Run time
Default: set
Applicable interface(s): Gen2, Gen2-iWARP

Setting this parameter enables mvapich2 to use shared receive queue.

11.77 MV2_USE_XRC

Class: Run time
Default: 0
Applicable device(s): Gen2

Use the XRC InfiniBand transport available since Mellanox ConnectX adapters. This features requires OFED version later than 1.3. It also automatically enables SRQ and ON-DEMAND connection management. Note that the MVAPICH2 library needs to have been configured with –enable-xrc=yes to use this feature.

11.78 MV2_VBUF_POOL_SIZE

Class: Run time
Default: 512
Applicable interface(s): Gen2, Gen2-iWARP

The number of vbufs in the initial pool. This pool is shared among all the connections.

11.79 MV2_VBUF_SECONDARY_POOL_SIZE

Class: Run time
Default: 128
Applicable interface(s): Gen2, Gen2-iWARP

The number of vbufs allocated each time when the global pool is running out in the initial pool. This is also shared among all the connections.

11.80 MV2_VBUF_TOTAL_SIZE

Class: Run time
Default: Architecture dependent (6 KB for EM64T)
Applicable interface(s): Gen2, Gen2-iWARP

The size of each vbuf, the basic communication buffer of MVAPICH2. For better performance, the value of MV2_IBA_EAGER_THRESHOLD should be set the same as MV2_VBUF_TOTAL_SIZE.

11.81 SMP_EAGERSIZE

Class: Run time
Default: Architecture dependent
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This parameter defines the switch point from Eager protocol to Rendezvous protocol for intra-node communication. Note that this variable should be set in KBytes.

11.82 SMPI_LENGTH_QUEUE

Class: Run time
Default: Architecture dependent
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This parameter defines the size of shared buffer between every two processes on the same node for transferring messages smaller than or equal to SMP_EAGERSIZE. Note that this variable should be set in KBytes.

11.83 SMP_NUM_SEND_BUFFER

Class: Run time
Default: Architecture dependent
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This parameter defines the number of internal send buffers for sending intra-node messages larger than SMP_EAGERSIZE.

11.84 SMP_SEND_BUF_SIZE

Class: Compile time
Default: Architecture dependent
Applicable interface(s): Gen2, Gen2-iWARP, uDAPL

This parameter defines the packet size when sending intra-node messages larger than SMP_EAGERSIZE.

2025 10월
월	화	수	목	금	토	일
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

MVAPICH2 1.4 User Guide

Contents

1 Overview of the Open-Source MVAPICH Project

2 How to use this User Guide?

3 MVAPICH2 1.4 Features

4 Installation Instructions

4.1 Building from a Tarball

4.2 Obtaining and Building the Source from Anonymous SVN

4.3 Selecting a Process Manager

4.3.1 Using SLURM

4.4 Configuring a build for OpenFabrics IB/iWARP

4.5 Configuring a build for uDAPL

4.6 Configuring a build for QLogic InfiniPath

4.7 Configuring a build for TCP/IP

5 Basic Usage Instructions

5.1 Compile MPI Applications

5.2 Run MPI Applications

5.2.1 Run MPI Applications using mpirun_rsh (for OpenFabrics IB/iWARP, QLogic InfiniPath and uDAPL Devices)

5.2.2 Run MPI Applications using SLURM

5.2.3 Setting MPD Environment for Running Applications with mpiexec

5.2.4 Run MPI Applications using mpiexec with OpenFabrics IB Device or QLogic InfiniPath Device

5.2.5 Run MPI-2 Application with Dynamic Process Management support

5.2.6 Run MPI Application with mpiexec using OpenFabrics iWARP Device

5.2.7 Run MPI Application using mpiexec with uDAPL Device

5.2.8 Run MPI Application using mpiexec with TCP/IP

5.2.9 Run MPI applications using ADIO driver for Lustre

5.2.10 Run MPI Applications using Shared Library Support

5.2.11 Run MPI Application using TotalView Debugger Support

6 Advanced Usage Instructions

6.1 Run MPI applications on Multi-Rail Configurations (for OpenFabrics IB/iWARP Devices)

6.2 Run MPI application with Customized Optimizations (for OpenFabrics IB/iWARP Devices)

6.3 Run MPI application with Checkpoint/Restart Support (for OpenFabrics IB Device)

6.4 Run MPI application with RDMA CM support (for OpenFabrics IB/iWARP Devices)

6.5 Run MPI application with Shared Memory Collectives

6.6 Run MPI Application with Hot-Spot and Congestion Avoidance (for OpenFabrics IB Device)

6.7 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics IB Device)

6.8 Run MPI Application with User Defined CPU (Core) Mapping

6.9 Run MPI Application with LiMIC2

7 Obtaining MVAPICH2 Library Version Information

8 Using OSU Benchmarks

9 FAQ and Troubleshooting with MVAPICH2

9.1 General Questions and Troubleshooting

9.1.1 Invalid Communicators Error

9.1.2 Are fork() and system() supported?

9.1.3 Cannot Build with the PathScale Compiler

9.1.4 MPI+OpenMP shows bad performance

9.1.5 Error message “No such file or directory” when using Lustre file system

9.1.6 My program segfaults with: File locking failed in ADIOI_Set_lock?

9.1.7 Running MPI programs built with gfortran

9.1.8 Does MVAPICH2 work across AMD and Intel systems?

9.2 Failure with Job Launchers

9.2.1 Cannot find mpd.conf

9.2.2 The MPD mpiexec fails with “no msg recvd from mpd when expecting ack of request.”

9.2.3 /usr/bin/env: mpispawn: No such file or directory

9.2.4 Totalview complains that “The MPI library contains no suitable type definition for struct MPIR_PROCDESC”

9.3 With Gen2 Interface

9.3.1 Cannot Open HCA

9.3.2 Checking state of IB Link

9.3.3 Undefined reference to ibv_get_device_list

9.3.4 Creation of CQ or QP failure

9.3.5 Hang with the HSAM Functionality

9.3.6 Failure with Automatic Path Migration

9.3.7 Error opening file

9.3.8 RDMA CM Address error

9.3.9 RDMA CM Route error

9.4 With Gen2-iWARP Interface

9.4.1 Error opening file

9.4.2 RDMA CM Address error

9.4.3 RDMA CM Route error

9.4.4 No Fortran interface on the MacOS platform

9.5 With uDAPL Interface

9.5.1 Cannot Open IA

9.5.2 DAT Insufficient Resource

9.5.3 Cannot Find libdat.so

9.5.4 Cannot Find mpd.conf

9.5.5 uDAPL over IB Does Not Scale Beyond 256 Nodes with rdma_cm Provider

9.6 Checkpoint/Restart

10 Scalable features for Large Scale Clusters and Performance Tuning

10.1 Job Launch Tuning

10.2 Basic QP Resource Tuning

9.1.6 My program segfaults with:
File locking failed in ADIOI_Set_lock?