HP-MPI Options

This section describes the specific options included in <mpirun_options> for all of the preceding examples. They are listed by the categories:

Interconnect selection

Launching specifications

Debugging and informational

RDMA control

MPI-2 functionality

Environment control

Special HP-MPI mode

Windows CCP

Interconnect selection options

Network selection

-elan/-ELAN
Explicit command line interconnect selection to use Quadrics Elan (available on Linux only). The lower case option is taken as advisory and indicates that the interconnect should be used if it is available. The upper case option is taken as mandatory and instructs MPI to abort if the interconnect is unavailable. The interaction between these options and the related MPI_IC_ORDER variable is that any command line interconnect selection here is implicitly prepended to MPI_IC_ORDER.

-gm/-GM
Explicit command line interconnect selection to use Myrinet GM (available on Linux only). The lower and upper case options are analogous to the Elan options (explained above).

-ibal/-IBAL
Explicit command line interconnect selection to use the Windows IB Access Layer (available on Windows only). The lower and upper case options are analogous to the Elan options (explained above).

-ibv/-IBV
Explicit command line interconnect selection to use OpenFabrics InfiniBand (available on Linux only). The lower and upper case options are analogous to the Elan options (explained above).

-itapi/-ITAPI
Explicit command line interconnect selection to use ITAPI (available on HP-UX only). The lower and upper case options are analogous to the Elan options (explained above).

-mx/-MX
Explicit command line interconnect selection to use Myrinet MX (available on Linux only). The lower and upper case options are analogous to the Elan options (explained above).

-psm/-PSM
Explicit command line interconnect selection to use QLogic InfiniBand (available on Linux only). The lower and upper case options are analogous to the Elan options (explained above).

-TCP
Specifies that TCP/IP should be used instead of another high-speed interconnect. If you have multiple TCP/IP interconnects, use -netaddr to specify which one to use. Use -prot to see which one was selected. Example:

% $MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out

-udapl/-UDAPL
Explicit command line interconnect selection to use uDAPL (available on Linux only). The lower and upper case options are analogous to the Elan options (explained above).

Dynamic linking is required with uDAPL. Do not link -static.

-vapi/-VAPI
Explicit command line interconnect selection to use Mellanox Verbs API (available on Linux only). The lower and upper case options are analogous to the Elan options (explained above).

Dynamic linking is required with VAPI. Do not link -static.

-commd
Routes all off-host communication through daemons rather than between processes.

Local host communication method

-intra=mix
This same functionality is available through the environment variable MPI_INTRA which can be set to shm, nic, or mix. Use shared memory for small messages. The default is 256k bytes, or what is set by MPI_RDMA_INTRALEN. For larger messages, the interconnect is used for better bandwidth.

This option does not work with TCP, Elan, MX, or PSM.

-intra=nic
Use the interconnect for all intra-host data transfers. (Not recommended for high performance solutions.)

-intra=shm
Use shared memory for all intra-host data transfers. This is the default.

TCP interface selection

-netaddr
This option is similar to -subnet, but allows finer control of the selection process for TCP/IP connections. MPI has two main sets of connections: those between ranks and/or daemons where all the real message traffic occurs, and connections between mpirun and the daemons where little traffic occurs (but are still necessary).

The -netaddr option can be used to specify a single IP/mask to use for both of these purposes, or specify them individually. The latter might be needed if mpirun happens to be run on a remote machine that doesn’t have access to the same ethernet network as the rest of the cluster. To specify both, the syntax would be -netaddr IP-specification[/mask]. To specify them individually it would be -netaddr mpirun:spec,rank:spec. The string launch: can be used in place of mpirun:.

The IP-specification can be a numeric IP address like 172.20.0.1 or it can be a hostname. If a hostname is used, the value will be the first IP address returned by gethostbyname(). The optional mask can be specified as a dotted quad, or can be given as a number representing how many bits are to be matched. So, for example, a mask of “11” would be equivalent to a mask of “255.224.0.0”.

If an IP and mask are given, then it is expected that one and only one IP will match at each lookup. An error or warning is printed as appropriate if there are no matches, or too many. If no mask is specified, then the IP matching will simply be done by the longest matching prefix.

This functionality can also be accessed using the environment variable MPI_NETADDR.

-subnet
Allows the user to select which default interconnect should be used for communication for TCP/IP. The interconnect is chosen by using the subnet associated with the hostname or IP address specified with -subnet.

% $MPI_ROOT/bin/mpirun -subnet \ <hostname-or-IP-address>

This option will be deprecated in favor of -netaddr in a future release.

Launching specifications options

Job launcher/scheduler

Options for LSF users

These options launch ranks as they are in appfile mode on the hosts specified in the environment variable.

-lsb_hosts
Launches the same executable across multiple hosts. Uses the list of hosts in the environment variable $LSB_HOSTS. Can be used with -np option.

-lsb_mcpu_hosts
Launches the same executable across multiple hosts. Uses the list of hosts in the environment variable $LSB_MCPU_HOSTS. Can be used with -np option.

Options for prun users

-prun
Enables start-up with Elan usage. Only supported when linking with shared libraries. Some features like mpirun -stdio processing are unavailable. The -np option is not allowed with -prun. Any arguments on the mpirun command line that follow -prun are passed down to the prun command.

Options for SLURM users

-srun
Enables start-up on XC clusters. Some features like mpirun -stdio processing are unavailable. The -np option is not allowed with -srun. Any arguments on the mpirun command line that follow -srun are passed to the srun command. Start-up directly from the srun command is not supported.

Remote shell launching

-f appfile
Specifies the appfile that mpirun parses to get program and process count information for the run. Refer to “Creating an appfile” for details about setting up your appfile.

-hostfile <filename>
Launches the same executable across multiple hosts. Filename is a text file with hostnames separated by spaces or new lines. Can be used with the -np option.

-hostlist <list>
Launches the same executable across multiple hosts. Can be used with the -np option. This hostlist may be delimited with spaces or commas. Hosts can be followed with an optional rank count, which is delimited from the hostname with either a space or colon. If spaces are used as delimiters anywhere in the hostlist, it may be necessary to place the entire hostlist inside quotes to prevent the command shell from interpreting it as multiple options.

-h host
Specifies a host on which to start the processes (default is local_host). Only applicable when running in single host mode (mpirun -np …). Refer to the -hostlist option which provides more flexibility.

-l user
Specifies the username on the target host (default is local username).

-l is not available on HP-MPI for Windows.

-np #
Specifies the number of processes to run. Generally used in single host mode, but also valid with -hostfile, -hostlist, -lsb_hosts, and -lsb_mcpu_hosts.

-stdio=[options]
Specifies standard IO options. Refer to “External input and output” for more information on standard IO, as well as a complete list of stdio options. This applies to appfiles only.

Process placement

-cpu_bind
Binds a rank to an ldom to prevent a process from moving to a different ldom after startup. Refer to “CPU binding” for details on how to use this option.

Application bitness specification

-mpi32
Option for running on Opteron and Intel®64. Should be used to indicate the bitness of the application to be invoked so that the availability of interconnect libraries can be properly determined by the HP-MPI utilities mpirun and mpid. The default is -mpi64.

-mpi64
Option for running on Opteron and Intel®64. Should be used to indicate the bitness of the application to be invoked so that the availability of interconnect libraries can be properly determined by the HP-MPI utilities mpirun and mpid. The default is -mpi64.

Debugging and informational options

-help
Prints usage information for mpirun.

-version
Prints the major and minor version numbers.

-prot
Prints the communication protocol between each host (e.g. TCP/IP or shared memory). The exact format and content presented by this option is subject to change as new interconnects and communication protocols are added to HP-MPI.

-ck
Behaves like the -p option, but supports two additional checks of your MPI application; it checks if the specified host machines and programs are available, and also checks for access or permission problems. This option is only supported when using appfile mode.

-d
Debug mode. Prints additional information about application launch.

-j
Prints the HP-MPI job ID.

-p
Turns on pretend mode. That is, the system goes through the motions of starting an HP-MPI application but does not create processes. This is useful for debugging and checking whether the appfile is set up correctly. This option is for appfiles only.

-v
Turns on verbose mode.

-i spec
Enables runtime instrumentation profiling for all processes. spec specifies options used when profiling. The options are the same as those for the environment variable MPI_INSTR. For example, the following is a valid command line:

% $MPI_ROOT/bin/mpirun -i mytrace:l:nc \
-f appfile

Refer to “MPI_INSTR” for an explanation of -i options.

-T
Prints user and system times for each MPI rank.

-tv
Specifies that the application runs with the TotalView® debugger for LSF launched applications. TV is only supported on XC systems.

RDMA control options

-dd
Use deferred deregistration when registering and deregistering memory for RDMA message transfers. The default is to use deferred deregistration. Note that using this option also produces a statistical summary of the deferred deregistration activity when MPI_Finalize is called. The option is ignored if the underlying interconnect does not use an RDMA transfer mechanism, or if the deferred deregistration is managed directly by the interconnect library.

Occasionally deferred deregistration is incompatible with a particular application or negatively impacts performance. Use -ndd to disable this feature if necessary.

Deferred deregistration of memory on RDMA networks is not supported on HP-MPI for Windows.

-ndd
Disable the use of deferred deregistration. Refer to the -dd option for more information.

-rdma
Specifies the use of envelope pairs for short message transfer. The pre-pinned memory will increase continuously with the job size.

-srq
Specifies use of the shared receiving queue protocol when OpenFabrics, Myrinet GM, ITAPI, Mellanox VAPI or uDAPL V1.2 interfaces are used. This protocol uses less pre-pinned memory for short message transfers. For more information, refer to “Scalability”.

MPI-2 functionality options

-1sided
Enables one-sided communication. Extends the communication mechanism of HP-MPI by allowing one process to specify all communication parameters, both for the sending side and for the receiving side.

The best performance is achieved if an RDMA enabled interconnect, like InfiniBand, is used. With this interconnect, the memory for the one-sided windows can come from MPI_Alloc_mem or from malloc. If TCP/IP is used, the performance will be lower, and in that case the memory for the one-sided windows must come from MPI_Alloc_mem.

-spawn
Enables dynamic processes. See “Dynamic Processes” for more information.

Environment control options

-e var[=val]
Sets the environment variable var for the program and gives it the value val if provided. Environment variable substitutions (for example, $FOO) are supported in the val argument. In order to append additional settings to an existing variable, %VAR can be used as in the example in “Setting remote environment variables”.

-sp paths
Sets the target shell PATH environment variable to paths. Search paths are separated by a colon.

Special HP-MPI mode option
-ha
Eliminates a teardown when ranks exit abnormally. Further communications involved with ranks that are unreachable return error class MPI_ERR_EXITED, but do not force the application to teardown, as long as the MPI_Errhandler is set to MPI_ERRORS_RETURN. Some restrictions apply:

Communication is done via TCP/IP (Does not use shared memory for intranode communication.)

Cannot be used with the diagnostic library.

Cannot be used with -i option

Windows CCP
The following are specific mpirun command line options for Windows CCP users.

-ccp
Indicates that the job is being submitted through the Windows CCP job scheduler/launcher. This is the recommended method for launching jobs. Required when the user doesn’t provide an appfile.

-ccperr <filename>
Assigns the job’s standard error file to the given filename when starting a job through the Windows CCP automatic job scheduler/launcher feature of HP-MPI. This flag has no effect if used for an existing CCP job.

-ccpin <filename>
Assigns the job’s standard input file to the given filename when starting a job through the Windows CCP automatic job scheduler/launcher feature of HP-MPI. This flag has no effect if used for an existing CCP job.

-ccpout <filename>
Assigns the job’s standard output file to the given filename when starting a job through the Windows CCP automatic job scheduler/launcher feature of HP-MPI. This flag has no effect if used for an existing CCP job.

-ccpwait
Causes the mpirun command to wait for the CCP job to finish before returning to the command prompt when starting a job through automatic job submittal feature of HP-MPI. By default, mpirun automatic jobs will not wait. This flag has no effect if used for an existing CCP job.

-headnode <headnode>
This option is used on Windows CCP to indicate the headnode to submit the mpirun job. If omitted, localhost is used. This option can only be used as a command line option when using the mpirun automatic submittal functionality.

-hosts
This option used on Windows CCP allows you to specify a node list to HP-MPI. Ranks are scheduled according to the host list. The nodes in the list must be in the job allocation or a scheduler error will occur. The HP-MPI program %MPI_ROOT%\bin\mpi_nodes.exe returns a string in the proper -hosts format with scheduled job resources.

-jobid <job-id>

This flag used on Windows CCP will schedule an HP-MPI job as a task to an existing job. It will submit the command as a single CPU mpirun task to the existing job indicated by the parameter job-id. This option can only be used as a command line option when using the mpirun automatic submittal functionality.

-nodex
Used on Windows CCP in addition to -ccp to indicate that only one rank is to be used per node, regardless of the number of CPU’s allocated with each host.

Runtime environment variables
 

Environment variables are used to alter the way HP-MPI executes an application. The variable settings determine how an application behaves and how an application allocates internal resources at runtime.

Many applications run without setting any environment variables. However, applications that use a large number of nonblocking messaging requests, require debugging support, or need to control process placement may need a more customized configuration.

Launching methods influence how environment variables are propagated. To ensure propagating environment variables to remote hosts, specify each variable in an appfile using the -e option. See “Creating an appfile” for more information.

Setting environment variables on the command line for HP-UX and Linux
Environment variables can be set globally on the mpirun command line. Command line options take precedence over environment variables. For example, on HP-UX and Linux:

% $MPI_ROOT/bin/mpirun -e MPI_FLAGS=y -f appfile

In the above example, if some MPI_FLAGS setting was specified in the appfile, then the global setting on the command line would override the setting in the appfile. To add to an environment variable rather than replacing it, use %VAR as in the following command:

% $MPI_ROOT/bin/mpirun -e MPI_FLAGS=%MPI_FLAGS,y -f appfile

In the above example, if the appfile specified MPI_FLAGS=z, then the resulting MPI_FLAGS seen by the application would be z, y.

% $MPI_ROOT/bin/mpirun -e \ LD_LIBRARY_PATH=%LD_LIBRARY_PATH:/path/to/third/party/lib \ -f appfile

In the above example, the user is appending to LD_LIBRARY_PATH.

Setting environment variables in an hpmpi.conf file
HP-MPI supports setting environment variables in an hpmpi.conf file. These variables are read by mpirun and exported globally, as if they had been included on the mpirun command line as “-e VAR=VAL” settings. The hpmpi.conf file search is performed in three places and each one is parsed, which allows the last one parsed to overwrite values set by the previous files. The three locations are:

$MPI_ROOT/etc/hpmpi.conf

/etc/hpmpi.conf

$HOME/.hpmpi.conf

This feature can be used for any environment variable, and is most useful for interconnect specifications. A collection of variables is available which tells HP-MPI which interconnects to search for and which libraries and modules to look for with each interconnect. These environment variables are the primary use of hpmpi.conf.

Syntactically, single and double quotes in hpmpi.conf can be used to create values containing spaces.If a value containing a quote is desired, two adjacent quotes are interpreted as a quote to be included in the value. When not contained within quotes, spaces are interpreted as element separators in a list, and are stored as tabs.

 
 
 
 NOTE: This explanation of the hpmpi.conf file is provided only for awareness that this functionality is available. Making changes to the hpmpi.conf file without contacting HP-MPI support is strongly discouraged.
 
 
 

List of runtime environment variables
 

The environment variables that affect the behavior of HP-MPI at runtime are described in the following sections categorized by the following functions:

General

CPU bind

Miscellaneous

Interconnect

InfiniBand

Memory usage

Connection related

RDMA

prun/srun

TCP

Elan

Rank ID

All environment variables are listed below in alphabetical order.

“MPI_2BCOPY”

“MPI_BIND_MAP”

“MPI_COMMD”

“MPI_CPU_AFFINITY”

“MPI_CPU_SPIN”

“MPI_DLIB_FLAGS”

“MPI_ELANLOCK”

“MPI_FLAGS”

“MPI_FLUSH_FCACHE”

“MP_GANG”

“MPI_GLOBMEMSIZE ”

“MPI_IB_CARD_ORDER”

“MPI_IB_PKEY”

“MPI_IBV_QPPARAMS”

“MPI_IC_ORDER ”

“MPI_IC_SUFFIXES”

“MPI_INSTR”

“MPI_LOCALIP”

“MPI_MAX_REMSH”

“MPI_MAX_WINDOW”

“MPI_MT_FLAGS”

“MPI_NETADDR”

“MPI_NO_MALLOCLIB”

“MPI_NOBACKTRACE”

“MPI_PAGE_ALIGN_MEM”

“MPI_PHYSICAL_MEMORY”

“MPI_PIN_PERCENTAGE”

“MPI_PRUNOPTIONS”

“MPI_RANKMEMSIZE”

“MPI_RDMA_INTRALEN”

“MPI_RDMA_MSGSIZE”

“MPI_RDMA_NENVELOPE”

“MPI_RDMA_NFRAGMENT”

“MPI_RDMA_NONESIDED”

“MPI_RDMA_NSRQRECV”

“MPI_REMSH”

“MPI_ROOT”

“MPI_SHMEMCNTL”

“MPI_SOCKBUFSIZE”

“MPI_SPAWN_PRUNOPTIONS”

“MPI_SPAWN_SRUNOPTIONS”

“MPI_SRUNOPTIONS”

“MPI_TCP_CORECVLIMIT”

“MPI_USE_LIBELAN”

“MPI_USE_LIBELAN_SUB”

“MPI_USE_MALLOPT_AVOID_MMAP”

“MPI_USEPRUN”

“MPI_USEPRUN_IGNORE_ARGS”

“MPI_USESRUN”

“MPI_USESRUN_IGNORE_ARGS”

“MPI_VAPI_QPPARAMS”

“MPI_WORKDIR”

“MPIRUN_OPTIONS”

“TOTALVIEW”

General environment variables
MPIRUN_OPTIONS
MPIRUN_OPTIONS is a mechanism for specifying additional command line arguments to mpirun. If this environment variable is set, then any mpirun command will behave as if the arguments in MPIRUN_OPTIONS had been specified on the mpirun command line. For example:

% export MPIRUN_OPTIONS=”-v -prot”

% $MPI_ROOT/bin/mpirun -np 2 /path/to/program.x

would be equivalent to running

% $MPI_ROOT/bin/mpirun -v -prot -np 2 /path/to/program.x

When settings are supplied on the command line, in the MPIRUN_OPTIONS variable, and in an hpmpi.conf file, the resulting command line is as if the hpmpi.conf settings had appeared first, followed by the MPIRUN_OPTIONS, followed by the actual command line. And since the settings are parsed left to right, this means the later settings have higher precedence than the earlier ones.

MPI_FLAGS
MPI_FLAGS modifies the general behavior of HP-MPI. The MPI_FLAGS syntax is a comma separated list as follows:

[edde,][exdb,][egdb,][eadb,][ewdb,][l,][f,][i,] [s[a|p][#],][y[#],][o,][+E2,][C,][D,][E,][T,][z]

where

edde
Starts the application under the dde debugger. The debugger must be in the command search path. See “Debugging HP-MPI applications” for more information.

exdb
Starts the application under the xdb debugger. The debugger must be in the command search path. See “Debugging HP-MPI applications” for more information.

egdb
Starts the application under the gdb debugger. The debugger must be in the command search path. See “Debugging HP-MPI applications” for more information.

eadb
Starts the application under adb—the absolute debugger. The debugger must be in the command search path. See “Debugging HP-MPI applications” for more information.

ewdb
Starts the application under the wdb debugger. The debugger must be in the command search path. See “Debugging HP-MPI applications” for more information.

epathdb
Starts the application under the path debugger. The debugger must be in the command search path. See “Debugging HP-MPI applications” for more information.

l
Reports memory leaks caused by not freeing memory allocated when an HP-MPI job is run. For example, when you create a new communicator or user-defined datatype after you call MPI_Init, you must free the memory allocated to these objects before you call MPI_Finalize. In C, this is analogous to making calls to malloc() and free() for each object created during program execution.

Setting the l option may decrease application performance.

f
Forces MPI errors to be fatal. Using the f option sets the MPI_ERRORS_ARE_FATAL error handler, ignoring the programmer’s choice of error handlers. This option can help you detect nondeterministic error problems in your code.

If your code has a customized error handler that does not report that an MPI call failed, you will not know that a failure occurred. Thus your application could be catching an error with a user-written error handler (or with MPI_ERRORS_RETURN) which masks a problem.

i
Turns on language interoperability concerning the MPI_BOTTOM constant.

MPI_BOTTOM Language Interoperability—Previous versions of HP-MPI were not compliant with Section 4.12.6.1 of the MPI-2 Standard which requires that sends/receives based at MPI_BOTTOM on a data type created with absolute addresses must access the same data regardless of the language in which the data type was created. If compliance with the standard is desired, set MPI_FLAGS=i to turn on language interoperability concerning the MPI_BOTTOM constant. Compliance with the standard can break source compatibility with some MPICH code.

s[a|p][#]
Selects signal and maximum time delay for guaranteed message progression. The sa option selects SIGALRM. The sp option selects SIGPROF. The # option is the number of seconds to wait before issuing a signal to trigger message progression. The default value for the MPI library is sp0, which never issues a progression related signal. If the application uses both signals for its own purposes, you cannot enable the heart-beat signals.

This mechanism may be used to guarantee message progression in applications that use nonblocking messaging requests followed by prolonged periods of time in which HP-MPI routines are not called.

Generating a UNIX signal introduces a performance penalty every time the application processes are interrupted. As a result, while some applications will benefit from it, others may experience a decrease in performance. As part of tuning the performance of an application, you can control the behavior of the heart-beat signals by changing their time period or by turning them off. This is accomplished by setting the time period of the s option in the MPI_FLAGS environment variable (for example: s600). Time is in seconds.

You can use the s[a][p]# option with the thread-compliant library as well as the standard non thread-compliant library. Setting s[a][p]# for the thread-compliant library has the same effect as setting MPI_MT_FLAGS=ct when you use a value greater than 0 for #. The default value for the thread-compliant library is sp0. MPI_MT_FLAGS=ct takes priority over the default MPI_FLAGS=sp0.

Refer to “MPI_MT_FLAGS” and “Thread-compliant library” for additional information.

Set MPI_FLAGS=sa1 to guarantee that MPI_Cancel works for canceling sends.

To use gprof on XC systems, set to environment variables:

MPI_FLAGS=s0

GMON_OUT_PREFIX=/tmp/app/name

These options are ignored on HP-MPI for Windows.

y[#]
Enables spin-yield logic. # is the spin value and is an integer between zero and 10,000. The spin value specifies the number of milliseconds a process should block waiting for a message before yielding the CPU to another process.

How you apply spin-yield logic depends on how well synchronized your processes are. For example, if you have a process that wastes CPU time blocked, waiting for messages, you can use spin-yield to ensure that the process relinquishes the CPU to other processes. Do this in your appfile, by setting y[#] to y0 for the process in question. This specifies zero milliseconds of spin (that is, immediate yield).

If you are running an application stand-alone on a dedicated system, the default setting which is MPI_FLAGS=y allows MPI to busy spin, thereby improving latency. To avoid unnecessary CPU consumption when using more ranks than cores, consider using a setting such as MPI_FLAGS=y40.

Specifying y without a spin value is equivalent to MPI_FLAGS=y10000, which is the default.

 
 
 
 NOTE: Except when using srun or prun to launch, if the ranks under a single mpid exceed the number of CPUs on the node and a value of MPI_FLAGS=y is not specified, the default is changed to MPI_FLAGS=y0. 
 
 
 

If the time a process is blocked waiting for messages is short, you can possibly improve performance by setting a spin value (between 0 and 10,000) that ensures the process does not relinquish the CPU until after the message is received, thereby reducing latency.

The system treats a nonzero spin value as a recommendation only. It does not guarantee that the value you specify is used.

o
The option writes an optimization report to stdout. MPI_Cart_create and MPI_Graph_create optimize the mapping of processes onto the virtual topology only if rank reordering is enabled (set reorder=1).

In the declaration statement below, see reorder=1

 
int numtasks, rank, source, dest, outbuf, i, tag=1, inbuf[4]={MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,}, nbrs[4], dims[2]={4,4}, periods[2]={0,0}, reorder=1, coords[2]; 
 

For example:

 
/opt/mpi/bin/mpirun -np 16  -e MPI_FLAGS=o ./a.outReordering ranks for the callMPI_Cart_create(comm(size=16), ndims=2,                 dims=[4 4], periods=[false false], reorder=true)Default mapping of processes would result communication paths         between hosts                      :    0         between subcomplexes               :    0         between hypernodes                 :    0         between CPUs within a hypernode/SMP:   24Reordered mapping results communication paths         between hosts                      :    0         between subcomplexes               :    0         between hypernodes                 :    0         between CPUs within a hypernode/SMP:   24Reordering will not reduce overall communication cost.Void the optimization and adopted unreordered mapping.rank= 2 coords= 0 2  neighbors(u,d,l,r)= -1 6 1 3rank= 0 coords= 0 0  neighbors(u,d,l,r)= -1 4 -1 1rank= 1 coords= 0 1  neighbors(u,d,l,r)= -1 5 0 2rank= 10 coords= 2 2  neighbors(u,d,l,r)= 6 14 9 11rank= 2                 inbuf(u,d,l,r)= -1 6 1 3rank= 6 coords= 1 2  neighbors(u,d,l,r)= 2 10 5 7rank= 7 coords= 1 3  neighbors(u,d,l,r)= 3 11 6 -1rank= 4 coords= 1 0  neighbors(u,d,l,r)= 0 8 -1 5rank= 0                 inbuf(u,d,l,r)= -1 4 -1 1rank= 5 coords= 1 1  neighbors(u,d,l,r)= 1 9 4 6rank= 11 coords= 2 3  neighbors(u,d,l,r)= 7 15 10 -1rank= 1                 inbuf(u,d,l,r)= -1 5 0 2rank= 14 coords= 3 2  neighbors(u,d,l,r)= 10 -1 13 15rank= 9 coords= 2 1  neighbors(u,d,l,r)= 5 13 8 10rank= 13 coords= 3 1  neighbors(u,d,l,r)= 9 -1 12 14rank= 15 coords= 3 3  neighbors(u,d,l,r)= 11 -1 14 -1rank= 10                 inbuf(u,d,l,r)= 6 14 9 11rank= 12 coords= 3 0  neighbors(u,d,l,r)= 8 -1 -1 13rank= 8 coords= 2 0  neighbors(u,d,l,r)= 4 12 -1 9rank= 3 coords= 0 3  neighbors(u,d,l,r)= -1 7 2 -1rank= 6                 inbuf(u,d,l,r)= 2 10 5 7rank= 7                 inbuf(u,d,l,r)= 3 11 6 -1rank= 4                 inbuf(u,d,l,r)= 0 8 -1 5rank= 5                 inbuf(u,d,l,r)= 1 9 4 6rank= 11                 inbuf(u,d,l,r)= 7 15 10 -1rank= 14                 inbuf(u,d,l,r)= 10 -1 13 15rank= 9                 inbuf(u,d,l,r)= 5 13 8 10rank= 13                 inbuf(u,d,l,r)= 9 -1 12 14rank= 15                 inbuf(u,d,l,r)= 11 -1 14 -1rank= 8                 inbuf(u,d,l,r)= 4 12 -1 9rank= 12                 inbuf(u,d,l,r)= 8 -1 -1 13rank= 3                 inbuf(u,d,l,r)= -1 7 2 –
 

+E2
Sets -1 as the value of .TRUE. and 0 as the value for .FALSE. when returning logical values from HP-MPI routines called within Fortran 77 applications.

C
Disables ccNUMA support. Allows you to treat the system as a symmetric multiprocessor. (SMP)

D
Dumps shared memory configuration information. Use this option to get shared memory values that are useful when you want to set the MPI_SHMEMCNTL flag.

E[on|off]
Function parameter error checking is turned off by default. It can be turned on by setting MPI_FLAGS=Eon.

T
Prints the user and system times for each MPI rank.

z
Enables zero-buffering mode. Set this flag to convert MPI_Send and MPI_Rsend calls in your code to MPI_Ssend, without rewriting your code.

MPI_MT_FLAGS
MPI_MT_FLAGS controls runtime options when you use the thread-compliant version of HP-MPI. The MPI_MT_FLAGS syntax is a comma separated list as follows:

[ct,][single,][fun,][serial,][mult]

where

ct
Creates a hidden communication thread for each rank in the job. When you enable this option, be careful not to oversubscribe your system. For example, if you enable ct for a 16-process application running on a 16-way machine, the result will be a 32-way job.

single
Asserts that only one thread executes.

fun
Asserts that a process can be multithreaded, but only the main thread makes MPI calls (that is, all calls are funneled to the main thread).

serial
Asserts that a process can be multithreaded, and multiple threads can make MPI calls, but calls are serialized (that is, only one call is made at a time).

mult
Asserts that multiple threads can call MPI at any time with no restrictions.

Setting MPI_MT_FLAGS=ct has the same effect as setting MPI_FLAGS=s[a][p]#, when the value of # that is greater than 0. MPI_MT_FLAGS=ct takes priority over the default MPI_FLAGS=sp0 setting. Refer to “MPI_FLAGS”.

The single, fun, serial, and mult options are mutually exclusive. For example, if you specify the serial and mult options in MPI_MT_FLAGS, only the last option specified is processed (in this case, the mult option). If no runtime option is specified, the default is mult.

For more information about using MPI_MT_FLAGS with the
thread-compliant library, refer to “Thread-compliant library”.

MPI_ROOT
MPI_ROOT indicates the location of the HP-MPI tree. If you move the HP-MPI installation directory from its default location in /opt/mpi for HP-UX and /opt/hpmpi for Linux, set the MPI_ROOT environment variable to point to the new location. See “Directory structure for HP-UX and Linux” for more information.

MPI_WORKDIR
MPI_WORKDIR changes the execution directory. This variable is ignored when srun or prun is used.

CPU Bind environment variables
MPI_BIND_MAP
MPI_BIND_MAP allows specification of the integer CPU numbers, ldom numbers, or CPU masks. These are a list of integers separated by commas (,).

MPI_CPU_AFFINITY
MPI_CPU_AFFINITY is an alternative method to using -cpu_bind on the command line for specifying binding strategy. The possible settings are LL, RANK, MAP_CPU, MASK_CPU, LDOM, CYCLIC, BLOCK, RR, FILL, PACKED, SLURM, and MAP_LDOM.

MPI_CPU_SPIN
MPI_CPU_SPIN allows selection of spin value. The default is 2 seconds.

MPI_FLUSH_FCACHE
MPI_FLUSH_FCACHE clears the file-cache (buffer-cache). Add “-e MPI_FLUSH_FCACHE[=x]” to the mpirun command line and the file-cache will be flushed before the code starts; where =x is an optional percent of memory at which to flush. If the memory in the file-cache is greater than x, the memory is flushed. The default value is 0 (in which case a flush is always performed). Only the lowest rank# on each host flushes the file-cache; limited to one flush per host/job.

Setting this environment variable saves time if, for example, the file-cache is currently using 8% of the memory and =x is set to 10. In this case, no flush is performed.

Example output:

 
MPI_FLUSH_FCACHE set to 0, fcache pct = 22, attempting to flush fcache on host opteron2
 
 
MPI_FLUSH_FCACHE set to 10, fcache pct = 3, no fcache flush required on host opteron2
 

Memory is allocated with mmap, then munmap’d afterwards.

MP_GANG
MP_GANG enables gang scheduling on HP-UX systems only. Gang scheduling improves the latency for synchronization by ensuring that all runable processes in a gang are scheduled simultaneously. Processes waiting at a barrier, for example, do not have to wait for processes that are not currently scheduled. This proves most beneficial for applications with frequent synchronization operations. Applications with infrequent synchronization, however, may perform better if gang scheduling is disabled.

Process priorities for gangs are managed identically to timeshare policies. The timeshare priority scheduler determines when to schedule a gang for execution. While it is likely that scheduling a gang will preempt one or more higher priority timeshare processes, the gang-schedule policy is fair overall. In addition, gangs are scheduled for a single time slice, which is the same for all processes in the system.

MPI processes are allocated statically at the beginning of execution. As an MPI process creates new threads, they are all added to the same gang if MP_GANG is enabled.

The MP_GANG syntax is as follows:

[ON|OFF]

where

ON
Enables gang scheduling.

OFF
Disables gang scheduling.

For multihost configurations, you need to set MP_GANG for each appfile entry. Refer to the -e option in “Creating an appfile”.

You can also use the HP-UX utility mpsched to enable gang scheduling. Refer to the HP-UX gang_sched and mpsched man pages for more information.

 
 
 
 NOTE: The MP_GANG feature will be deprecated in a future release.
 
 
 

Miscellaneous environment variables
MPI_2BCOPY
Point-to-point bcopy() is disabled by setting MPI_2BCOPY to 1. Valid on PA-RISC only.

MPI_MAX_WINDOW
MPI_MAX_WINDOW is used for one-sided applications. It specifies the maximum number of windows a rank can have at the same time. It tells HP-MPI to allocate enough table entries. The default is 5.

% export MPI_MAX_WINDOW=10

The above example allows 10 windows to be established for one-sided communication.

Diagnostic/debug environment variables
MPI_DLIB_FLAGS
MPI_DLIB_FLAGS controls runtime options when you use the diagnostics library. The MPI_DLIB_FLAGS syntax is a comma separated list as follows:

[ns,][h,][strict,][nmsg,][nwarn,][dump:prefix,]
[dumpf:prefix][xNUM]

where

ns
Disables message signature analysis.

h
Disables default behavior in the diagnostic library that ignores user specified error handlers. The default considers all errors to be fatal.

strict
Enables MPI object-space corruption detection. Setting this option for applications that make calls to routines in the MPI-2 standard may produce false error messages.

nmsg
Disables detection of multiple buffer writes during receive operations and detection of send buffer corruptions.

nwarn
Disables the warning messages that the diagnostic library generates by default when it identifies a receive that expected more bytes than were sent.

dump:prefix
Dumps (unformatted) all sent and received messages to prefix.msgs.rank where rank is the rank of a specific process.

dumpf:prefix
Dumps (formatted) all sent and received messages to prefix.msgs.rank where rank is the rank of a specific process.

xNUM
Defines a type-signature packing size. NUM is an unsigned integer that specifies the number of signature leaf elements. For programs with diverse derived datatypes the default value may be too small. If NUM is too small, the diagnostic library issues a warning during the MPI_Finalize operation.

Refer to “Using the diagnostics library” for more information.

MPI_NOBACKTRACE
On PA-RISC systems, a stack trace is printed when the following signals occur within an application:

SIGILL

SIGBUS

SIGSEGV

SIGSYS

In the event one of these signals is not caught by a user signal handler, HP-MPI will display a brief stack trace that can be used to locate the signal in the code.

 
Signal 10: bus errorPROCEDURE TRACEBACK:
 
 
(0)   0x0000489c   bar + 0xc   [././a.out](1)   0x000048c4   foo + 0x1c   [,/,/a.out](2)   0x000049d4   main + 0xa4   [././a.out](3)   0xc013750c   _start + 0xa8   [/usr/lib/libc.2](4)   0x0003b50         $START$ + 0x1a0   [././a.out]                                                
 

This feature can be disabled for an individual signal handler by declaring a user-level signal handler for the signal. To disable for all signals, set the environment variable MPI_NOBACKTRACE:

% setenv MPI_NOBACKTRACE

See “Backtrace functionality” for more information.

MPI_INSTR
MPI_INSTR enables counter instrumentation for profiling HP-MPI applications. The MPI_INSTR syntax is a colon-separated list (no spaces between options) as follows:

prefix[:l][:nc][:off]

where

prefix
Specifies the instrumentation output file prefix. The rank zero process writes the application’s measurement data to prefix.instr in ASCII. If the prefix does not represent an absolute pathname, the instrumentation output file is opened in the working directory of the rank zero process when MPI_Init is called.

l
Locks ranks to CPUs and uses the CPU’s cycle counter for less invasive timing. If used with gang scheduling, the :l is ignored.

nc
Specifies no clobber. If the instrumentation output file exists, MPI_Init aborts.

off
Specifies counter instrumentation is initially turned off and only begins after all processes collectively call MPIHP_Trace_on.

Refer to “Using counter instrumentation” for more information.

Even though you can specify profiling options through the MPI_INSTR environment variable, the recommended approach is to use the mpirun command with the -i option instead. Using mpirun to specify profiling options guarantees that multihost applications do profiling in a consistent manner. Refer to “mpirun ” for more information.

Counter instrumentation and trace-file generation are mutually exclusive profiling techniques.

 
 
 
 NOTE: When you enable instrumentation for multihost runs, and invoke mpirun either on a host where at least one MPI process is running, or on a host remote from all your MPI processes, HP-MPI writes the instrumentation output file (prefix.instr) to the working directory on the host that is running rank 0.
 
 
 

TOTALVIEW
When you use the TotalView debugger, HP-MPI uses your PATH variable to find TotalView. You can also set the absolute path and TotalView specific options in the TOTALVIEW environment variable. This environment variable is used by mpirun.

% setenv TOTALVIEW /opt/totalview/bin/totalview

Interconnect selection environment variables
MPI_IC_ORDER
MPI_IC_ORDER is an environment variable whose default contents are “ibv:vapi:udapl:psm:mx:gm:elan:itapi:TCP” and instructs HP-MPI to search in a specific order for the presence of an interconnect. Lowercase selections imply use if detected, otherwise keep searching. An uppercase option demands that the interconnect option be used, and if it cannot be selected the application will terminate with an error. For example:

% export MPI_IC_ORDER=”ibv:vapi:udapl:psm:mx:gm:elan: \
itapi:TCP”

% export MPIRUN_OPTIONS=”-prot”

% $MPI_ROOT/bin/mpirun -srun -n4 ./a.out

The command line for the above will appear to mpirun as $MPI_ROOT/bin/mpirun -prot -srun -n4 ./a.out and the interconnect decision will look for the presence of Elan and use it if found. Otherwise, interconnects will be tried in the order specified by MPI_IC_ORDER.

The following is an example of using TCP over GigE, assuming GigE is installed and 192.168.1.1 corresponds to the ethernet interface with GigE. Note the implicit use of -netaddr 192.168.1.1 is required to effectively get TCP over the proper subnet.

% export MPI_IC_ORDER=”ibv:vapi:udapl:psm:mx:gm:elan: \
itapi:TCP”

% export MPIRUN_SYSTEM_OPTIONS=”-netaddr 192.168.1.1″

% $MPI_ROOT/bin/mpirun -prot -TCP -srun -n4 ./a.out

On an XC system, the cluster installation will define the MPI interconnect search order based on what is present on the system.

MPI_IC_SUFFIXES
When HP-MPI is determining the availability of a given interconnect on Linux, it tries to open libraries and find loaded modules based on a collection of variables of the form

This is described in more detail in “Interconnect support”.

The use of interconnect environment variables MPI_ICLIB_ELAN, MPI_ICLIB_GM, MPI_ICLIB_ITAPI, MPI_ICLIB_MX, MPI_ICLIB_UDAPL, MPI_ICLIB_VAPI, and MPI_ICLIB_VAPIDIR has been deprecated. Refer to “Interconnect support” for more information on interconnect environment variables.

MPI_COMMD
MPI_COMMD routes all off-host communication through daemons rather than between processes. The MPI_COMMD syntax is as follows:

out_frags,in_frags

where

out_frags
Specifies the number of 16Kbyte fragments available in shared memory for outbound messages. Outbound messages are sent from processes on a given host to processes on other hosts using the communication daemon.

The default value for out_frags is 64. Increasing the number of fragments for applications with a large number of processes improves system throughput.

in_frags
Specifies the number of 16Kbyte fragments available in shared memory for inbound messages. Inbound messages are sent from processes on one or more hosts to processes on a given host using the communication daemon.

The default value for in_frags is 64. Increasing the number of fragments for applications with a large number of processes improves system throughput.

Only works with the -commd option. When -commd is used, MPI_COMMD specifies daemon communication fragments.

InfiniBand environment variables
MPI_IB_CARD_ORDER
Defines mapping of ranks to IB cards.

% setenv MPI_IB_CARD_ORDER <card#>[:port#]

Where:

card#
ranges from 0 to N-1

port#
ranges from 0 to 1

Card:port can be a comma separated list which drives the assignment of ranks to cards and ports within the cards.

Note that HP-MPI numbers the ports on a card from 0 to N-1, whereas utilities such as vstat display ports numbered 1 to N.

Examples:

To use the 2nd IB card:

% mpirun -e MPI_IB_CARD_ORDER=1 …

To use the 2nd port of the 2nd card:

% mpirun -e MPI_IB_CARD_ORDER=1:1 …

To use the 1st IB card:

% mpirun -e MPI_IB_CARD_ORDER=0 …

To assign ranks to multiple cards:

% mpirun -e MPI_IB_CARD_ORDER=0,1,2
will assign the local ranks per node in order to each card.

% mpirun -hostlist “host0 4 host1 4”
creates ranks 0-3 on host0 and ranks 4-7 on host1. Will assign rank 0 to card 0, rank 1 to card 1, rank 2 to card 2, rank 3 to card 0 all on host0. And will assign rank 4 to card 0, rank 5 to card 1, rank 6 to card 2, rank 7 to card 0 all on host1.

% mpirun -hostlist -np 8 “host0 host1”
creates ranks 0 through 7 alternatingly on host0, host1, host0, host1, etc. Will assign rank 0 to card 0, rank 2 to card 1, rank 4 to card 2, rank 6 to card 0 all on host0. And will assign rank 1 to card 0, rank 3 to card 1, rank 5 to card 2, rank 7 to card 0 all on host1.

MPI_IB_PKEY
HP-MPI supports IB partitioning via Mellanox VAPI and OpenFabrics Verbs API.

By default, HP-MPI will search the unique full membership partition key from the port partition key table used. If no such pkey is found, an error is issued. If multiple pkeys are found, all such pkeys are printed and an error message is issued.

If the environment variable MPI_IB_PKEY has been set to a value, either in hex or decimal, the value is treated as the pkey, and the pkey table is searched for the same pkey. If the pkey is not found, an error message is issued.

When a rank selects a pkey to use, a check is made to make sure all ranks are using the same pkey. If ranks are not using the same pkey, and error message is issued.

MPI_IBV_QPPARAMS
MPI_IBV_QPPARAMS=a,b,c,d,e Specifies QP settings for IBV where:

a
Time-out value for IBV retry if no response from target. Minimum is 1. Maximum is 31. Default is 18.

b
The retry count after time-out before error is issued. Minimum is 0. Maximum is 7. Default is 7.

c
The minimum Receiver Not Ready (RNR) NAK timer. After this time, an RNR NAK is sent back to the sender. Values: 1(0.01ms) – 31(491.52ms); 0(655.36ms). The default is 24(40.96ms).

d
RNR retry count before error is issued. Minimum is 0. Maximum is 7. Default is 7 (infinite).

e
The max inline data size. Default is 128 bytes.

MPI_VAPI_QPPARAMS
MPI_VAPI_QPPARAMS=a,b,c,d specifies time-out setting for VAPI where:

a
Time out value for VAPI retry if no response from target. Minimum is 1. Maximum is 31. Default is 18.

b
The retry count after time-out before error is issued. Minimum is 0. Maximum is 7. Default is 7.

c
The minimum Receiver Not Ready (RNR) NAK timer. After this time, an RNR NAK is set back to the sender. Values: 1(0.01ms) – 31(491.52ms); 0(655.36ms). The default is 24(40.96ms).

d
RNR retry count before error is issued. Minimum is 0. Maximum is 7. Default is 7 (infinite).

Memory usage environment variables
MPI_GLOBMEMSIZE
MPI_GLOBMEMSIZE=e Where e is the total bytes of shared memory of the job. If the job size is N, then each rank has e/N bytes of shared memory. 12.5% is used as generic. 87.5% is used as fragments. The only way to change this ratio is to use MPI_SHMEMCNTL.

MPI_NO_MALLOCLIB
Set MPI_NO_MALLOCLIB to avoid using HP-MPI’s ptmalloc implementation and instead use the standard libc implementation (or perhaps a malloc implementation contained in the application).

See “Improved deregistration via ptmalloc (Linux only)” for more information.

MPI_PAGE_ALIGN_MEM
MPI_PAGE_ALIGN_MEM causes the HP-MPI library to page align and page pad memory. This is for multi-threaded InfiniBand support.

% export MPI_PAGE_ALIGN_MEM=1

MPI_PHYSICAL_MEMORY
MPI_PHYSICAL_MEMORY allows the user to specify the amount of physical memory in kilobytes available on the system. MPI normally attempts to determine the amount of physical memory for the purpose of determining how much memory to pin for RDMA message transfers on InfiniBand and Myrinet GM. The value determined by HP-MPI can be displayed using the -dd option. If HP-MPI specifies an incorrect value for physical memory, this environment variable can be used to specify the value explicitly:

% export MPI_PHYSICAL_MEMORY=1048576

The above example specifies that the system has 1GB of physical memory.

MPI_PIN_PERCENTAGE and MPI_PHYSICAL_MEMORY are ignored unless InfiniBand or Myrinet GM is in use.

MPI_RANKMEMSIZE
MPI_RANKMEMSIZE=d Where d is the total bytes of shared memory of the rank. Specifies the shared memory for each rank. 12.5% is used as generic. 87.5% is used as fragments. The only way to change this ratio is to use MPI_SHMEMCNTL. MPI_RANKMEMSIZE differs from MPI_GLOBMEMSIZE, which is the total shared memory across all the ranks on the host. MPI_RANKMEMSIZE takes precedence over MPI_GLOBMEMSIZE if both are set. Both MPI_RANKMEMSIZE and MPI_GLOBMEMSIZE are mutually exclusive to MPI_SMEMCNTL. If MPI_SHMEMCNTL is set, then the user cannot set the other two, and vice versa.

MPI_PIN_PERCENTAGE
MPI_PIN_PERCENTAGE communicates the maximum percentage of physical memory (see MPI_PHYSICAL_MEMORY) that can be pinned at any time. The default is 20%.

% export MPI_PIN_PERCENTAGE=30

The above example permits the HP-MPI library to pin (lock in memory) up to 30% of physical memory. The pinned memory is shared between ranks of the host that were started as part of the same mpirun invocation. Running multiple MPI applications on the same host can cumulatively cause more than one application’s MPI_PIN_PERCENTAGE to be pinned. Increasing MPI_PIN_PERCENTAGE can improve communication performance for communication intensive applications in which nodes send and receive multiple large messages at a time, such as is common with collective operations. Increasing MPI_PIN_PERCENTAGE allows more large messages to be progressed in parallel using RDMA transfers, however pinning too much of physical memory may negatively impact computation performance. MPI_PIN_PERCENTAGE and MPI_PHYSICAL_MEMORY are ignored unless InfiniBand or Myrinet GM is in use.

MPI_SHMEMCNTL
MPI_SHMEMCNTL controls the subdivision of each process’s shared memory for the purposes of point-to-point and collective communications. It cannot be used in conjunction with MPI_GLOBMEMSIZE. The MPI_SHMEMCNTL syntax is a comma separated list as follows:

nenv, frag, generic

where

nenv
Specifies the number of envelopes per process pair. The default is 8.

frag
Denotes the size in bytes of the message-passing fragments region. The default is 87.5 percent of shared memory after mailbox and envelope allocation.

generic
Specifies the size in bytes of the generic-shared memory region. The default is 12.5 percent of shared memory after mailbox and envelope allocation. The generic region is typically used for collective communication.

MPI_SHMEMCNTL=a,b,c where:

a
The number of envelopes for shared memory communication. The default is 8.

b
The bytes of shared memory to be used as fragments for messages.

c
The bytes of shared memory for other generic use, such as MPI_Alloc_mem() call.

MPI_USE_MALLOPT_AVOID_MMAP
Instructs the underlying malloc implementation to avoid mmaps and instead use sbrk() to get all the memory used. The default is MPI_USE_MALLOPT_AVOID_MMAP=0.

Connection related environment variables
MPI_LOCALIP
MPI_LOCALIP specifies the host IP address that is assigned throughout a session. Ordinarily, mpirun determines the IP address of the host it is running on by calling gethostbyaddr. However, when a host uses a SLIP or PPP protocol, the host’s IP address is dynamically assigned only when the network connection is established. In this case, gethostbyaddr may not return the correct IP address.

The MPI_LOCALIP syntax is as follows:

xxx.xxx.xxx.xxx

where xxx.xxx.xxx.xxx specifies the host IP address.

MPI_MAX_REMSH
MPI_MAX_REMSH=N HP-MPI includes a startup scalability enhancement when using the -f option to mpirun. This enhancement allows a large number of HP-MPI daemons (mpid) to be created without requiring mpirun to maintain a large number of remote shell connections.

When running with a very large number of nodes, the number of remote shells normally required to start all of the daemons can exhaust the available file descriptors. To create the necessary daemons, mpirun uses the remote shell specified with MPI_REMSH to create up to 20 daemons only, by default. This number can be changed using the environment variable MPI_MAX_REMSH. When the number of daemons required is greater than MPI_MAX_REMSH, mpirun will create only MPI_MAX_REMSH number of remote daemons directly. The directly created daemons will then create the remaining daemons using an n-ary tree, where n is the value of MPI_MAX_REMSH. Although this process is generally transparent to the user, the new startup requires that each node in the cluster is able to use the specified MPI_REMSH command (e.g. rsh, ssh) to each node in the cluster without a password. The value of MPI_MAX_REMSH is used on a per-world basis. Therefore, applications which spawn a large number of worlds may need to use a small value for MPI_MAX_REMSH. MPI_MAX_REMSH is only relevant when using the -f option to mpirun. The default value is 20.

MPI_NETADDR
Allows control of the selection process for TCP/IP connections. The same functionality can be accessed by using the -netaddr option to mpirun. See “mpirun options” for more information.

MPI_REMSH
By default, HP-MPI attempts to use ssh on Linux and remsh on HP-UX. On Linux, we recommend that ssh users set StrictHostKeyChecking=no in their ~/.ssh/config.

To use rsh on Linux instead, the following script needs to be run as root on each node in the cluster:

% /opt/hpmpi/etc/mpi.remsh.default

Or, to use rsh on Linux, use the alternative method of manually populating the files /etc/profile.d/hpmpi.csh and /etc/profile.d/hpmpi.sh with the following settings respectively:

setenv MPI_REMSH rsh

export MPI_REMSH=rsh

On HP-UX, MPI_REMSH specifies a command other than the default remsh to start remote processes. The mpirun, mpijob, and mpiclean utilities support MPI_REMSH. For example, you can set the environment variable to use a secure shell:

% setenv MPI_REMSH /bin/ssh

HP-MPI allows users to specify the remote execution tool to use when HP-MPI needs to start processes on remote hosts. The tool specified must have a call interface similar to that of the standard utilities: rsh, remsh and ssh. An alternate remote execution tool, such as ssh, can be used on HP-UX by setting the environment variable MPI_REMSH to the name or full path of the tool to use:

% export MPI_REMSH=ssh

% $MPI_ROOT/bin/mpirun <options> -f <appfile>

HP-MPI also supports setting MPI_REMSH using the -e option to mpirun:

% $MPI_ROOT/bin/mpirun -e MPI_REMSH=ssh <options> -f \ <appfile>

This release also supports setting MPI_REMSH to a command which includes additional arguments:

% $MPI_ROOT/bin/mpirun -e MPI_REMSH=”ssh -x” <options> \
-f <appfile>

When using ssh, first ensure that it is possible to use ssh from the host where mpirun is executed to the other nodes without ssh requiring any interaction from the user.

RDMA tunable environment variables
MPI_RDMA_INTRALEN
-e MPI_RDMA_INTRALEN=262144 Specifies the size (in bytes) of the transition from shared memory to interconnect when -intra=mix is used. For messages less than or equal to the specified size, shared memory will be used. For messages greater than that size, the interconnect will be used. TCP/IP, Elan, MX, and PSM do not have mixed mode.

MPI_RDMA_MSGSIZE
MPI_RDMA_MSGSIZE=a,b,c Specifies message protocol length where:

a
Short message protocol threshold. If the message length is bigger than this value, middle or long message protocol is used. The default is 16384 bytes, but on HP-UX 32768 bytes is used.

b
Middle message protocol. If the message length is less than or equal to b, consecutive short messages are used to send the whole message. By default, b is set to 16384 bytes, the same as a, to effectively turn off middle message protocol. On IBAL, the default is 131072 bytes.

c
Long message fragment size. If the message is greater than b, the message is fragmented into pieces up to c in length (or actual length if less than c) and the corresponding piece of the user’s buffer is pinned directly. The default is 4194304 bytes, but on Myrinet GM and IBAL the default is 1048576 bytes.

MPI_RDMA_NENVELOPE
MPI_RDMA_NENVELOPE=N Specifies the number of short message envelope pairs for each connection if RDMA protocol is used, where N is the number of envelope pairs. The default is between 8 and 128 depending on the number of ranks.

MPI_RDMA_NFRAGMENT
MPI_RDMA_NFRAGMENT=N Specifies the number of long message fragments that can be concurrently pinned down for each process, either sending or receiving. The max number of fragments that can be pinned down for a process is 2*N. The default value of N is 128.

MPI_RDMA_NONESIDED
MPI_RDMA_NONESIDED=N Specifies the number of one-sided operations that can be posted concurrently for each rank, no matter the destination. The default is 8.

MPI_RDMA_NSRQRECV
MPI_RDMA_NSRQRECV=K Specifies the number of receiving buffers used when the shared receiving queue is used, where K is the number of receiving buffers. If N is the number of offhost connection from a rank, then the default value can be calculated as:

The smaller of the values Nx8 and 2048.

In the above example, the number of receiving buffers are calculated as 8 times the number of offhost connections. If this number is greater than 2048, then 2048 is used as the maximum number.

prun/srun environment variables
MPI_SPAWN_PRUNOPTIONS
Allows prun options to be implicitly added to the launch command when SPAWN functionality is used to create new ranks with prun.

MPI_SPAWN_SRUNOPTIONS
Allows srun options to be implicitly added to the launch command when SPAWN functionality is used to create new ranks with srun.

MPI_SRUNOPTIONS
Allows additional srun options to be specified such as –label.

% setenv MPI_SRUNOPTIONS <option>

MPI_USEPRUN
HP-MPI provides the capability to automatically assume that prun is the default launching mechanism. This mode of operation automatically classifies arguments into ‘prun’ and ‘mpirun’ arguments and correctly places them on the command line.The assumed prun mode also allows appfiles to be interpreted for command line arguments and translated into prun mode. The implied prun method of launching is useful for applications which embed or generate their mpirun invocations deeply within the application.

See Appendix C for more information.

MPI_USEPRUN_IGNORE_ARGS
Provides an easy way to modify the arguments contained in an appfile by supplying a list of space-separated arguments that mpirun should ignore.

% setenv MPI_USEPRUN_IGNORE_ARGS <option>

MPI_USESRUN
HP-MPI provides the capability to automatically assume that srun is the default launching mechanism. This mode of operation automatically classifies arguments into ‘srun’ and ‘mpirun’ arguments and correctly places them on the command line.The assumed srun mode also allows appfiles to be interpreted for command line arguments and translated into srun mode. The implied srun method of launching is useful for applications which embed or generate their mpirun invocations deeply within the application. This allows existing ports of an application from an HP-MPI supported platform to XC.

See Appendix C for more information.

MPI_USESRUN_IGNORE_ARGS
Provides an easy way to modify the arguments contained in an appfile by supplying a list of space-separated arguments that mpirun should ignore.

% setenv MPI_USESRUN_IGNORE_ARGS <option>

In the example below, the command line contains a reference to -stdio=bnone which is filtered out because it is set in the ignore list.

% setenv MPI_USESRUN_VERBOSE 1

% setenv MPI_USESRUN_IGNORE_ARGS -stdio=bnone

% setenv MPI_USESRUN 1

% setenv MPI_SRUNOPTION –label

% bsub -I -n4 -ext “SLURM[nodes=4]” \
$MPI_ROOT/bin/mpirun -stdio=bnone -f appfile — pingpong

 
Job <369848> is submitted to default queue <normal>.<<Waiting for dispatch …>><<Starting on lsfhost.localdomain>>/opt/hpmpi/bin/mpirununsetMPI_USESRUN;/opt/hpmpi/bin/mpirun-srun  ./pallas.x -npmin 4 pingpong
 

MPI_PRUNOPTIONS
Allows prun specific options to be added automatically to the mpirun command line. For example:

% export MPI_PRUNOPTIONS=”-m cyclic -x host0″

% mpirun -prot -prun -n2 ./a.out

is equivalent to:

% mpirun -prot -prun -m cyclic -x host0 -n2 ./a.out

TCP environment variables
MPI_TCP_CORECVLIMIT
The integer value indicates the number of simultaneous messages larger than 16KB that may be transmitted to a single rank at once via TCP/IP. Setting this variable to a larger value can allow HP-MPI to utilize more parallelism during its low-level message transfers, but can greatly reduce performance by causing switch congestion. Setting MPI_TCP_CORECVLIMIT to zero will not limit the number of simultaneous messages a rank may receive at once. The default value is 0.

MPI_SOCKBUFSIZE
Specifies, in bytes, the amount of system buffer space to allocate for sockets when using the TCP/IP protocol for communication. Setting MPI_SOCKBUFSIZE results in calls to setsockopt (…, SOL_SOCKET, SO_SNDBUF, …) and setsockopt (…, SOL_SOCKET, SO_RCVBUF, …). If unspecified, the system default (which on many systems is 87380 bytes) is used.

Elan environment variables
MPI_USE_LIBELAN
By default when Elan is in use, the HP-MPI library uses Elan’s native collective operations for performing MPI_Bcast and MPI_ Barrier operations on MPI_COMM_WORLD sized communicators. This behavior can be changed by setting MPI_USE_LIBELAN to “false” or “0”, in which case these operations will be implemented using point-to-point Elan messages.

To turn off:

% export MPI_USE_LIBELAN=0

MPI_USE_LIBELAN_SUB
The use of Elan’s native collective operations may be extended to include communicators which are smaller than MPI_COMM_WORLD by setting the MPI_USE_LIBELAN_SUB environment variable to a positive integer. By default, this functionality is disabled due to the fact that libelan memory resources are consumed and may eventually cause runtime failures when too many sub-communicators are created.

% export MPI_USE_LIBELAN_SUB=10

MPI_ELANLOCK
By default, HP-MPI only provides exclusive window locks via Elan lock when using the Elan interconnect. In order to use HP-MPI shared window locks, the user must turn off Elan lock and use window locks via shared memory. In this way, both exclusive and shared locks are from shared memory. To turn off Elan locks, set MPI_ELANLOCK to zero.

% export MPI_ELANLOCK=0

Rank Identification Environment Variables
HP-MPI sets several environment variables to let the user access information about the MPI rank layout prior to calling MPI_Init. These variables differ from the others in this section in that the user doesn’t set these to provide instructions to HP-MPI; HP-MPI sets them to give information to the user’s application.

HPMPI=1
This variable is set so that an application can conveniently tell if it is running under HP-MPI.

MPI_NRANKS
This is set to the number of ranks in the MPI job.

MPI_RANKID
This is set to the rank number of the current process.

MPI_LOCALNRANKS
This is set to the number of ranks on the local host.

MPI_LOCALRANKID
This is set to the rank number of the current process relative to the local host (0.. MPI_LOCALNRANKS-1).

Note that these settings are not available when running under srun or prun. However, similar information can be gathered from the variables set by those systems; such as SLURM_NPROCS and SLURM_PROCID.

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

1 Response

페이스북/트위트/구글 계정으로 댓글 가능합니다.