Loose MPICH MPI Integration with Grid Engine 5.3
Loose MPICH MPI Integration with Grid Engine 5.3
Related FAQ entries:
[lart=33 lang=en]
[lart=34 lang=en]
Background:
Grid Engine can support multiple, customized “parallel environments” allowing for parallel aware applications to run within the cluster. Each parallel environment (PE) within Grid Engine can be customized to satisfy the potentially unique startup/shutdown/cleanup demands of the parallel environment (typically MPICH, LAM-MPI or PVM) or even specific scientific applications.
“Loose” vs. “Tight” Integration of Parallel Environments:
The term “loose integration” is used to describe an integration approach in which the cluster scheduler is only responsible for finding available parallel job slots within the cluster and dispatching pending jobs at the appropriate time. If a parallel job is requesting 8 CPUs the scheduler will hold the job until 8 free slots are available within the cluster. Once the resources are available, the scheduler will dispatch the job along with a unique machine file that designates which machines and/or CPUs the parallel job is allowed to run on.
The advantage of “loose” integration is primarily with its simplicity. Because the scheduler does not have to do much more than match available job slots with requested CPUs (and decide when pending jobs get launched!) it is fairly easy to quickly add Grid Engine support for all sorts of parallel application environments including PVM, MPICH, LAM-MPI, LINDA etc.
There are several disadvantages to “loose” integration. The primary downside is that the parallel tasks are not running under the control and direction of a sge_shepherd daemon. This means that Grid Engine may not be able to accurately account for resource utilization or clean up tasks left over from a runaway job. With “loose” integration you must also trust the parallel application itself to honor the customized machine file being provided. There are no technical barriers to prevent the job from ignoring the provided machinefile and just launching parallel tasks at will on every cluster node.
“Tight Integration” approaches solve these sorts of problems by binding Grid Engine more directly into the parallel application environment. With “tight” integration, Grid Engine does far more than just kicking out a custom machine file — it also takes over the responsibility for launching and managing the parallel tasks themselves. There is far more control and monitoring of the parallel jobs.
The primary problem with “tight” integration is that it tends to be highly specific to the parallel environment being used. In some cases, it may not be enough to build tightly integrated PE’s for MPICH or LAM-MPI — you may be forced to integrate on an application-by-application basis.
As a general rule, we recommend starting first with loosely integrated parallel environments. Then, if needed, tight integration can be explored for critical applications or highly popular parallel environments.
What this document covers:
This document covers one way of setting up a loosely integrated parallel environment within Grid Engine (version 5.3, not 6.x) that supports the MPICH implementation of the MPI standard.
This document was written using MPICH-1.2.6 built on an Apple Xserve cluster running Mac OS X.
Grid Engine 5.3 versus Grid Engine 6.0:
The SGE information in this document is specific to Grid Engine 5.3 — both the “standard” and “enterprise” editions. The last public release of Grid Engine 5.3 is v5.3p6.
Why? Because treatment of parallel environments has changed slightly with Grid Engine 6. We will include Grid Engine 6 specific instructions shortly.
Creating the Parallel Environment:
You will need to be logged into a system that is considered an “admin host” by Grid Engine. Your account should also be one that has Grid Engine manager authority.
A new parallel environment (PE) is created by issuing the “qconf -ap ” command. Let’s name our PE ‘mpich’.
Grid Engine happens to ship with simple PE start and stop scripts for MPICH environments. They are suitable for our loose integration needs. These scripts can be found in your $SGE_ROOT directory inside the ‘mpi/’ folder. For iNquiry clusters this location would be /common/sge/mpi/.
The example below sets the number of available MPICH slots at “14”. This is because our testbed cluster has 7 dual processor compute nodes and we only want one parallel task to run per CPU at max. You will need to adjust the value of SLOTS to reflect your local cluster environment.
The example below also sets the allocation_rule to something called “$fill_up” — this forces Grid Engine to fill all available parallel job slots on one machine before moving on to the next machine.
This may not match what you want on your local cluster — some people do not want to “pack” their tasks onto a smaller number of machines. If you set the value of allocation_rule to $round_robin, Grid Engine will attempt to spread the parallel tasks across as many machines as possible. The “best” setting for this rule may be different for various parallel application types and is one of the main reasons why you may want to consider setting up separate parallel environments for certain cluster applications.
Issue the command:
qconf -ap mpich
When the editor pops up, populate the fields with the following information (remember to customize the slots and allocation_rule values):
pe_name mpich
queue_list all
slots 14
user_lists NONE
xuser_lists NONE
start_proc_args /common/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args /common/sge/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
If you look at the startmpi.sh and stopmpi.sh scripts you will see that while startmpi.sh is doing some useful work, all that the stop script is doing is deleting the machines file from a temporary directory. This is not really useful at all so feel free to disable the “stop_proc_args” stop script by inserting “/bin/true” in place of the stopmpi.sh script.
How? Well within Grid Engine if you want to modify the configuration of an existing parallel environment you would use the command “qconf -mp (pe_name)”.
So to modify our mpich PE we would issue the command:
qconf -mp mpich
and simply edit the settings as necessary.
At any time you may print out the current configuration of the mpich PE by issuing the command:
# qconf -sp mpich
pe_name mpich
queue_list all
slots 14
user_lists NONE
xuser_lists NONE
start_proc_args /common/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
Testing the newly created MPICH PE:
We are going to take the example program ‘cpi’ that is distributed with the MPICH code and try to submit it to our newly created Grid Engine parallel environment.
As a reminder, this document assumes that you have already installed MPICH and compiled the ‘cpi’ binary similarly to what is described in our FAQ entry: [lart=33 lang=en]
Because Grid Engine 5.3 does not allow for direct submission of a binary we have to throw together a trivial wrapper script. In the script below we are embedding Grid Engine commands so we don’t have to include them when we submit the script via ‘qsub’. The embedded commands are simple, we are assigning a name to the job, requesting the ‘mpich’ parallel environment with a range of CPUs and telling Grid Engine to default to the current directory whenever full patchnames are not used:
#!/bin/csh -f
#
### Begin embedded Grid Engine arguments
# (name the job)
#$ -N MPI_Job
# (ask for our newly created PE and a range of 3-5 CPUs)
#$ -pe mpich 3-5
# (assume current working directory for paths)
#$ -cwd
### End embedded Grid Engine commmands
echo “I have $NSLOTS slots to run on!”
/usr/local/mpich-1.2.6/ch_p4/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./cpi
Trying to run the test script:
[workgroupcluster:~/mpitest] www% qsub ./mpi_cpi.sh
your job 2700 (“MPI_Job”) has been submitted
[workgroupcluster:~/mpitest] www%
[workgroupcluster:~/mpitest] www% qstat
job-ID prior name user state submit/start at queue master ja-task-ID
———————————————————————————————
2700 0 MPI_Job www t 09/30/2004 13:56:24 node01.q SLAVE
0 MPI_Job www t 09/30/2004 13:56:24 node01.q SLAVE
2700 0 MPI_Job www t 09/30/2004 13:56:24 node03.q MASTER
0 MPI_Job www t 09/30/2004 13:56:24 node03.q SLAVE
2700 0 MPI_Job www t 09/30/2004 13:56:24 node04.q SLAVE
[workgroupcluster:~/mpitest]
If it works, you will see several output files. The standard output will contain something like this:
Got 5 slots. (tempdir=/tmp/2700.1.node03.q)
pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.008615
And the standard error file will contain something similar to:
Process 0 on node03.cluster.private
Process 3 on node04.cluster.private
Process 1 on node01.cluster.private
Process 2 on node01.cluster.private
Process 4 on node01.cluster.private
Putting it all together:
Using your cluster to calculate values of Pi via MPI is fun for a very short period of time.
If you have gotten to to the point where you can run the ‘cpi’ program within Grid Engine, you have completed 90% of the work necessary to get the parallel applications you really are interested in up and running.
Consider what you had to do to get to this point:
1. Build, configure and test MPICH
2. Create, debug and test a Grid Engine parallel environment
3. Successfully run a MPI job with Grid Engine
The final step is running your application of choice under this newly created Grid Engine PE.
Want some homework? Looking for a good life science test case? Well if you are interested in phylogenetic analysis, consider trying to build and use a parallel-enabled version of MrBayes by referencing this FAQ entry: [lart=34 lang=en]
relax music