(Slide 1: Overview of Parallelization with MPI)
The purpose of this presentation is to provide an overview of parallelization with MPI, the Message Passing Interface.
(Slide 2: Presentation Outline)
First I will present
background information, then
the MPI programming model,
the commonly used blocking point-to-point routines,
the commonly used blocking collective routines,
the format for MPI programs
column and row block distribution
And finally my concluding remarks
(Slide 3: Background)
MPI is a message passing library for writing parallel Fortran, C, and C++ programs. MPI 1.0 became available in 1994. MPI continues to evolve and MPI 3.1 became available in 2016. MPI programs can run on both shared memory and distributed memory parallel machines.
(Slide 4: The MPI Programming Model)
MPI uses the Single Program Multiple Data programming model. This means that all MPI processes run the same program. The user specifies the number of MPI processes. Then MPI assigns each MPI process a rank: rank 0, rank 1, …, rank (p-1) where p is the number of MPI processes selected. This allows different MPI processes to perform different calculations. The following example shows how this can be done.
(Slide 5: Example)
Notice that the code for the rank 0 process can be different from the code used for all the other processes.
(Slide 6: Collective & Point-to-Point Routines)
To send data between MPI processes, one calls MPI collective and/or point-to-point routines. MPI collective routines involve all p processes in the communication. MPI point-to-point routines involve only two processes in the communication. Both collective and point-to-point routines can be blocking and nonblocking. Blocking routines are easier to use than nonblocking and are recommended for beginners. A discussion of blocking and nonblocking routines is beyond the scope of this presentation. When possible, one should use collective routines instead of point-to-point.
(Slide 7: Commonly Used Blocking Point-to-Point Routines)
The commonly used blocking point-to-point routines are:
mpi_send is used for sending messages.
mpi_ssend is the synchronous send which is useful for program debugging.
mpi_recv is used for receiving messages.
mpi_sendrecv is used for exchanging messages between MPI processes when the send and receive buffers are not the same.
mpi_sendrecv_replace is used for exchanging messages between MPI processes when the send and receive buffers are the same.
(Slide 8: Commonly Used Blocking Collective Routines)
The commonly used blocking collective routines are:
mpi_barrier is used to ensure all p processes must reach the barrier before execution continues.
mpi_bcast is used to broadcast a message on the root process to all other MPI processes.
mpi_gather gathers messages from all MPI processes to a root process.
mpi_allgather gathers messages from all MPI processes and places the gathered message on all processes.
mpi_scatter scatters data from a root process to all other processes.
mpi_alltoall causes every MPI process sends data to all MPI processes.
mpi_reduce reduces values on all processes to a single value and places it on the root process.
mpi_allreduce is similar to mpi_reduce except the reduced values are placed on all processes.
(Slide 9: Format for MPI Programs)
The following shows the format for an MPI program and lists the MPI routines needed for every MPI program:
“use mpi” must be the first statement in the program and makes all MPI routines available to the program.
The next section contains the declaration of program variables. For convenience of programming, I suggest including the following statement:
integer,parameter :: comm=mpi_comm_world, dp=mpi_double_precision
since one can then use “comm” instead of “mpi_comm_world” and “dp” instead of “mpi_double_precision”.
The call to mpi_init initializes the MPI environment and must be called prior to calling other MPI routines.
The call to mpi_comm_size sets the value of p to the number of MPI processes used.
The call to mpi_comm_rank sets rank equal to the rank of the executing process.
The next section is the “program”. This is followed by
calling mpi_finalize to terminate the MPI environment.
The end statement then ends program execution.
(Slide 10: Column and Row Block Distribution)
Before writing an MPI program, one must first decide how to divide data and computation among processes in a way to provide good load balancing and minimize message passing time. For example, if the program data is a 2-dimensional, rectangular computational domain then it can be column or row blocked distributed among p processes.
(Slide 11: Example 1: A= B + C – page 1)
Example 1 shows how to parallelize A = B + C where A, B and C are 2 dimensional, double precision arrays of size n by n that are column blocked among p processes so on each process their size is n by m where m = n/p. Notice that A, B and C are declared to be allocatable since their size depends on the number of prosesses which is not known until program execution. The value of 4000 for n was selected arbitrarily.
(Slide 12: Example 1: A= B + C – page 2)
Page 2 of example 1 shows how the n columns of the arrays are divided among p processes. Notice it does not assume that p divides n. Then B and C are allocated and initialized.
(Slide 13: Example 1: A= B + C – page 3)
Page 3 of example 1 shows the last portion of the program by computing the sum on each column block. The program is then terminated by calling mpi_finalize and “end”. Notice that the answer to the global sum of A is distributed among all p processes and that message passing is not needed for example 1.
(Slide 14: Example 2: sum = dot product(a,b)
Example 2 shows how to compute the global dot product of two 1 dimensional arrays a and b each of size n distributed across p processes. The value of the global dot product of a and b and is returned in the variable named dot_product on all p processes. Notice that local sums are first computed and then the global sum is calculated and put on all p processes by calling mpi_allreduce.
(Slide 15: Concluding Remarks)
In conclusion, the MPI message passing library is used to parallelize Fortran, C and C++ programs for both distributed memory and shared memory parallel computers.
To achieve high performance, serial execution must be fast, load balancing must be good, and there must be minimal overhead from MPI.
A comprehnsive MPI tutorial is available on the Lawrence Livermore National Laboratory web site.
I would like to thank the College of Liberal Arts and Sciences for their support for this project.