Research Computing | OpenMP

Video
Transcript or Alternate URL: 

Slide 1: 
The purpose of this presentation is to provide an overview of shared memory parallelization with OpenMP.

Slide 2: 
For this presentation, I will first give
background information, then talk about
loop parallelization,
race conditions,
critical regions
using private and shared variables,
the reduction clause, and then I’ll give my
concluding remarks.

Slide 3: 
OpenMP provides an easy-to-use method to write parallel programs for shared memory computers. OpenMP is a collection of directives for Fortran and pragmas for C and C++ that are added to an existing program. OpenMP 1.0 was made available in 1997 and it was simple. OpenMP 5.0 is planned to be available in 2017 and is complex since it will include support for accelerators. This presentation uses mostly OpenMP 1.0 constructs.

Slide 4: 
OpenMP programs consist of serial and parallel regions. All OpenMP programs begin with a serial region that is executed by thread 0. Parallel regions are executed by multiple threads. Parallel regions are started by using a parallel directive or pragma and terminated by another directive or pragma. For example, in a Fortran program parallel regions are usually started using the “omp parallel” or “omp parallel do” directives. OpenMP’s primary parallelization method is the parallelization of loops within parallel regions.

Slide 5: 
Loop iterations must be independent of their order of execution to allow parallelization.
Loop iterations are scheduled to OpenMP threads for parallel execution. OpenMP provides several different ways to schedule loop interations to threads. For example, one can use auto, static, dynamic or guided scheduling. A detailed discussion of scheduling options is beyond the scope of this presentation. However, if loop iterations take about the same amount of time to execute, usually static scheduling will give the best performance; otherwise, auto scheduling will usually give the best performance. Program variables in a parallel region must be either private or shared. Values of private variables are private for each thread and values of shared variables are the same on all threads. 

Slide 6: 
This example shows how one can parallelize the vector add operation using OpenMP. Notice a, b and c are shared variables and the loop index is private. The schedule clause shows how one can select different scheduling options. The “omp parallel do” indicates the following loop is to be parallelized. The “omp end parallel do” terminates the parallel region so that subsequent statements are executed serially on thread 0.

Slide 7: 
When writing an OpenMP program one needs to be aware of the possibility of having race conditions. A race condition exists when answers depend on the order of thread execution. The following code segment illustrates a common race condition with the sum operation since the shared variable sum may be updated by multiple threads at the same time.
 

Slide 8: 
To eliminate the above race condition one may use an OpenMP critical region since critical regions are executed by a single thread at a time. The code between the “omp critical” and “omp end critical” shows how a critical region is designated. Even though this corrects the race condition, execution is very slow since this serializes the calculation. Using private variables to compute local sums, one can significantly improve the performance.

Slide 9: 
The following is a general OpenMP optimization principle: Perform calculations using private variables as much as possible. For example, the performance of the above sum code is greatly improved when using private variables for computing a local sum on each thread.

Slide 10: 
This example does not use the “omp parallel do” directive but creates a parallel region using the “omp parallel” directive. The “omp do” directive is used to parallelize the loop. The “omp end parallel” directive terminates the parallel region. Notice that the local sums are performed using the private sloc variable and only the updating sum with sloc is put in a critical region. This greatly improves performance.

Slide 11: 
The sum operation is used so often that OpenMP has created a special clause to simplify programming and allow compilers to optimize the sum and other operations. Using the reduction clause, the sum operation can be written by adding the “reduction(+:sum)” clause to the “omp do” directive. The reduction clause in this example performs the same computation as in the previous two examples, but it allows a compiler to perform further optimizations.

Lawrence Livermore National Laboratory: https://computing.llnl.gov/tutorials/...