Assignment 10: Hybrid MPI+OpenMP
Note (Apr 15, 2025): the value given below to use for D (i.e. delta t) in the bigger case (N=200) was wrong; it should have been D=0.000125, so 10x smaller than what was given below. If you have already done your assignment with the old value, we will of course except it but there would likely not have been a speedup from goin hybrid. With the correct value of D, it should be possible to obtain a speedup.
You are give an MPI code that simulates the three-dimension KPP-Fisher's equation:
∂u/∂t −∂²u/∂x² − ∂²u/∂y² − ∂²u/∂z² = u(1−u)
which is a diffusion equation with a non-linear reaction term.
The code can be found below or on the Teach cluster by issuing the command "git clone ~lcl_uotphy1610s1001/assignment10".
The x, y, and z values are restricted to a cube [0,L]x[0,L]x[0,L], with the condition that u = A on all the boundaries.
The initial value is given by u(x,y,z,0)=0 (except on the boundaries).
The solution is numerically computed with an explicit time-stepping scheme using timesteps Δt up to a time T, using a discretization of the interval into N points, using forward Euler. The simulation uses a straightforward stencil method without calls to BLAS or FFTW.
The main code is in pkkfisher3d.cpp. It is fairly flat (i.e., not very modular), to make it clearer what is going on (hopefully). The little bit of modularity is found in the helper codes in output.h, output.cpp, readcommandline.h and readcommmandline.cpp, which deal with the data output and the input parameters.
"make" compiles the application, producing the application "pkkfisher3d". On the Teach cluster, this requires the following modules to be loaded: gcc/12.3, rarray/2.8.0, boost/1.85.0, and openmpi/4.1.5. There's a file called 'setup' as well, so you can type "source setup" to load these modules.
During the simulation, P snapshots are printed to a file F of the values of u on all N-2 points (i.e. all point excluding the boundary where u=A). The format of the output is such that each line contains five numbers separated by spaces, i.e., the time, position in all three dimensions, and value of u at that time and position.
P, L, A, N, T, Δt, and F are input parameters of the program.
"make run" runs all the applications using the following parameters: L=15, A=0.2, N=100, T=10, P=10, and Δt=0.001, once in serial and once with 4 processes. The results stored in the files output1.dat and output4.dat and should be identical.
Your task is to create a hybrid MPI+OpenMP version of this code, and use it to scale up to a larger problem with N=200 with D=0.00125 (keep other parameters the same as above), using up to 3 Teach-cluster nodes
of 40 cores. [Apr 15, 2025: Please see above note about this value of D being incorrect, the correct value would be D=0.000125] In other words:
* Using the existing code, write and submit job scripts to run this on 1, 2, and
3 nodes for 40, 80, and 120 MPI processes respectively. Make sure
you time each runs.
* Modify the code in pkkfisher3d.cpp and output.cpp to use OpenMP
parallelization. Hint: the i-direction is already parallelized,
focus on the deeper levels.
* Create further job scripts, all for 3 nodes, but running this with
various hybrid combinations of ntasks_per_node (#mpi processes) and
cpus_per_task (#openmp threads).
* Note: the output files will become large; do not add them to your repo!
As usually, use git and comment the code (modifications). Add the jobscripts to your repo as well as the timing of the runs. What gave the most speedup? Append your finding to the README file. Zip up the repo including the working directory and submit by April 15th.
- 8 April 2025, 11:22 AM