Solution of the Laplace Equation by the Jacobi Method
| Name | Surname | Student ID | UniTS email | Personal email | Master course |
|---|---|---|---|---|---|
| Piero | Zappi | SM3600004 | [email protected] |
[email protected] |
SDIC |
This repository is organized with the following structure:
.
├── LICENSE # MIT License file
├── Makefile # Makefile for compilation
├── README.md # This file
├── build.sh # Bash script for automated compilation
├── images # Folder containing images for the presentation
├── jacobi.c # Parallel C code for the Jacobi method
├── plot.plt # Gnuplot script for plotting the solution
├── plots.ipynb # Jupyter notebook for plotting the time measurements
├── presentation # Folder containing the presentation files
├── run.sh # Bash script for automated execution
└── times # Folder containing the time measurements
This repository contains all the materials and codes related to the final project of the course Advanced High Performance Computing of the Scientific and Data-Intensive Computing Master Degree at University of Trieste and SISSA.
The project focuses on the solution of the Laplace equation using the Jacobi method. The code is implemented in C and is designed to be compiled and executed on a high-performance computing (HPC) cluster.
The algorithm implemented in the code solves the Laplace equation, which is a second-order partial differential equation, given by:
where
Specifically, the code solves the Laplace equation on an evenly spaced grid by applying the Jacobi iteration technique, which employs numerical second derivatives to approximate the solution.
Assuming to solve the heat equation across a square metal plate, where the source of the heat flow is located at one of the corners of the plate, a two dimensional grid is created, which is divided evenly into square regions. The solution to the equation is then represented as a square matrix, whose entries correspond to the value of the potential at a given spatial point on the grid.
The code iteratively updates the potential values at each grid point using the following formula:
where
The code is fully parallel and is designed to be executed on a high-performance computing (HPC) cluster. Specifically, it leverages MPI for distributed memory parallelism and efficient inter-process communication and OpenMP for shared memory parallelism and GPU offloading.
The global matrix is decomposed by rows across MPI processes. Each process is assigned a block of contiguous rows of the global matrix; if the number of rows is not divisible by the number of processes, the first few processes are assigned an additional row to ensure the complete coverage of the domain.
Within each MPI process, OpenMP is used to parallelize the initialization of the local matrix and the application of boundary conditions, distributing the workload across multiple threads.
The computation is performed using OpenMP offloading to the GPU. The initial state of the local matrix is copied to the device memory at the beginning of the computation, while the final result is copied back to the host memory at the end. To do so, the #pragma omp target data map directive is used to create a target data region that maps the local matrix to the device memory.
The computation of each iteration is offloaded to the GPU using the #pragma omp target teams distribute parallel for simd collapse(2) num_teams(108) directive. This directive enables efficient parallelization by distributing the workload across multiple teams of threads on the GPU. Each team is responsible for a subset of the grid, and within each team, threads collaboratively compute the new values for the grid points. The collapse(2) clause ensures that the two nested loops over the grid dimensions are flattened into a single loop, maximizing parallelism and improving performance. The num_teams(108) clause specifies the number of teams to be created, which can be tuned based on the GPU architecture to achieve optimal utilization. Finally, the simd clause enables vectorization of the loop, allowing the compiler to generate SIMD (Single Instruction, Multiple Data) instructions for further performance improvements.
At the end of each iteration, the matrices storing the new and old values are swapped to be used in the next iteration.
Each iteration of the Jacobi method requires communication between neighboring MPI processes to exchange boundary values. In particular, to compute the new values each process needs to send its top row to the process above and its bottom row to the process below, as well as receive the top row from the process below and the bottom row from the process above.
Each MPI process allocates two additional rows in its local matrix to store the boundary data exchanged with its neighboring processes. These extra rows are used to hold the top boundary received from the process above and the bottom boundary received from the process below. The first and last processes in the MPI communicator have MPI_PROC_NULL as their top and bottom neighbors, respectively, to safely handle communication boundaries.
The communication is performed using one-sided communication (Remote Memory Access, RMA) to reduce the latency and the synchronization overhead associated with the communication. Each process opens two memory windows, one for the top and one for the bottom rows of its local matrix, exposing them to the neighboring processes, which in this way can directly access the data they need without requiring explicit send and receive operations. Specifically, the MPI_Win_create function is used to create the memory windows and the MPI_Get function is used to read the data from the neighboring processes; MPI_Win_fence is used for synchronization.
The communication is performed using the #pragma omp target data use_device_ptr directive to ensure that the data is accessed directly from the device memory, avoiding unnecessary data transfers between the host and the device: in fact, in this way RMA accesses operate directly on device pointers, which can significantly improve performance.
After the completion of the iterations, each MPI process writes its local matrix to a binary file (solution.bin) using the MPI_File_write_at_all() function to ensure that the data is written in parallel. In fact, this enables parallel I/O, with each rank writing at its correct offset. The output file contains the solution to the heat equation, which can be visualized using a suitable software such as gnuplot.
The code is designed to be compiled and executed on a high-performance computing (HPC) cluster. Specifically, it has been tested on the CINECA LEONARDO supercomputer, which is equipped with NVIDIA GPUs and supports both MPI and OpenMP parallelization.
From the root folder of this project it is possible to compile it using the provided Makefile.
The build.sh bash script is provided for quick and automated compilation on the Booster partition of the CINECA LEONARDO supercomputer.
The code has been executed on the nodes of the Booster partition of the CINECA LEONARDO supercomputer; each node is equipped with 4 NVIDIA Tensor Core GPUs and a single Intel CPU with 32 cores.
The run.sh bash script is provided for automated execution of the code on the Booster partition of the CINECA LEONARDO supercomputer. To run the code, simply use the following command:
sbatch --nodes <N> run.shwhere N is number of nodes to be used for the execution. The script will automatically allocate the requested number of nodes and run the code on them. Specifically, the number of MPI tasks per node is set to 4, the number of OpenMP threads per MPI task is set to 8 and the number of GPUs per node is set to 4. The code will run for 1000 iterations on a matrix of size 10000 x 10000; the output will be written to a binary file named solution.bin in the current directory.
The plot.plt script is provided for plotting the solution using gnuplot.
The code has been tested on the CINECA LEONARDO supercomputer, using the Booster partition. Specifically, a performance analysis of the code scaling has been performed, running the code on a matrix of size 10000 x 10000 for 1000 iterations using from 1 to 10 nodes. Time measurements have been taken considering initialization, copy to device, computation and communication times; the results are located in the times/ folder.
The plots.ipynb Jupyter notebook is provided for plotting the results of the performance analysis. The following plots have been generated:
- Total Time vs Number of Nodes: this plot shows the total time taken by the code as a function of the number of nodes used for the execution.
- Time Breakdown: this plot shows the breakdown of the total time into initialization, copy to device, computation and communication times as a function of the number of nodes used for the execution. This allows to understand how the time is distributed among the different phases of the code and how it scales with the number of nodes.
- Speedup: this plot shows the speedup of the code as a function of the number of nodes used for the execution. The speedup is defined as the ratio of the time taken by the code on a single node to the time taken by the code on multiple nodes. This allows to understand how well the code scales with the number of nodes.
- Efficiency: this plot shows the efficiency of the code as a function of the number of nodes used for the execution. The efficiency is defined as the ratio of the speedup to the number of nodes used for the execution. This allows to understand how well the code scales with the number of nodes and how much of the available resources are being used.
The plots are saved in the images/ folder.
Distributed under the MIT License. See LICENSE for more information.