PanguLU is an open source software package for solving a linear system Ax = b on heterogeneous distributed platforms. The library is written in C, and exploits parallelism from MPI, OpenMP and CUDA. The sparse LU factorisation algorithm used in PanguLU splits the sparse matrix into multiple equally-sized sparse matrix blocks and computes them by using sparse BLAS. The latest version of PanguLU uses a synchronisation-free communication strategy to reduce the overall latency overhead, and a variety of block-wise sparse BLAS methods have been adaptively called to improve efficiency on CPUs and GPUs. Currently, PanguLU supports both single and double precision, both real and complex values. In addition, our team at the SSSLab is constantly optimising and updating PanguLU.
PanguLU/README instructions on installation
PanguLU/src C and CUDA source code, to be compiled into libpangulu.a and libpangulu.so
PanguLU/examples example code
PanguLU/include contains headers archieve pangulu.h
PanguLU/lib contains library archieve libpangulu.a and libpangulu.so
PanguLU/reordering_omp Parallel graph partitioning and fill-reducing matrix ordering
PanguLU/Makefile top-level Makefile that does installation and testing
PanguLU/make.inc compiler, compiler flags included in all Makefiles (excepts examples/Makefile)
"make" is an automatic build tool, it is required to build PanguLU. "make" is available in most GNU/Linux. You can install it using package managers like apt
or yum
.
PanguLU requires MPI library. you need to install MPI library with header files. Tested MPI libraries : MPICH 4.3.0.
If GPUs are used, CUDA is required. Tested version : CUDA 12.9.
A BLAS library is required if CPU takes part in algebra computing of numeric factorisation. Tested version : OpenBLAS 0.3.26.
Search /path/to
in make.inc
. Replace them to the path actually on your computer.
The Makefile of example code doesn't include make.inc
. Search /path/to
in examples/Makefile
. Replace them to the path actually on your computer.
GPU is enabled by default. If you want to disable GPU, you should :
- Remove
GPU_CUDA
inbuild_list.csv
; - Remove
-DGPU_OPEN
inPANGULU_FLAGS
. You can findPANGULU_FLAGS
inmake.inc
; - Comment
LINK_CUDA
inexamples/Makefile
.
Vise versa.
Make sure the working directory of your terminal is the root directory of PanguLU. If PanguLU was successfully built, you will find libpangulu.a
and libpangulu.so
in lib
directory, and pangulu_example.elf
in exampls
directory.
PANGULU_FLAGS
influences build behaviors. You can edit PANGULU_FLAGS
in make.inc
to implement different features of PanguLU. Here are available flags :
Use -DGPU_OPEN
to use GPU, vice versa. Please notice that using this flag is not the only thing to do if you want to use GPU. Please check Step 7 in the Installation part.
Use -DCALCULATE_TYPE_R64
(double real) or -DCALCULATE_TYPE_CR64
(double complex) or -DCALCULATE_TYPE_R32
(float real) or -DCALCULATE_TYPE_CR32
(float complex). Note to also add this option to the compilation command in example/Makefile
.
Use -DPANGULU_MC64
to enable MC64 algorithm. Please notice that MC64 is not supported when matrix entries are complex numbers. If complex values are selected and -DPANGULU_MC64
flag is used, MC64 would not enable.
Use -DMETIS
if you want to use METIS; otherwise, our parallel reordering algorithm will be used by default.
Please select zero or one of these flags : -DPANGULU_LOG_INFO
, -DPANGULU_LOG_WARNING
or -DPANGULU_LOG_ERROR
. Log level "INFO" prints all messages to standard output (including warnings and errors). Log level "WANRING" only prints warnings and errors. Log level "ERROR" only prints fatal errors causing PanguLU to terminate abnormally.
Use -DPANGULU_PERF
to output additional performance information, such as kernel time per gpu and gflops of numeric factorisation. Note that this will slow down the speed of numeric factorisation.
Hyper-threading is not recommended. If you can't turn off the hyper-threading and each core of your CPU has 2 threads, using -DHT_IS_OPEN
may reaps performance gain.
To make it easier to call PanguLU in your software, PanguLU provides the following function interfaces:
void pangulu_init(
sparse_index_t pangulu_n, // Specifies the number of rows in the CSC matrix.
sparse_pointer_t pangulu_nnz, // Specifies the total number of non-zero elements in the CSC matrix.
sparse_pointer_t *csc_colptr, // Points to an array that stores pointers to columns of the CSC matrix.
sparse_index_t *csc_rowidx, // Points to an array that stores indices to rows of the CSC matrix.
sparse_value_t *csc_value, // Points to an array that stores the values of non-zero elements of the CSC matrix.
pangulu_init_options *init_options, // Pointer to a pangulu_init_options structure containing initialisation parameters for the solver.
void **pangulu_handle // On return, contains a handle pointer to the library's internal state.
);
void pangulu_gstrf(
pangulu_gstrf_options *gstrf_options, // Pointer to pangulu_gstrf_options structure.
void **pangulu_handle // Pointer to the solver handle returned on initialisation.
);
void pangulu_gstrs(
sparse_value_t *rhs, // Pointer to the right-hand side vector.
pangulu_gstrs_options *gstrs_options, // Pointer to the pangulu_gstrs_options structure.
void** pangulu_handle // Pointer to the library internal state handle returned on initialisation.
);
void pangulu_gssv(
sparse_value_t *rhs, // Pointer to the right-hand side vector.
pangulu_gstrf_options *gstrf_options, // Pointer to a pangulu_gstrf_options structure.
pangulu_gstrs_options *gstrs_options, // Pointer to a pangulu_gstrs_options structure.
void **pangulu_handle // Pointer to the library internal status handle returned on initialisation.
);
void pangulu_finalize(
void **pangulu_handle // Pointer to the library internal state handle returned on initialisation.
);
example.c
is a sample program to call PanguLU. You can refer to this file to complete the call to PanguLU. You should first create the distributed matrix using pangulu_init()
. If you need to solve multiple right-hand side vectors while the matrix is unchanged, you can call pangulu_gstrs()
multiple times after calling pangulu_gstrf()
. If you need to factorise a number of different matrices, call pangulu_finalize()
after completing the solution of one matrix, and then use pangulu_init()
to to initialise the next matrix.
The test routines are placed in the examples
directory. The routine in examples/example.c
firstly call pangulu_gstrf()
to perform LU factorisation, and then call pangulu_gstrs()
to solve linear equation.
mpirun -np process_count ./pangulu_example.elf -nb block_size -f path_to_mtx -r path_to_rhs
process_count : MPI process number to launch PanguLU;
block_size : Rank of each non-zero block;
path_to_mtx : The matrix name in mtx format;
path_to_rhs : The path of the right-hand side vector. (optional)
You can also use the run.sh, for example:
bash run path_to_mtx block_size process_count
mpirun -np 4 ./pangulu_example.elf -nb 10 -f Trefethen_20b.mtx
or use the run.sh:
bash run.sh Trefethen_20b.mtx 10 4
In this example, 4 processes are used to test, the block_size is 10, matrix name is Trefethen_20b.mtx.
- Added a task aggregator to increase numeric factorisation performance;
- Optimised performance of preprocessing phase;
- Added parallel reordering algorithm on CPU as the default reordering algorithm;
- Optimised GPU memory layout to reduce GPU memory usage.
- Optimised memory usage of numeric factorisation and solving;
- Added parallel building support.
- Optimised user interfaces of solver routines;
- Optimised performamce of numeric factorisation phase on CPU platform;
- Added support on complex matrix solving;
- Optimised preprocessing performance;
- Updated the pre-processing phase with OpenMP.
- Updated the compilation method of PanguLU, compile libpangulu.so and libpangulu.a at the same time.
- Updated timing for the reorder phase, the symbolic factorisation phase, the pre-processing phase.
- Added GFLOPS for the numeric factorisation phase.
- Used adaptive selection sparse BLAS in the numeric factorisation phase.
- Added the reorder phase.
- Added the symbolic factorisation phase.
- Added mc64 sorting algorithm in the reorder phase.
- Added interface for 64-bit metis package in the reorder phase.
- Used a synchronisation-free scheduling strategy in the numeric factorisation phase.
- Updated the MPI communication method in the numeric factorisation phase.
- Added single precision in the numeric factorisation phase.
- Used a rule-based 2D LU factorisation scheduling strategy.
- Used Sparse BLAS for floating point calculations on GPUs.
- Added the pre-processing phase.
- Added the numeric factorisation phase.
- Added the triangular solve phase.
- [1] Xu Fu, Bingbin Zhang, Tengcheng Wang, Wenhao Li, Yuechen Lu, Enxin Yi, Jianqi Zhao, Xiaohan Geng, Fangying Li, Jingwen Zhang, Zhou Jin, Weifeng Liu. PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems. 36th ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '23). 2023.