This repository provides a high-performance General Matrix Multiplication (GEMM) library for single-node multi-GPU HPC systems. The library uses autotuning to analyze each problem's communication, scheduling, and imbalance characteristics on the fly and create a customized static schedule with optimized data movement, caching, and overlap for that problem (more in our paper).
- CUDA toolkit 10+ (Latest release tested with 12.x versions on V100 and A100 clusters)
- A gcc/g++ compiler compatible with the above CUDA (tested with 11.x, 12.x)
- OpenBLAS, installed with the same gcc compiler.
- Boost, installed with the same gcc compiler.
- CMake minimum version 3.10
- numactl
- Python 3.x (packages: os, pandas. Additionally for plotting: math, numpy, scipy, matplotlib, seaborn)
- The nvbandwidth tool (Installation & microbenchmarks already included in
./deploy.sh, see bellow).
PARALiA-GEMMex installation consists of 2(3) easy steps:
- Fill
config_system.shwith compiler/library details, module loads etc. - (Optional) Modify
CMakeLists.txtto enable/disable optimizations and/or check experimental features. - Run
./deploy.shon the target system (cross-compilation not supported due to deployment microbenchmarks). This file:- Installs PARALiA-GEMMex.
- Downloads and installs nvbandwidth (if done manually, you have to set its path instead).
- Performs some fast microbenchmarks (~mins) to estimate the communication capabilities of the target interconnect.
- Runs tests for PARALiA-GEMMex functions to validate installation correctness (~mins).
After a succesful installation, you should have:
${PARALIA_GEMMEX_INSTALL_PREFIX}/lib, that contains shared .so files (Main library functions:libparalia.so)${PARALIA_GEMMEX_INSTALL_PREFIX}/include, that contains all header files (Main header functions:PARALiA.hpp)${PARALIA_GEMMEX_INSTALL_PREFIX}/testing-binwith tests and benchmarks for PARALiA-GEMMex and cuBLASXt routines.
To use PARALiA-GEMMex functions:
- Always:
source config_system.sh - Compiling: Include the main header
PARALiA.hppin code and-I${PARALIA_GEMMEX_INSTALL_PREFIX}/includeduring compilation. - Linking: use
-L${PARALIA_GEMMEX_INSTALL_PREFIX}/lib -lparaliaduring linking. - PARALiA BLAS functions accept usual BLAS paramaters in a similar way to OpenBLAS/cuBLASXt etc (also see benchmarks bellow).
PARALiADgemm(TransA, TransB, M, N, K, alpha, A, ldA, B, ldB, beta, C, ldC)- A, B, C can reside in CPU or (any) GPU memory.
${PARALIA_GEMMEX_INSTALL_PREFIX}/testing-bincontainsPARALiA-GEMMexrunners for double and single precision- Also half, but still experimental (validators not updated for low precision).
- Usage:
./[s,d]gemm_runner dev_num dev_ids T cache_max_size TransA TransB alpha beta D1 D2 D3 loc1 loc2 loc3 outloc- Control parameters (you should leave all these to -1 unless you know what you are doing):
dev_num: The number of GPUs for the benchmark. Use all system GPUs if < 0.dev_ids: A list of the GPUs used for execution. Ignored if dev_num < 0.- Input form example:
0101 for devices = [0,2],1111 for devices = [0,1,2,3]etc.
- Input form example:
T: The internal tiling size. Problem-tailored automatic tiling size selection if < 0.cache_max_size: The maximum cache size that can be allocated in each GPU. Defined automatically if < 0.
- Routine input parameters:
TransA, TransB: N or T. The transpose parameter used for GEMM routine invocation for the A,B matrices, respectively.alpha, beta: The GEMM constants for the routine invocation.D1 D2 D3: D1 = M, D2 = N, D3 = K for the routine invocation.
- Data placement parameters:
loc1 loc2 loc3: The locations of the A,B and C matrices in memory. Input form example:loc = 0 to (system_gpu_num - 1): Matrix initially on the corresponding GPU memory (order = nvidia-smi).loc = system_gpu_num: Matrix on pinned host memory (Numa-unaware, not advised).loc = system_gpu_num + 1: Matrix on pinned numa-interleaved host memory (advised for host allocations).
outloc: The output location of the C matrix. Always set toloc3currently.
- Control parameters (you should leave all these to -1 unless you know what you are doing):
For questions, issues, or collaboration inquiries, please reach out via email: panastas@cslab.ece.ntua.gr
- This work is open-source and distributed under a GPL-3.0 license.
- Main PARALiA-GEMMex implementation: Uncut-GEMMs: Communication-Aware Matrix Multiplication on Multi-GPU Nodes
- Scheduling & autotuning: PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems
- Overlap modeling & T selection: CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs