PARALiA-GEMMex (Uncut-GEMMs)

This repository provides a high-performance General Matrix Multiplication (GEMM) library for single-node multi-GPU HPC systems. The library uses autotuning to analyze each problem's communication, scheduling, and imbalance characteristics on the fly and create a customized static schedule with optimized data movement, caching, and overlap for that problem (more in our paper).

Software/Compiler requirements

CUDA toolkit 10+ (Latest release tested with 12.x versions on V100 and A100 clusters)
A gcc/g++ compiler compatible with the above CUDA (tested with 11.x, 12.x)
OpenBLAS, installed with the same gcc compiler.
Boost, installed with the same gcc compiler.
CMake minimum version 3.10
numactl
Python 3.x (packages: os, pandas. Additionally for plotting: math, numpy, scipy, matplotlib, seaborn)
The nvbandwidth tool (Installation & microbenchmarks already included in ./deploy.sh, see bellow).

Installation

PARALiA-GEMMex installation consists of 2(3) easy steps:

Fill config_system.sh with compiler/library details, module loads etc.
(Optional) Modify CMakeLists.txt to enable/disable optimizations and/or check experimental features.
Run ./deploy.sh on the target system (cross-compilation not supported due to deployment microbenchmarks). This file:
- Installs PARALiA-GEMMex.
- Downloads and installs nvbandwidth (if done manually, you have to set its path instead).
- Performs some fast microbenchmarks (~mins) to estimate the communication capabilities of the target interconnect.
- Runs tests for PARALiA-GEMMex functions to validate installation correctness (~mins).

Usage

After a succesful installation, you should have:

${PARALIA_GEMMEX_INSTALL_PREFIX}/lib, that contains shared .so files (Main library functions: libparalia.so)
${PARALIA_GEMMEX_INSTALL_PREFIX}/include, that contains all header files (Main header functions: PARALiA.hpp)
${PARALIA_GEMMEX_INSTALL_PREFIX}/testing-bin with tests and benchmarks for PARALiA-GEMMex and cuBLASXt routines.

To use PARALiA-GEMMex functions:

Always: source config_system.sh
Compiling: Include the main header PARALiA.hpp in code and -I${PARALIA_GEMMEX_INSTALL_PREFIX}/include during compilation.
Linking: use -L${PARALIA_GEMMEX_INSTALL_PREFIX}/lib -lparalia during linking.
PARALiA BLAS functions accept usual BLAS paramaters in a similar way to OpenBLAS/cuBLASXt etc (also see benchmarks bellow).
- PARALiADgemm(TransA, TransB, M, N, K, alpha, A, ldA, B, ldB, beta, C, ldC)
  - A, B, C can reside in CPU or (any) GPU memory.

Prebuild Benchmarks

${PARALIA_GEMMEX_INSTALL_PREFIX}/testing-bin contains PARALiA-GEMMex runners for double and single precision
- Also half, but still experimental (validators not updated for low precision).
Usage: ./[s,d]gemm_runner dev_num dev_ids T cache_max_size TransA TransB alpha beta D1 D2 D3 loc1 loc2 loc3 outloc
- Control parameters (you should leave all these to -1 unless you know what you are doing):
  - dev_num: The number of GPUs for the benchmark. Use all system GPUs if < 0.
  - dev_ids: A list of the GPUs used for execution. Ignored if dev_num < 0.
    - Input form example: 0101 for devices = [0,2], 1111 for devices = [0,1,2,3] etc.
  - T: The internal tiling size. Problem-tailored automatic tiling size selection if < 0.
  - cache_max_size: The maximum cache size that can be allocated in each GPU. Defined automatically if < 0.
- Routine input parameters:
  - TransA, TransB: N or T. The transpose parameter used for GEMM routine invocation for the A,B matrices, respectively.
  - alpha, beta: The GEMM constants for the routine invocation.
  - D1 D2 D3: D1 = M, D2 = N, D3 = K for the routine invocation.
- Data placement parameters:
- loc1 loc2 loc3: The locations of the A,B and C matrices in memory. Input form example:
  - loc = 0 to (system_gpu_num - 1): Matrix initially on the corresponding GPU memory (order = nvidia-smi).
  - loc = system_gpu_num: Matrix on pinned host memory (Numa-unaware, not advised).
  - loc = system_gpu_num + 1: Matrix on pinned numa-interleaved host memory (advised for host allocations).
- outloc: The output location of the C matrix. Always set to loc3 currently.

Contact

For questions, issues, or collaboration inquiries, please reach out via email: panastas@cslab.ece.ntua.gr

Licence

This work is open-source and distributed under a GPL-3.0 license.

Related Publications/Citations:

Main PARALiA-GEMMex implementation: Uncut-GEMMs: Communication-Aware Matrix Multiplication on Multi-GPU Nodes
Scheduling & autotuning: PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems
Overlap modeling & T selection: CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
Autotuner		Autotuner
Backend		Backend
Benchmarking		Benchmarking
Deployment_files		Deployment_files
Library		Library
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
config_system.sh		config_system.sh
deploy.sh		deploy.sh
nvidia_topo_parse.py		nvidia_topo_parse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PARALiA-GEMMex (Uncut-GEMMs)

Software/Compiler requirements

Installation

Usage

Prebuild Benchmarks

Contact

Licence

Related Publications/Citations:

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PARALiA-GEMMex (Uncut-GEMMs)

Software/Compiler requirements

Installation

Usage

Prebuild Benchmarks

Contact

Licence

Related Publications/Citations:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages