MAtrix Adaptive Design for Highly Imbalanced SpMV Accelerator (with GeMV Support) on HBM-based FPGAs
MAD-HiSpMV is a high-performance FPGA accelerator for Sparse Matrix–Vector Multiplication (SpMV) with optional dense overlay for GeMV support. It builds on our previous HiSpMV work with several key enhancements:
- Scalable HBM support: Multiple HBM channels are used to load input vectors and matrices efficiently.
- Hybrid Row Distribution Network: Routes PE outputs to dedicated y_Ax handlers for accumulation, balancing workload.
- Adder Chain Groups (ACG): Optional pre-addition of multiplication results to avoid RAW dependency on output accumulation and reducing pipeline stalls.
- Dense Overlay Support: Allows a single kernel to handle both SpMV and GeMV for mixed sparse-dense workloads.
Data Flow Summary:
- Sparse matrix
A
and input vectorx
are streamed from HBM channels to PEGs. - PEs multiply nonzero elements of
A
with the corresponding entries ofx
. - Results are routed through the hybrid row distribution network to the correct y_Ax handlers.
- Optional adder chains pre-accumulate results before final accumulation.
- Final output
y
is streamed back to HBM.
⚠️ Note: The PASTA+AutoBridge repo is private until publication. Please request access if needed.
-
Create and activate a Conda environment
Install PASTA following its instructions. -
Clone this repository and set up environment
load_vitis23 source miniconda3/bin/activate your_conda_env cd HiSpMV source setup cd - export CONDA_LOC=$(PWD)/miniconda3
load_vitis23
: loads Vitis HLS & XRT path variables.setup
: sets required environment variables for MAD-HiSpMV.
-
Install Python dependencies
pip install -r requirements.txt
-
Download benchmarking matrices
python get_tb_matrices.py
apps # Python apps: run SpMV/GeMV + sample DNN model
automation_tool # Scripts to auto-generate accelerator configs (matrix-adaptive)
builds # Source code + xclbin for U280/U50 configs, usage reports, floorplans
common # Common host + kernel source code
cpu # CPU benchmarking (Intel MKL SpMV/GeMV + power measurement)
gpu # GPU benchmarking (cuSPARSE SpMV + power measurement)
matrices # Storage for benchmarking matrices (downloaded by script)
pyhispmv # pybind11 wrapper to invoke FPGA kernels via XRT
get_tb_matrices.py # Script to fetch test/benchmarking matrices
requirements.txt # Python dependencies
setup # Environment setup script
README.md # Project documentation
-
Build the
pyhispmv
packagecd pyhispmv python setup.py build_ext --inplace cd ..
-
Run SpMV/GeMV tests
-
General test (no arguments):
cd apps python general_test.py
-
DNN model test (configurable):
cd apps python model_test.py \ --batch_size 1 \ --input_size 4096 \ --hidden_size_1 8192 \ --hidden_size_2 8192 \ --output_size 1024 \ --density1 0.1 \ --density2 0.25
-
Note on device selection:
Both scripts require settingdevice_id
(the FPGA index).
To find available devices, run:xbutil examine
Update
device_id
in the scripts to match the U280 board.
-
cd cpu
make clean all
./run_spmv.sh # Run SpMV benchmarks
./run_gemv.sh # Run GeMV benchmarks
cd gpu
make clean all
./run_all.sh # Run all SpMV benchmarks
The automation tool allows generating accelerator configurations either automatically (matrix-adaptive) or manually (explicit parameters).
automation_tool/src/main.py
analyzes the input matrix and automatically chooses optimal parameters such as HBM channel usage and optimizations.
Command:
cd automation_tool/src
python main.py <build_dir> --device {U50|U280|V80} [--matrices <file_or_dir>] [--dense-overlay]
Arguments:
build_dir
(positional): Path to the build directory.--device
: Target device (U50
,U280
, orV80
) [required].--matrices
: Path to a matrix file or a directory containing matrices.--dense-overlay
: Enable dense overlay mode (SpMV kernel with GeMV support).
- In normal mode (without
--dense-overlay
), the tool uses the input matrix to tailor the accelerator design. - In dense overlay mode, the design is not tailored to the input sparse matrix, and the
--matrices
argument is ignored. The generated kernel supports both SpMV and GeMV for mixed workloads.
Examples:
- Generate SpMV design for U280 with matrix directory:
python main.py ../../builds --device U280 --matrices ../matrices/
- Generate SpMV+GeMV hybrid design for U50 (no matrices needed):
python main.py ../../builds --device U50 --dense-overlay
automation_tool/src/rsc/spmvcodegen.py
provides fine-grained control over accelerator parameters instead of relying on automation.
Command:
cd automation_tool/src/
python spmvcodegen.py <output_dir> --device {U50|U280} [options]
Arguments:
output_dir
: Path to the output directory.--device
: Target FPGA device (U50
orU280
) [required].--num-ch-A
: Number of HBM channels for sparse matrix A (default: 16).--num-ch-x
: Number of HBM channels for input vector x (default: 1).--num-ch-y
: Number of HBM channels for output vector y (default: 1).--ch-width
: Width of HBM channels in bits (default: 512).--urams-per-pe
: URAM banks per PE for output accumulation (default: 2).--dense-overlay
: Enable dense overlay for GeMV support.--pre-accumulator
: Enable pre-accumulator optimization.--row-dist-net
: Enable row distribution network.--high-freq
: Build hardware for 400 MHz kernel clock.
Example (small dense-overlay design):
python ../../automation_tool/src/spmvcodegen.py ../ --device U280 \
--num-ch-A 4 --num-ch-x 1 --num-ch-y 1 --urams-per-pe 1 --row-dist-net --dense-overlay
Example log output:
20250822:204011 [INFO] Resource: FPGAResource(bram=128, uram=32, dsp=613, lut=134724, reg=135873)
20250822:204011 [INFO] Successfully Generated Code at ../Dense-HI-SpMV-4-1-1
-
Navigate to the generated design directory
The script automatically names the directory with configuration info:cd ../Dense-HI-SpMV-4-1-1
-
Build host code
make host
-
Run C simulation (HLS source code)
- Sparse matrix input (SpMV):
./spmv-host ../../matrices/poli_large/poli_large.mtx
- Dense matrix input (dense overlay / GeMV):
where
./spmv-host 512 512
512 512
specifies rows and columns of the dense matrix.
- Sparse matrix input (SpMV):
-
Run hardware-software co-simulation
First, synthesize the RTL code:make tapa
Then run co-simulation using the Vivado TAPA fast cosim:
./spmv-host 512 512 --bitstream="spmv.xilinx_u280_gen3x16_xdma_1_202211_1.hw.xo"
Note: More details about the TAPA fast co-simulation for RTL simulation can be found here https://tapa.readthedocs.io/en/main/user/cosim.html
-
Build final hardware bitstream
make hw
-
Run on actual FPGA hardware
./spmv-host ../../matrices/analytics/analytics.mtx \ --bitstream="vitis_run_hw/SpMV_xilinx_u280_gen3x16_xdma_1_202211_1.xclbin"
This workflow covers dense-overlay design generation, C simulation, co-simulation, and execution on real FPGA hardware.
If you use MAD-HiSpMV in your work, please cite our upcoming publication (to be added here after acceptance).