Skip to content

SFU-HiAccel/HiSpMV

Repository files navigation

MAD-HiSpMV

MAtrix Adaptive Design for Highly Imbalanced SpMV Accelerator (with GeMV Support) on HBM-based FPGAs


Overview of MAD-HiSpMV Architecture

MAD-HiSpMV Architecture

MAD-HiSpMV is a high-performance FPGA accelerator for Sparse Matrix–Vector Multiplication (SpMV) with optional dense overlay for GeMV support. It builds on our previous HiSpMV work with several key enhancements:

  • Scalable HBM support: Multiple HBM channels are used to load input vectors and matrices efficiently.
  • Hybrid Row Distribution Network: Routes PE outputs to dedicated y_Ax handlers for accumulation, balancing workload.
  • Adder Chain Groups (ACG): Optional pre-addition of multiplication results to avoid RAW dependency on output accumulation and reducing pipeline stalls.
  • Dense Overlay Support: Allows a single kernel to handle both SpMV and GeMV for mixed sparse-dense workloads.

Data Flow Summary:

  1. Sparse matrix A and input vector x are streamed from HBM channels to PEGs.
  2. PEs multiply nonzero elements of A with the corresponding entries of x.
  3. Results are routed through the hybrid row distribution network to the correct y_Ax handlers.
  4. Optional adder chains pre-accumulate results before final accumulation.
  5. Final output y is streamed back to HBM.

Software Requirements

⚠️ Note: The PASTA+AutoBridge repo is private until publication. Please request access if needed.


⚙️ Setup Instructions

  1. Create and activate a Conda environment
    Install PASTA following its instructions.

  2. Clone this repository and set up environment

    load_vitis23
    source miniconda3/bin/activate your_conda_env 
    cd HiSpMV 
    source setup
    cd -
    export CONDA_LOC=$(PWD)/miniconda3
    • load_vitis23: loads Vitis HLS & XRT path variables.
    • setup: sets required environment variables for MAD-HiSpMV.
  3. Install Python dependencies

    pip install -r requirements.txt
  4. Download benchmarking matrices

    python get_tb_matrices.py

📂 Repository Structure

apps             # Python apps: run SpMV/GeMV + sample DNN model
automation_tool  # Scripts to auto-generate accelerator configs (matrix-adaptive)
builds           # Source code + xclbin for U280/U50 configs, usage reports, floorplans
common           # Common host + kernel source code
cpu              # CPU benchmarking (Intel MKL SpMV/GeMV + power measurement)
gpu              # GPU benchmarking (cuSPARSE SpMV + power measurement)
matrices         # Storage for benchmarking matrices (downloaded by script)
pyhispmv         # pybind11 wrapper to invoke FPGA kernels via XRT
get_tb_matrices.py  # Script to fetch test/benchmarking matrices
requirements.txt # Python dependencies
setup            # Environment setup script
README.md        # Project documentation

🚀 Example Usage

FPGA Benchmarks (Python Apps)

  1. Build the pyhispmv package

    cd pyhispmv
    python setup.py build_ext --inplace
    cd ..
  2. Run SpMV/GeMV tests

    • General test (no arguments):

      cd apps
      python general_test.py
    • DNN model test (configurable):

      cd apps
      python model_test.py \
        --batch_size 1 \
        --input_size 4096 \
        --hidden_size_1 8192 \
        --hidden_size_2 8192 \
        --output_size 1024 \
        --density1 0.1 \
        --density2 0.25
    • Note on device selection:
      Both scripts require setting device_id (the FPGA index).
      To find available devices, run:

      xbutil examine

      Update device_id in the scripts to match the U280 board.


CPU Benchmarks (Intel MKL)

cd cpu
make clean all
./run_spmv.sh   # Run SpMV benchmarks
./run_gemv.sh   # Run GeMV benchmarks

GPU Benchmarks (NVIDIA cuSPARSE)

cd gpu
make clean all
./run_all.sh    # Run all SpMV benchmarks

Automation Tool (Matrix-Adaptive Design Generation)

The automation tool allows generating accelerator configurations either automatically (matrix-adaptive) or manually (explicit parameters).


Option 1: Automatic Configuration (main.py)

automation_tool/src/main.py analyzes the input matrix and automatically chooses optimal parameters such as HBM channel usage and optimizations.

Command:

cd automation_tool/src
python main.py <build_dir> --device {U50|U280|V80} [--matrices <file_or_dir>] [--dense-overlay]

Arguments:

  • build_dir (positional): Path to the build directory.
  • --device: Target device (U50, U280, or V80) [required].
  • --matrices: Path to a matrix file or a directory containing matrices.
  • --dense-overlay: Enable dense overlay mode (SpMV kernel with GeMV support).

⚠️ Important Notes:

  • In normal mode (without --dense-overlay), the tool uses the input matrix to tailor the accelerator design.
  • In dense overlay mode, the design is not tailored to the input sparse matrix, and the --matrices argument is ignored. The generated kernel supports both SpMV and GeMV for mixed workloads.

Examples:

  • Generate SpMV design for U280 with matrix directory:
    python main.py ../../builds --device U280 --matrices ../matrices/
  • Generate SpMV+GeMV hybrid design for U50 (no matrices needed):
    python main.py ../../builds --device U50 --dense-overlay

Option 2: Manual Configuration (spmvcodegen.py)

automation_tool/src/rsc/spmvcodegen.py provides fine-grained control over accelerator parameters instead of relying on automation.

Command:

cd automation_tool/src/
python spmvcodegen.py <output_dir> --device {U50|U280} [options]

Arguments:

  • output_dir: Path to the output directory.
  • --device: Target FPGA device (U50 or U280) [required].
  • --num-ch-A: Number of HBM channels for sparse matrix A (default: 16).
  • --num-ch-x: Number of HBM channels for input vector x (default: 1).
  • --num-ch-y: Number of HBM channels for output vector y (default: 1).
  • --ch-width: Width of HBM channels in bits (default: 512).
  • --urams-per-pe: URAM banks per PE for output accumulation (default: 2).
  • --dense-overlay: Enable dense overlay for GeMV support.
  • --pre-accumulator: Enable pre-accumulator optimization.
  • --row-dist-net: Enable row distribution network.
  • --high-freq: Build hardware for 400 MHz kernel clock.

Example (small dense-overlay design):

python ../../automation_tool/src/spmvcodegen.py ../ --device U280 \
  --num-ch-A 4 --num-ch-x 1 --num-ch-y 1 --urams-per-pe 1 --row-dist-net --dense-overlay

Example log output:

20250822:204011 [INFO]  Resource: FPGAResource(bram=128, uram=32, dsp=613, lut=134724, reg=135873)
20250822:204011 [INFO]  Successfully Generated Code at ../Dense-HI-SpMV-4-1-1

Build and Test the Generated Design

  1. Navigate to the generated design directory
    The script automatically names the directory with configuration info:

    cd ../Dense-HI-SpMV-4-1-1
  2. Build host code

    make host
  3. Run C simulation (HLS source code)

    • Sparse matrix input (SpMV):
      ./spmv-host ../../matrices/poli_large/poli_large.mtx
    • Dense matrix input (dense overlay / GeMV):
      ./spmv-host 512 512
      where 512 512 specifies rows and columns of the dense matrix.
  4. Run hardware-software co-simulation
    First, synthesize the RTL code:

    make tapa

    Then run co-simulation using the Vivado TAPA fast cosim:

    ./spmv-host 512 512 --bitstream="spmv.xilinx_u280_gen3x16_xdma_1_202211_1.hw.xo"

Note: More details about the TAPA fast co-simulation for RTL simulation can be found here https://tapa.readthedocs.io/en/main/user/cosim.html

  1. Build final hardware bitstream

    make hw
  2. Run on actual FPGA hardware

    ./spmv-host ../../matrices/analytics/analytics.mtx \
        --bitstream="vitis_run_hw/SpMV_xilinx_u280_gen3x16_xdma_1_202211_1.xclbin"

This workflow covers dense-overlay design generation, C simulation, co-simulation, and execution on real FPGA hardware.


📖 Citation

If you use MAD-HiSpMV in your work, please cite our upcoming publication (to be added here after acceptance).

About

[TRETS 2025][FPGA 2024] FPGA Accelerator for Imbalanced SpMV using HLS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published