HiPACK: Efficient Sub-8-Bit Direct Convolution with SIMD and Bitwise Management

HIPACK is an efficient sub-8-bit direct convolution acceleration library to maximize the performance of quantized NN execution on Arm processors.

Design Principles

HiPACK follows the theoretical approach of adopting multiplication for low-bitwidth convolution and develops a series of novel approaches to fill the efficiency gap of low-bitwidth convolution on wimpy processors with SIMD optimizations and bitwise management. HiPack is built upon the following principles:

Multiplication-based Convolution: Adopts long-bitwidth multiplication for low-bitwidth convolution.

Data Dependency Elimination: Identifies and handles data dependencies in the process of adopting large-bitwidth multiplication for low-bitwidht convolution operations.

SIMD Optimization: Utilizes SIMD instructions to maximize data reuse with operation decoupling and reordering to improve data parallelism.

Bitwise Management: Develops optimal segmentation bitwidth identification mechanism and dual interleaved register mechanism to improve the efficiency of low-bitwidth convolution on wimpy processors with bitwise management.

The synergistic combination of the above methods is thoroughly evaluated with various CNN models on ARM processors. Experimental results demonstrate over $3.2\times$ performance improvements compared to existing approaches, enabling efficient execution of low-bitwidth DNNs on resource-constrained ARM devices.

Features

Dynamic Bitwidth Support: Adapts to quantized computations with bitwidths lower than 8-bit.
High Performance: Significant performance improvements, achieving a minimum of 3.2x speedup.
PyTorch Integration: Provides PyTorch operator interfaces in torch_func, making it easy to integrate into existing deep learning workflows.
Support for Various Convolution Shapes:
- DirectConv (nx3): Native support for nx3 convolution shapes.
- DirectConv (nxn): Extended implementation for arbitrary nxn shapes by tiling them into multiple nx3 convolutions.

Benchmarking Results

Implementation Instructions

The native support of nx3 kernel is implemented with C++ and located in src folder. The other convolution kernel sizes are implemented by tiling the convolution into multiple nx3 convolutions through pytorch function calls (detailed in torch_func folder).

Environmental Setups

# C++17
# g++ 10.2.1 
# PyTorch >= 2.2.2 (With PyTorch C++ extension)
# OpenMP
# Clone this repository to your local raspberry pi 4B+ platform

`nx3` kernel implementation (C++ backend)

Customizable parameters

N: Input batch size. (Supported values: 1, 2, 4, 8)
Ci: Number of input channels. (Supported values: 32, 64, 128, 256)
H: Height of input feature map. (Supported values: 8, 16, 32)
W: Width of input feature map. (Currently only support numbers divisible by 12, if not, will be padded with zeros to the nearest number divisible by 12, e.g., 32 will be padded to 36. Recommented values: 12, 24, 36)
Co: Number of output channels. (Supported values: 32, 64, 128, 256)
WA_bits: Bitwidth of weights and activations. (Supported values: 1, 2, 3, 4, 5, 6. Note: values greater than 4 may have the risk of overflow.)
verbose: Whether to print verbose information. (Supported values: 0, 1)
debug: Whether to verify the correctness of the computation. (Supported values: 0, 1)

Based on these parameters, the tensor dimensions for computation are represented as:

Input shape: [N, Ci, H, W]
Weight shape: [Co, Ci, 3, 3]

Use the following command to run the fast expetiments on a Raspberry Pi 4B+ platform.

$ cd src
# The make commond is inserted into the shell script
$ bash run_bench.sh

You can get the following output (configuration with performance) if no compilation and execution errors.

config: N1 Ci2 H2 W2 Co2 W3A3 debug1 verbose0
	[W3A3] input[1,2,2,12] * weight[2,2,3,3]: Test pass
	[W3A3] input[1,2,2,12] * weight[2,2,3,3]: Elapsed time: 0.000168 seconds Performance: 0.023943 GFLOPS.
config: N1 Ci2 H2 W2 Co4 W3A3 debug1 verbose0
	[W3A3] input[1,2,2,12] * weight[4,2,3,3]: Test pass
	[W3A3] input[1,2,2,12] * weight[4,2,3,3]: Elapsed time: 0.001268 seconds Performance: 0.006360 GFLOPS.
...
...
config: W3A3, save to: logs/test_hipack_perf_W3A3.log
        [W3A3] input[16,3,224,228] * weight[32,3,3,3]: Elapsed time: 0.224631 seconds Performance: 6.397795 GFLOPS.
        [W3A3] input[16,32,112,120] * weight[64,32,3,3]: Elapsed time: 0.248804 seconds Performance: 32.970821 GFLOPS.
...
...
        [W3A3] input[16,512,7,12] * weight[1024,512,3,3]: Elapsed time: 0.221781 seconds Performance: 85.784536 GFLOPS.
        [W3A3] input[16,1024,7,12] * weight[1024,1024,3,3]: Elapsed time: 0.446465 seconds Performance: 85.226671 GFLOPS.

`nxn` Kernel implementation (PyTorch implementation)

PyTorch Integration

Navigate to the torch_func folder.

cd torch_func/

1. Compile the DirectConv Operator

# compile commands are scripted in compile.sh
bash compile.sh

Once compiled, the direct_conv operator is ready to use for convolutions.

2. Using the DirectConv Operator

Refer to the file usage_of_directconv.py for an example of how to use the direct_conv operator for efficient convolutions. The following is a simple example.

from direct_conv2d import direct_conv2d

N, Ci, H, W, Co, W_bits,A_bits =16,256,32,36,256,3,3
flops = 2*N*Ci*Co*H*W*3*3
inp = torch.randint(0, 2**A_bits -1, (N, Ci, H, W)).int()
weight = torch.randint(0, 2**W_bits -1, (Co, Ci, 3, 3)).int()
output = direct_conv2d(inp,weight,W_bits, A_bits,1,1,0,0)

3. Extending to nxn Convolution Shapes

nxn shape support is implementated by tiling nxn convolutions into multiple nx3 convolutions. For example:

A 5x5 convolution can be tiled into 2 5x3 convolutions.
A 9x9 convolution can be tiled into 3 9x3 convolutions.

We evaluate the conv2d with various kernel sizes, including 3x3, 5x5, 7x7, 9x9, and 11x11, in the script extend_conv2d.py. You can run the following command to run the fast experiments, but please ensure that you have compiled the PyTorch integration beforehand.

cd torch_func
python extend_conv2d.py

Then you can get the following outputs:

Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x3x3
        Float 3x3 time: 3.4242 s  Performance: 12.6996GFLOPS
        Qint8 3x3 time: 2.6350 s  Performance: 16.5035GFLOPS
        HIPACK 3x3 time: 1.0159 s  Performance: 42.8046GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x5x5
        Float 5x5 time: 9.5982 s  Performance: 12.5853GFLOPS
        Qint8 5x5 time: 7.4285 s  Performance: 16.2611GFLOPS
        HIPACK 5x5 time: 2.6788 s  Performance: 45.0934GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x7x7
        Float 7x7 time: 19.5089 s  Performance: 12.1360GFLOPS
        Qint8 7x7 time: 15.2863 s  Performance: 15.4884GFLOPS
        HIPACK 7x7 time: 5.5651 s  Performance: 42.5435GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x9x9
        Float 9x9 time: 33.2281 s  Performance: 11.7786GFLOPS
        Qint8 9x9 time: 26.6702 s  Performance: 14.6748GFLOPS
        HIPACK 9x9 time: 6.7355 s  Performance: 58.1069GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x11x11
        Float 11x11 time: 37.7351 s  Performance: 15.4936GFLOPS
        Qint8 11x11 time: 49.6404 s  Performance: 11.7778GFLOPS
        HIPACK 11x11 time: 11.9832 s  Performance: 48.7894GFLOPS
--------------------------------------------------------------------------------

The comparison results are shown below. Additionally, you can find more settings in the script extend_conv2d.py, including the input shape and kernel size settings. Feel free to modify these settings to conduct other experiments. Refer to the file extend_conv2d.py for details on using the extended convolution operator.

Full Model Evaluations

We have conducted comprehensive model evaluations, including VGG16, ResNet18, and ResNet34, as detailed in our manuscript. These evaluations can be found in the file torch_func/full_model_eval.py. In these models, only 3x3 convolutions have been replaced with direct_conv2d, which is the PyTorch integration of HIPACK as previously mentioned. We take the W3A3 settings as an example.

You can use the following command to run the fast expetiments on a Raspberry Pi 4B+ platform (You need complie the PyTorch intergration first).

cd torch_func
python full_model_eval.py

Then you can get following outputs:

Evaluate latency on VGG16 with batchsize of 16:
Float time: 55.3923 s
Qint8 time: 14.0694 s
HIPACK-W3A3 time: 11.6774 s
Evaluate latency on ResNet18 with batchsize of 16:
Float time: 6.3254 s
Qint8 time: 3.2957 s
HIPACK-W3A3 time: 2.9776 s
Evaluate latency on ResNet34 with batchsize of 16:
Float time: 10.2774 s
Qint8 time: 6.0597 s
HIPACK-W3A3 time: 5.3619 s

The Pytorch function call cause some performance degradation.

Citation

@inproceedings{hipack2025micro,
  author={Yao, Chen and Cheng, Gong and Bingsheng, He},
  booktitle={IEEE/ACM International Symposium on Microarchitecture (MICRO)}, 
  title={HiPACK: Efficient Sub-8-Bit Direct Convolution with SIMD and Bitwise Management}, 
  year={2025},
  volume={},
  number={},
  pages={},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
src		src
torch_func		torch_func
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HiPACK: Efficient Sub-8-Bit Direct Convolution with SIMD and Bitwise Management

Design Principles

Features

Benchmarking Results

Implementation Instructions

Environmental Setups

`nx3` kernel implementation (C++ backend)

Customizable parameters

`nxn` Kernel implementation (PyTorch implementation)

PyTorch Integration

1. Compile the DirectConv Operator

2. Using the DirectConv Operator

3. Extending to nxn Convolution Shapes

Full Model Evaluations

Citation

About

Uh oh!

Releases

Packages

Languages

License

Xtra-Computing/HIPACK

Folders and files

Latest commit

History

Repository files navigation

HiPACK: Efficient Sub-8-Bit Direct Convolution with SIMD and Bitwise Management

Design Principles

Features

Benchmarking Results

Implementation Instructions

Environmental Setups

nx3 kernel implementation (C++ backend)

Customizable parameters

nxn Kernel implementation (PyTorch implementation)

PyTorch Integration

1. Compile the DirectConv Operator

2. Using the DirectConv Operator

3. Extending to nxn Convolution Shapes

Full Model Evaluations

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`nx3` kernel implementation (C++ backend)

`nxn` Kernel implementation (PyTorch implementation)

Packages