HIPACK is an efficient sub-8-bit direct convolution acceleration library to maximize the performance of quantized NN execution on Arm processors.
HiPACK follows the theoretical approach of adopting multiplication for low-bitwidth convolution and develops a series of novel approaches to fill the efficiency gap of low-bitwidth convolution on wimpy processors with SIMD optimizations and bitwise management. HiPack is built upon the following principles:
- Multiplication-based Convolution: Adopts long-bitwidth multiplication for low-bitwidth convolution.
- Data Dependency Elimination: Identifies and handles data dependencies in the process of adopting large-bitwidth multiplication for low-bitwidht convolution operations.
- SIMD Optimization: Utilizes SIMD instructions to maximize data reuse with operation decoupling and reordering to improve data parallelism.
- Bitwise Management: Develops optimal segmentation bitwidth identification mechanism and dual interleaved register mechanism to improve the efficiency of low-bitwidth convolution on wimpy processors with bitwise management.
The synergistic combination of the above methods is thoroughly evaluated with various CNN models on ARM processors. Experimental results demonstrate over
- Dynamic Bitwidth Support: Adapts to quantized computations with bitwidths lower than 8-bit.
- High Performance: Significant performance improvements, achieving a minimum of 3.2x speedup.
- PyTorch Integration: Provides PyTorch operator interfaces in torch_func, making it easy to integrate into existing deep learning workflows.
- Support for Various Convolution Shapes:
- DirectConv (nx3): Native support for
nx3
convolution shapes. - DirectConv (nxn): Extended implementation for arbitrary
nxn
shapes by tiling them into multiplenx3
convolutions.
- DirectConv (nx3): Native support for
The native support of nx3
kernel is implemented with C++ and located in src folder. The other convolution kernel sizes are implemented by tiling the convolution into multiple nx3
convolutions through pytorch function calls (detailed in torch_func folder).
# C++17
# g++ 10.2.1
# PyTorch >= 2.2.2 (With PyTorch C++ extension)
# OpenMP
# Clone this repository to your local raspberry pi 4B+ platform
- N: Input batch size. (Supported values: 1, 2, 4, 8)
- Ci: Number of input channels. (Supported values: 32, 64, 128, 256)
- H: Height of input feature map. (Supported values: 8, 16, 32)
- W: Width of input feature map. (Currently only support numbers divisible by 12, if not, will be padded with zeros to the nearest number divisible by 12, e.g., 32 will be padded to 36. Recommented values: 12, 24, 36)
- Co: Number of output channels. (Supported values: 32, 64, 128, 256)
- WA_bits: Bitwidth of weights and activations. (Supported values: 1, 2, 3, 4, 5, 6. Note: values greater than 4 may have the risk of overflow.)
- verbose: Whether to print verbose information. (Supported values: 0, 1)
- debug: Whether to verify the correctness of the computation. (Supported values: 0, 1)
Based on these parameters, the tensor dimensions for computation are represented as:
- Input shape: [N, Ci, H, W]
- Weight shape: [Co, Ci, 3, 3]
Use the following command to run the fast expetiments on a Raspberry Pi 4B+ platform.
$ cd src
# The make commond is inserted into the shell script
$ bash run_bench.sh
You can get the following output (configuration with performance) if no compilation and execution errors.
config: N1 Ci2 H2 W2 Co2 W3A3 debug1 verbose0
[W3A3] input[1,2,2,12] * weight[2,2,3,3]: Test pass
[W3A3] input[1,2,2,12] * weight[2,2,3,3]: Elapsed time: 0.000168 seconds Performance: 0.023943 GFLOPS.
config: N1 Ci2 H2 W2 Co4 W3A3 debug1 verbose0
[W3A3] input[1,2,2,12] * weight[4,2,3,3]: Test pass
[W3A3] input[1,2,2,12] * weight[4,2,3,3]: Elapsed time: 0.001268 seconds Performance: 0.006360 GFLOPS.
...
...
config: W3A3, save to: logs/test_hipack_perf_W3A3.log
[W3A3] input[16,3,224,228] * weight[32,3,3,3]: Elapsed time: 0.224631 seconds Performance: 6.397795 GFLOPS.
[W3A3] input[16,32,112,120] * weight[64,32,3,3]: Elapsed time: 0.248804 seconds Performance: 32.970821 GFLOPS.
...
...
[W3A3] input[16,512,7,12] * weight[1024,512,3,3]: Elapsed time: 0.221781 seconds Performance: 85.784536 GFLOPS.
[W3A3] input[16,1024,7,12] * weight[1024,1024,3,3]: Elapsed time: 0.446465 seconds Performance: 85.226671 GFLOPS.
Navigate to the torch_func folder.
cd torch_func/
# compile commands are scripted in compile.sh
bash compile.sh
Once compiled, the direct_conv
operator is ready to use for convolutions.
Refer to the file usage_of_directconv.py
for an example of how to use the direct_conv
operator for efficient convolutions.
The following is a simple example.
from direct_conv2d import direct_conv2d
N, Ci, H, W, Co, W_bits,A_bits =16,256,32,36,256,3,3
flops = 2*N*Ci*Co*H*W*3*3
inp = torch.randint(0, 2**A_bits -1, (N, Ci, H, W)).int()
weight = torch.randint(0, 2**W_bits -1, (Co, Ci, 3, 3)).int()
output = direct_conv2d(inp,weight,W_bits, A_bits,1,1,0,0)
nxn
shape support is implementated by tiling nxn
convolutions into multiple nx3
convolutions. For example:
- A 5x5 convolution can be tiled into 2 5x3 convolutions.
- A 9x9 convolution can be tiled into 3 9x3 convolutions.
We evaluate the conv2d with various kernel sizes, including 3x3, 5x5, 7x7, 9x9, and 11x11, in the script extend_conv2d.py
.
You can run the following command to run the fast experiments, but please ensure that you have compiled the PyTorch integration beforehand.
cd torch_func
python extend_conv2d.py
Then you can get the following outputs:
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x3x3
Float 3x3 time: 3.4242 s Performance: 12.6996GFLOPS
Qint8 3x3 time: 2.6350 s Performance: 16.5035GFLOPS
HIPACK 3x3 time: 1.0159 s Performance: 42.8046GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x5x5
Float 5x5 time: 9.5982 s Performance: 12.5853GFLOPS
Qint8 5x5 time: 7.4285 s Performance: 16.2611GFLOPS
HIPACK 5x5 time: 2.6788 s Performance: 45.0934GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x7x7
Float 7x7 time: 19.5089 s Performance: 12.1360GFLOPS
Qint8 7x7 time: 15.2863 s Performance: 15.4884GFLOPS
HIPACK 7x7 time: 5.5651 s Performance: 42.5435GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x9x9
Float 9x9 time: 33.2281 s Performance: 11.7786GFLOPS
Qint8 9x9 time: 26.6702 s Performance: 14.6748GFLOPS
HIPACK 9x9 time: 6.7355 s Performance: 58.1069GFLOPS
--------------------------------------------------------------------------------
Evaluate Conv2D with input size of 16x512x24x24 and kernel size of 512x512x11x11
Float 11x11 time: 37.7351 s Performance: 15.4936GFLOPS
Qint8 11x11 time: 49.6404 s Performance: 11.7778GFLOPS
HIPACK 11x11 time: 11.9832 s Performance: 48.7894GFLOPS
--------------------------------------------------------------------------------
The comparison results are shown below.
Additionally, you can find more settings in the script
extend_conv2d.py
, including the input shape and kernel size settings.
Feel free to modify these settings to conduct other experiments.
Refer to the file extend_conv2d.py
for details on using the extended convolution operator.
We have conducted comprehensive model evaluations, including VGG16, ResNet18, and ResNet34, as detailed in our manuscript. These evaluations can be found in the file torch_func/full_model_eval.py.
In these models, only 3x3 convolutions have been replaced with direct_conv2d
, which is the PyTorch integration of HIPACK as previously mentioned.
We take the W3A3 settings as an example.
You can use the following command to run the fast expetiments on a Raspberry Pi 4B+ platform (You need complie the PyTorch intergration first).
cd torch_func
python full_model_eval.py
Then you can get following outputs:
Evaluate latency on VGG16 with batchsize of 16:
Float time: 55.3923 s
Qint8 time: 14.0694 s
HIPACK-W3A3 time: 11.6774 s
Evaluate latency on ResNet18 with batchsize of 16:
Float time: 6.3254 s
Qint8 time: 3.2957 s
HIPACK-W3A3 time: 2.9776 s
Evaluate latency on ResNet34 with batchsize of 16:
Float time: 10.2774 s
Qint8 time: 6.0597 s
HIPACK-W3A3 time: 5.3619 s
- The Pytorch function call cause some performance degradation.
@inproceedings{hipack2025micro,
author={Yao, Chen and Cheng, Gong and Bingsheng, He},
booktitle={IEEE/ACM International Symposium on Microarchitecture (MICRO)},
title={HiPACK: Efficient Sub-8-Bit Direct Convolution with SIMD and Bitwise Management},
year={2025},
volume={},
number={},
pages={},
}