Large Language Models (LLMs) require substantial computational resources, limiting their deployment on resource-constrained hardware. Ternary LLMs mitigate these demands through weight quantization via 2-bit ternary values {-1, 0, +1}, achieving significant compression often with 50 − 90% sparsity. However, existing approaches face limitations:
- Existing CPU and GPU do not support native 2-bit operations, and existing libraries like PyTorch and CUDA do not have dedicated computing kernels for ternary weights.
- Existing sparse formats like Compressed Sparse Column are not optimized for ternary values, causing extra storage and decompression overhead.
- Methods optimized for ternary weights, such as BitNet, RSR, and RSR++, fail to capture sparsity structures.
Therefore, we aim to solve these challenges by novel algorithms, code optimization, and hardware accelerators. This repository contains code for three projects:
- SSR: Sparse Segment Reduction for Ternary GEMM Acceleration (target limitation 3)
- Fast Ternary Large Language Model Inference with Addition-Based Sparse GEMM on Edge Devices (target limitations 1 and 2)
- An Accelerator for Ternary Language Models based on FPGA (target limitation 1)
File organization and main contributors:
- SSR: Adeline Pittet, Valerie Verdan, and Shien Zhu
- ternaryLLM_CPU: Mila Kjoseva, and Shien Zhu
- ternaryLLM_GPU: Guanshujie Fu
- ternaryLLM_FPGA: Gabriele Giacone
Please refer to the README inside each folder for the detailed experiment setups. If you find this repository helpful, please cite the following paper(s):
@inproceedings{SSR_DATE_2026,
title={SSR: Sparse Segment Reduction for Ternary GEMM Acceleration},
author={Adeline Pittet and Shien Zhu and Valerie Verdan and Gustavo Alonso},
booktitle={Design, Automation and Test in Europe (DATE)},
year={2026}
}