This repository contains the pseudo quantization workflow for M2XFP and provides a lightweight way to evaluate accuracy (e.g., perplexity) on LLaMA-3 and other LLMs.
- Create and activate conda environment:
conda create -n mxq python=3.10
conda activate mxq- Install the package in development mode:
pip install vllm==0.7.0 --extra-index-url https://download.pytorch.org/whl/cu128
pip install -e .Run the main quantization workflow:
# Perplexity evaluation on WikiText for LLaMA-3
bash llama3_run.sh wikitext
# Reasoning benchmarks
bash reasoning.shentry.py- Main entry point for quantizationllama3_run.sh- An example script to run Llama3 quantizationquantize/- Core quantization modulesquant_func.py- Quantization configuration and functionsquantizer.py- Main quantization logiclinear.py- Quantized linear layer implementationpre_quant.py- Pre-quantization utilities
utils/- Utility modulesmodule.py- Module manipulation utilitiesdataload_utils.py- Data loading utilitiesparallel.py- Parallel processing utilitiescalib_data.py- Calibration data handlingutils.py- General utilities
If you find this repository useful in your research or project, please kindly cite:
@misc{hu2026m2xfpmetadataaugmentedmicroscalingdata,
title={M2XFP: A Metadata-Augmented Microscaling Data Format for Efficient Low-bit Quantization},
author={Weiming Hu and Zihan Zhang and Haoyan Zhang and Chen Zhang and Cong Guo and Yu Feng and Tianchi Hu and Guanglin Li and Guipeng Hu and Junsong Wang and Jingwen Leng},
year={2026},
eprint={2601.19213},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2601.19213},
}
We sincerely thank the authors and contributors of the following open-source projects. Our implementation builds upon their excellent codebases: