Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
We introduce Uni-MuMER, which fully fine-tunes the Qwen2.5-VL-3B model for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.
Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model, SSAN, by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting.
- 2025-09-18: This work got accepted to NeurIPS 2025 as a Spotlight (688/21575).
- 2025-09-09 : Release dataset (Uni-MuMER-Data and valid/test data) and training code. [See Training]
- 2025-06-02: Release of model weights and inference scripts.
- Download
data.zipfrom GitHub, Huggingface, or Google Drive link. - Unzip it at the project root. After extraction, you should have:
data
├── CROHME/
├── CROHME2023/
├── HME100K/
├── Im2LaTeXv2/
├── MathWriting/
└── MNE/
After the dataset is in place, you can run batch inference over all three test sets with one of the two commands below.
bash eval/eval_crohme.sh -i <input-dir> -o <output-dir> -m <model> -b <batch_size>Example
bash eval/eval_all.sh -m models/Uni-MuMER-3B -s test1 -b 32768python scripts/vllm_infer.py --input-dir <input-dir> --output-dir <output-dir> --model <model> --batch_size <batch_size>Tip:
-
To select GPUs on multi‑GPU machines just export
CUDA_VISIBLE_DEVICESbefore running the script, e.g.,export CUDA_VISIBLE_DEVICES=1,2 -
For batch_size, you can use the
--batch_sizeargument to control the number of samples pervLLM.generate()call. The default value is 32768, which is prevented from being too large to avoid OOM errors.
Our training code depends on LLaMA-Factory.
For training dependencies, please refer to LLaMA-Factory or requirements_training.txt.
llamafactory-cli train train/Uni-MuMER-train.yaml- Inference code and pretrained models.
- Evaluation code.
- Training code.
- Training data.
- Preprocess code.
Thanks to the following projects:
If you find Uni-MuMER useful for your study or research, please cite our paper with:
@article{li2025unimumer,
title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Wei, Baole and Zhou, Yuxuan and Gao, Liangcai},
year = {2025},
journal={arXiv preprint arXiv:2505.23566},
}

