Skip to content

BFlameSwift/Uni-MuMER

Repository files navigation

Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Description

We introduce Uni-MuMER, which fully fine-tunes the Qwen2.5-VL-3B model for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.

Uni-MuMER

Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model, SSAN, by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting.

intro

📢 Updates

  • 2025-09-18: This work got accepted to NeurIPS 2025 as a Spotlight (688/21575).
  • 2025-09-09 : Release dataset (Uni-MuMER-Data and valid/test data) and training code. [See Training]
  • 2025-06-02: Release of model weights and inference scripts.

📦 Dataset Preparation

  1. Download data.zip from GitHub, Huggingface, or Google Drive link.
  2. Unzip it at the project root. After extraction, you should have:
data
├── CROHME/
├── CROHME2023/
├── HME100K/
├── Im2LaTeXv2/
├── MathWriting/
└── MNE/

🏃 Inference

After the dataset is in place, you can run batch inference over all three test sets with one of the two commands below.

Shell wrapper (recommended)

bash eval/eval_crohme.sh  -i <input-dir> -o <output-dir> -m <model> -b <batch_size>

Example

bash eval/eval_all.sh -m models/Uni-MuMER-3B -s test1 -b 32768

Direct Python call

python scripts/vllm_infer.py --input-dir <input-dir> --output-dir <output-dir> --model <model> --batch_size <batch_size>

Tip:

  • To select GPUs on multi‑GPU machines just export CUDA_VISIBLE_DEVICES before running the script, e.g., export CUDA_VISIBLE_DEVICES=1,2

  • For batch_size, you can use the --batch_size argument to control the number of samples per vLLM.generate() call. The default value is 32768, which is prevented from being too large to avoid OOM errors.

🏋️ Training

Our training code depends on LLaMA-Factory.

For training dependencies, please refer to LLaMA-Factory or requirements_training.txt.

llamafactory-cli train train/Uni-MuMER-train.yaml

✅ TODO

  • Inference code and pretrained models.
  • Evaluation code.
  • Training code.
  • Training data.
  • Preprocess code.

🙏 Acknowledgements

Thanks to the following projects:

📝 Citation

If you find Uni-MuMER useful for your study or research, please cite our paper with:

@article{li2025unimumer,
  title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
  author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Wei, Baole and Zhou, Yuxuan and Gao, Liangcai},
  year = {2025},
  journal={arXiv preprint arXiv:2505.23566},
}

About

[NeurIPS'25 Spotlight🔥]Official implementation of Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors