This project provides a modular implementation for fine-tuning VLM models on LaTeX OCR tasks. Read the paper
Contributed to MinerU under Shanghai AI Lab.
- Install the required dependencies:
pip install -r requirements.txt- Make sure you have CUDA available if you want to use GPU acceleration.
All configuration settings are centralized in config.py. Key settings include:
- Model Configuration: Model name, output directories
- Dataset Configuration: Dataset name and config
- Training Configuration: Batch size, learning rate, epochs, etc.
- Hardware Configuration: Mixed precision, gradient checkpointing, etc.
python main.py --mode trainpython main.py --mode evaluate --model_path ./path/to/finetuned/modelpython main.py --mode bothpython main.py --mode evaluate \
--model_path ./qwen-latex-ocr-finetuned/final \
--batch_size 2 \
--max_new_tokens 512 \
--num_examples 500- Training: Model checkpoints saved to
./qwen-latex-ocr-finetuned/ - Final Model: Saved to
./qwen-latex-ocr-finetuned/final/ - Evaluation Results: Saved to
evaluation_results_5.json - Logs: TensorBoard logs in the output directory
The evaluation computes:
- BLEU Score: Measures text similarity between predicted and reference LaTeX
- CER Score: Measure how many steps needed to edit the predicted string into becoming the label
- Exact Match: Percentage of exactly matching predictions
- Inference Time: Average time per batch
- Memory Usage: Peak GPU memory allocation
To customize the training:
- Modify
config.pyfor different hyperparameters - Extend
LatexOCRDatasetindata_preprocessing.pyfor different data formats - Adjust
collate_fnfor different input processing - Modify training arguments in
train.pyfor different training strategies
- The code supports both full fine-tuning and quantized training
- Gradient checkpointing is enabled by default to save memory
- The model uses chat templates for consistent input formatting
- Evaluation automatically extracts assistant responses from generated text