LaTeX OCR Fine-tuning

This project provides a modular implementation for fine-tuning VLM models on LaTeX OCR tasks. Read the paper

Contributed to MinerU under Shanghai AI Lab.

Results

Visualize Comparison of Simple Test Case

Installation

Install the required dependencies:

pip install -r requirements.txt

Make sure you have CUDA available if you want to use GPU acceleration.

Configuration

All configuration settings are centralized in config.py. Key settings include:

Model Configuration: Model name, output directories
Dataset Configuration: Dataset name and config
Training Configuration: Batch size, learning rate, epochs, etc.
Hardware Configuration: Mixed precision, gradient checkpointing, etc.

Usage

Training Only

python main.py --mode train

Evaluation Only

python main.py --mode evaluate --model_path ./path/to/finetuned/model

Both Training and Evaluation

python main.py --mode both

Custom Evaluation Parameters

python main.py --mode evaluate \
    --model_path ./qwen-latex-ocr-finetuned/final \
    --batch_size 2 \
    --max_new_tokens 512 \
    --num_examples 500

Output

Training: Model checkpoints saved to ./qwen-latex-ocr-finetuned/
Final Model: Saved to ./qwen-latex-ocr-finetuned/final/
Evaluation Results: Saved to evaluation_results_5.json
Logs: TensorBoard logs in the output directory

Metrics

The evaluation computes:

BLEU Score: Measures text similarity between predicted and reference LaTeX
CER Score: Measure how many steps needed to edit the predicted string into becoming the label
Exact Match: Percentage of exactly matching predictions
Inference Time: Average time per batch
Memory Usage: Peak GPU memory allocation

Customization

To customize the training:

Modify config.py for different hyperparameters
Extend LatexOCRDataset in data_preprocessing.py for different data formats
Adjust collate_fn for different input processing
Modify training arguments in train.py for different training strategies

Notes

The code supports both full fine-tuning and quantized training
Gradient checkpointing is enabled by default to save memory
The model uses chat templates for consistent input formatting
Evaluation automatically extracts assistant responses from generated text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LaTeX OCR Fine-tuning

Results

Visualize Comparison of Simple Test Case

Installation

Configuration

Usage

Training Only

Evaluation Only

Both Training and Evaluation

Custom Evaluation Parameters

Output

Metrics

Customization

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LaTeX OCR Fine-tuning

Results

Visualize Comparison of Simple Test Case

Installation

Configuration

Usage

Training Only

Evaluation Only

Both Training and Evaluation

Custom Evaluation Parameters

Output

Metrics

Customization

Notes