Skip to content

Latest commit

 

History

History
93 lines (65 loc) · 2.84 KB

File metadata and controls

93 lines (65 loc) · 2.84 KB

LaTeX OCR Fine-tuning

SWIN-mBART

This project provides a modular implementation for fine-tuning VLM models on LaTeX OCR tasks. Read the paper

Contributed to MinerU under Shanghai AI Lab.

Results

image

Visualize Comparison of Simple Test Case

image

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. Make sure you have CUDA available if you want to use GPU acceleration.

Configuration

All configuration settings are centralized in config.py. Key settings include:

  • Model Configuration: Model name, output directories
  • Dataset Configuration: Dataset name and config
  • Training Configuration: Batch size, learning rate, epochs, etc.
  • Hardware Configuration: Mixed precision, gradient checkpointing, etc.

Usage

Training Only

python main.py --mode train

Evaluation Only

python main.py --mode evaluate --model_path ./path/to/finetuned/model

Both Training and Evaluation

python main.py --mode both

Custom Evaluation Parameters

python main.py --mode evaluate \
    --model_path ./qwen-latex-ocr-finetuned/final \
    --batch_size 2 \
    --max_new_tokens 512 \
    --num_examples 500

Output

  • Training: Model checkpoints saved to ./qwen-latex-ocr-finetuned/
  • Final Model: Saved to ./qwen-latex-ocr-finetuned/final/
  • Evaluation Results: Saved to evaluation_results_5.json
  • Logs: TensorBoard logs in the output directory

Metrics

The evaluation computes:

  • BLEU Score: Measures text similarity between predicted and reference LaTeX
  • CER Score: Measure how many steps needed to edit the predicted string into becoming the label
  • Exact Match: Percentage of exactly matching predictions
  • Inference Time: Average time per batch
  • Memory Usage: Peak GPU memory allocation

Customization

To customize the training:

  1. Modify config.py for different hyperparameters
  2. Extend LatexOCRDataset in data_preprocessing.py for different data formats
  3. Adjust collate_fn for different input processing
  4. Modify training arguments in train.py for different training strategies

Notes

  • The code supports both full fine-tuning and quantized training
  • Gradient checkpointing is enabled by default to save memory
  • The model uses chat templates for consistent input formatting
  • Evaluation automatically extracts assistant responses from generated text