Skip to content

Promptly-Technologies-LLC/vgt-fine-tuning-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VGT fine-tuning pipeline

Adapted from Huridocs' pdf-document-layout-analysis service, this repository is streamlined for a single purpose: fine-tuning the Vision Grid Transformer (VGT) on DocLayNet-style datasets. It is designed to be run on Modal.com.

Overview

Repository structure

  • Entrypoint to Modal app in modal_train.py.
  • Configuration files in src/configuration.py and src/model_configuration/.
  • Core model and trainer in src/ditod/ (e.g., VGTTrainer.py, VGT.py, dataset_mapper.py, Wordnn_embedding.py).
  • Bros tokenizer in src/bros/.
  • PDF feature parsing in src/pdf_features/.
  • Data pre-processing utilities in src/vgt.
  • Utility to download a fine-tunable VGT model src/download_models.py.

Models

By default, we use the "microsoft/layoutlm-base-uncased" model for the word grid embedding and "HURIDOCS/pdf-document-layout-analysis" model as the base VGT model.

Data format

The dataset is expected to be in COCO format, which is a standard format for object detection datasets. It consists of JSON with the following structure:

{
  "info": { ... }, // dataset metadata (optional)
  "licenses": [ ... ], // image licenses (optional)
  "images": [ // image list
    {
      "id": 1, // image id
      "file_name": "your_document_001.png", // relative file name
      "height": 1000, // px
      "width": 750 // px
    },
    // ...
  ],

  "categories": [ // class labels (ids start at 1)
    {
      "id": 1, // category id
      "name": "header", // class name
      "supercategory": "layout" // optional group
    },
    {
      "id": 2,
      "name": "custom_table",
      "supercategory": "layout"
    },
    // ...
  ],

  "annotations": [ // bounding boxes
    {
      "id": 1, // annotation id
      "image_id": 1, // refs images.id
      "category_id": 2, // refs categories.id
      "bbox": [100, 200, 50, 75], // [x, y, w, h] in px
      "area": 3750, // w*h
      "iscrowd": 0, // 0 for detection
      "segmentation": [] // empty for boxes
    },
    // ...
  ]
}

Data is expected to be in datasets/<your_dataset_name>, with images and annotations subdirectories. images should contain train and val subdirectories, and annotations should contain train.json and val.json files.

Utilities are provided to help with dataset preparation; see "Usage > Data prep" below.

Usage

Setup

  1. Install system packages (Ubuntu):
sudo apt-get install -y libgomp1 ffmpeg libsm6 libxext6 pdftohtml git ninja-build g++ qpdf
  1. Install Python deps:
uv sync
uv pip install git+https://github.com/facebookresearch/detectron2.git@70f454304e1a38378200459dd2dbca0f0f4a5ab4
uv pip install pycocotools==2.0.8
  1. Download the fine-tunable VGT model:
uv run python -m src.download_models doclaynet

Data prep

We've created a GUI Document Layout Annotation Editor for preparing training data. That tool will export an array of flat JSON objects with the following keys: left, top, width, height, page_number, page_width, page_height, text, type, id.

src/vgt/convert_to_coco.py is a command line tool that will convert JSON files in this format to COCO, a standard format for object detection datasets. Run it per document to produce per-document COCO JSON and page images:

uv run python -m src.vgt.convert_to_coco \
    --pdf /path/to/foo.pdf \
    --json /path/to/foo.json

Then aggregate per-document outputs into train/val (and optional test) splits with consistent categories and unique IDs using src/vgt/aggregate_coco_splits.py:

uv run python -m src.vgt.aggregate_coco_splits \
  --in-annotations-dir /path/to/datasets/my_dataset/annotations \
  --in-images-root /path/to/datasets/my_dataset/images \
  --out-root /path/to/datasets/my_dataset \
  --val-ratio 0.2 \
  --test-ratio 0.0 \
  --seed 42

This creates:

/path/to/datasets/my_dataset/
├── images/
│   ├── train/
│   │   └── {doc_base}/page_0001.jpg ...
│   └── val/
│       └── {doc_base}/page_0007.jpg ...
└── annotations/
    ├── train.json
    └── val.json

Run src.vgt.create_word_grid on each split to produce word-grid pickles used by the dataset mapper:

uv run python -m src.vgt.create_word_grid \
  --images_dir /my_dataset/images/train \
  --annotations /my_dataset/annotations/train.json \
  --output_dir /my_dataset/word_grids/train

uv run python -m src.vgt.create_word_grid \
  --images_dir /my_dataset/images/val \
  --annotations /my_dataset/annotations/val.json \
  --output_dir /my_dataset/word_grids/val

Make sure the tokenizer vocab is available (BROS vocab will be downloaded automatically).

Category definition and NUM_CLASSES

  • Look at src/configuration.py for DOCLAYNET_TYPE_BY_ID. This shows the 11 categories the base model was trained on.
  • If your categories are the same, keep MODEL.ROI_HEADS.NUM_CLASSES: 11 in src/model_configuration/doclaynet_VGT_cascade_PTM.yaml.
  • If you change categories, ensure your aggregated COCO categories reflect them and set MODEL.ROI_HEADS.NUM_CLASSES accordingly. Example:
MODEL:
  ROI_HEADS:
    NUM_CLASSES: 7

Training configuration and dataset names

src/model_configuration/doclaynet_VGT_cascade_PTM.yaml is the default model configuration file. The dataset names are set to match what modal_train.py registers (doc_train, doc_val). If you register different names, update the YAML:

DATASETS:
  TRAIN: ("doc_train",)
  TEST: ("doc_val",)

When calling Modal training, point to the split directories and JSONs from the aggregator:

modal run modal_train.py -- \
  --train_images /vol/datasets/my_dataset/images/train \
  --train_ann /vol/datasets/my_dataset/annotations/train.json \
  --val_images /vol/datasets/my_dataset/images/val \
  --val_ann /vol/datasets/my_dataset/annotations/val.json \
  --config src/model_configuration/doclaynet_VGT_cascade_PTM.yaml \
  --output /vol/outputs/run1

Best Practices and Potential Pitfalls

  • Start Small: Before converting your entire dataset, try it with a small subset (e.g., 50 images) to ensure your data formatting and training pipeline work end-to-end.
  • Data Quality: The success of fine-tuning is highly dependent on the quality of your annotations. Ensure your bounding boxes are tight and labels are consistent.
  • Learning Rate: The learning rate is the most important hyperparameter. If your model doesn't learn, try lowering it. If it learns too slowly, you can try raising it slightly.
  • Freezing Layers: For a very small dataset, you might get better results by "freezing" the backbone of the network and only training the final layers (the ROI heads). This prevents the model from "forgetting" the powerful low-level features it learned from DocLayNet. This is an advanced technique that may require code changes.
  • Class Imbalance: If your dataset has many instances of one class (e.g., Text) and very few of another (e.g., Logo), the model may struggle with the rare class. Look into techniques like class-weighted loss if this becomes an issue.

Using the fine-tuned model

Once training is complete, your best model will be saved as model_final.pth in your OUTPUT_DIR. To use it for inference, deploy it with our PDF Document Layout Analysis API service.

About

Vision Grid Transformers fine-tuning pipeline intended to run on Modal.com

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages