Adapted from Huridocs' pdf-document-layout-analysis service, this repository is streamlined for a single purpose: fine-tuning the Vision Grid Transformer (VGT) on DocLayNet-style datasets. It is designed to be run on Modal.com.
- Entrypoint to Modal app in
modal_train.py
. - Configuration files in
src/configuration.py
andsrc/model_configuration/
. - Core model and trainer in
src/ditod/
(e.g.,VGTTrainer.py
,VGT.py
,dataset_mapper.py
,Wordnn_embedding.py
). - Bros tokenizer in
src/bros/
. - PDF feature parsing in
src/pdf_features/
. - Data pre-processing utilities in
src/vgt
. - Utility to download a fine-tunable VGT model
src/download_models.py
.
By default, we use the "microsoft/layoutlm-base-uncased" model for the word grid embedding and "HURIDOCS/pdf-document-layout-analysis" model as the base VGT model.
The dataset is expected to be in COCO format, which is a standard format for object detection datasets. It consists of JSON with the following structure:
Data is expected to be in datasets/<your_dataset_name>
, with images
and annotations
subdirectories. images
should contain train
and val
subdirectories, and annotations
should contain train.json
and val.json
files.
Utilities are provided to help with dataset preparation; see "Usage > Data prep" below.
- Install system packages (Ubuntu):
sudo apt-get install -y libgomp1 ffmpeg libsm6 libxext6 pdftohtml git ninja-build g++ qpdf
- Install Python deps:
uv sync
uv pip install git+https://github.com/facebookresearch/detectron2.git@70f454304e1a38378200459dd2dbca0f0f4a5ab4
uv pip install pycocotools==2.0.8
- Download the fine-tunable VGT model:
uv run python -m src.download_models doclaynet
We've created a GUI Document Layout Annotation Editor for preparing training data. That tool will export an array of flat JSON objects with the following keys: left
, top
, width
, height
, page_number
, page_width
, page_height
, text
, type
, id
.
src/vgt/convert_to_coco.py
is a command line tool that will convert JSON files in this format to COCO, a standard format for object detection datasets. Run it per document to produce per-document COCO JSON and page images:
uv run python -m src.vgt.convert_to_coco \
--pdf /path/to/foo.pdf \
--json /path/to/foo.json
Then aggregate per-document outputs into train/val (and optional test) splits with consistent categories and unique IDs using src/vgt/aggregate_coco_splits.py
:
uv run python -m src.vgt.aggregate_coco_splits \
--in-annotations-dir /path/to/datasets/my_dataset/annotations \
--in-images-root /path/to/datasets/my_dataset/images \
--out-root /path/to/datasets/my_dataset \
--val-ratio 0.2 \
--test-ratio 0.0 \
--seed 42
This creates:
/path/to/datasets/my_dataset/
├── images/
│ ├── train/
│ │ └── {doc_base}/page_0001.jpg ...
│ └── val/
│ └── {doc_base}/page_0007.jpg ...
└── annotations/
├── train.json
└── val.json
Run src.vgt.create_word_grid
on each split to produce word-grid pickles used by the dataset mapper:
uv run python -m src.vgt.create_word_grid \
--images_dir /my_dataset/images/train \
--annotations /my_dataset/annotations/train.json \
--output_dir /my_dataset/word_grids/train
uv run python -m src.vgt.create_word_grid \
--images_dir /my_dataset/images/val \
--annotations /my_dataset/annotations/val.json \
--output_dir /my_dataset/word_grids/val
Make sure the tokenizer vocab is available (BROS vocab will be downloaded automatically).
- Look at
src/configuration.py
forDOCLAYNET_TYPE_BY_ID
. This shows the 11 categories the base model was trained on. - If your categories are the same, keep
MODEL.ROI_HEADS.NUM_CLASSES: 11
insrc/model_configuration/doclaynet_VGT_cascade_PTM.yaml
. - If you change categories, ensure your aggregated COCO
categories
reflect them and setMODEL.ROI_HEADS.NUM_CLASSES
accordingly. Example:
MODEL:
ROI_HEADS:
NUM_CLASSES: 7
src/model_configuration/doclaynet_VGT_cascade_PTM.yaml
is the default model configuration file. The dataset names are set to match what modal_train.py
registers (doc_train
, doc_val
). If you register different names, update the YAML:
DATASETS:
TRAIN: ("doc_train",)
TEST: ("doc_val",)
When calling Modal training, point to the split directories and JSONs from the aggregator:
modal run modal_train.py -- \
--train_images /vol/datasets/my_dataset/images/train \
--train_ann /vol/datasets/my_dataset/annotations/train.json \
--val_images /vol/datasets/my_dataset/images/val \
--val_ann /vol/datasets/my_dataset/annotations/val.json \
--config src/model_configuration/doclaynet_VGT_cascade_PTM.yaml \
--output /vol/outputs/run1
- Start Small: Before converting your entire dataset, try it with a small subset (e.g., 50 images) to ensure your data formatting and training pipeline work end-to-end.
- Data Quality: The success of fine-tuning is highly dependent on the quality of your annotations. Ensure your bounding boxes are tight and labels are consistent.
- Learning Rate: The learning rate is the most important hyperparameter. If your model doesn't learn, try lowering it. If it learns too slowly, you can try raising it slightly.
- Freezing Layers: For a very small dataset, you might get better results by "freezing" the backbone of the network and only training the final layers (the ROI heads). This prevents the model from "forgetting" the powerful low-level features it learned from DocLayNet. This is an advanced technique that may require code changes.
- Class Imbalance: If your dataset has many instances of one class (e.g.,
Text
) and very few of another (e.g.,Logo
), the model may struggle with the rare class. Look into techniques like class-weighted loss if this becomes an issue.
Once training is complete, your best model will be saved as model_final.pth
in your OUTPUT_DIR
. To use it for inference, deploy it with our PDF Document Layout Analysis API service.