FormNet is a state-of-the-art document extraction system designed to parse complex forms with irregular layouts, tables, and columns. It significantly improves upon standard sequence models by leveraging Rich Attention for 2D spatial reasoning and Graph Convolutional Networks (GCN) for structural context.
- Spatial Embedding: Learns rich representations from 2D token coordinates.
- Rich Attention: A modified self-attention mechanism that captures direct spatial relationships between tokens.
- GCN Super-Tokens: Refines token representations using a spatial graph (K-NN) to aggregate neighborhood context.
- Robust Serialization: Minimizes layout-serialization errors common in flat BERT-like models.
The system is built with modular components:
- Embedding Layer: Fuses DistilBERT text embeddings with learnable spatial embeddings.
- Rich Attention: Incorporates 2D spatial biases into the self-attention mechanism, allowing the model to "see" the layout.
- GCN Module: Refines token representations using a K-Nearest Neighbors (KNN) graph constructed from bounding box coordinates.
- Decoder: A classification head for BIO tagging (Entity Extraction) or sequence generation.
For a detailed explanation of the codebase, see the Developer Guide.
-
Clone the repository:
git clone https://github.com/pronzzz/docxformnet.git cd docxformnet -
Install dependencies:
pip3 install -r requirements.txt
FormNet supports CORD (Receipts) and FUNSD (Forms) datasets.
python3 data/download_datasets.pyThe script automatically handles dataset downloading and formatting using the Hugging Face datasets library.
Train the model on the CORD dataset:
python3 train.py --data_dir data/CORD --epochs 10 --batch_size 4This command will:
- Initialize the dataset and tokenizer.
- constructing spatial graphs on-the-fly during batch collation.
- Train the FormNet model and save checkpoints to
checkpoints/.
Evaluate a trained model on the test set:
python3 evaluate.py --checkpoint_path checkpoints/model_epoch_10.pt --data_dir data/CORDYou can visualize the model's understanding of the document layout (bounding boxes and labels):
from utils.visualize import visualize_document
# ... load your data ...
visualize_document("path/to/image.png", words, bboxes, labels)This project is licensed under the MIT License - see the LICENSE file for details.