FormNet: Document Extraction with Rich Attention and GCN

FormNet is a state-of-the-art document extraction system designed to parse complex forms with irregular layouts, tables, and columns. It significantly improves upon standard sequence models by leveraging Rich Attention for 2D spatial reasoning and Graph Convolutional Networks (GCN) for structural context.

🚀 Key Features

Spatial Embedding: Learns rich representations from 2D token coordinates.
Rich Attention: A modified self-attention mechanism that captures direct spatial relationships between tokens.
GCN Super-Tokens: Refines token representations using a spatial graph (K-NN) to aggregate neighborhood context.
Robust Serialization: Minimizes layout-serialization errors common in flat BERT-like models.

🏗 Architecture Overview

The system is built with modular components:

Embedding Layer: Fuses DistilBERT text embeddings with learnable spatial embeddings.
Rich Attention: Incorporates 2D spatial biases into the self-attention mechanism, allowing the model to "see" the layout.
GCN Module: Refines token representations using a K-Nearest Neighbors (KNN) graph constructed from bounding box coordinates.
Decoder: A classification head for BIO tagging (Entity Extraction) or sequence generation.

📂 Project Structure

For a detailed explanation of the codebase, see the Developer Guide.

🛠 Installation

Clone the repository:

git clone https://github.com/pronzzz/docxformnet.git
cd docxformnet

Install dependencies:
```
pip3 install -r requirements.txt
```

📊 Dataset Preparation

FormNet supports CORD (Receipts) and FUNSD (Forms) datasets.

python3 data/download_datasets.py

The script automatically handles dataset downloading and formatting using the Hugging Face datasets library.

🏋️‍♀️ Training

Train the model on the CORD dataset:

python3 train.py --data_dir data/CORD --epochs 10 --batch_size 4

This command will:

Initialize the dataset and tokenizer.
constructing spatial graphs on-the-fly during batch collation.
Train the FormNet model and save checkpoints to checkpoints/.

📈 Evaluation

Evaluate a trained model on the test set:

python3 evaluate.py --checkpoint_path checkpoints/model_epoch_10.pt --data_dir data/CORD

🔍 Visualization

You can visualize the model's understanding of the document layout (bounding boxes and labels):

from utils.visualize import visualize_document
# ... load your data ...
visualize_document("path/to/image.png", words, bboxes, labels)

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
gcn_super_tokens		gcn_super_tokens
model		model
preprocessing		preprocessing
rich_attention		rich_attention
tokenization		tokenization
utils		utils
.gitignore		.gitignore
GUIDE.md		GUIDE.md
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FormNet: Document Extraction with Rich Attention and GCN

🚀 Key Features

🏗 Architecture Overview

📂 Project Structure

🛠 Installation

📊 Dataset Preparation

🏋️‍♀️ Training

📈 Evaluation

🔍 Visualization

📜 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FormNet: Document Extraction with Rich Attention and GCN

🚀 Key Features

🏗 Architecture Overview

📂 Project Structure

🛠 Installation

📊 Dataset Preparation

🏋️‍♀️ Training

📈 Evaluation

🔍 Visualization

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages