TrOCR_HU_2022

Introduction
illusturation
OCR Pipeline
DataSets
Data Augmentation in Efficient way
Toolkit Handwritten Text Recognition Data Generator
Methodology
Evaluation Metrics
How do I use it?
Inference
Demo
Acknowledgement
More processing scripts

Introduction:

The project has been done for Enhancing State-of-the-Art Language Models for Hungarian Handwritten Text Recognition. By leveraging the Transformer architecture for both visual models (CV) for image understanding and language models (LM) for wordpiece-level text generation and its end-to-end architecture. The baseline model is TrOCR.

illusturation

Sample for Leavearging vision(CV) with Languge Model(LMs) (self-Edited) source Deit, BERT & GPT-2 Models.

OCR Pipeline (Self-Made)

The OCR models are provided in the Huggingface format.[Models]

DataSets

The data we used contains Humman data that was collected and synthetic data that we generated. To see how the generation step has been done visit HuTRDG . The dataset has been split into 3 sets: train (80%), validation (10%) to optimize the used hyper-parameters, and test (10%) to see how much the trained and finetuned models generalize.

1- Humman data (DH-Lab)

The baseline models are trained with a proprietary dataset. The dataset is private data that contains images (in jpg format). These images have been segmented by lines and annotated with the corresponding text in the text file. The annotations contain the image name, the status, whether is okay or not, and many other meta parameters (they are not important in our task), and the last will contain the text for this image (features) separating each word by (|) characters, the (+) sign is used to concatenate the next line with the current sentence. As shown below.

Sampling image:

And the corresponding text:tott űlése határozata folytán

2- Human data Augmentation efficiently (DH-Lab Aug.)

You can find the script that represents the toolkit for data augmentation HTR_Aug.

3- Synthetic Data

This dataset has been generated synthetically by developing an existing toolkit for Handwritten Text Data Generation(HTDG) HuTRDG

The methodology used follows the figure below:

The models have been developed in two stages: the first is pre-training on synthetic Hungarian data, and the second is human data from DH-Lab on the resulting checkpoints from stage one.

The Evaluation Metrics

Character Error Rate (CER) demo
Word Error Rate (WER) demo

How do I use it?

Installation

Make the new virtual environment

pip install virtualenv
python3.8 -m venv env

For Activating Virtual Environment source env/bin/activate

git clone https://github.com/Mohammed20201991/OCR_HU_Tra2022.git
cd OCR_HU_Tra2022
cd HuTrOCR
pip install -r requirements.txt

Task: (only one-stage) Model selection experiments could be found in the MODEL_SELECTION.md file

Task: (only one-stage) Fine-tuning all TrOCR baseline models on the DH-Lab dataset results could be found at

Task: Two stages (pre-training & Fine-tuning) Word-level results could be found at WORDS_LEVEL.md

Task: Two stages (pre-training and Fine-tuning) lines_hu_v4 results could be found at LINES_HU_V4.md

Task: Two stages (pre-training and fine-tuning) lines_hu_v2_1 results:

1- Pre-training Test Results (First Stage):

Model Name	CER(%)	WER(%)
TrOCR-large-handwritten	1.792	4.944
Deit+Roberta-base	2.327	6.2332
Deit-PULI-BERT	2.129	4.691

2-Fine-tuning Test Results (Second Stage):

Model Name	Data	CER(%)	WER(%)
TrOCR-large-handwritten (only fine-tune stage one [English weights])	DH-Lab	5.764	23.297
TrOCR-large-handwritten (only fine-tune stage one [English weights])	DH-Lab(Augumented)	6.473	22.211
TrOCR-large-handwritten	DH-Lab	3.681<\u>	16.189 <\u>
TrOCR-large-handwritten	DH-Lab(Augumented)	5.221	18.46
Deit+Roberta-base	DH-Lab	8.374	29.121
Deit+Roberta-base	DH-Lab(Augumentation)	4.889	18.558
Deit-PULI-Bert	DH-Lab	5.381	16.091
Deit-PULI-Bert	DH-Lab(Augumentation)	6.123	16.357

TrOCR large-handwritten

Re-Pretraining on Synthetic lines_hu_v2_1 dataset

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=100
--eval_batch_size=100
--logging_steps=500
--save_steps=5000
--eval_steps=5000
--learning_rate=5e-5
--gradient_checkpointing=True
--full_train=False

Setting --full_train=False because we are doing Re-Pre-training

Evaluation on lines_hu_v2_1 (Test set)

i- If you want to use the models from HuggingFace library. For accessing private models Reach our models upon request type in your terminal huggingface-cli login Then copy and past the provided token ="xxxx".

ii- export CUDA_VISIBLE_DEVICES=3

iii- For testing use the command below or you can

python3 test.py 
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1)

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--epochs=25
--train_batch_size=8
--eval_batch_size=8
--logging_steps=500
--save_steps=1000
--eval_steps=500
--learning_rate=5e-5
--full_train=False
--processor_dir="/Models/TrOCR_large_handwritten/processor"
--ft_model_id="/Models/TrOCR_large_handwritten/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/TrOCR_large_handwritten_ft/"

Setting --full_train=False because we are doing Fine-tuning

Evaluation on DH-Lab (Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/TrOCR_large_handwritten_ft/checkpoint-xxxx"
--max_length=64

Leavearging Deit with PULI-BERT model

Pretraining on Synthetic lines_hu_v2_1 dataset, first-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=5000
--eval_steps=5000
--learning_rate=4e-5
--leveraging=True
--max_length=96
--working_dir="Models/PULI-BERT_Deit/"

Evaluation on lines_hu_v2_1 (Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models/PULI-BERT_Deit"
--max_length=96

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1) second-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=2000
--eval_steps=1000
--learning_rate=4e-5
--full_train=False
--processor_dir="/Models/PULI-BERT_Deit/processor"
--ft_model_id="/Models/PULI-BERT_Deit/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/PULI-BERT_Deit_ft/"

Setting --full_train=False because we are doing Fine-tuning

Evaluation on DH-Lab(Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/PULI-BERT_Deit_ft/checkpoint-xxxx"
--max_length=64

Leavearging Deit with Roberta-base model

Pretraining on Synthetic lines_hu_v2_1 dataset, first-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=32
--eval_batch_size=32
--logging_steps=100
--save_steps=5000
--eval_steps=5000
--learning_rate=4e-5
--nlp_model_dir="Roberta-base"
--leveraging=True
--max_length=96
--working_dir="Models/Roberta-base_Deit/"

Evaluation on lines_hu_v2_1 (Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models/Roberta-base_Deit"
--max_length=96

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1) second-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=2000
--eval_steps=1000
--learning_rate=4e-5
--full_train=False
--processor_dir="/Models/Roberta-base_Deit/processor"
--ft_model_id="/Models/Roberta-base_Deit/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/Roberta-base_Deit_ft/"

Setting --full_train=False because we are doing Fine-tuning

Evaluation on DH-Lab(Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/PULI-BERT_Deit_ft/checkpoint-xxxx"
--max_length=64

An Inference Example

Please see details in pic_inference.py.Or you can see

Demo

See Image2Text Script or For OCR_live_demo using Gradio, see:

Video could be found here :

License

The source code is free.
The dataset is not allowed to be used as it is private data and was given only for academic research.

Contact Information

Author: Mohammed A.S. Al-Hitawi Email: [email protected] , [email protected], [email protected]

Acknowledgement

Gyöngyössy Natabara Máté , Email: [email protected] My Supervisor during AI Project Labs and thesis Eötvös Loránd University
Dr. János Botzheim, Email: [email protected] My Supervisor during AI Project Labs. Eötvös Loránd University
Szekrényes István and Nemeskey Dávid Hungarian Digital Heritage Lab Researchers (DH-Lab) where they provide me the historical handwriting dataset benchmark of János Arany & valuable A100 8GPUS 80GB

Contribution

The thesis includes results from months-long GPU-optimal runs, where the most novel technologies, as intended, are utilized
Addressing Double start token in many international models see notebook
Generate more than two million synthetic datasets for the Hungarian language
The results overcome the state-of-the-Art TrOCR Model for Hungarian handwriting recognition.
Leveraging new state-of-the-art vision-language models in OCR architecture
What else is left? for future work and it is open for contribution

Replace the GPT-2 Hungarian based in TrOCR architecture
Use TrOCR model in Parallel with PULI-GPT-2 see the draft notebook Parallel
Generate more variations of synthetic data or collect more human-annotated data HTR_Aug.

References

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei, AAAI 2023.
Jönnek a nagyok! BERT-Large, GPT-2 és GPT-3 nyelvmodellek magyar nyelvre or PULI-BERT-Large
Deit , Training data-efficient image transformers & distillation through attention
RoBERTa: A Robustly Optimized BERT Pretraining Approach , Roberta-base
The Official implementation for TrOCR this repo
The Human dataset I am using is a private dataset provided by the Hungarian Digital Heritage Lab (DH-Lab). Written by Auther(János Arany)
Tools used: Python , Huggingface, Pytroch , VScode Tensorbord, and Linux OS Ubunto distribution

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
Data		Data
HTR_Aug		HTR_Aug
HuTRDG		HuTRDG
HuTrOCR		HuTrOCR
JupLab		JupLab
README.md		README.md

Mohammed20201991/OCR_HU_Tra2022

Folders and files

Latest commit

History

Repository files navigation

TrOCR_HU_2022

Sample for Leavearging vision(CV) with Languge Model(LMs) (self-Edited) source Deit, BERT & GPT-2 Models.

DataSets

1- Humman data (DH-Lab)

And the corresponding text:tott űlése határozata folytán

2- Human data Augmentation efficiently (DH-Lab Aug.)

3- Synthetic Data

The methodology used follows the figure below:

The Evaluation Metrics

How do I use it?

Installation

Make the new virtual environment

Task: (only one-stage) Model selection experiments could be found in the MODEL_SELECTION.md file

Task: (only one-stage) Fine-tuning all TrOCR baseline models on the DH-Lab dataset results could be found at

Task: Two stages (pre-training & Fine-tuning) Word-level results could be found at WORDS_LEVEL.md

Task: Two stages (pre-training and Fine-tuning) lines_hu_v4 results could be found at LINES_HU_V4.md

Task: Two stages (pre-training and fine-tuning) lines_hu_v2_1 results:

1- Pre-training Test Results (First Stage):

2-Fine-tuning Test Results (Second Stage):

TrOCR large-handwritten

Re-Pretraining on Synthetic lines_hu_v2_1 dataset

Evaluation on lines_hu_v2_1 (Test set)

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1)

Evaluation on DH-Lab (Test set)

Leavearging Deit with PULI-BERT model

Pretraining on Synthetic lines_hu_v2_1 dataset, first-stage

Evaluation on lines_hu_v2_1 (Test set)

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1) second-stage

Evaluation on DH-Lab(Test set)

Leavearging Deit with Roberta-base model

Pretraining on Synthetic lines_hu_v2_1 dataset, first-stage

Evaluation on lines_hu_v2_1 (Test set)

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1) second-stage

Evaluation on DH-Lab(Test set)

An Inference Example

Demo

License

Contact Information

Acknowledgement

Contribution

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages