Skip to content

Mohammed20201991/OCR_HU_Tra2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TrOCR_HU_2022

Introduction:

The project has been done for Enhancing State-of-the-Art Language Models for Hungarian Handwritten Text Recognition. By leveraging the Transformer architecture for both visual models (CV) for image understanding and language models (LM) for wordpiece-level text generation and its end-to-end architecture. The baseline model is TrOCR.

illusturation

Sample for Leavearging vision(CV) with Languge Model(LMs) (self-Edited) source Deit, BERT & GPT-2 Models.

plot

OCR Pipeline (Self-Made)

plot

The OCR models are provided in the Huggingface format.[Models]


DataSets

The data we used contains Humman data that was collected and synthetic data that we generated. To see how the generation step has been done visit HuTRDG . The dataset has been split into 3 sets: train (80%), validation (10%) to optimize the used hyper-parameters, and test (10%) to see how much the trained and finetuned models generalize.

1- Humman data (DH-Lab)

The baseline models are trained with a proprietary dataset. The dataset is private data that contains images (in jpg format). These images have been segmented by lines and annotated with the corresponding text in the text file. The annotations contain the image name, the status, whether is okay or not, and many other meta parameters (they are not important in our task), and the last will contain the text for this image (features) separating each word by (|) characters, the (+) sign is used to concatenate the next line with the current sentence. As shown below.

Sampling image:

alt

And the corresponding text:tott űlése határozata folytán

2- Human data Augmentation efficiently (DH-Lab Aug.)

You can find the script that represents the toolkit for data augmentation HTR_Aug. Open In Colab

3- Synthetic Data

This dataset has been generated synthetically by developing an existing toolkit for Handwritten Text Data Generation(HTDG) HuTRDG

The methodology used follows the figure below:

The models have been developed in two stages: the first is pre-training on synthetic Hungarian data, and the second is human data from DH-Lab on the resulting checkpoints from stage one.

plot

The Evaluation Metrics

  1. Character Error Rate (CER) demo
  2. Word Error Rate (WER) demo

How do I use it?

Installation

Make the new virtual environment

pip install virtualenv
python3.8 -m venv env

For Activating Virtual Environment source env/bin/activate

git clone https://github.com/Mohammed20201991/OCR_HU_Tra2022.git
cd OCR_HU_Tra2022
cd HuTrOCR
pip install -r requirements.txt

Task: (only one-stage) Model selection experiments could be found in the MODEL_SELECTION.md file

Task: (only one-stage) Fine-tuning all TrOCR baseline models on the DH-Lab dataset results could be found at

Task: Two stages (pre-training & Fine-tuning) Word-level results could be found at WORDS_LEVEL.md

Task: Two stages (pre-training and Fine-tuning) lines_hu_v4 results could be found at LINES_HU_V4.md

Task: Two stages (pre-training and fine-tuning) lines_hu_v2_1 results:

1- Pre-training Test Results (First Stage):

Model Name CER(%) WER(%)
TrOCR-large-handwritten 1.792 4.944
Deit+Roberta-base 2.327 6.2332
Deit-PULI-BERT 2.129 4.691

2-Fine-tuning Test Results (Second Stage):

Model Name Data CER(%) WER(%)
TrOCR-large-handwritten (only fine-tune stage one [English weights]) DH-Lab 5.764 23.297
TrOCR-large-handwritten (only fine-tune stage one [English weights]) DH-Lab(Augumented) 6.473 22.211
TrOCR-large-handwritten DH-Lab 3.681<\u> 16.189 <\u>
TrOCR-large-handwritten DH-Lab(Augumented) 5.221 18.46
Deit+Roberta-base DH-Lab 8.374 29.121
Deit+Roberta-base DH-Lab(Augumentation) 4.889 18.558
Deit-PULI-Bert DH-Lab 5.381 16.091
Deit-PULI-Bert DH-Lab(Augumentation) 6.123 16.357

TrOCR large-handwritten

Re-Pretraining on Synthetic lines_hu_v2_1 dataset

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=100
--eval_batch_size=100
--logging_steps=500
--save_steps=5000
--eval_steps=5000
--learning_rate=5e-5
--gradient_checkpointing=True
--full_train=False

Setting --full_train=False because we are doing Re-Pre-training

Evaluation on lines_hu_v2_1 (Test set)

i- If you want to use the models from HuggingFace library. For accessing private models Reach our models upon request type in your terminal huggingface-cli login Then copy and past the provided token ="xxxx".

ii- export CUDA_VISIBLE_DEVICES=3

iii- For testing use the command below or you can Open it in  Colab

python3 test.py 
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1)

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--epochs=25
--train_batch_size=8
--eval_batch_size=8
--logging_steps=500
--save_steps=1000
--eval_steps=500
--learning_rate=5e-5
--full_train=False
--processor_dir="/Models/TrOCR_large_handwritten/processor"
--ft_model_id="/Models/TrOCR_large_handwritten/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/TrOCR_large_handwritten_ft/"

Setting --full_train=False because we are doing Fine-tuning

Evaluation on DH-Lab (Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/TrOCR_large_handwritten_ft/checkpoint-xxxx"
--max_length=64

Leavearging Deit with PULI-BERT model

Pretraining on Synthetic lines_hu_v2_1 dataset, first-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=5000
--eval_steps=5000
--learning_rate=4e-5
--leveraging=True
--max_length=96
--working_dir="Models/PULI-BERT_Deit/"

Evaluation on lines_hu_v2_1 (Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models/PULI-BERT_Deit"
--max_length=96

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1) second-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=2000
--eval_steps=1000
--learning_rate=4e-5
--full_train=False
--processor_dir="/Models/PULI-BERT_Deit/processor"
--ft_model_id="/Models/PULI-BERT_Deit/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/PULI-BERT_Deit_ft/"

Setting --full_train=False because we are doing Fine-tuning

Evaluation on DH-Lab(Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/PULI-BERT_Deit_ft/checkpoint-xxxx"
--max_length=64

Leavearging Deit with Roberta-base model

Pretraining on Synthetic lines_hu_v2_1 dataset, first-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=32
--eval_batch_size=32
--logging_steps=100
--save_steps=5000
--eval_steps=5000
--learning_rate=4e-5
--nlp_model_dir="Roberta-base"
--leveraging=True
--max_length=96
--working_dir="Models/Roberta-base_Deit/"

Evaluation on lines_hu_v2_1 (Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models/Roberta-base_Deit"
--max_length=96

Fine-tune the DH-Lab dataset on the pre-trained Synthetic checkpoint (lines_hu_v2_1) second-stage

export CUDA_VISIBLE_DEVICES=3

python3 train.py 
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=2000
--eval_steps=1000
--learning_rate=4e-5
--full_train=False
--processor_dir="/Models/Roberta-base_Deit/processor"
--ft_model_id="/Models/Roberta-base_Deit/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/Roberta-base_Deit_ft/"

Setting --full_train=False because we are doing Fine-tuning

Evaluation on DH-Lab(Test set)

export CUDA_VISIBLE_DEVICES=3

python3 test.py 
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/PULI-BERT_Deit_ft/checkpoint-xxxx"
--max_length=64

An Inference Example

Please see details in pic_inference.py.Or you can see Open it in  Colab plot

Demo

See Image2Text Script or Open I2T Pipeline in Colab For OCR_live_demo using Gradio, see: Open in  Colab

Video could be found here :

License

  • The source code is free.
  • The dataset is not allowed to be used as it is private data and was given only for academic research.

Contact Information

Author: Mohammed A.S. Al-Hitawi Email: [email protected] , [email protected], [email protected]

Acknowledgement

Contribution

  • The thesis includes results from months-long GPU-optimal runs, where the most novel technologies, as intended, are utilized
  • Addressing Double start token in many international models see notebook
  • Generate more than two million synthetic datasets for the Hungarian language
  • The results overcome the state-of-the-Art TrOCR Model for Hungarian handwriting recognition.
  • Leveraging new state-of-the-art vision-language models in OCR architecture
  • What else is left? for future work and it is open for contribution
    1. Replace the GPT-2 Hungarian based in TrOCR architecture
    2. Use TrOCR model in Parallel with PULI-GPT-2 see the draft notebook Parallel
    3. Generate more variations of synthetic data or collect more human-annotated data HTR_Aug.

References

Releases

No releases published

Packages

No packages published