- Introduction
- illusturation
- OCR Pipeline
- DataSets
- Data Augmentation in Efficient way
- Toolkit Handwritten Text Recognition Data Generator
- Methodology
- Evaluation Metrics
- How do I use it?
- Inference
- Demo
- Acknowledgement
- More processing scripts
Introduction:
The project has been done for Enhancing State-of-the-Art Language Models for Hungarian Handwritten Text Recognition. By leveraging the Transformer architecture for both visual models (CV) for image understanding and language models (LM) for wordpiece-level text generation and its end-to-end architecture. The baseline model is TrOCR.illusturation
Sample for Leavearging vision(CV) with Languge Model(LMs) (self-Edited) source Deit, BERT & GPT-2 Models.
OCR Pipeline (Self-Made)
The OCR models are provided in the Huggingface format.[Models]
The data we used contains Humman data that was collected and synthetic data that we generated. To see how the generation step has been done visit HuTRDG . The dataset has been split into 3 sets: train (80%), validation (10%) to optimize the used hyper-parameters, and test (10%) to see how much the trained and finetuned models generalize.
The baseline models are trained with a proprietary dataset. The dataset is private data that contains images (in jpg format). These images have been segmented by lines and annotated with the corresponding text in the text file. The annotations contain the image name, the status, whether is okay or not, and many other meta parameters (they are not important in our task), and the last will contain the text for this image (features) separating each word by (|) characters, the (+) sign is used to concatenate the next line with the current sentence. As shown below.
Sampling image:
You can find the script that represents the toolkit for data augmentation HTR_Aug.
This dataset has been generated synthetically by developing an existing toolkit for Handwritten Text Data Generation(HTDG) HuTRDG
The models have been developed in two stages: the first is pre-training on synthetic Hungarian data, and the second is human data from DH-Lab on the resulting checkpoints from stage one.pip install virtualenv
python3.8 -m venv env
For Activating Virtual Environment source env/bin/activate
git clone https://github.com/Mohammed20201991/OCR_HU_Tra2022.git
cd OCR_HU_Tra2022
cd HuTrOCR
pip install -r requirements.txt
Task: (only one-stage) Model selection experiments could be found in the MODEL_SELECTION.md file
Task: (only one-stage) Fine-tuning all TrOCR baseline models on the DH-Lab dataset results could be found at
Task: Two stages (pre-training & Fine-tuning) Word-level results could be found at WORDS_LEVEL.md
Task: Two stages (pre-training and Fine-tuning) lines_hu_v4 results could be found at LINES_HU_V4.md
Model Name | CER(%) | WER(%) |
---|---|---|
TrOCR-large-handwritten | 1.792 | 4.944 |
Deit+Roberta-base | 2.327 | 6.2332 |
Deit-PULI-BERT | 2.129 | 4.691 |
Model Name | Data | CER(%) | WER(%) |
---|---|---|---|
TrOCR-large-handwritten (only fine-tune stage one [English weights]) | DH-Lab | 5.764 | 23.297 |
TrOCR-large-handwritten (only fine-tune stage one [English weights]) | DH-Lab(Augumented) | 6.473 | 22.211 |
TrOCR-large-handwritten | DH-Lab | 3.681<\u> | 16.189 <\u> |
TrOCR-large-handwritten | DH-Lab(Augumented) | 5.221 | 18.46 |
Deit+Roberta-base | DH-Lab | 8.374 | 29.121 |
Deit+Roberta-base | DH-Lab(Augumentation) | 4.889 | 18.558 |
Deit-PULI-Bert | DH-Lab | 5.381 | 16.091 |
Deit-PULI-Bert | DH-Lab(Augumentation) | 6.123 | 16.357 |
export CUDA_VISIBLE_DEVICES=3
python3 train.py
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=100
--eval_batch_size=100
--logging_steps=500
--save_steps=5000
--eval_steps=5000
--learning_rate=5e-5
--gradient_checkpointing=True
--full_train=False
Setting --full_train=False
because we are doing Re-Pre-training
i- If you want to use the models from HuggingFace library. For accessing private models Reach our models upon request type in your terminal huggingface-cli login
Then copy and past the provided token ="xxxx"
.
ii- export CUDA_VISIBLE_DEVICES=3
iii- For testing use the command below or you can
python3 test.py
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
export CUDA_VISIBLE_DEVICES=3
python3 train.py
--epochs=25
--train_batch_size=8
--eval_batch_size=8
--logging_steps=500
--save_steps=1000
--eval_steps=500
--learning_rate=5e-5
--full_train=False
--processor_dir="/Models/TrOCR_large_handwritten/processor"
--ft_model_id="/Models/TrOCR_large_handwritten/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/TrOCR_large_handwritten_ft/"
Setting --full_train=False
because we are doing Fine-tuning
export CUDA_VISIBLE_DEVICES=3
python3 test.py
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/TrOCR_large_handwritten_ft/checkpoint-xxxx"
--max_length=64
export CUDA_VISIBLE_DEVICES=3
python3 train.py
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=5000
--eval_steps=5000
--learning_rate=4e-5
--leveraging=True
--max_length=96
--working_dir="Models/PULI-BERT_Deit/"
export CUDA_VISIBLE_DEVICES=3
python3 test.py
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models/PULI-BERT_Deit"
--max_length=96
export CUDA_VISIBLE_DEVICES=3
python3 train.py
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=2000
--eval_steps=1000
--learning_rate=4e-5
--full_train=False
--processor_dir="/Models/PULI-BERT_Deit/processor"
--ft_model_id="/Models/PULI-BERT_Deit/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/PULI-BERT_Deit_ft/"
Setting --full_train=False
because we are doing Fine-tuning
export CUDA_VISIBLE_DEVICES=3
python3 test.py
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/PULI-BERT_Deit_ft/checkpoint-xxxx"
--max_length=64
export CUDA_VISIBLE_DEVICES=3
python3 train.py
--text_path="../Data/lines_hu_v2_1/train.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--epochs=25
--train_batch_size=32
--eval_batch_size=32
--logging_steps=100
--save_steps=5000
--eval_steps=5000
--learning_rate=4e-5
--nlp_model_dir="Roberta-base"
--leveraging=True
--max_length=96
--working_dir="Models/Roberta-base_Deit/"
export CUDA_VISIBLE_DEVICES=3
python3 test.py
--text_path="../Data/lines_hu_v2_1/test.jsonl"
--images_path="../Data/lines_hu_v2_1/images/"
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models/Roberta-base_Deit"
--max_length=96
export CUDA_VISIBLE_DEVICES=3
python3 train.py
--epochs=25
--train_batch_size=24
--eval_batch_size=24
--logging_steps=100
--save_steps=2000
--eval_steps=1000
--learning_rate=4e-5
--full_train=False
--processor_dir="/Models/Roberta-base_Deit/processor"
--ft_model_id="/Models/Roberta-base_Deit/checkpoint-xxxx"
--max_length=64
--working_dir="Models_ft/Roberta-base_Deit_ft/"
Setting --full_train=False
because we are doing Fine-tuning
export CUDA_VISIBLE_DEVICES=3
python3 test.py
--get_by_model_id=False
--load_model_from_checkpoint_dir="./Models_ft/PULI-BERT_Deit_ft/checkpoint-xxxx"
--max_length=64
Please see details in pic_inference.py.Or you can see
See Image2Text Script or
For OCR_live_demo using Gradio, see:
- The source code is free.
- The dataset is not allowed to be used as it is private data and was given only for academic research.
Author: Mohammed A.S. Al-Hitawi
Email: [email protected] , [email protected], [email protected]
- Gyöngyössy Natabara Máté , Email: [email protected] My Supervisor during AI Project Labs and thesis Eötvös Loránd University
- Dr. János Botzheim, Email: [email protected] My Supervisor during AI Project Labs. Eötvös Loránd University
- Szekrényes István and Nemeskey Dávid Hungarian Digital Heritage Lab Researchers (DH-Lab) where they provide me the historical handwriting dataset benchmark of János Arany & valuable A100 8GPUS 80GB
- The thesis includes results from months-long GPU-optimal runs, where the most novel technologies, as intended, are utilized
- Addressing Double start token in many international models see notebook
- Generate more than two million synthetic datasets for the Hungarian language
- The results overcome the state-of-the-Art TrOCR Model for Hungarian handwriting recognition.
- Leveraging new state-of-the-art vision-language models in OCR architecture
- What else is left? for future work and it is open for contribution
- Replace the GPT-2 Hungarian based in TrOCR architecture
- Use TrOCR model in Parallel with PULI-GPT-2 see the draft notebook Parallel
- Generate more variations of synthetic data or collect more human-annotated data HTR_Aug.
- TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei,
AAAI 2023
. - Jönnek a nagyok! BERT-Large, GPT-2 és GPT-3 nyelvmodellek magyar nyelvre or PULI-BERT-Large
- Deit , Training data-efficient image transformers & distillation through attention
- RoBERTa: A Robustly Optimized BERT Pretraining Approach , Roberta-base
- The Official implementation for TrOCR this repo
- The Human dataset I am using is a private dataset provided by the Hungarian Digital Heritage Lab (DH-Lab). Written by Auther(János Arany)
- Tools used: Python , Huggingface, Pytroch , VScode Tensorbord, and Linux OS Ubunto distribution