This project provides a complete pipeline for extracting specific text fields—GTIN, Serial Number, LOT Number, and Expiry Date—from product label images. It leverages a deep learning approach, combining object detection with optical character recognition (OCR) for accurate and automated data extraction.
The core of this project is a two-stage process:
- Object Detection: A YOLOv8-obb (Oriented Bounding Box) model is trained to identify and locate the precise regions of the four target text fields on an image.
- Text Recognition: PaddleOCR is then used to perform OCR on these specific regions to extract the text content.
The models are also converted to TensorRT for optimized inference performance on NVIDIA GPUs.
- Targeted Extraction: Specifically trained to find and read:
- GTIN (Global Trade Item Number)
- SR_NO (Serial Number)
- LOT (Lot Number)
- EXP (Expiry Date)
- High Accuracy: The YOLOv8 model is trained on a custom dataset to achieve robust detection.
- Optimized for Speed: Includes scripts for converting models to TensorRT, significantly speeding up inference time.
- End-to-End Pipeline: From model training to final text extraction and saving results to a CSV file.
Text-Extraction-Project/
├── extracted_results/
│ ├── extracted_data.csv # Final extracted text output
│ └── processing_summary.txt # Summary of the image processing run
├── runs/
│ └── obb/train/
│ ├── args.yaml # YOLO training configuration
│ ├── results.csv # Training metrics per epoch
│ └── weights/
│ └── best.pt # Best trained YOLO model weights
├── data.yaml # Dataset configuration for YOLO
├── PaddleOCR Model.ipynb # Notebook for OCR logic
├── Requirements.txt # Project dependencies
├── TensorRT conversion YOLO.ipynb # Script for YOLO to TensorRT conversion
├── TensorRT deployment.ipynb # Script for running the optimized models
├── Untitled.ipynb # Main notebook for batch image processing
└── YOLO Training.ipynb # Notebook for training the YOLOv8 model
- NVIDIA GPU
- NVIDIA Driver
- CUDA Toolkit
- cuDNN
- TensorRT
It is critical that the versions of CUDA, cuDNN, and TensorRT are compatible with your version of PaddlePaddle-GPU. Please refer to the official PaddlePaddle documentation for version compatibility. The errors in this project (cudnn64_8.dll not found) suggest a mismatch between these libraries.
git clone <repository-url>
cd Text-Extraction-Project
It is highly recommended to use a virtual environment (e.g., venv or conda).
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r Requirements.txt
- Data Preparation: Organize your labeled dataset and update the data.yaml file with the correct paths for training, validation, and test sets.
- Run Training: Open and execute the YOLO Training.ipynb notebook. The training process will run for 100 epochs by default. The best model weights (best.pt) will be saved in the runs/obb/train/weights/ directory.
- YOLO Conversion: Run the TensorRT conversion YOLO.ipynb notebook to convert the best.pt model into a .engine file for faster inference.
- PaddleOCR Conversion: Run the TensorRT conversion PaddleOCR.ipynb notebook to do the same for the PaddleOCR models.
- Configure Paths: Open the Untitled.ipynb notebook. This is the main script for processing images.
- Set Model Path: In the ProductLabelReader class, ensure the model_path points to your trained YOLO model (best.pt or the converted .engine file).
- Set Image Directory: Specify the path to the directory containing the images you want to process.
- Execute: Run all cells in the notebook. The script will:
- Detect text regions in each image.
- Crop these regions.
- Use PaddleOCR to extract the text.
- Clean the extracted text.
- Save the final results in extracted_results/extracted_data.csv.
The primary output is extracted_results/extracted_data.csv, a table containing the filename and the extracted text for each of the four fields.
The YOLOv8-obb model was trained for 100 epochs and achieved a mean Average Precision (mAP50-95) of 0.6918, indicating a good performance in detecting the text regions.
- Dependency Errors: The project currently fails during the OCR step due to issues with NVIDIA library paths. Errors like RuntimeError: TensorRT dynamic library is not found and PreconditionNotMet: Could not find registered platform with id: 0x... point to problems with the CUDA/cuDNN/TensorRT installation or environment variables.
- Solution: Carefully reinstall NVIDIA libraries, ensuring they are compatible with your PyTorch and PaddlePaddle versions. Make sure the library paths are correctly set in your system's environment variables.
- Robust Error Handling: Implement more specific error handling to gracefully manage images where text cannot be found or read.
- Text Validation: Use regular expressions to validate the format of the extracted text (e.g., ensure EXP is a valid date format, GTIN contains only numbers).
- Environment Dockerization: Create a Dockerfile to encapsulate the entire environment, making it much easier to replicate and run the project without dependency headaches.
- Streamlined Scripting: Combine the Jupyter notebooks into a single, modular Python script that can be run from the command line with arguments for the image directory and model paths.