This script automates the process of Optical Character Recognition (OCR) on images and organises the resulting files into appropriate directories. The script utilises AWS Textract for OCR and performs several key steps including configuration setup, image selection, OCR processing, and file sorting.
- Folder Verification: Ensures that all required folders are present and configured correctly using the
CoreConfigclass. - OCR Processing: Performs OCR on images located in the input folder using the
TextractOCRclass. - File Sorting: Organises the processed OCR output files into appropriate directories using the
SortOCRclass. - Error Handling: Implements retry logic for robust execution, with logging of any errors encountered during the process in either Json or Txt format.
- ALTO XML Generation: Generates ALTO XML files post-OCR processing.
CoreConfig: Manages configuration settings and folder paths. Ensures all necessary folders are available.TextractOCR: Handles OCR operations using AWS Textract.SortOCR: Organises OCR output files into the correct directories.LogActivities: Manages logging activities, capturing both routine operations and errors.AltoGenerator: Generates ALTO XML files from OCR output.LibNas: Manages file transfers to and from a network-attached storage (NAS) system, this can also serve as just the output and input, and does not have to be used with LibNas.CheckEmptyFolder: Monitors the status of folders to ensure they are not empty before processing.JsonLogger: Generates logs in Json format.
-
Run the Script:
- The script can be executed directly. It will automatically verify the necessary folders, process images using OCR, sort the results, and handle any errors with retry logic.
-
OCR Process:
- The script selects images from the input folder and processes them using AWS Textract.
- The results are sorted and organised into folders as per the configuration.
-
Retry Logic:
- If errors are encountered, the script will log the error, wait for a specified delay, and retry the process up to a maximum number of retries.
-
Finalisation:
- After processing, the script returns the processed files to the NAS storage.
- Optionally, set up a virtual environment before running the installer to avoid conflicts with other projects.
- Ensure you have all the libraries listed in the requirements.txt file installed by running:
pip install -r requirements.txt
```bash
python main.pyThe CoreConfig class is designed to manage the configuration settings and folder paths required for the OCR processing pipeline. This class handles the creation of necessary folders and provides configuration values used throughout the pipeline.
make changes to it with caution
Folder Verification: Ensures that all required folders exist and creates them if necessary.Configuration Management: Provides essential configuration settings such as retry limits, file extensions, and confidence thresholds.Customisable Paths: Allows customisation of input and output paths to suit your project's needs.
The CoreConfig class defines a set of folder paths required for the OCR process. These paths are returned as a dictionary by the requiredFolders method.
Below is the default folder structure created and managed by the CoreConfig class:
input_folder: Path for input files, the entry point to the pipeline.core_folders: Base path for core folders, containing all subsequent folders.logs_folder: Path for log files.json_folder: Path for JSON files.images_folder: Path for image files.json_sorter: Path for sorted JSON files.failed_folder: Path for failed processing files.images_sorter: Path for sorted image files.failed_ocr_folder: Path for files with failed OCR.processed_folder: Path for successfully processed files.low_confidence_folder: Path for files with low OCR confidence.
The CoreConfig class also includes specific paths for input and output folders on LIBNAS system. By default, these paths are set as follows:
"libnas_input": r"z:\\OCR outputs\\p155_kerby_miller\\Letters\\ocr_test\\input",
"libnas_output": r"z:\\OCR outputs\\p155_kerby_miller\\Letters\\ocr_test\\output",Make changes to match your desire path on LIBNAS.