A robust, end-to-end pipeline for training/retraining a Tesseract OCR model specifically optimised for Ghana ECG (Electricity Company of Ghana) postpaid meter reading images.
- Project Structure
- Quick Start
- Makefile Reference
- Pipeline Phases
- Configuration
- Scripts Reference
- Notebooks
- Evaluation
- Troubleshooting
- System Overview
ecg-ocr-project/
├── Makefile # Convenient shortcuts for all pipeline operations (make help)
├── raw_images/ # Original meter reading photos
├── preprocessed/ # Cleaned/deskewed images ready for OCR
├── ground_truth/ # .gt.txt label files (paired with .tif)
├── augmented/ # Synthetically augmented images
├── eval_data/ # Holdout test set (auto-populated by 04_prepare_training_data.py)
├── corrections/ # Human-corrected samples for iterative retraining
├── results/ # Inference outputs, CSV reports, test_set.txt
├── models/
│ └── ecg_meter/
│ └── tessdata/ # Final .traineddata model installed here
├── tesstrain/ # git submodule — owns all training internals
│ └── data/
│ └── ecg_meter-ground-truth/ # ← 04_prepare_training_data.py writes here
├── ecg-meter-api/ # Production REST API
│ ├── app.py # Flask application
│ ├── ocr.py # OCR engine wrapper
│ └── model/ # Trained model (.traineddata)
├── scripts/ # All pipeline scripts
├── config/ # YAML configs
├── logs/ # Training logs, CER curves
├── notebooks/
│ └── ecg_ocr_analysis.ipynb # Interactive dataset analysis & preprocessing visualisation
├── tests/ # Unit tests
└── docs/ # Additional documentation
# 1. Clone the repo with the tesstrain submodule
git clone --recurse-submodules <your-repo-url>
cd ecg-ocr-project
# 2.1 READ docs/macos_setup.md
# 2.1 Install system and Python dependencies
bash scripts/install_dependencies.sh
# 3. Configure the project
cp config/config.example.yaml config/config.yaml
# Edit config/config.yaml with your paths
# 4. Preprocess your raw images
python scripts/01_preprocess.py --input raw_images/ --output preprocessed/
# 5. Label your images
# Option A: GUI-based (Label Studio)
python scripts/02_annotate.py --launch --images preprocessed/
# Option B: CLI-based (quick for small datasets)
python scripts/02_annotate.py --manual --images preprocessed/ --output ground_truth/
# 6. Augment the dataset
python scripts/03_augment.py --input preprocessed/ --output augmented/ --factor 5
# 7. Copy pairs into tesstrain's ground-truth directory
# This also splits data: train+val → tesstrain/, test → eval_data/
python scripts/04_prepare_training_data.py
# 8. Run fine-tuning via tesstrain
bash scripts/05_run_training.sh
# 9. Evaluate the model
python scripts/06_evaluate.py --model models/ecg_meter/tessdata/ecg_meter.traineddata
# (Optional) Add additional test images to eval_data/
python scripts/add_to_eval_data.py --image new_meter.jpg --interactive
# 10. (Optional) Iterative error correction
python scripts/08_iterative_correction.py --find-errors --threshold 0.1
python scripts/08_iterative_correction.py --corrections corrections/ --retrain --rounds 3
# 11. Run inference on new images
python scripts/07_inference.py --input raw_images/new/ --output results/The Makefile wraps every pipeline step so you don't need to remember long command strings.
make help # Print all available targets| Target | Description |
|---|---|
make setup |
Install all system + Python dependencies |
make preprocess |
Run 01_preprocess.py on raw_images/ → preprocessed/ |
make preprocess-single |
Preprocess a single image (prompts for path) |
make preprocess-debug |
Preprocess with debug visualisation |
make annotate |
Launch CLI annotation tool (02_annotate.py --manual) |
make annotate-studio |
Launch Label Studio GUI for annotation |
make validate-gt |
Validate all .gt.txt files for common errors |
make augment |
Run 03_augment.py with factor 5 |
make augment-preview |
Preview augmentations on a single image |
make prepare |
Run 04_prepare_training_data.py |
make stats |
Print dataset statistics without copying any files |
make train |
Run 05_run_training.sh |
make train-resume |
Resume training from the last checkpoint |
make evaluate |
Run 06_evaluate.py on eval_data/ |
make evaluate-compare |
Evaluate and compare against the base eng model |
make infer |
Run 07_inference.py on raw_images/ and save CSV output |
make infer-json |
Same as infer but output JSON |
make add-eval |
Preprocess and add a single image to eval_data/ (prompts for path) |
make add-eval-dir |
Preprocess and add a directory of images to eval_data/ (prompts for dir) |
make correct |
Run 3 rounds of iterative error correction and trigger retraining |
make find-errors |
Report high-CER samples (threshold 0.10) without correcting |
make plot-curves |
Plot CER/WER curves from the latest training log |
make test |
Run the test suite with pytest |
make test-cov |
Run tests with HTML coverage report |
make full-pipeline |
Shortcut: preprocess → augment → prepare → train → evaluate |
make clean |
Remove generated files (raw images + ground truth are preserved) |
make clean-generated |
Remove preprocessed/, augmented/, tesstrain/data/ |
make clean-models |
Remove model checkpoints (final .traineddata preserved) |
make clean-results |
Remove everything in results/ |
Converts raw meter photos into clean, normalised images:
- Automatic deskew & perspective correction
- Adaptive thresholding (handles glare, shadows)
- ROI (region of interest) extraction (isolates the meter display)
- Upscaling to ≥300 DPI equivalent
Ground truth labeling with validation:
- Label Studio GUI integration (
--launchmode) for browser-based annotation - Built-in CLI annotation tool (
--manualmode) for quick terminal-based labeling - Automatic format conversion to
.gt.txt - Quality checks on labels
Synthetic data generation to expand small datasets:
- Brightness/contrast variation
- Slight rotation & perspective warp
- Gaussian noise & blur
- Morphological degradation
Converts annotated images to Tesseract-ready format:
- Paired
.tif+.gt.txtfiles - Train/validation/test split (80/10/10 by default)
- Training pairs →
tesstrain/data/<model>-ground-truth/ - Test pairs →
eval_data/(held out for unbiased evaluation)
LSTM fine-tuning from eng base model via tesstrain:
- Configurable iterations and learning rate
- Checkpoint saving
- Real-time CER monitoring
Comprehensive metrics:
- Character Error Rate (CER)
- Word Error Rate (WER)
- Field-level (reading, account number) accuracy
- Confusion matrix for misread characters
Production-ready OCR with:
- Character whitelist for meter domains
- Post-processing validation (regex + business rules)
- Confidence thresholding & human review flagging
- CSV/JSON output
Continuous model improvement through error analysis:
- Identify worst-performing samples from evaluation
- Present failures for human review and correction
- Add corrected samples to
corrections/directory - Merge corrections into training set
- Trigger incremental retraining cycles
- Re-evaluate and track CER improvements
- Repeat until target accuracy achieved
Production REST API for real-time OCR:
- Flask-based HTTP endpoint
- Image upload via POST multipart/form-data
- Automatic preprocessing pipeline
- JSON response with readings, confidence scores
- Health check endpoint
The ecg-meter-api/ directory contains a lightweight Flask API for production deployment.
cd ecg-meter-api
python app.pyThe API will start on http://127.0.0.1:5000
POST /read-meter
- Upload meter image and get OCR results
- Content-Type:
multipart/form-data - Field name:
image - Supported formats: .jpg, .jpeg, .png, .tif, .tiff, .bmp
- Max size: 10MB
Example using curl:
curl -X POST http://127.0.0.1:5000/read-meter \
-F "image=@/path/to/meter.jpg"Response:
{
"raw_text": "00005",
"meter_readings": ["00005"],
"meter_serial": [],
"account_numbers": [],
"dates": [],
"confidence": 95.2,
"flagged": false
}GET /health
- Health check endpoint
- Returns:
{"status": "ok"}
For production deployment, use a WSGI server like Gunicorn:
pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:8000 app:appOr use Docker (see ecg-meter-api/Dockerfile if available).
Edit config/config.yaml:
model:
name: ecg_meter
base_model: eng
max_iterations: 10000
learning_rate: 0.0001
target_cer: 0.02
training:
split: [0.80, 0.10, 0.10]
augmentation_factor: 5
preprocessing:
target_dpi: 300
adaptive_thresh_block_size: 11
adaptive_thresh_c: 2
min_image_width: 800
ocr:
psm: 6
oem: 1
whitelist: "0123456789.-kWhKWH/ABCDEFGHIJKLMNOPQRSTUVWXYZ "
confidence_threshold: 60| Script | Purpose |
|---|---|
01_preprocess.py |
Image cleaning, deskew, ROI extraction |
02_annotate.py |
Label Studio launcher + format converter |
03_augment.py |
Synthetic data augmentation |
04_prepare_training_data.py |
Build tesstrain-ready file pairs |
05_run_training.sh |
Execute tesstrain fine-tuning |
06_evaluate.py |
Full evaluation with metrics |
07_inference.py |
Production inference pipeline |
08_iterative_correction.py |
dshea89-style error correction loop |
add_to_eval_data.py |
Add new images to eval_data/ for testing |
install_dependencies.sh |
System + Python dependency installer |
plot_training_curves.py |
Visualise CER/WER over training |
An interactive Jupyter notebook for exploring the dataset, visualising the preprocessing pipeline, and inspecting OCR results.
# Open in VS Code
code notebooks/ecg_ocr_analysis.ipynb
# Or launch Jupyter in the browser
jupyter lab notebooks/ecg_ocr_analysis.ipynbWhat it covers:
- Dataset overview — image counts across
preprocessed/,augmented/,ground_truth/, andtesstrain/data/ecg_meter-ground-truth/ - Step-by-step preprocessing visualisation (resize → deskew → perspective correction → threshold → denoise → ROI)
- Ground truth text distribution and length analysis
- OCR results comparison across different pipeline stages
The 06_evaluate.py script provides comprehensive model evaluation with detailed metrics.
# Evaluate on the holdout test set (populated by 04_prepare_training_data.py)
python scripts/06_evaluate.py --model models/ecg_meter/tessdata/ecg_meter.traineddata
# Compare custom model against base eng model
python scripts/06_evaluate.py \
--model models/ecg_meter/tessdata/ecg_meter.traineddata \
--compareNote: The eval_data/ directory contains your held-out test set (10% of data by default), which was never seen during training. This gives an unbiased estimate of model performance.
To test your model on completely new images (not from your original dataset):
# Add a single image (interactive annotation)
python scripts/add_to_eval_data.py --image new_meter.jpg --interactive
# Add a directory of images with existing ground truth files
python scripts/add_to_eval_data.py \
--input new_test_images/ \
--gt-dir ground_truth_labels/
# Add images without annotation (for inference testing only)
python scripts/add_to_eval_data.py \
--input new_test_images/ \
--no-annotationThis preprocesses the images and saves them to eval_data/ in the correct format (.tif + .gt.txt).
--test-dir DIR Test images directory (default: eval_data/)
--gt-dir DIR Ground truth directory (default: ground_truth/)
--model FILE Path to .traineddata model file
--lang LANG Tesseract language code (default: eng)
--output DIR Output directory for reports (default: results/)
--compare Compare against base eng model
--config FILE Path to config.yaml (default: config/config.yaml)
- Character Error Rate (CER): Levenshtein distance at character level
- Word Error Rate (WER): Levenshtein distance at word level
- Field-level accuracy: Meter reading, account number, date extraction
- Per-character confusion matrix: Which characters are commonly misread
- Confidence score distribution: OCR confidence levels
All evaluation results are saved to results/:
evaluation_report.csv: Detailed per-image metricsconfusion_matrix.png: Character error visualizationcer_distribution.png: CER histogram across test set- Console summary with rich tables
| CER Range | Quality | Action |
|---|---|---|
| < 1% | Excellent | Deploy to production |
| 1-3% | Good | Optional iterative correction |
| 3-5% | Acceptable | Run iterative correction |
| > 5% | Poor | Review preprocessing & add more training data |
Low accuracy on specific meter brands Add more images of that meter type. ECG uses Actaris, Landis+Gyr, and Conlog meters — each has different digit fonts.
High CER despite large dataset
Check preprocessing: glare, perspective distortion, and low contrast are the
most common culprits. Run scripts/01_preprocess.py --debug to visualise each
preprocessing step.
Training loss not decreasing
Lower learning_rate to 0.00005 or reduce max_iterations if overfitting.
Tesseract not found
Run bash scripts/install_dependencies.sh and ensure /usr/bin/tesseract is
on your PATH.
This project follows a structured OCR pipeline:
- Image preprocessing
- Annotation & augmentation
- Model training (tesstrain)
- Inference & evaluation
Detailed diagrams are available in /docs.
Shows how a user uses the ECG OCR system end-to-end.
flowchart TD
U[User / Researcher]
U --> A[Put Images in raw_images/]
A --> B[01_preprocess.py]
B --> C[preprocessed/]
C --> D[02_annotate.py]
D --> E[ground_truth/]
E --> F[03_augment.py]
F --> G[augmented/]
G --> H[04_prepare_training_data.py]
H --> I[05_run_training.sh]
I --> J[models/]
J --> K{Choose Inference Method}
K -->|Batch Processing| L[07_inference.py]
K -->|Production API| M[ecg-meter-api/]
L --> N[results/]
M --> O[Real-time OCR via HTTP]
N --> P[06_evaluate.py]
P --> Q{Acceptable CER?}
Q -->|No| R[08_iterative_correction.py]
R --> S[corrections/]
S -->|Merge & Retrain| G
Q -->|Yes| T[Reports / Plots]
T --> U