This is a library for processing scanned or digital EEO-1 and EEO-5 PDF reports as required by 141 of the Acts of 2024 (Massachusetts Salary Range Transparency Law).
The repository also includes tools for post-processing, data aggregation and analysis of the extracted data.
The repository provides custom parsing logic for different form types (e.g., EEO-1, EEO-5). This reduced parsing errors and improved extraction accuracy.
The pipeline processes EEO-1 and EEO-5 forms (PDF or images) in batches and includes the following stages:
1.Preprocessing – Enhances image quality for improved OCR performance.
2.Optical Character Recognition – Extracts text using a deep learning-based OCR engine.
3.Postprocessing & Parsing – Segments content into structured fields.
4.Validation & Cleaning – Validates extracted data (e.g., zip codes, city names) against public datasets.
5.Aggregation & Analysis – Groups, aggregates, and exports data for reporting.
Component | Tool Used | Description |
---|---|---|
Image Preprocessing | OpenCV, PIL | Deduplication, formatting, scaling, padding |
OCR Engine | DocTR | Handwritten and printed text recognition |
Postprocessing | Python | Field extraction, form segmentation, validation and correction |
Data Aggregation | Pandas | CSV/JSON parsing, group-by, statistical summaries |
Each component is built as an independent, reusable module, facilitating extensibility and debugging.
- Ubuntu 22.04.5 LTS
- Python ≥ 3.10.12
- Dependencies listed in
requirements.txt
- Offline OCR models downloaded and stored locally
-
Clone the repository
git clone
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
(Optional) Deactivate when done:
deactivate
The entire pipeline is designed to run in an air-gapped environment. All models and tools used are available offline after the initial setup.
For the purpose of this project, we explored 3 different OCR tools: Tesseract, Nougat and DocTR.
OCR Engine | Accuracy | Support Complex Layout | Support Handwritten Forms | Comments |
---|---|---|---|---|
Tesseract | Medium accuracy | No | No | Cannot distinguish form borders and handle complex layouts |
Donut | High accuracy on plaintext documents | Yes | Yes | Requires GPU for reasonable inference time; needs to be fine-tuned for different tasks |
DocTR | High accuracy across varied document types | Yes | No | No good support for handle handwritten forms |
Since the EEO-1 and EEO-5 forms are structured and have a complex layout, and we only have CPU resources, we chose to use DocTR for the OCR engine.
- Handwritten data variability – Especially problematic with cursive or non-standard characters
- Integrate a layout detection model for dynamic form segmentation
- Expand support for additional form types and formats
- Train a custom OCR model fine-tuned on EEO forms
- Build an interactive viewer for browsing OCR results
Jida Li: https://github.com/jidalii
Haodong Xu: https://github.com/chuckhxu
Rohit Vemparala: https://github.com/RVKarmani