EEO Toolkit

1. Introduction

This is a library for processing scanned or digital EEO-1 and EEO-5 PDF reports as required by 141 of the Acts of 2024 (Massachusetts Salary Range Transparency Law).

The repository also includes tools for post-processing, data aggregation and analysis of the extracted data.

The repository provides custom parsing logic for different form types (e.g., EEO-1, EEO-5). This reduced parsing errors and improved extraction accuracy.

2. Library Overview

The pipeline processes EEO-1 and EEO-5 forms (PDF or images) in batches and includes the following stages:

1.Preprocessing – Enhances image quality for improved OCR performance.
2.Optical Character Recognition – Extracts text using a deep learning-based OCR engine.
3.Postprocessing & Parsing – Segments content into structured fields.
4.Validation & Cleaning – Validates extracted data (e.g., zip codes, city names) against public datasets.
5.Aggregation & Analysis – Groups, aggregates, and exports data for reporting.

Pipeline Components

Component	Tool Used	Description
Image Preprocessing	OpenCV, PIL	Deduplication, formatting, scaling, padding
OCR Engine	DocTR	Handwritten and printed text recognition
Postprocessing	Python	Field extraction, form segmentation, validation and correction
Data Aggregation	Pandas	CSV/JSON parsing, group-by, statistical summaries

Each component is built as an independent, reusable module, facilitating extensibility and debugging.

3. How to Use

3.1 Requirements

Ubuntu 22.04.5 LTS
Python ≥ 3.10.12
Dependencies listed in requirements.txt
Offline OCR models downloaded and stored locally

3.2 Steps

Clone the repository
```
git clone
```

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
(Optional) Deactivate when done:
```
deactivate
```

4. OCR Tools Explored

The entire pipeline is designed to run in an air-gapped environment. All models and tools used are available offline after the initial setup.

For the purpose of this project, we explored 3 different OCR tools: Tesseract, Nougat and DocTR.

OCR Engine	Accuracy	Support Complex Layout	Support Handwritten Forms	Comments
Tesseract	Medium accuracy	No	No	Cannot distinguish form borders and handle complex layouts
Donut	High accuracy on plaintext documents	Yes	Yes	Requires GPU for reasonable inference time; needs to be fine-tuned for different tasks
DocTR	High accuracy across varied document types	Yes	No	No good support for handle handwritten forms

Since the EEO-1 and EEO-5 forms are structured and have a complex layout, and we only have CPU resources, we chose to use DocTR for the OCR engine.

5. Limitations and Challenges

Handwritten data variability – Especially problematic with cursive or non-standard characters

6. Ongoing Work

Integrate a layout detection model for dynamic form segmentation
Expand support for additional form types and formats
Train a custom OCR model fine-tuned on EEO forms
Build an interactive viewer for browsing OCR results

Authors

Jida Li: https://github.com/jidalii

Haodong Xu: https://github.com/chuckhxu

Rohit Vemparala: https://github.com/RVKarmani

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data_aggregation		data_aggregation
ocr		ocr
public_data		public_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EEO Toolkit

1. Introduction

2. Library Overview

Pipeline Components

3. How to Use

3.1 Requirements

3.2 Steps

4. OCR Tools Explored

5. Limitations and Challenges

6. Ongoing Work

Authors

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

CASP-Systems-BU/eeo-toolkit

Folders and files

Latest commit

History

Repository files navigation

EEO Toolkit

1. Introduction

2. Library Overview

Pipeline Components

3. How to Use

3.1 Requirements

3.2 Steps

4. OCR Tools Explored

5. Limitations and Challenges

6. Ongoing Work

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages