Skip to content

Commit 264e0f5

Browse files
authored
Use settings.toml to configure LLM OCR reader
- Implement UV as the python build management tool
2 parents 2ccb251 + 795229a commit 264e0f5

26 files changed

+3442
-208
lines changed

.github/workflows/main.yml

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,37 +5,37 @@ name: Python application
55

66
on:
77
push:
8-
branches: [ "main" ]
8+
branches: ["main"]
99
pull_request:
10-
branches: [ "main" ]
10+
branches: ["main"]
1111

1212
permissions:
1313
contents: read
1414

1515
jobs:
1616
build:
17-
1817
runs-on: ubuntu-latest
1918

2019
steps:
21-
- uses: actions/checkout@v4
22-
- name: Set up Python 3.12
23-
uses: actions/setup-python@v5
24-
with:
25-
python-version: "3.12"
26-
- name: Display Python version
27-
run: python -c "import sys; print(sys.version)"
28-
- name: Install dependencies
29-
run: |
30-
python -m pip install --upgrade pip
31-
pip install flake8 pytest pytest-cov
32-
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
33-
- name: Lint with flake8
34-
run: |
35-
# stop the build if there are Python syntax errors or undefined names
36-
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
37-
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
38-
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
39-
- name: Test with pytest
40-
run: |
41-
pytest --cov app tests/ --cov-report xml --cov-report html --cov-report term
20+
- uses: actions/checkout@v4
21+
- name: Install uv
22+
uses: astral-sh/setup-uv@v5
23+
24+
- name: "Set up Python"
25+
uses: actions/setup-python@v5
26+
with:
27+
python-version-file: "pyproject.toml"
28+
29+
- name: Install the project
30+
run: uv sync --all-extras --dev
31+
32+
- name: Display Python version
33+
run: python -c "import sys; print(sys.version)"
34+
35+
- name: Lint with ruff
36+
run: |
37+
uv run ruff check app tests
38+
39+
- name: Run tests with coverage
40+
run: |
41+
uv run pytest tests --cov app tests/ --cov-report term

README.md

Lines changed: 17 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ The goal of the Ballot Initiative project is to reduce the manual labor involved
4040

4141
![Core Algorithm](app/ballot_initiative_schematic.png)
4242

43-
1. **Extraction:** Forms in PDF format are processed through an OCR engine (using [gpt-4o-mini](https://platform.openai.com/docs/models/gpt-4o-mini)) to crop text sections and extract data.
43+
1. **Extraction:** Forms in PDF format are processed through an OCR engine (using generative AI) to crop text sections and extract data.
4444

4545
2. **Identification:** The engine identifies and extracts key information (tailored to DC Ballot Initiatives) related to validating signatures:
4646

@@ -63,10 +63,14 @@ An alternate approach to get up and running is to use [Github Codespaces](https:
6363

6464
### Prerequisites
6565

66-
- Python 3.12
67-
- OpenAI API key[^1]
66+
- [Python 3.12+](https://wiki.python.org/moin/BeginnersGuide/Download)
67+
- [UV](https://docs.astral.sh/uv/getting-started/installation/) for building the project
68+
- API keys for at least one of the following[^1]:
69+
- [OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
70+
- [Mistral API key](https://docs.mistral.ai/getting-started/quickstart/)
71+
- [Gemini API key](https://ai.google.dev/gemini-api/docs/api-key)
6872

69-
[^1]: The OpenAI free tier has a low rate limit. To increase the rate limit, you'll have to have a form payment on your OpenAI account. [See this page for details](https://platform.openai.com/docs/guides/rate-limits?tier=tier-one)
73+
[^1]: The free tiers for these services typically have a low rate limit that can cause issues. Many services require adding a payment method to your account to increase rate limits. Please verify your account settings and usage limits before running the application.
7074

7175
- PDF files of ballot initiative signatures
7276
- Use fake data in [`sample_data/fake_signed_petitions.pdf`](sample_data/fake_signed_petitions.pdf) folder to test.
@@ -86,8 +90,8 @@ cd ballot-initiative
8690
2. Create and activate a virtual environment:
8791

8892
```bash
89-
# Create virtual environment
90-
python -m venv venv
93+
# Initalise project and install dependencies
94+
uv sync --all-extras --dev
9195

9296
# Activate virtual environment
9397
# On Windows:
@@ -96,29 +100,20 @@ venv\Scripts\activate
96100
source venv/bin/activate
97101
```
98102

99-
3. Install dependencies:
100-
101-
```bash
102-
pip install -r requirements.txt
103-
```
104-
105-
4. Set up your environment:
106-
- Create a `.env` file in the project root folder.
107-
- Replicate the format shown in the `.env.example` file.
108-
- [Get an OpenAI API key](https://www.howtogeek.com/885918/how-to-get-an-openai-api-key/) if you don't have one
109-
- Add your OpenAI API key to the `.env` file:
110-
```
111-
OPENAI_API_KEY=<YOUR_API_KEY>
112-
```
103+
3. Configure and save settings:
104+
- Make a copy of the `settings.example.toml` file and rename it to `settings.toml`.
105+
- Add your GenAI API keys to the `api_key` field of the selected model
106+
- Add the name of the model to the `model` field e.g. `mistral-small-latest` or `gpt-4o-mini`
113107

114108
### Running the Application
115109

116110
1. Start the Streamlit app:
117111

118112
```bash
119-
streamlit run app/Home.py
113+
uv run main.py
120114
```
121115

116+
122117
2. Upload your files:
123118
- PDF of signed petitions
124119
- Voter records file
@@ -131,7 +126,7 @@ streamlit run app/Home.py
131126
3. Run the following command:
132127

133128
```bash
134-
python pytest
129+
uv run pytest
135130
```
136131

137132
## Project Documentation

app/__init__.py

Whitespace-only changes.

app/fuzzy_match_helper.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99
import pandas as pd
1010
import numpy as np
1111
from concurrent.futures import ThreadPoolExecutor
12-
import streamlit as st
1312
import logging
1413
from datetime import datetime
1514

app/ocr/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .ocr_client_factory import extract_from_encoding_async
2+
3+
__all__ = ["extract_from_encoding_async"]

app/ocr/ocr_client_factory.py

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
from typing import List
2+
from langchain_openai import ChatOpenAI
3+
from langchain_mistralai import ChatMistralAI
4+
from langchain_google_genai import ChatGoogleGenerativeAI
5+
from langchain_core.runnables import (
6+
Runnable,
7+
)
8+
from langchain_core.messages import HumanMessage
9+
from pydantic import BaseModel, Field
10+
from settings import (
11+
load_settings,
12+
OpenAiConfig,
13+
MistralAiConfig,
14+
GeminiAiConfig,
15+
)
16+
from utils.app_logger import logger
17+
import json
18+
19+
20+
###
21+
## OCR FUNCTIONS
22+
###
23+
class OCREntry(BaseModel):
24+
"""Ballot signatory data"""
25+
26+
Name: str = Field(description="Name of the petition signer")
27+
Address: str = Field(description="Address of the petition signatory")
28+
Date: str = Field(description="Date of the signed")
29+
Ward: int = Field(description="The area or 'Ward' that the signer belongs to")
30+
31+
32+
class OCRData(BaseModel):
33+
Data: List[OCREntry]
34+
35+
36+
def _create_ocr_client() -> Runnable:
37+
"""
38+
Create an OpenAI client with the appropriate settings.
39+
40+
Returns:
41+
Runnable: An AI client for OCR extraction.
42+
"""
43+
44+
ocr_config = load_settings().selected_config
45+
46+
client: Runnable = None
47+
48+
match ocr_config:
49+
case OpenAiConfig():
50+
client = ChatOpenAI(
51+
api_key=ocr_config.api_key,
52+
temperature=0.0,
53+
openai_api_base="https://oai.helicone.ai/v1",
54+
model=ocr_config.model,
55+
).with_structured_output(OCRData)
56+
case MistralAiConfig():
57+
client = ChatMistralAI(
58+
api_key=ocr_config.api_key,
59+
temperature=0.0,
60+
model_name=ocr_config.model,
61+
).with_structured_output(OCRData)
62+
case GeminiAiConfig():
63+
client = ChatGoogleGenerativeAI(
64+
api_key=ocr_config.api_key,
65+
temperature=0.0,
66+
model=ocr_config.model,
67+
).with_structured_output(OCRData)
68+
69+
logger.debug(f"Creating client {ocr_config}")
70+
71+
return client
72+
73+
74+
async def extract_from_encoding_async(base64_image: str) -> List[dict]:
75+
"""
76+
Extracts names and addresses from single ballot image asynchronously.
77+
Uses base64_image
78+
79+
Args:
80+
base64_image: The base64 encoded image to extract data from.
81+
82+
Returns:
83+
list: A list of dictionaries with the OCR data.
84+
"""
85+
logger.debug("Starting OCR extraction for image")
86+
87+
try:
88+
# AI client definition
89+
client = _create_ocr_client()
90+
# prompt message
91+
messages = [
92+
{
93+
"type": "text",
94+
"text": """Using the written text in the image create a list of dictionaries where each dictionary consists of keys 'Name', 'Address', 'Date', and 'Ward'. Fill in the values of each dictionary with the correct entries for each key. Write all the values of the dictionary in full. Only output the list of dictionaries. No other intro text is necessary.""",
95+
},
96+
{
97+
"type": "text",
98+
"text": """Remove the city name 'Washington, DC' and any zip codes from the 'Address' values.""",
99+
},
100+
{
101+
"type": "image_url",
102+
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
103+
},
104+
]
105+
106+
results = await client.ainvoke([HumanMessage(content=messages)])
107+
108+
parsed_results = results
109+
110+
# dictionary results
111+
parsed_list = json.loads(parsed_results.json())["Data"]
112+
logger.debug(f"Successfully extracted {len(parsed_list)} entries from image")
113+
return parsed_list
114+
115+
except Exception as e:
116+
logger.error(f"Error in OCR extraction: {str(e)}")
117+
raise

0 commit comments

Comments
 (0)