🏠 Address Extraction Pipeline

This project extracts addresses from documents using a BERT-based Named Entity Recognition (NER) model.

📄 Supported Document Formats

The pipeline supports the following file types:

✅ PDF (.pdf)
✅ Word Documents (.docx, .doc)
✅ Text Files (.txt)

Note: The UnstructuredIO API handles multiple file formats for preprocessing. Additional file types can be supported in future implementations.

🔧 Pipeline Workflow

1️⃣ File Preprocessing

The input document is processed using the UnstructuredIO API.
Extracted text and metadata are stored in JSON format, which includes:
- Bounding boxes
- Page numbers
- File name
- Extracted text content

2️⃣ NER Model Processing

The extracted text is passed through two models:
- Fine-Tuned BERT Model
- NER Address Model

3️⃣ Regex-Based Address Validation

The extracted addresses are checked using a custom regex model.
Only valid addresses are retained with highest confidence score

4️⃣ Final Address Extraction Output

The extracted information includes:
- 📌 Extracted Address
- 📂 File Number
- 📄 Page Number
- 🔲 Approximate Bounding Boxes
- 🎯 Confidence Score (Maximum value from both models)
- 🏢 Associated Entity (Organization or Person)

📜 Assumptions

The associated entity (Organization/Person) appears before the address in the document.
The pipeline assigns the nearest preceding entity as the corresponding entity for each extracted address.

📌 Project Structure

🏗 Architecture Flow Diagram
📖 Fine-Tuned Model Notebook (Final Version)
experimentation/fine_tuning/bert_fine_tuning_v2.ipynb
🧪 Test Notebook for Address Extraction Pipeline
experimentation/address_extraction_v4.ipynb

🚀 How to Run the Pipeline

1️⃣ Set Up the Virtual Environment

# Create a Python 3 virtual environment
python3 -m venv env

# Activate the virtual environment
# On Windows:
source env/bin/activate  
# On macOS/Linux:
source env/bin/activate

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run the Application

streamlit run Home.py

🎯 Using the Streamlit App

The Home Page provides a summary of the pipeline.
The Application Tab allows users to upload a file and visualize extracted addresses:
- 📌 If the file contains addresses, bounding boxes are displayed.
- 📊 A summary table presents extracted information.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
experimentation		experimentation
pages		pages
src		src
.env		.env
.gitignore		.gitignore
Home.py		Home.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏠 Address Extraction Pipeline

📄 Supported Document Formats

🔧 Pipeline Workflow

1️⃣ File Preprocessing

2️⃣ NER Model Processing

3️⃣ Regex-Based Address Validation

4️⃣ Final Address Extraction Output

📜 Assumptions

📌 Project Structure

🚀 How to Run the Pipeline

1️⃣ Set Up the Virtual Environment

2️⃣ Install Dependencies

3️⃣ Run the Application

🎯 Using the Streamlit App

About

Uh oh!

Releases

Packages

Languages

manikrishna-m/address_extraction

Folders and files

Latest commit

History

Repository files navigation

🏠 Address Extraction Pipeline

📄 Supported Document Formats

🔧 Pipeline Workflow

1️⃣ File Preprocessing

2️⃣ NER Model Processing

3️⃣ Regex-Based Address Validation

4️⃣ Final Address Extraction Output

📜 Assumptions

📌 Project Structure

🚀 How to Run the Pipeline

1️⃣ Set Up the Virtual Environment

2️⃣ Install Dependencies

3️⃣ Run the Application

🎯 Using the Streamlit App

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages