Skip to content

manikrishna-m/address_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏠 Address Extraction Pipeline

This project extracts addresses from documents using a BERT-based Named Entity Recognition (NER) model.


📄 Supported Document Formats

The pipeline supports the following file types:

  • PDF (.pdf)
  • Word Documents (.docx, .doc)
  • Text Files (.txt)

Note: The UnstructuredIO API handles multiple file formats for preprocessing. Additional file types can be supported in future implementations.


🔧 Pipeline Workflow

1️⃣ File Preprocessing

  • The input document is processed using the UnstructuredIO API.
  • Extracted text and metadata are stored in JSON format, which includes:
    • Bounding boxes
    • Page numbers
    • File name
    • Extracted text content

2️⃣ NER Model Processing

3️⃣ Regex-Based Address Validation

  • The extracted addresses are checked using a custom regex model.
  • Only valid addresses are retained with highest confidence score

4️⃣ Final Address Extraction Output

  • The extracted information includes:
    • 📌 Extracted Address
    • 📂 File Number
    • 📄 Page Number
    • 🔲 Approximate Bounding Boxes
    • 🎯 Confidence Score (Maximum value from both models)
    • 🏢 Associated Entity (Organization or Person)

📜 Assumptions

  • The associated entity (Organization/Person) appears before the address in the document.
  • The pipeline assigns the nearest preceding entity as the corresponding entity for each extracted address.

📌 Project Structure

  • 🏗 Architecture Flow Diagram
    Architecture Diagram

  • 📖 Fine-Tuned Model Notebook (Final Version)
    experimentation/fine_tuning/bert_fine_tuning_v2.ipynb

  • 🧪 Test Notebook for Address Extraction Pipeline
    experimentation/address_extraction_v4.ipynb


🚀 How to Run the Pipeline

1️⃣ Set Up the Virtual Environment

# Create a Python 3 virtual environment
python3 -m venv env

# Activate the virtual environment
# On Windows:
source env/bin/activate  
# On macOS/Linux:
source env/bin/activate  

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run the Application

streamlit run Home.py

🎯 Using the Streamlit App

  • The Home Page provides a summary of the pipeline.
  • The Application Tab allows users to upload a file and visualize extracted addresses:
    • 📌 If the file contains addresses, bounding boxes are displayed.
    • 📊 A summary table presents extracted information.

About

In this project, addresses from documents are extracted using a BERT-based NER model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published