This project extracts addresses from documents using a BERT-based Named Entity Recognition (NER) model.
The pipeline supports the following file types:
- ✅ PDF (
.pdf
) - ✅ Word Documents (
.docx
,.doc
) - ✅ Text Files (
.txt
)
Note: The
UnstructuredIO API
handles multiple file formats for preprocessing. Additional file types can be supported in future implementations.
- The input document is processed using the
UnstructuredIO API
. - Extracted text and metadata are stored in JSON format, which includes:
- Bounding boxes
- Page numbers
- File name
- Extracted text content
- The extracted text is passed through two models:
- The extracted addresses are checked using a custom regex model.
- Only valid addresses are retained with highest confidence score
- The extracted information includes:
- 📌 Extracted Address
- 📂 File Number
- 📄 Page Number
- 🔲 Approximate Bounding Boxes
- 🎯 Confidence Score (Maximum value from both models)
- 🏢 Associated Entity (Organization or Person)
- The associated entity (Organization/Person) appears before the address in the document.
- The pipeline assigns the nearest preceding entity as the corresponding entity for each extracted address.
-
📖 Fine-Tuned Model Notebook (Final Version)
experimentation/fine_tuning/bert_fine_tuning_v2.ipynb
-
🧪 Test Notebook for Address Extraction Pipeline
experimentation/address_extraction_v4.ipynb
# Create a Python 3 virtual environment
python3 -m venv env
# Activate the virtual environment
# On Windows:
source env/bin/activate
# On macOS/Linux:
source env/bin/activate
pip install -r requirements.txt
streamlit run Home.py
- The Home Page provides a summary of the pipeline.
- The Application Tab allows users to upload a file and visualize extracted addresses:
- 📌 If the file contains addresses, bounding boxes are displayed.
- 📊 A summary table presents extracted information.