This project extracts addresses from documents using a BERT-based Named Entity Recognition (NER) model.
The pipeline supports the following file types:
- ✅ PDF (
.pdf) - ✅ Word Documents (
.docx,.doc) - ✅ Text Files (
.txt)
Note: The
UnstructuredIO APIhandles multiple file formats for preprocessing. Additional file types can be supported in future implementations.
- The input document is processed using the
UnstructuredIO API. - Extracted text and metadata are stored in JSON format, which includes:
- Bounding boxes
- Page numbers
- File name
- Extracted text content
- The extracted text is passed through two models:
- The extracted addresses are checked using a custom regex model.
- Only valid addresses are retained with highest confidence score
- The extracted information includes:
- 📌 Extracted Address
- 📂 File Number
- 📄 Page Number
- 🔲 Approximate Bounding Boxes
- 🎯 Confidence Score (Maximum value from both models)
- 🏢 Associated Entity (Organization or Person)
- The associated entity (Organization/Person) appears before the address in the document.
- The pipeline assigns the nearest preceding entity as the corresponding entity for each extracted address.
-
📖 Fine-Tuned Model Notebook (Final Version)
experimentation/fine_tuning/bert_fine_tuning_v2.ipynb -
🧪 Test Notebook for Address Extraction Pipeline
experimentation/address_extraction_v4.ipynb
# Create a Python 3 virtual environment
python3 -m venv env
# Activate the virtual environment
# On Windows:
source env/bin/activate
# On macOS/Linux:
source env/bin/activate pip install -r requirements.txtstreamlit run Home.py- The Home Page provides a summary of the pipeline.
- The Application Tab allows users to upload a file and visualize extracted addresses:
- 📌 If the file contains addresses, bounding boxes are displayed.
- 📊 A summary table presents extracted information.
