Skip to content

Fine-tuned DistilBERT model for detecting sensitive information (EMAIL + SECRET) with hybrid rule-based integration.

Notifications You must be signed in to change notification settings

MaithaAlhammadi98/SensitiveInfoDetector

Repository files navigation

Sensitive Info Detector Banner

🔐 SensitiveInfoDetector

A hybrid NLP system for detecting sensitive information (EMAIL + SECRET) using a fine-tuned DistilBERT model and regex rules.


🧠 Overview

SensitiveInfoDetector identifies sensitive entities such as emails and secrets (tokens / passwords) in free text.
It integrates a transformer-based model (DistilBERT) with rule-based pattern matching to improve precision and recall across different data distributions.


🖼️ App Previews

Example Screenshot
✅ No Sensitive Information
⚠️ Warning
🚨 Critical
⚠️ + 🚨 Combined

🚀App Quick Launch (Colab / Local)

Clone the repository:

!git clone https://github.com/MaithaAlhammadi98/SensitiveInfoDetector.git
%cd SensitiveInfoDetector

Install dependencies:

!pip install -r requirements.txt

Launch the Gradio demo:

!python app/gradio_app.py

🧠 Dataset Details

This project uses a custom synthetic dataset designed for sensitive-information detection. All samples are anonymized and ethically generated.

Aspect Description
Data Source Synthetic / publicly available text snippets
Entities EMAIL, SECRET (API keys, passwords, tokens)
Annotation Manual labeling with Python span tagging
Dataset Size ≈ 1 500 train / 500 eval samples
Balance Equal EMAIL and SECRET representation
Ethical Note No real personal data used

🧩 The dataset trains the DistilBERT model to recognize and label sensitive entities within text data.


🧩 Architecture

Component Description
DistilBERT Fine-tuned transformer model for entity detection (EMAIL, SECRET).
Regex Rules Deterministic patterns to catch edge cases missed by the model.
Hybrid Pipeline Combines transformer predictions with regex results for higher recall.

🧪 Model Training

  • Base model: distilbert-base-uncased
  • Fine-tuning: Custom dataset with labeled EMAIL and SECRET entities.
  • Framework: Hugging Face Transformers + PyTorch.
  • Evaluation: Span-based precision, recall, and F1-score comparison across balanced and skewed datasets.

📊 Experimental Results

Dataset Precision Recall F1-Score
Balanced 0.94 0.99 0.97
Skewed 0.97 0.98 0.98

Hybrid model consistently outperformed both the standalone transformer and rule-based baselines.


📈 Visual Results

Model Confusion Matrix
DistilBERT Model Confusion
Rule-based Rules Confusion

🌐 Online Resources


🧩 Folder Structure

SensitiveInfoDetector/
│
├── app/                 # Gradio interface
│   └── gradio_app.py
│
├── notebooks/           # Model training notebooks
│
├── evaluation/          # Metrics and confusion matrices
│   ├── span_metrics.py
│   ├── model_tp_fp_fn_bal.png
│
├── requirements.txt
└── README.md

🧑‍💻 Author

Maitha Alhammadi Master of Artificial Intelligence — University of Technology Sydney 📍 SensitiveInfoDetector is part of an NLP application-oriented project under Dr. Wei Liu.


🤖 AI Assistance Disclosure

ChatGPT (GPT-5, OpenAI) was used for debugging, report structuring, and documentation polishing in line with UTS academic integrity and ethical use guidelines.


📜 License

This project is licensed under the MIT License — free for academic and research use.


About

Fine-tuned DistilBERT model for detecting sensitive information (EMAIL + SECRET) with hybrid rule-based integration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published