🔐 SensitiveInfoDetector

A hybrid NLP system for detecting sensitive information (EMAIL + SECRET) using a fine-tuned DistilBERT model and regex rules.

🧠 Overview

SensitiveInfoDetector identifies sensitive entities such as emails and secrets (tokens / passwords) in free text.
It integrates a transformer-based model (DistilBERT) with rule-based pattern matching to improve precision and recall across different data distributions.

🖼️ App Previews

Example	Screenshot
✅ No Sensitive Information
⚠️ Warning
🚨 Critical
⚠️ + 🚨 Combined

🚀App Quick Launch (Colab / Local)

Clone the repository:

!git clone https://github.com/MaithaAlhammadi98/SensitiveInfoDetector.git
%cd SensitiveInfoDetector

Install dependencies:

!pip install -r requirements.txt

Launch the Gradio demo:

!python app/gradio_app.py

🧠 Dataset Details

This project uses a custom synthetic dataset designed for sensitive-information detection. All samples are anonymized and ethically generated.

Aspect	Description
Data Source	Synthetic / publicly available text snippets
Entities	EMAIL, SECRET (API keys, passwords, tokens)
Annotation	Manual labeling with Python span tagging
Dataset Size	≈ 1 500 train / 500 eval samples
Balance	Equal EMAIL and SECRET representation
Ethical Note	No real personal data used

🧩 The dataset trains the DistilBERT model to recognize and label sensitive entities within text data.

🧩 Architecture

Component	Description
DistilBERT	Fine-tuned transformer model for entity detection (`EMAIL`, `SECRET`).
Regex Rules	Deterministic patterns to catch edge cases missed by the model.
Hybrid Pipeline	Combines transformer predictions with regex results for higher recall.

🧪 Model Training

Base model: distilbert-base-uncased
Fine-tuning: Custom dataset with labeled EMAIL and SECRET entities.
Framework: Hugging Face Transformers + PyTorch.
Evaluation: Span-based precision, recall, and F1-score comparison across balanced and skewed datasets.

📊 Experimental Results

Dataset	Precision	Recall	F1-Score
Balanced	0.94	0.99	0.97
Skewed	0.97	0.98	0.98

Hybrid model consistently outperformed both the standalone transformer and rule-based baselines.

📈 Visual Results

Model	Confusion Matrix
DistilBERT
Rule-based

🌐 Online Resources

GitHub Repository: MaithaAlhammadi98/SensitiveInfoDetector
Hugging Face Model: Sensitive Info Detector (DistilBERT)

🧩 Folder Structure

SensitiveInfoDetector/
│
├── app/                 # Gradio interface
│   └── gradio_app.py
│
├── notebooks/           # Model training notebooks
│
├── evaluation/          # Metrics and confusion matrices
│   ├── span_metrics.py
│   ├── model_tp_fp_fn_bal.png
│
├── requirements.txt
└── README.md

🧑‍💻 Author

Maitha Alhammadi Master of Artificial Intelligence — University of Technology Sydney 📍 SensitiveInfoDetector is part of an NLP application-oriented project under Dr. Wei Liu.

🤖 AI Assistance Disclosure

ChatGPT (GPT-5, OpenAI) was used for debugging, report structuring, and documentation polishing in line with UTS academic integrity and ethical use guidelines.

📜 License

This project is licensed under the MIT License — free for academic and research use.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
app		app
assets		assets
evaluation		evaluation
Orgnized_NLP_Sensitive_Info_Project_Final_FIXED.ipynb		Orgnized_NLP_Sensitive_Info_Project_Final_FIXED.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔐 SensitiveInfoDetector

🧠 Overview

🖼️ App Previews

🚀App Quick Launch (Colab / Local)

🧠 Dataset Details

🧩 Architecture

🧪 Model Training

📊 Experimental Results

📈 Visual Results

🌐 Online Resources

🧩 Folder Structure

🧑‍💻 Author

🤖 AI Assistance Disclosure

📜 License

About

Uh oh!

Releases

Packages

Languages

MaithaAlhammadi98/SensitiveInfoDetector

Folders and files

Latest commit

History

Repository files navigation

🔐 SensitiveInfoDetector

🧠 Overview

🖼️ App Previews

🚀App Quick Launch (Colab / Local)

🧠 Dataset Details

🧩 Architecture

🧪 Model Training

📊 Experimental Results

📈 Visual Results

🌐 Online Resources

🧩 Folder Structure

🧑‍💻 Author

🤖 AI Assistance Disclosure

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages