A hybrid NLP system for detecting sensitive information (EMAIL + SECRET) using a fine-tuned DistilBERT model and regex rules.
SensitiveInfoDetector identifies sensitive entities such as emails and secrets (tokens / passwords) in free text.
It integrates a transformer-based model (DistilBERT) with rule-based pattern matching to improve precision and recall across different data distributions.
| Example | Screenshot |
|---|---|
| ✅ No Sensitive Information | ![]() |
![]() |
|
| 🚨 Critical | ![]() |
![]() |
Clone the repository:
!git clone https://github.com/MaithaAlhammadi98/SensitiveInfoDetector.git
%cd SensitiveInfoDetectorInstall dependencies:
!pip install -r requirements.txtLaunch the Gradio demo:
!python app/gradio_app.pyThis project uses a custom synthetic dataset designed for sensitive-information detection. All samples are anonymized and ethically generated.
| Aspect | Description |
|---|---|
| Data Source | Synthetic / publicly available text snippets |
| Entities | EMAIL, SECRET (API keys, passwords, tokens) |
| Annotation | Manual labeling with Python span tagging |
| Dataset Size | ≈ 1 500 train / 500 eval samples |
| Balance | Equal EMAIL and SECRET representation |
| Ethical Note | No real personal data used |
🧩 The dataset trains the DistilBERT model to recognize and label sensitive entities within text data.
| Component | Description |
|---|---|
| DistilBERT | Fine-tuned transformer model for entity detection (EMAIL, SECRET). |
| Regex Rules | Deterministic patterns to catch edge cases missed by the model. |
| Hybrid Pipeline | Combines transformer predictions with regex results for higher recall. |
- Base model:
distilbert-base-uncased - Fine-tuning: Custom dataset with labeled
EMAILandSECRETentities. - Framework: Hugging Face Transformers + PyTorch.
- Evaluation: Span-based precision, recall, and F1-score comparison across balanced and skewed datasets.
| Dataset | Precision | Recall | F1-Score |
|---|---|---|---|
| Balanced | 0.94 | 0.99 | 0.97 |
| Skewed | 0.97 | 0.98 | 0.98 |
Hybrid model consistently outperformed both the standalone transformer and rule-based baselines.
| Model | Confusion Matrix |
|---|---|
| DistilBERT | ![]() |
| Rule-based | ![]() |
- GitHub Repository: MaithaAlhammadi98/SensitiveInfoDetector
- Hugging Face Model: Sensitive Info Detector (DistilBERT)
SensitiveInfoDetector/
│
├── app/ # Gradio interface
│ └── gradio_app.py
│
├── notebooks/ # Model training notebooks
│
├── evaluation/ # Metrics and confusion matrices
│ ├── span_metrics.py
│ ├── model_tp_fp_fn_bal.png
│
├── requirements.txt
└── README.md
Maitha Alhammadi Master of Artificial Intelligence — University of Technology Sydney 📍 SensitiveInfoDetector is part of an NLP application-oriented project under Dr. Wei Liu.
ChatGPT (GPT-5, OpenAI) was used for debugging, report structuring, and documentation polishing in line with UTS academic integrity and ethical use guidelines.
This project is licensed under the MIT License — free for academic and research use.






