PromptShield is a prompt security middleware for LLMs. It uses a fine-tuned DistilBERT classifier to label incoming prompts as safe, unsafe, suspicious, or jailbreak before forwarding to Google Gemini 2.5 Flash.
User Prompt → DistilBERT Classifier → safe? → Gemini 2.5 Flash → Response
→ unsafe/suspicious/jailbreak → Blocked
- User enters a prompt in the frontend
- The prompt is sent to the Flask backend
- DistilBERT classifies it into one of 4 classes
- If safe → forwarded to Gemini, response shown
- If unsafe / suspicious / jailbreak → request is blocked
| Class | Description |
|---|---|
safe |
Normal prompt, forwarded to LLM |
unsafe |
Directly requests harmful content |
suspicious |
Recon-style or ambiguous behavior |
jailbreak |
Actively attempts to bypass safety |
PromptShield/
├── index.html # Frontend entry point
├── css/
│ ├── reset.css
│ ├── variables.css # Design tokens
│ ├── layout.css
│ ├── components.css
│ ├── status.css # Per-class color rules
│ └── animations.css
├── js/
│ ├── config.js # API keys & backend URL
│ ├── ui.js # DOM manipulation
│ ├── classifier.js # POST /classify
│ ├── gemini.js # POST /generate
│ └── main.js # Event wiring
├── backend.py # Flask API (classify + generate)
├── train.py # DistilBERT fine-tuning script
├── prompts.csv # Labeled training data
└── promptshield_model/ # Saved fine-tuned model
├── config.json
├── tokenizer_config.json
├── label_map.json
└── pytorch_model.bin
pip install flask flask-cors transformers torchpython backend.pyBackend runs on http://localhost:5000
Open index.html in your browser directly, or serve it:
python -m http.server 8080Then visit http://localhost:8080
Edit js/config.js:
const CONFIG = {
BACKEND_URL: 'http://localhost:5000',
BACKEND_HANDLES_GEMINI: true,
};Set your Gemini API key in backend.py.
- Base model:
distilbert-base-uncased - Task: 4-class sequence classification
- Training data: ~12k labeled prompts (
prompts.csv) - Metrics tracked: Accuracy, F1, Precision, Recall
- Best checkpoint selected by: weighted F1
python train.py// Request
{ "prompt": "How do I center a div?" }
// Response
{
"label": "safe",
"scores": {
"safe": 0.9421,
"unsafe": 0.0231,
"suspicious": 0.0187,
"jailbreak": 0.0161
}
}// Request
{ "prompt": "How do I center a div?" }
// Response
{ "response": "You can center a div using flexbox..." }- Classifier: DistilBERT (HuggingFace Transformers)
- Backend: Python, Flask
- LLM: Google Gemini 2.5 Flash
- Frontend: HTML, CSS, Vanilla JS