A framework for testing AI model safety and robustness against adversarial prompts. The project tests Ollama models.
This project is a systematic tool for evaluating language model vulnerability to adversarial attacks. It tests models (running via Ollama) with a curated library of harmful/misuse prompts to measure safety and identify potential weaknesses.
Loads prompts from categorized text files and runs them against a specified model, collecting responses for analysis.
Evaluates model responses for signs of harmful output by detecting:
- Refusal patterns (e.g., "I cannot", "I won't")
- Step-by-step instructions or code
- Detailed procedural information
- Response length and structure
Interactive visualization for results:
- Filters by model and prompt category
- Metrics: total tests, average harm score, unique prompts
- Displays risky responses ranked by harm score
- Prompt library statistics
Records results in:
- CSV format for aggregated data
- JSON format with timestamps for raw results
(Will be implementing more robust logging mechanism)
- Load adversarial prompts
- Send each prompt to an Ollama model
- Score responses for harmful content
- Log results to CSV and JSON
- View results in the Streamlit dashboard