Skip to content

wanguiwaweru/AI-red-teaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Red-Teaming

A framework for testing AI model safety and robustness against adversarial prompts. The project tests Ollama models.

Overview

This project is a systematic tool for evaluating language model vulnerability to adversarial attacks. It tests models (running via Ollama) with a curated library of harmful/misuse prompts to measure safety and identify potential weaknesses.

Core Components

Redteaming results

Attack Runner

Loads prompts from categorized text files and runs them against a specified model, collecting responses for analysis.

Scoring System

Evaluates model responses for signs of harmful output by detecting:

  • Refusal patterns (e.g., "I cannot", "I won't")
  • Step-by-step instructions or code
  • Detailed procedural information
  • Response length and structure

Dashboard

Interactive visualization for results:

  • Filters by model and prompt category
  • Metrics: total tests, average harm score, unique prompts
  • Displays risky responses ranked by harm score
  • Prompt library statistics

Logging System

Records results in:

  • CSV format for aggregated data
  • JSON format with timestamps for raw results

(Will be implementing more robust logging mechanism)

Workflow

  1. Load adversarial prompts
  2. Send each prompt to an Ollama model
  3. Score responses for harmful content
  4. Log results to CSV and JSON
  5. View results in the Streamlit dashboard

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages