Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 154 additions & 1 deletion dataset/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,155 @@
# AIRTBench Dataset
# AIRTBench Dataset - External Release

- [AIRTBench Dataset - External Release](#airtbench-dataset---external-release)
- [Overview](#overview)
- [Dataset Statistics](#dataset-statistics)
- [Model Success Rates](#model-success-rates)
- [Challenge Difficulty Distribution](#challenge-difficulty-distribution)
- [Data Dictionary](#data-dictionary)
- [Identifiers](#identifiers)
- [Primary Outcomes](#primary-outcomes)
- [Performance Metrics](#performance-metrics)
- [Resource Usage](#resource-usage)
- [Cost Analysis](#cost-analysis)
- [Conversation Content](#conversation-content)
- [Error Analysis](#error-analysis)
- [Usage Examples](#usage-examples)
- [Basic Analysis](#basic-analysis)
- [Cost Analysis](#cost-analysis-1)
- [Performance Analysis](#performance-analysis)
- [Conversation Content](#conversation-content-1)
- [Contact](#contact)
- [Version History](#version-history)

## Overview

This dataset contains the complete experimental results from the AIRTBench paper: "*AIRTBench: An AI Red Teaming Benchmark for Evaluating Language Models' Ability to Autonomously Discover and Exploit AI/ML Security Vulnerabilities.*"

The dataset includes 8,066 experimental runs across 12 different language models and 70 security challenges and is available [here](https://huggingface.co/datasets/dreadnode/AIRTBench/).

## Dataset Statistics

- **Total Runs**: 8,066
- **Unique Models**: 12
- **Unique Challenges**: 70
- **Success Rate**: 20.5%

## Model Success Rates

| Model | Success Rate |
|---|---|
| claude-3-7-sonnet-20250219 | 46.86% |
| gpt-4.5-preview | 36.89% |
| gemini/gemini-2.5-pro-preview-05-06 | 34.29% |
| openai/o3-mini | 28.43% |
| together_ai/deepseek-ai/DeepSeek-R1 | 26.86% |
| gemini/gemini-2.5-flash-preview-04-17 | 26.43% |
| openai/gpt-4o | 20.29% |
| gemini/gemini-2.0-flash | 16.86% |
| gemini/gemini-1.5-pro | 15.14% |
| groq/meta-llama/llama-4-scout-17b-16e-instruct | 1.00% |
| groq/qwen-qwq-32b | 0.57% |
| groq/llama-3.3-70b-versatile | 0.00% |

## Challenge Difficulty Distribution

| Difficulty | Count |
|---|---|
| easy | 4,259 |
| medium | 2,657 |
| hard | 1,150 |

## Data Dictionary

### Identifiers
- **model**: Original model name from API
- **model_family**: Model provider/family (Anthropic, OpenAI, Google, etc.)
- **challenge_name**: Name of the security challenge
- **challenge_difficulty**: Difficulty level (Easy/Medium/Hard)

### Primary Outcomes
- **flag_found**: Boolean indicating if the run found the flag.

### Performance Metrics
- **total_flag_submissions**: Total number of flag submissions attempted
- **correct_flag_submissions**: Number of correct flag submissions (led to success)
- **incorrect_flag_submissions**: Number of incorrect flag submissions (failed)
- **duration_minutes**: Total runtime in minutes

### Resource Usage
- **input_tokens**: Number of input tokens consumed (integer)
- **output_tokens**: Number of output tokens generated (integer)
- **total_tokens**: Total tokens (input + output) (integer)
- **execution_spans**: Number of execution spans

### Cost Analysis
- **total_cost_usd**: Total cost in USD for the run
- **input_cost_usd**: Cost for input tokens in USD
- **output_cost_usd**: Cost for output tokens in USD
- **tokens_per_dollar**: Number of tokens per dollar spent

### Conversation Content
- **conversation**: Complete conversation including all chat messages (API keys redacted)

### Error Analysis
- **hit_rate_limit**: Boolean indicating if rate limits were hit
- **rate_limit_count**: Number of rate limit errors encountered

## Usage Examples

### Basic Analysis
```python
import pandas as pd

# Load the dataset
df = pd.read_parquet('airtbench_external_dataset.parquet')

# Calculate success rates by model
success_by_model = df.groupby('model')['flag_found'].mean().sort_values(ascending=False)
print(success_by_model)

# Calculate success rates by challenge
success_by_challenge = df.groupby('challenge_name')['flag_found'].mean().sort_values(ascending=False)
print(success_by_challenge)
```

### Cost Analysis
```python
# Analyze cost efficiency
cost_analysis = df.groupby('model').agg({
'total_cost_usd': 'mean',
'cost_per_success': 'mean',
'tokens_per_dollar': 'mean',
'flag_found': 'mean'
}).round(4)
print(cost_analysis)
```

### Performance Analysis
```python
# Analyze performance metrics
performance = df.groupby('model').agg({
'duration_minutes': 'mean',
'total_tokens': 'mean',
'execution_spans': 'mean',
'flag_found': 'mean'
}).round(2)
print(performance)
```

### Conversation Content

```python
# Example of conversation content
conversation = df.loc[50, 'conversation']
conversation = eval(conversation) # Convert string to list
print(conversation)
```

## Contact

For questions about this dataset, please contact [[email protected]](mailto:[email protected]).

## Version History

- v1.0: Initial external release