Skip to content

Commit 96532fc

Browse files
docs: add airtbench dataset readme (#23)
* docs: add airtbench dataset readme * docs: add correct link
1 parent f2f02c7 commit 96532fc

File tree

1 file changed

+154
-1
lines changed

1 file changed

+154
-1
lines changed

dataset/README.md

Lines changed: 154 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,155 @@
1-
# AIRTBench Dataset
1+
# AIRTBench Dataset - External Release
22

3+
- [AIRTBench Dataset - External Release](#airtbench-dataset---external-release)
4+
- [Overview](#overview)
5+
- [Dataset Statistics](#dataset-statistics)
6+
- [Model Success Rates](#model-success-rates)
7+
- [Challenge Difficulty Distribution](#challenge-difficulty-distribution)
8+
- [Data Dictionary](#data-dictionary)
9+
- [Identifiers](#identifiers)
10+
- [Primary Outcomes](#primary-outcomes)
11+
- [Performance Metrics](#performance-metrics)
12+
- [Resource Usage](#resource-usage)
13+
- [Cost Analysis](#cost-analysis)
14+
- [Conversation Content](#conversation-content)
15+
- [Error Analysis](#error-analysis)
16+
- [Usage Examples](#usage-examples)
17+
- [Basic Analysis](#basic-analysis)
18+
- [Cost Analysis](#cost-analysis-1)
19+
- [Performance Analysis](#performance-analysis)
20+
- [Conversation Content](#conversation-content-1)
21+
- [Contact](#contact)
22+
- [Version History](#version-history)
23+
24+
## Overview
25+
26+
This dataset contains the complete experimental results from the AIRTBench paper: "*AIRTBench: An AI Red Teaming Benchmark for Evaluating Language Models' Ability to Autonomously Discover and Exploit AI/ML Security Vulnerabilities.*"
27+
28+
The dataset includes 8,066 experimental runs across 12 different language models and 70 security challenges and is available [here](https://huggingface.co/datasets/dreadnode/AIRTBench/).
29+
30+
## Dataset Statistics
31+
32+
- **Total Runs**: 8,066
33+
- **Unique Models**: 12
34+
- **Unique Challenges**: 70
35+
- **Success Rate**: 20.5%
36+
37+
## Model Success Rates
38+
39+
| Model | Success Rate |
40+
|---|---|
41+
| claude-3-7-sonnet-20250219 | 46.86% |
42+
| gpt-4.5-preview | 36.89% |
43+
| gemini/gemini-2.5-pro-preview-05-06 | 34.29% |
44+
| openai/o3-mini | 28.43% |
45+
| together_ai/deepseek-ai/DeepSeek-R1 | 26.86% |
46+
| gemini/gemini-2.5-flash-preview-04-17 | 26.43% |
47+
| openai/gpt-4o | 20.29% |
48+
| gemini/gemini-2.0-flash | 16.86% |
49+
| gemini/gemini-1.5-pro | 15.14% |
50+
| groq/meta-llama/llama-4-scout-17b-16e-instruct | 1.00% |
51+
| groq/qwen-qwq-32b | 0.57% |
52+
| groq/llama-3.3-70b-versatile | 0.00% |
53+
54+
## Challenge Difficulty Distribution
55+
56+
| Difficulty | Count |
57+
|---|---|
58+
| easy | 4,259 |
59+
| medium | 2,657 |
60+
| hard | 1,150 |
61+
62+
## Data Dictionary
63+
64+
### Identifiers
65+
- **model**: Original model name from API
66+
- **model_family**: Model provider/family (Anthropic, OpenAI, Google, etc.)
67+
- **challenge_name**: Name of the security challenge
68+
- **challenge_difficulty**: Difficulty level (Easy/Medium/Hard)
69+
70+
### Primary Outcomes
71+
- **flag_found**: Boolean indicating if the run found the flag.
72+
73+
### Performance Metrics
74+
- **total_flag_submissions**: Total number of flag submissions attempted
75+
- **correct_flag_submissions**: Number of correct flag submissions (led to success)
76+
- **incorrect_flag_submissions**: Number of incorrect flag submissions (failed)
77+
- **duration_minutes**: Total runtime in minutes
78+
79+
### Resource Usage
80+
- **input_tokens**: Number of input tokens consumed (integer)
81+
- **output_tokens**: Number of output tokens generated (integer)
82+
- **total_tokens**: Total tokens (input + output) (integer)
83+
- **execution_spans**: Number of execution spans
84+
85+
### Cost Analysis
86+
- **total_cost_usd**: Total cost in USD for the run
87+
- **input_cost_usd**: Cost for input tokens in USD
88+
- **output_cost_usd**: Cost for output tokens in USD
89+
- **tokens_per_dollar**: Number of tokens per dollar spent
90+
91+
### Conversation Content
92+
- **conversation**: Complete conversation including all chat messages (API keys redacted)
93+
94+
### Error Analysis
95+
- **hit_rate_limit**: Boolean indicating if rate limits were hit
96+
- **rate_limit_count**: Number of rate limit errors encountered
97+
98+
## Usage Examples
99+
100+
### Basic Analysis
101+
```python
102+
import pandas as pd
103+
104+
# Load the dataset
105+
df = pd.read_parquet('airtbench_external_dataset.parquet')
106+
107+
# Calculate success rates by model
108+
success_by_model = df.groupby('model')['flag_found'].mean().sort_values(ascending=False)
109+
print(success_by_model)
110+
111+
# Calculate success rates by challenge
112+
success_by_challenge = df.groupby('challenge_name')['flag_found'].mean().sort_values(ascending=False)
113+
print(success_by_challenge)
114+
```
115+
116+
### Cost Analysis
117+
```python
118+
# Analyze cost efficiency
119+
cost_analysis = df.groupby('model').agg({
120+
'total_cost_usd': 'mean',
121+
'cost_per_success': 'mean',
122+
'tokens_per_dollar': 'mean',
123+
'flag_found': 'mean'
124+
}).round(4)
125+
print(cost_analysis)
126+
```
127+
128+
### Performance Analysis
129+
```python
130+
# Analyze performance metrics
131+
performance = df.groupby('model').agg({
132+
'duration_minutes': 'mean',
133+
'total_tokens': 'mean',
134+
'execution_spans': 'mean',
135+
'flag_found': 'mean'
136+
}).round(2)
137+
print(performance)
138+
```
139+
140+
### Conversation Content
141+
142+
```python
143+
# Example of conversation content
144+
conversation = df.loc[50, 'conversation']
145+
conversation = eval(conversation) # Convert string to list
146+
print(conversation)
147+
```
148+
149+
## Contact
150+
151+
For questions about this dataset, please contact [[email protected]](mailto:[email protected]).
152+
153+
## Version History
154+
155+
- v1.0: Initial external release

0 commit comments

Comments
 (0)