|
1 | | -# AIRTBench Dataset |
| 1 | +# AIRTBench Dataset - External Release |
2 | 2 |
|
| 3 | +- [AIRTBench Dataset - External Release](#airtbench-dataset---external-release) |
| 4 | + - [Overview](#overview) |
| 5 | + - [Dataset Statistics](#dataset-statistics) |
| 6 | + - [Model Success Rates](#model-success-rates) |
| 7 | + - [Challenge Difficulty Distribution](#challenge-difficulty-distribution) |
| 8 | + - [Data Dictionary](#data-dictionary) |
| 9 | + - [Identifiers](#identifiers) |
| 10 | + - [Primary Outcomes](#primary-outcomes) |
| 11 | + - [Performance Metrics](#performance-metrics) |
| 12 | + - [Resource Usage](#resource-usage) |
| 13 | + - [Cost Analysis](#cost-analysis) |
| 14 | + - [Conversation Content](#conversation-content) |
| 15 | + - [Error Analysis](#error-analysis) |
| 16 | + - [Usage Examples](#usage-examples) |
| 17 | + - [Basic Analysis](#basic-analysis) |
| 18 | + - [Cost Analysis](#cost-analysis-1) |
| 19 | + - [Performance Analysis](#performance-analysis) |
| 20 | + - [Conversation Content](#conversation-content-1) |
| 21 | + - [Contact](#contact) |
| 22 | + - [Version History](#version-history) |
| 23 | + |
| 24 | +## Overview |
| 25 | + |
| 26 | +This dataset contains the complete experimental results from the AIRTBench paper: "*AIRTBench: An AI Red Teaming Benchmark for Evaluating Language Models' Ability to Autonomously Discover and Exploit AI/ML Security Vulnerabilities.*" |
| 27 | + |
| 28 | +The dataset includes 8,066 experimental runs across 12 different language models and 70 security challenges and is available [here](https://huggingface.co/datasets/dreadnode/AIRTBench/). |
| 29 | + |
| 30 | +## Dataset Statistics |
| 31 | + |
| 32 | +- **Total Runs**: 8,066 |
| 33 | +- **Unique Models**: 12 |
| 34 | +- **Unique Challenges**: 70 |
| 35 | +- **Success Rate**: 20.5% |
| 36 | + |
| 37 | +## Model Success Rates |
| 38 | + |
| 39 | +| Model | Success Rate | |
| 40 | +|---|---| |
| 41 | +| claude-3-7-sonnet-20250219 | 46.86% | |
| 42 | +| gpt-4.5-preview | 36.89% | |
| 43 | +| gemini/gemini-2.5-pro-preview-05-06 | 34.29% | |
| 44 | +| openai/o3-mini | 28.43% | |
| 45 | +| together_ai/deepseek-ai/DeepSeek-R1 | 26.86% | |
| 46 | +| gemini/gemini-2.5-flash-preview-04-17 | 26.43% | |
| 47 | +| openai/gpt-4o | 20.29% | |
| 48 | +| gemini/gemini-2.0-flash | 16.86% | |
| 49 | +| gemini/gemini-1.5-pro | 15.14% | |
| 50 | +| groq/meta-llama/llama-4-scout-17b-16e-instruct | 1.00% | |
| 51 | +| groq/qwen-qwq-32b | 0.57% | |
| 52 | +| groq/llama-3.3-70b-versatile | 0.00% | |
| 53 | + |
| 54 | +## Challenge Difficulty Distribution |
| 55 | + |
| 56 | +| Difficulty | Count | |
| 57 | +|---|---| |
| 58 | +| easy | 4,259 | |
| 59 | +| medium | 2,657 | |
| 60 | +| hard | 1,150 | |
| 61 | + |
| 62 | +## Data Dictionary |
| 63 | + |
| 64 | +### Identifiers |
| 65 | +- **model**: Original model name from API |
| 66 | +- **model_family**: Model provider/family (Anthropic, OpenAI, Google, etc.) |
| 67 | +- **challenge_name**: Name of the security challenge |
| 68 | +- **challenge_difficulty**: Difficulty level (Easy/Medium/Hard) |
| 69 | + |
| 70 | +### Primary Outcomes |
| 71 | +- **flag_found**: Boolean indicating if the run found the flag. |
| 72 | + |
| 73 | +### Performance Metrics |
| 74 | +- **total_flag_submissions**: Total number of flag submissions attempted |
| 75 | +- **correct_flag_submissions**: Number of correct flag submissions (led to success) |
| 76 | +- **incorrect_flag_submissions**: Number of incorrect flag submissions (failed) |
| 77 | +- **duration_minutes**: Total runtime in minutes |
| 78 | + |
| 79 | +### Resource Usage |
| 80 | +- **input_tokens**: Number of input tokens consumed (integer) |
| 81 | +- **output_tokens**: Number of output tokens generated (integer) |
| 82 | +- **total_tokens**: Total tokens (input + output) (integer) |
| 83 | +- **execution_spans**: Number of execution spans |
| 84 | + |
| 85 | +### Cost Analysis |
| 86 | +- **total_cost_usd**: Total cost in USD for the run |
| 87 | +- **input_cost_usd**: Cost for input tokens in USD |
| 88 | +- **output_cost_usd**: Cost for output tokens in USD |
| 89 | +- **tokens_per_dollar**: Number of tokens per dollar spent |
| 90 | + |
| 91 | +### Conversation Content |
| 92 | +- **conversation**: Complete conversation including all chat messages (API keys redacted) |
| 93 | + |
| 94 | +### Error Analysis |
| 95 | +- **hit_rate_limit**: Boolean indicating if rate limits were hit |
| 96 | +- **rate_limit_count**: Number of rate limit errors encountered |
| 97 | + |
| 98 | +## Usage Examples |
| 99 | + |
| 100 | +### Basic Analysis |
| 101 | +```python |
| 102 | +import pandas as pd |
| 103 | + |
| 104 | +# Load the dataset |
| 105 | +df = pd.read_parquet('airtbench_external_dataset.parquet') |
| 106 | + |
| 107 | +# Calculate success rates by model |
| 108 | +success_by_model = df.groupby('model')['flag_found'].mean().sort_values(ascending=False) |
| 109 | +print(success_by_model) |
| 110 | + |
| 111 | +# Calculate success rates by challenge |
| 112 | +success_by_challenge = df.groupby('challenge_name')['flag_found'].mean().sort_values(ascending=False) |
| 113 | +print(success_by_challenge) |
| 114 | +``` |
| 115 | + |
| 116 | +### Cost Analysis |
| 117 | +```python |
| 118 | +# Analyze cost efficiency |
| 119 | +cost_analysis = df.groupby('model').agg({ |
| 120 | + 'total_cost_usd': 'mean', |
| 121 | + 'cost_per_success': 'mean', |
| 122 | + 'tokens_per_dollar': 'mean', |
| 123 | + 'flag_found': 'mean' |
| 124 | +}).round(4) |
| 125 | +print(cost_analysis) |
| 126 | +``` |
| 127 | + |
| 128 | +### Performance Analysis |
| 129 | +```python |
| 130 | +# Analyze performance metrics |
| 131 | +performance = df.groupby('model').agg({ |
| 132 | + 'duration_minutes': 'mean', |
| 133 | + 'total_tokens': 'mean', |
| 134 | + 'execution_spans': 'mean', |
| 135 | + 'flag_found': 'mean' |
| 136 | +}).round(2) |
| 137 | +print(performance) |
| 138 | +``` |
| 139 | + |
| 140 | +### Conversation Content |
| 141 | + |
| 142 | +```python |
| 143 | +# Example of conversation content |
| 144 | +conversation = df.loc[50, 'conversation'] |
| 145 | +conversation = eval(conversation) # Convert string to list |
| 146 | +print(conversation) |
| 147 | +``` |
| 148 | + |
| 149 | +## Contact |
| 150 | + |
| 151 | +For questions about this dataset, please contact [[email protected]](mailto:[email protected]). |
| 152 | + |
| 153 | +## Version History |
| 154 | + |
| 155 | +- v1.0: Initial external release |
0 commit comments