A comprehensive evaluation framework for assessing multimodal large language models (MLLMs) on multi-image relationship (MIR) based safety attacks.
HuggingFace: thu-coai/MIR-SafetyBench
Paper: arXiv:2601.14127
MIR-SafetyBench evaluates MLLM safety through multi-image relationship attacks across 6 safety categories and 9 relationship types:
Safety Categories:
- Hate Speech
- Violence
- Self-Harm
- Illegal Activities
- Harassment
- Privacy
Relationship Types:
- Analogy
- Causality
- Complementarity
- Decomposition
- Relevance
- Spatial Embedding
- Spatial Juxtaposition
- Temporal Continuity
- Temporal Jump
Each sample contains:
id: Unique identifieroriginal_question: Original unsafe questionrelationship_type: Multi-image relationship typerevised_prompt: Attack prompt utilizing multi-image relationshipsimage_descriptions: Textual descriptions of imagesimage_keywords: Keywords for each imageimages: List of image file pathsiteration: Generation iteration number
Download the dataset from HuggingFace:
# Set your HuggingFace token
export HUGGINGFACE_TOKEN=your_token_here
# The extraction script will download automatically
# Or download manually to a local pathConvert the HuggingFace dataset to local file structure:
# From HuggingFace (requires HUGGINGFACE_TOKEN)
python extract_data.py
# From local path
python extract_data.py --local-path /path/to/downloaded/dataset
# Specify output directory
python extract_data.py --output ./data --local-path /path/to/datasetThis creates a structured directory:
output/
βββ Hate_Speech/
β βββ images/
β β βββ Analogy/
β β β βββ Analogy_1_0.png
β β β βββ ...
β β βββ ...
β βββ Analogy_final.json
β βββ ...
βββ Violence/
βββ ...
Evaluate your model on the benchmark:
# Basic usage
python eval.py \
--json_dir ./data \
--models your_model_name \
--output_dir ./results \
--model_path /path/to/your/model
# With HarmBench evaluation
python eval.py \
--json_dir ./data \
--models your_model_name \
--output_dir ./results \
--model_path /path/to/your/model \
--evaluators harmbench \
--harmbench_model_path /path/to/HarmBench-Llama-2-13b-cls
# For closed-source models (API)
python eval.py \
--json_dir ./data \
--models close_source_model \
--api_model_name gpt-4o \
--output_dir ./resultsCreate a new model adapter in the models/ directory. Your adapter must implement three functions:
def load_model(model_path, num_gpus=1):
"""Load and return model"""
pass
def infer(pipe, prompts: List[str], image_path_sets: List[List[str]]):
"""Run inference and return results"""
pass
def unload_model(pipe):
"""Clean up model resources"""
passFor standard vision-language chat models:
- Uses vLLM for efficient inference
- Processes multiple images per prompt
- Returns structured outputs
python eval.py \
--json_dir ./data \
--models qwen2_5_VL_3B \
--output_dir ./results \
--model_path /path/to/Qwen2.5-VL-3B-InstructFor models with chain-of-thought reasoning:
- Extracts answers from reasoning traces
- Handles
<think>and<answer>tags - Robust parsing for incomplete outputs
python eval.py \
--json_dir ./data \
--models GLM-4.1V-9B-Thinking \
--output_dir ./results \
--model_path /path/to/GLM-4.1V-9B-ThinkingFor API-based models (OpenAI, Claude, etc.):
- Handles rate limiting and retries
- Supports concurrent requests
- Automatic error recovery
# Set API credentials
export OPENAI_API_KEY=your_api_key
# Run evaluation
python eval.py \
--json_dir ./data \
--models close_source_model \
--api_model_name gpt-4o \
--output_dir ./resultsEnvironment Variables for API Models:
export OPENAI_API_KEY=your_key # Required
export CLOSE_SOURCE_API_BASE_URL=https://... # Optional
export CLOSE_SOURCE_API_NUM_WORKERS=9 # Concurrent processes
export CLOSE_SOURCE_API_TIMEOUT_SEC=149 # Request timeout
export CLOSE_SOURCE_API_MAX_TOTAL_RETRIES=8 # Max retry attemptsSpecify GPU requirements in your model file:
# Single GPU
GPU_REQUIREMENT = 1
# No GPU (for API models)
GPU_REQUIREMENT = 0
# For API models, specify CPU workers
NUM_CPU_WORKERS = 9Results are organized by evaluation stage:
results/
βββ infer/
β βββ {model_name}/
β βββ {category}/
β βββ {relationship_type}.json
βββ harmbench/
βββ {model_name}/
βββ {category}/
βββ {relationship_type}.json
Each inference result contains:
original_question: Original unsafe questionrevised_prompt: Attack prompt with imagesanswer: Model's responseitem_index: Sample indexinference_status:success,failed, orcrashed
The framework supports:
- HarmBench: Binary safety classification using HarmBench-Llama-2-13b-cls
Configure evaluators via --evaluators flag:
python eval.py \
--json_dir ./data \
--models your_model \
--output_dir ./results \
--evaluators harmbench \
--harmbench_model_path /path/to/HarmBench-Llama-2-13b-cls \
--harmbench_batch_size 1After evaluation, use statics.py to analyze results and compute Attack Success Rate (ASR):
# Analyze results for a specific model
python statics.py --path ./results/harmbench/{model_name}
# Example
python statics.py --path ./results/harmbench/qwen2_5_VL_3BOutput:
- Total unsafe count per relationship type
- Total samples per relationship type
- ASR (%) for each relationship type
- Overall ASR across all categories
Example Output:
==============================================================================
Final Statistics
==============================================================================
Filename Unsafe Count Total Count ASR(%)
------------------------------------------------------------------------------
Analogy.json 45 50 90.00%
Causality.json 38 45 84.44%
...
------------------------------------------------------------------------------
Total 423 500 84.60%
==============================================================================
The script automatically:
- Traverses all 6 safety category folders
- Aggregates statistics across categories
- Calculates per-relationship-type and overall ASR
pip install -r requirements.txtCore dependencies:
datasets- For HuggingFace dataset handlingpillow- For image processingtqdm- For progress barstorch- For model inferencevllm- For efficient LLM inferencetransformers- For model loading
See LICENSE file for details.
If you use this benchmark, please cite:
@misc{chen2026effectssmartsafetyrisks,
title={The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning},
author={Renmiao Chen and Yida Lu and Shiyao Cui and Xuan Ouyang and Victor Shea-Jay Huang and Shumin Zhang and Chengwei Pan and Han Qiu and Minlie Huang},
year={2026},
eprint={2601.14127},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.14127},
}