autoXplain is a framework that combines Vision Language Models (VLMs) with various Class Activation Mapping (CAM) methods to automatically explain and evaluate vision model predictions. It provides detailed explanations, saliency maps, and quantitative evaluations of model performance.
Install the autoXplain package:
pip install git+https://github.com/phuvinhnguyen/autoXplain.gitOr you can clone and install it
git clone https://github.com/phuvinhnguyen/autoXplain.git
cd autoXplain
pip install -e .- Multiple CAM methods support:
- GradCAM
- SmoothGradCAM++
- GradCAM++
- CAM
- ScoreCAM
- LayerCAM
- XGradCAM
- Automatic evaluation using Vision Language Models (VLMs)
- Batch processing of images
- Comprehensive result analysis and reporting
- Support for different vision models (ResNet18, MaxViT)
- Detailed performance metrics and visualizations
from autoXplain.evaluating import CamJudge
from FlowDesign.litellm import LLMInference
from torchcam.methods import GradCAM
from torchvision.models import resnet18
import torchvision
import json
import urllib.request
# Load labels of ImageNet models
url = "https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json"
with urllib.request.urlopen(url) as response:
class_idx = json.load(response)
labels = [class_idx[str(i)][1] for i in range(1000)]
# Load LLM
bot = LLMInference("gemini/gemini-1.5-flash", api_key='<API_TOKEN>')
# Load vision model
model = resnet18(weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1)
# Create workflow
agent = CamJudge(bot, GradCAM, model, labels=labels)
# Run framework
output = agent({'image': 'path/to/image.png', 'label': 'label_of_the_image'})
print(output)For processing multiple images and generating XAI confusion matrix, use the examples/process_folder.py script:
python examples/process_folder.py input_folder \
--save_dir autoXplain_results \
--model resnet18 \
--cam_type gradcam \
--threshold 2.5 \
--vlm_model gemini/gemini-1.5-flash \
--api_key YOUR_API_KEY_1,YOUR_API_KEY_2input_folder: Path to folder containing images (required)--save_dir: Path to save processed results (default: 'autoXplain_results')--model: Model to use (choices: 'resnet18', 'maxvit_t', ..., default: 'resnet18')--cam_type: Type of CAM method to use (choices: 'gradcam', 'smoothgradcam', 'gradcamplusplus', 'cam', 'scorecam', 'layercam', 'xgradcam', default: 'gradcam')--threshold: Threshold for VLM score (higher means good, default: 2.5)--vlm_model: name of VLM used for evaluating (default: 'gemini/gemini-1.5-flash')--api_key: Google API key for Gemini model (required)
Images should be named in the format: id_label.extension
Example: 001_cat.jpg, 002_dog.png
The framework provides comprehensive outputs:
- Saliency map
- Masked CAM image
- Description
- Justification
- Score
- Prediction
- All individual image results
- Final report categorizing results into four cases:
- Correct predictions with high VLM score
- Correct predictions with low VLM score
- Wrong predictions with high VLM score
- Wrong predictions with low VLM score
- Summary statistics
- Detailed analysis of each case
The pipeline follows these steps:
- Takes model and images as input
- Computes attention (saliency map) using the selected CAM method
- Uses VLMs to evaluate and score samples
- Computes confusion matrix of VLMs' judgment and accuracy
- Generates comprehensive reports and visualizations
If you want to use this work in your research, please cite the following paper:
@article{nguyen2025novel,
title={A Novel Framework for Automated Explain Vision Model Using Vision-Language Models},
author={Nguyen, Phu-Vinh and Pham, Tan-Hanh and Ngo, Chris and Hy, Truong Son},
journal={arXiv preprint arXiv:2508.20227},
year={2025}
}