[NeurIPS 2025] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set
This repository is the official implementation of VL-SAE, which helps users to understand the vision-language alignment of VLMs via concepts.
Create a conda virtual environment and activate it:
conda create -n vlsae python=3.8 -y
conda activate vlsaeInstall dependencies:
pip install -r requirements.txt
Install CC3M dataset from cc3m-wds, and put it under ./CC3M
Running the provided scripts to preprocess the dataset:
bash cc3m_untar.sh
python cc3m_moving.py
python cc3m_meta.pyDownload LLaVA 1.5 and put it under ./lvlms/pretrained_models
For OpenCLIP-ViT-B/32, download the pre-trained VL-SAE weights (SAE weights, metadata) and put it under cvlms/demo.
For LLaVA 1.5, download the the pre-trained VL-SAE (SAE weights, Auxiliary AE weights, metadata) and put it under lvlms/demo.
We present the demo of VL-SAE with OpenCLIP and LLaVA 1.5 in cvlms/demo/demo.ipynb and lvlms/demo/demo.ipynb, respectively.
Moreover, we provide scripts lvlms/demo/demo_inference.ipynb that incorporate the VL-SAE to modify the representations during the inference process of LVLMs.
The pre-trained VL-SAE is provided in ModelScope and HuggingFace.
| Base Model | ModelScope | HuggingFace |
|---|---|---|
| OpenCLIP-ViT-B/32 | SAE weights, metadata | SAE weights, metadata |
| OpenCLIP-ViT-B/16 | SAE weights, metadata | SAE weights, metadata |
| OpenCLIP-ViT-L/14 | SAE weights, metadata | SAE weights, metadata |
| OpenCLIP-ViT-H/14 | SAE weights, metadata | SAE weights, metadata |
| LLaVA-1.5-7B | SAE weights, Auxiliary AE weights, metadata | SAE weights, Auxiliary AE weights, metadata |
This repo supports the construction of VL-SAE for LLaVA-1.5 and OpenCLIP.
First, collect the hidden representations of pre-trained models:
model_type="cvlms" # for OpenCLIP
# model_type="lvlms" # for LLaVA
cd ./${model_type}/representation_collection
bash get_activations.shWith a single NVIDIA RTX 4090, this step takes approximately 5 hours for OpenCLIP and 4 days for LLaVA.
Then, train VL-SAE based on the collected representations:
cd ../sae_trainer
bash train.shVisualize the concepts learned by VL-SAE:
cd ../eval
python visualize_concept.py --topk 256 --ckpt-path ../sae_trainer/sae_weights/openclip_ViT-B-32_VL_SAE_256_8_best.pthEach concept is represented by a set of images stored in the corresponding folder and sentences in the text_interpretation.txt file.
After the visualizations of concepts, their inter-similarity score and intra-similarity score can be computed using CLIP embeddings.
python eval.py --target-dir ./concept_images/vlsae_ViT-B-32_256Finally, generate a JSON file for the trained SAE, which stores the index of each concept, along with its mean activation value, and maximum activation data (image URL, texts). This file is designed to support the integration of VL-SAE into the model inference process for interpretability purposes.
python concept2data.py --topk 256 --ckpt-path ../sae_trainer/sae_weights/openclip_ViT-B-32_VL_SAE_256_8_best.pthIntegrate the pre-trained VL-SAE into the inference process of LLaVA 1.5 to eliminate hallucinations.
First, download the validation images & annotations of COCO 2014 and put it under lvlms/VCD/data/coco.
Then, run the provided scripts to evaluate the performance of VL-SAE on different benchmarks.
cd lvlms/VCD/experiments
# For POPE benchmark
bash cd_scripts/llava1.5_pope.sh
# For CHAIR benchmark
bash cd_scripts/llava1.5_chair.shIf you find VL-SAE useful for your research and applications, please cite using this BibTeX:
@misc{shen2025vlsae,
title={VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set},
author={Shufan Shen and Junshu Sun and Qingming Huang and Shuhui Wang},
year={2025},
eprint={2510.21323},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.21323},
}


