Our code will be made available soon.
- We systematically analyze sparse autoencoders (SAEs) in large vision-language models (LVLMs) and uncover two major limitations:
- a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use,
- user-desired concepts are often absent in the SAE's learned dictionary, limiting its practical utility.
- We address these limitations with our proposed Concept Bottleneck Sparse Autoencoders (CB-SAE):
- using a novel post-hoc framework that prunes low-utility neurons; and
- augmenting the SAE latent space with a concept bottleneck aligned to a user-defined concept set.
- Our CBSAE improves interpretability by ~32% and steerability by ~14% across LVLMs and image generation tasks.
A. Kulkarni, T.-W. Weng, V. Narayanaswamy, S. Liu, W. A. Sakla, K. Thopalli, Interpretable and Steerable Concept Bottleneck Sparse Autoencoders, CVPR 2026
@inproceedings{kulkarni2026interpretable
title={Interpretable and Steerable Concept Bottleneck Sparse Autoencoders},
author={Kulkarni, Akshay and Weng, Tsui-Wei and Narayanaswamy, Vivek and Liu, Shusen and Sakla, Wesam and Thopalli, Kowshik},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026},
}