Skip to content

Trustworthy-ML-Lab/CB-SAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders (CVPR 2026)

Our code will be made available soon.

  • We systematically analyze sparse autoencoders (SAEs) in large vision-language models (LVLMs) and uncover two major limitations:
    • a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use,
    • user-desired concepts are often absent in the SAE's learned dictionary, limiting its practical utility.
  • We address these limitations with our proposed Concept Bottleneck Sparse Autoencoders (CB-SAE):
    • using a novel post-hoc framework that prunes low-utility neurons; and
    • augmenting the SAE latent space with a concept bottleneck aligned to a user-defined concept set.
  • Our CBSAE improves interpretability by ~32% and steerability by ~14% across LVLMs and image generation tasks.

Overview

Cite this work

A. Kulkarni, T.-W. Weng, V. Narayanaswamy, S. Liu, W. A. Sakla, K. Thopalli, Interpretable and Steerable Concept Bottleneck Sparse Autoencoders, CVPR 2026

@inproceedings{kulkarni2026interpretable
    title={Interpretable and Steerable Concept Bottleneck Sparse Autoencoders},
    author={Kulkarni, Akshay and Weng, Tsui-Wei and Narayanaswamy, Vivek and Liu, Shusen and Sakla, Wesam and Thopalli, Kowshik},
    booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2026},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors