This work is motivated by an observation in my Test-Time-Training experiments: information in the CLS tokens of vision transformer models is not used by later layers. This can be seen by shuffling the CLS tokens in large test-set batches, which has no effect on the accuracy of classification.
Decodable vs causal information in the CLS tokens of ViTs https://lrast.github.io/science/2026/01/20/lab_notes.html