Using Franca's "Removal of Absolute Spatial Attributes" Post-Training (RASA) (Venkataramanan et al., 2025) to remove positional bias in of other SSL pretrained ViTs is simple and provides an increase in this downstream performance. I evaluated the performance with OverClustering (Ziegler & Asano, 2022) and obtain a 1-3% performance boost on the validation set for different ViT model sizes. Since the original codebase for RASA is not easy to reuse I adjusted it to easily put in any pre-trained encoder. I have observed an increase in performance for the DINOv2, DINOv3 encoders.
This image displays patch cosine similarity between a selected patch token and the other patches, like on page 4 of the DINOv3 paper (Siméoni et al., 2025). This quantative evaluation helps us see how well it can distinguish between object types in the image. In the visualization.ipynb
I evaluate what encoder size and RASA post-training does to the performance of the model. Smaller models still struggle to produce good cosine similarities. See the visualization.ipynb
notebook or test it for yourself in Google collab.
Install the packages using the requirements.txt
file.
# using conda
conda create --name dino python=3.11
conda activate dino
# Run the code, adjust the ./configs/rasa.yml or argparse flags
python main.py --exp_name "rasa_vits"
Pascal VOC2012
Performance on the validation set with the DINOv3 ViT-S encoder for OverClustering (Ziegler & Asano, 2022) with k={21, 100, 300} on 40/90 batches of the validation set with batch size 16 (due to compute constraints).
k | Validation mIoU | After RASA Post-Training Validation mIoU | Δ vs Original |
---|---|---|---|
21 | 15.67% | 16.12% | +0.45% |
100 | 46.56% | 47.64% | +1.08% |
300 | 59.14% | 59.94% | +0.80% |
Performance on the validation set with the DINOv3 ViT-B encoder for OverClustering (Ziegler & Asano, 2022) with k={21, 100, 300} on 40/90 batches of the validation set with batch size 16 (due to compute constraints).
k | Validation mIoU | After RASA Post-Training Validation mIoU | Δ vs Original |
---|---|---|---|
21 | 20.07% | 21.56% | +1.56% |
100 | 51.30% | 54.22% | +2.92% |
300 | 68.03% | 66.60% | -1.43% |
I picked the best weights based on the an intermediate evaluation with
Venkataramanan, S., Pariza, V., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., & Asano, Y. M. (2025). Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning (No. arXiv:2507.14137). arXiv. https://doi.org/10.48550/arXiv.2507.14137
Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3 (No. arXiv:2508.10104). arXiv. https://doi.org/10.48550/arXiv.2508.10104
Ziegler, A., & Asano, Y. M. (2022). Self-Supervised Learning of Object Parts for Semantic Segmentation (No. arXiv:2204.13101). arXiv. https://doi.org/10.48550/arXiv.2204.13101