Authors:
Matteo Bando · Nancy Kalaj · Alberto Vendramini
University of Trento — Trends and Applications of Computer Vision
Paper (PDF): link
Vision–language models have recently demonstrated emergent localization capabilities without explicit spatial supervision. However, whether such localization reflects robust compositional grounding remains unclear. In this work, we analyze the grounding behavior of dino.txt, a text-augmented DINOv2 model, across a set of complementary benchmarks targeting different aspects of visual grounding: zero-shot localization on RefCOCO, compositional robustness via ARPGrounding, and conceptual understanding via Probe-C and Probe-B.
- dino.txt achieves reasonable localization on attribute-based and descriptive referring expressions, benefiting from rich DINOv2 visual features.
- It exhibits systematic failure under subject–object role inversion, performing worse than random chance.
- Errors are dominated by asymmetric failures — the model localizes relevant objects but cannot consistently assign linguistic roles.
- Probe-C and Probe-B results show strong object–attribute binding and reduced background dependence, indicating the failure is linguistic rather than visual.
These results highlight an important distinction between emergent localization and true compositional grounding, suggesting that stronger text encoders with explicit cross-modal interaction are required beyond patch-level supervision.
- Jose et al. DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment. CVPR 2025.
- Zeng et al. Investigating Compositional Challenges in Vision-Language Models for Visual Grounding. CVPR 2024.
- Schiappa et al. Probing Conceptual Understanding of Large Visual-Language Models. CVPR 2024.
- Oquab et al. DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
- Radford et al. Learning Transferable Visual Models from Natural Language Supervision. ICML 2021.
If you find this work useful, please cite us:
@misc{bando2025emergent,
title = {Emergent Localization Is Not Compositional Grounding: A Diagnostic Study of dino.txt},
author = {Bando, Matteo and Kalaj, Nancy and Vendramini, Alberto},
year = {2025},
url = {https://github.com/bandomatteo/Emergent-localization-is-not-compositional-grounding/blob/docs/Emergent_Localization_Is%20Not_Compositional_Grounding.pdf}
}