Emergent Localization Is Not Compositional Grounding

A Diagnostic Study of dino.txt

Authors:
Matteo Bando · Nancy Kalaj · Alberto Vendramini
University of Trento — Trends and Applications of Computer Vision

Paper (PDF): link

Abstract

Vision–language models have recently demonstrated emergent localization capabilities without explicit spatial supervision. However, whether such localization reflects robust compositional grounding remains unclear. In this work, we analyze the grounding behavior of dino.txt, a text-augmented DINOv2 model, across a set of complementary benchmarks targeting different aspects of visual grounding: zero-shot localization on RefCOCO, compositional robustness via ARPGrounding, and conceptual understanding via Probe-C and Probe-B.

Key Findings

dino.txt achieves reasonable localization on attribute-based and descriptive referring expressions, benefiting from rich DINOv2 visual features.
It exhibits systematic failure under subject–object role inversion, performing worse than random chance.
Errors are dominated by asymmetric failures — the model localizes relevant objects but cannot consistently assign linguistic roles.
Probe-C and Probe-B results show strong object–attribute binding and reduced background dependence, indicating the failure is linguistic rather than visual.

These results highlight an important distinction between emergent localization and true compositional grounding, suggesting that stronger text encoders with explicit cross-modal interaction are required beyond patch-level supervision.

References

Jose et al. DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment. CVPR 2025.
Zeng et al. Investigating Compositional Challenges in Vision-Language Models for Visual Grounding. CVPR 2024.
Schiappa et al. Probing Conceptual Understanding of Large Visual-Language Models. CVPR 2024.
Oquab et al. DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
Radford et al. Learning Transferable Visual Models from Natural Language Supervision. ICML 2021.

Citation

If you find this work useful, please cite us:

@misc{bando2025emergent,
  title = {Emergent Localization Is Not Compositional Grounding: A Diagnostic Study of dino.txt},
  author = {Bando, Matteo and Kalaj, Nancy and Vendramini, Alberto},
  year = {2025},
  url = {https://github.com/bandomatteo/Emergent-localization-is-not-compositional-grounding/blob/docs/Emergent_Localization_Is%20Not_Compositional_Grounding.pdf}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Emergent_Localization_Is_Not_Compositional_Grounding.ipynb		Emergent_Localization_Is_Not_Compositional_Grounding.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emergent Localization Is Not Compositional Grounding

A Diagnostic Study of dino.txt

Abstract

Key Findings

References

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Emergent Localization Is Not Compositional Grounding

A Diagnostic Study of dino.txt

Abstract

Key Findings

References

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages