Skip to content

bandomatteo/Emergent-localization-is-not-compositional-grounding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Emergent Localization Is Not Compositional Grounding

A Diagnostic Study of dino.txt

Authors:
Matteo Bando · Nancy Kalaj · Alberto Vendramini
University of Trento — Trends and Applications of Computer Vision

Paper (PDF): link


Abstract

Vision–language models have recently demonstrated emergent localization capabilities without explicit spatial supervision. However, whether such localization reflects robust compositional grounding remains unclear. In this work, we analyze the grounding behavior of dino.txt, a text-augmented DINOv2 model, across a set of complementary benchmarks targeting different aspects of visual grounding: zero-shot localization on RefCOCO, compositional robustness via ARPGrounding, and conceptual understanding via Probe-C and Probe-B.


Key Findings

  • dino.txt achieves reasonable localization on attribute-based and descriptive referring expressions, benefiting from rich DINOv2 visual features.
  • It exhibits systematic failure under subject–object role inversion, performing worse than random chance.
  • Errors are dominated by asymmetric failures — the model localizes relevant objects but cannot consistently assign linguistic roles.
  • Probe-C and Probe-B results show strong object–attribute binding and reduced background dependence, indicating the failure is linguistic rather than visual.

These results highlight an important distinction between emergent localization and true compositional grounding, suggesting that stronger text encoders with explicit cross-modal interaction are required beyond patch-level supervision.


References

  • Jose et al. DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment. CVPR 2025.
  • Zeng et al. Investigating Compositional Challenges in Vision-Language Models for Visual Grounding. CVPR 2024.
  • Schiappa et al. Probing Conceptual Understanding of Large Visual-Language Models. CVPR 2024.
  • Oquab et al. DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
  • Radford et al. Learning Transferable Visual Models from Natural Language Supervision. ICML 2021.

Citation

If you find this work useful, please cite us:

@misc{bando2025emergent,
  title = {Emergent Localization Is Not Compositional Grounding: A Diagnostic Study of dino.txt},
  author = {Bando, Matteo and Kalaj, Nancy and Vendramini, Alberto},
  year = {2025},
  url = {https://github.com/bandomatteo/Emergent-localization-is-not-compositional-grounding/blob/docs/Emergent_Localization_Is%20Not_Compositional_Grounding.pdf}
}

About

Diagnostic study of dino.txt: evaluating compositional grounding across RefCOCO, ARPGrounding, Probe-C and Probe-B benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors