Skip to content

Clarification of the zero-shot segmentation baselines in Table 16Β #307

@benji2264

Description

@benji2264

Hi, thank you so much for providing the models and so many experimental results in the paper!

TL;DR: Table 16 reports baseline numbers for zero-segmentation with PE-Core-L-336 using MaskCLIP. Given that PE-Core uses Attention pooling, how do you extract the V (value) embeddings? Do you simply ignore the Attention Pooling layer, extract the value embeddings from the previous layer, and remove the CLS token?

My question is about Table 16, which reports the performance of DINOv3+dino.txt on multiple zero-shot benchmarks, including zero-shot segmentation. I'm assuming that the baselines in this table (e.g. CLIP, SigLIP2, PE, etc...) are all using MaskCLIP, as in the original dino.txt paper.

MaskCLIP works by extracting for each patch the V (value) embeddings from the last attention layer, projecting them into the text dimension, and measuring the similarity with the prompts of the different classes.

When using the original CLIP-L14/336, I get scores that are very close from the ones reported in dino.txt and DINOv3+dino.txt. However, with PE-Core-L-14-336, i get 0-2 mIoU on ADE, Cityscapes, Context, Stuff and VOC (i have checked that i can reproduce the numbers of PE-Core on the image/video classification/retrieval from the original paper).

I'm now wondering if the issue could come from the interaction between MaskCLIP and the Attention Pooling from PE-Core. How did you extract the V embeddings with PE-Core, did you skip the Attention Pooling and simply extract the value embeddings from the previous layer, and remove the CLS token? Is there sth specific to be careful with when applying MaskCLIP-like seg with those models?

Sorry for the long question and thank you very much for your contributions to the community! 😊

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions