Skip to content

model: nvidia/llama-nemotron-embed-vl-1b-v2 for ViDoRe#4192

Open
gabrielspmoreira wants to merge 15 commits intoembeddings-benchmark:mainfrom
gabrielspmoreira:nemotron_vl_1b
Open

model: nvidia/llama-nemotron-embed-vl-1b-v2 for ViDoRe#4192
gabrielspmoreira wants to merge 15 commits intoembeddings-benchmark:mainfrom
gabrielspmoreira:nemotron_vl_1b

Conversation

@gabrielspmoreira
Copy link
Contributor

@gabrielspmoreira gabrielspmoreira commented Mar 3, 2026

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.
  • The model is public, i.e., is available either as an API or the weights are publicly available to download

Close #4169

gabrielspmoreira and others added 7 commits March 3, 2026 13:38
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
@Samoed Samoed added the new model Questions related to adding a new model to the benchmark label Mar 3, 2026
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Added minor comment

@KennethEnevoldsen
Copy link
Contributor

@gabrielspmoreira can you take a look at @Samoed comment. Once that is considered then we can merge this

Comment on lines +370 to +371
use_image_modality: bool = False,
use_text_modality: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that disabling one of modalities by default is expected behavior for models

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see it as a routing thing. Like when there is text+images then use just the text, but this will def. punish the model on other tasks where the text doesn't provide enough information (I don't think that reflects the intended use).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can submit text-only as an experiment (docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see @Samoed . I did that so that users can reproduce the results PR I submitted, which were run with these default values (use_text_modality = True, use_image_modality = False). The reason is that for Vidore V3, text-only modality provides slightly higher accuracy than image modality, but the embedding throughput is much higher. providing a better trade-off for this dataset. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that make sense, but I think model should use all modalities by default

Copy link

@boliu61 boliu61 Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Samoed , I think each model should use their best modality.

E.g. we implemented and evaluated Qwen3-VL-Embedding in three modalities: image only, image + text, text only. The best one is image only (higher accuracy than image + text), so we reported image only results and made image only as the default (see #4198)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this, but this would create problems for users that would run text only tasks. Your approach is vidore focused, but I think we should use more general approach when we're adding models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen @Samoed we have followed your suggestions. Default modality for Nemotron Embed VL 1B is is image + text (when available). Results with text-only were submitted as an experiment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model Questions related to adding a new model to the benchmark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add model: nvidia/llama-nemotron-embed-vl-1b-v2

5 participants