model: nvidia/llama-nemotron-embed-vl-1b-v2 for ViDoRe#4192
model: nvidia/llama-nemotron-embed-vl-1b-v2 for ViDoRe#4192gabrielspmoreira wants to merge 15 commits intoembeddings-benchmark:mainfrom
Conversation
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Samoed
left a comment
There was a problem hiding this comment.
Looks good. Added minor comment
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
|
@gabrielspmoreira can you take a look at @Samoed comment. Once that is considered then we can merge this |
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
| use_image_modality: bool = False, | ||
| use_text_modality: bool = True, |
There was a problem hiding this comment.
I don't think that disabling one of modalities by default is expected behavior for models
There was a problem hiding this comment.
I could see it as a routing thing. Like when there is text+images then use just the text, but this will def. punish the model on other tasks where the text doesn't provide enough information (I don't think that reflects the intended use).
There was a problem hiding this comment.
You can submit text-only as an experiment (docs)
There was a problem hiding this comment.
I see @Samoed . I did that so that users can reproduce the results PR I submitted, which were run with these default values (use_text_modality = True, use_image_modality = False). The reason is that for Vidore V3, text-only modality provides slightly higher accuracy than image modality, but the embedding throughput is much higher. providing a better trade-off for this dataset. Does that make sense?
There was a problem hiding this comment.
Yes, that make sense, but I think model should use all modalities by default
There was a problem hiding this comment.
Hi @Samoed , I think each model should use their best modality.
E.g. we implemented and evaluated Qwen3-VL-Embedding in three modalities: image only, image + text, text only. The best one is image only (higher accuracy than image + text), so we reported image only results and made image only as the default (see #4198)
There was a problem hiding this comment.
I understand this, but this would create problems for users that would run text only tasks. Your approach is vidore focused, but I think we should use more general approach when we're adding models
There was a problem hiding this comment.
@KennethEnevoldsen @Samoed we have followed your suggestions. Default modality for Nemotron Embed VL 1B is is image + text (when available). Results with text-only were submitted as an experiment
mteb.get_model(model_name, revision)andmteb.get_model_meta(model_name, revision)Close #4169