Hi, I have a quick question about the Vidore test set.
When checking the preprocessing code (e.g., for vidore_arxivqa), the query is the question, and the candidate is constructed as prompt + image. It seems that the answer text of the question is never used in either training or evaluation.
Just want to double-check:
• Is it expected that the answer field is not used at all?
• So the model only learns from (question → image) relevance, without using the actual answer?
I want to confirm whether this is the intended design. Thanks!