Differences in MMsearch engine result reproducibility metrics

Hello, I carefully followed the instructions in the repo and used the gpt-4o model to reproduce the MMsearch-engine method. I found that the results for the requirement, rerank, and summary metrics were all slightly lower than those reported in the paper. However, the end2end method showed a huge difference from the results in the paper. I would like to ask what might be the reason for this?
<img width="2075" height="389" alt="Image" src="https://github.com/user-attachments/assets/e25f404c-8d4c-47af-881b-f993527e6901" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in MMsearch engine result reproducibility metrics #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Differences in MMsearch engine result reproducibility metrics #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions