Hello, I carefully followed the instructions in the repo and used the gpt-4o model to reproduce the MMsearch-engine method. I found that the results for the requirement, rerank, and summary metrics were all slightly lower than those reported in the paper. However, the end2end method showed a huge difference from the results in the paper. I would like to ask what might be the reason for this?
