-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hello, thank you for your excellent work and for open-sourcing your model.
I am trying to reproduce your LongBench evaluation results from Figure 4.
The Issue
I used longbench_pred.py with YuWangX/memoryllm-7b and 8b to get predictions.
For evaluation, I applied the logic from your metrics/qa_f1_score function (text normalization + F1 calculation). However, my resulting F1 scores are significantly lower than those reported in your paper.
Questions
To help with reproduction, could you please clarify:
Metric: Is qa_f1_score (with its normalization) the correct metric for the LongBench QA tasks (e.g., hotpotqa)?
Prediction Settings: Are there any crucial parameters for longbench_pred.py (like max_length or specific prompts) required to match the paper's results?
Evaluation Script: Would it be possible to share the evaluation script you used to process the outputs from longbench_pred.py?
This would be extremely helpful for the community. Thank you!