Discrepancy in LongBench Evaluation Results

Hello, thank you for your excellent work and for open-sourcing your model.

I am trying to reproduce your LongBench evaluation results from Figure 4.

**The Issue**

I used longbench_pred.py with YuWangX/memoryllm-7b and 8b to get predictions.

For evaluation, I applied the logic from your metrics/qa_f1_score function (text normalization + F1 calculation). However, my resulting F1 scores are significantly lower than those reported in your paper.

**Questions**

To help with reproduction, could you please clarify:

**Metric:** Is qa_f1_score (with its normalization) the correct metric for the LongBench QA tasks (e.g., hotpotqa)?

**Prediction Settings:** Are there any crucial parameters for longbench_pred.py (like max_length or specific prompts) required to match the paper's results?

**Evaluation Script:** Would it be possible to share the evaluation script you used to process the outputs from longbench_pred.py?

This would be extremely helpful for the community. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discrepancy in LongBench Evaluation Results #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Discrepancy in LongBench Evaluation Results #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions