Skip to content

Discrepancy in LongBench Evaluation Results #14

@smiling-k

Description

@smiling-k

Hello, thank you for your excellent work and for open-sourcing your model.

I am trying to reproduce your LongBench evaluation results from Figure 4.

The Issue

I used longbench_pred.py with YuWangX/memoryllm-7b and 8b to get predictions.

For evaluation, I applied the logic from your metrics/qa_f1_score function (text normalization + F1 calculation). However, my resulting F1 scores are significantly lower than those reported in your paper.

Questions

To help with reproduction, could you please clarify:

Metric: Is qa_f1_score (with its normalization) the correct metric for the LongBench QA tasks (e.g., hotpotqa)?

Prediction Settings: Are there any crucial parameters for longbench_pred.py (like max_length or specific prompts) required to match the paper's results?

Evaluation Script: Would it be possible to share the evaluation script you used to process the outputs from longbench_pred.py?

This would be extremely helpful for the community. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions