Reproducibility Question

感谢你们的工作和贡献，但是在复现LongBench-V2实验结果时出现了一些问题：

模型为 Qwen2.5-7B-Instruct + Yarn + cot，

1. 直接使用你们的脚本，不更改配置文件，复现结果时，得到 Short-40.0，Medium-27.9。与论文中（43.9， 32.6）的结果差距较大。
2. 使用单次cot（即不调用第二次LLM以获取cot答案，而是直接更改prompt要求一次性给出final answer），得到short的得分为 40.4，另一种单次cot的prompt得到的结果为38.7。
3. 在不使用cot时，得到的结果为 short: 45.5 (与论文中40.6）差距较大 (直接输出one-token answer的波动或许更容易理解）。

请问上述波动是否是可能存在的，你们在进行实验时是否观测到类似的波动，以及是否尝试过不同的prompt（可能对小模型产生较大的影响）。

谢谢！

单次cot的prompt如下
```
Please read the following text and answer the question below.

<text>
$DOC$
</text>

What is the correct answer to this question: $Q$
Choices:
(A) $C_A$
(B) $C_B$
(C) $C_C$
(D) $C_D$

Please reason step by step. Format your final answer as follows: "The correct answer is (insert answer here)".```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility Question #119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducibility Question #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions