再次确认下,是只包含 answering 评测指标,还是把 memory 相关部分也折算到the average token length for answering one question? 了。 <img width="1271" height="730" alt="Image" src="https://github.com/user-attachments/assets/a41a84de-d66b-47ee-bf76-adc7361d1c05" />