When I ran the experiment with Qwen-4B, I obtained the following results:
- baseline: 70%
- textmas: 69%
- LatentMAS: 66%
These results are markedly different from those reported in the paper:
- baseline: 47.7%
- textmas: 65.3%
- LatentMAS: 66.3%
I strictly used the parameters provided in the project’s README file for my setup. The discrepancy is particularly surprising for the baseline model—its performance as a standalone LLM differs drastically from the paper’s findings.
To clarify, I have no intention of discrediting this work; on the contrary, I find it intriguing and insightful, which is why I am attempting to reproduce the results. For reference, my experiment was conducted via Hugging Face (HF) using the V100S 32G GPU. While hardware differences can affect outcomes, I would not expect such a significant gap from this factor alone.
I would greatly appreciate any guidance or insights into what might be causing these inconsistencies.
When I ran the experiment with Qwen-4B, I obtained the following results:
These results are markedly different from those reported in the paper:
I strictly used the parameters provided in the project’s README file for my setup. The discrepancy is particularly surprising for the baseline model—its performance as a standalone LLM differs drastically from the paper’s findings.
To clarify, I have no intention of discrediting this work; on the contrary, I find it intriguing and insightful, which is why I am attempting to reproduce the results. For reference, my experiment was conducted via Hugging Face (HF) using the V100S 32G GPU. While hardware differences can affect outcomes, I would not expect such a significant gap from this factor alone.
I would greatly appreciate any guidance or insights into what might be causing these inconsistencies.