Result about MedQA using Qwen-4B


When I ran the experiment with **Qwen-4B**, I obtained the following results:
- baseline: 70%
- textmas: 69%
- LatentMAS: 66%

These results are **markedly different** from those reported in the paper:
- baseline: 47.7%
- textmas: 65.3%
- LatentMAS: 66.3%

I strictly used the parameters provided in the project’s README file for my setup. The discrepancy is particularly surprising for the baseline model—its performance as a standalone LLM differs drastically from the paper’s findings.

To clarify, I have no intention of discrediting this work; on the contrary, I find it intriguing and insightful, which is why I am attempting to reproduce the results. For reference, my experiment was conducted via Hugging Face (HF) using the **V100S 32G GPU**. While hardware differences can affect outcomes, I would not expect such a significant gap from this factor alone.

I would greatly appreciate any guidance or insights into what might be causing these inconsistencies.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result about MedQA using Qwen-4B #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Result about MedQA using Qwen-4B #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions