Skip to content

Result about MedQA using Qwen-4B #29

@Linyi617

Description

@Linyi617

When I ran the experiment with Qwen-4B, I obtained the following results:

  • baseline: 70%
  • textmas: 69%
  • LatentMAS: 66%

These results are markedly different from those reported in the paper:

  • baseline: 47.7%
  • textmas: 65.3%
  • LatentMAS: 66.3%

I strictly used the parameters provided in the project’s README file for my setup. The discrepancy is particularly surprising for the baseline model—its performance as a standalone LLM differs drastically from the paper’s findings.

To clarify, I have no intention of discrediting this work; on the contrary, I find it intriguing and insightful, which is why I am attempting to reproduce the results. For reference, my experiment was conducted via Hugging Face (HF) using the V100S 32G GPU. While hardware differences can affect outcomes, I would not expect such a significant gap from this factor alone.

I would greatly appreciate any guidance or insights into what might be causing these inconsistencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions