Skip to content

Inquiring about reproducing LatentMAS results with Qwen3-4B (Sequential setting) #22

@wonjun-chung

Description

@wonjun-chung

Hi LatentMAS team,

First of all, thank you for sharing your amazing work, LatentMAS. I am deeply impressed by the core ideas of your project and have a keen interest in your research.

I am currently working on reproducing your experiments, but I am facing some difficulties and would like to request your guidance. I have been conducting experiments using the Qwen3-4B model in a sequential setting, but I have not been able to reproduce the reported performance across all benchmarks.

Below are the settings, results, and commands I used for MBPP+ and HumanEval+:

Settings

Model max_new_tokens temperature top_p prompt latent_steps split generate_bs
Qwen3-4B 4096 0.6 0.95 sequential 20 test 1

Results
 

Method MBPP+ HumanEval+
Baseline 0.727 0.762
TextMAS 0.761 0.878
LatentMAS 0.613 0.682
LatentMAS (think) 0.571 0.640
LatentMAS (think + realign) 0.558 0.670

Command

# basline
python run.py \
            --method baseline \
            --model_name "Qwen/Qwen3-4B" \
            --task "$task" \
            --max_samples -1 \
            --use_vllm \
            --generate_bs 10
# TextMAS
python run.py \
            --method text_mas \
            --model_name "Qwen/Qwen3-4B" \
            --task "$task" \
            --max_samples -1 \
            --prompt "sequential" \
            --use_vllm \
            --generate_bs 10
# LatentMAS
python run.py \
            --method latent_mas \
            --model_name "Qwen/Qwen3-4B" \
            --task "$task" \
            --max_samples -1 \
            --prompt "sequential" \
            --latent_steps $LATENT_STEPS \
            --generate_bs 1

As shown in the table, while the performance of the Baseline and TextMAS is measured higher than the reported results, the performance of LatentMAS is measured even lower than the Baseline.

Additionally, when I conducted experiments on the latent_steps for MBPP+, I observed a trend where performance decreases as the latent_step increases.

Results

latent_steps MBPP+
0 0.714
10 0.621
20 0.613

Furthermore, I set generate_bs to 1 for LatentMAS because a significant drop in performance was observed during batch inference. To address this, I attempted left-padding since Qwen is a decoder-only model, but the same issue persisted.

Thank you again for sharing such an interesting piece of work. I would deeply appreciate it if you could look into these discrepancies I’ve observed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions