Inquiring about reproducing LatentMAS results with Qwen3-4B (Sequential setting)

Hi LatentMAS team,

First of all, thank you for sharing your amazing work, LatentMAS. I am deeply impressed by the core ideas of your project and have a keen interest in your research.

I am currently working on reproducing your experiments, but I am facing some difficulties and would like to request your guidance. I have been conducting experiments using the Qwen3-4B model in a sequential setting, but I have not been able to reproduce the reported performance across all benchmarks.

Below are the settings, results, and commands I used for MBPP+ and HumanEval+:

Settings
Model | max_new_tokens | temperature | top_p | prompt | latent_steps | split | generate_bs
-- | -- | -- | -- | -- | -- | -- | --
Qwen3-4B | 4096 | 0.6 | 0.95 | sequential | 20 | test | 1

Results
 <html><head></head><body>
Method | MBPP+ | HumanEval+
-- | -- | --
Baseline | 0.727 | 0.762
TextMAS | 0.761 | 0.878
LatentMAS | 0.613 | 0.682
LatentMAS (think) | 0.571 | 0.640
LatentMAS (think + realign) | 0.558 | 0.670

</body></html>

Command
```bash
# basline
python run.py \
            --method baseline \
            --model_name "Qwen/Qwen3-4B" \
            --task "$task" \
            --max_samples -1 \
            --use_vllm \
            --generate_bs 10
# TextMAS
python run.py \
            --method text_mas \
            --model_name "Qwen/Qwen3-4B" \
            --task "$task" \
            --max_samples -1 \
            --prompt "sequential" \
            --use_vllm \
            --generate_bs 10
# LatentMAS
python run.py \
            --method latent_mas \
            --model_name "Qwen/Qwen3-4B" \
            --task "$task" \
            --max_samples -1 \
            --prompt "sequential" \
            --latent_steps $LATENT_STEPS \
            --generate_bs 1

```


As shown in the table, while the performance of the Baseline and TextMAS is measured higher than the reported results, the performance of LatentMAS is measured even lower than the Baseline.

Additionally, when I conducted experiments on the latent_steps for MBPP+, I observed a trend where performance decreases as the latent_step increases.

Results
latent_steps | MBPP+
-- | --
0 | 0.714
10 | 0.621
20 | 0.613


Furthermore, I set generate_bs to 1 for LatentMAS because a significant drop in performance was observed during batch inference. To address this, I attempted left-padding since Qwen is a decoder-only model, but the same issue persisted.

Thank you again for sharing such an interesting piece of work. I would deeply appreciate it if you could look into these discrepancies I’ve observed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiring about reproducing LatentMAS results with Qwen3-4B (Sequential setting) #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	MBPP+	HumanEval+
Baseline	0.727	0.762
TextMAS	0.761	0.878
LatentMAS	0.613	0.682
LatentMAS (think)	0.571	0.640
LatentMAS (think + realign)	0.558	0.670

Inquiring about reproducing LatentMAS results with Qwen3-4B (Sequential setting) #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions