Hi LatentMAS team,
First of all, thank you for sharing your amazing work, LatentMAS. I am deeply impressed by the core ideas of your project and have a keen interest in your research.
I am currently working on reproducing your experiments, but I am facing some difficulties and would like to request your guidance. I have been conducting experiments using the Qwen3-4B model in a sequential setting, but I have not been able to reproduce the reported performance across all benchmarks.
Below are the settings, results, and commands I used for MBPP+ and HumanEval+:
Settings
| Model |
max_new_tokens |
temperature |
top_p |
prompt |
latent_steps |
split |
generate_bs |
| Qwen3-4B |
4096 |
0.6 |
0.95 |
sequential |
20 |
test |
1 |
Results
| Method |
MBPP+ |
HumanEval+ |
| Baseline |
0.727 |
0.762 |
| TextMAS |
0.761 |
0.878 |
| LatentMAS |
0.613 |
0.682 |
| LatentMAS (think) |
0.571 |
0.640 |
| LatentMAS (think + realign) |
0.558 |
0.670 |
Command
# basline
python run.py \
--method baseline \
--model_name "Qwen/Qwen3-4B" \
--task "$task" \
--max_samples -1 \
--use_vllm \
--generate_bs 10
# TextMAS
python run.py \
--method text_mas \
--model_name "Qwen/Qwen3-4B" \
--task "$task" \
--max_samples -1 \
--prompt "sequential" \
--use_vllm \
--generate_bs 10
# LatentMAS
python run.py \
--method latent_mas \
--model_name "Qwen/Qwen3-4B" \
--task "$task" \
--max_samples -1 \
--prompt "sequential" \
--latent_steps $LATENT_STEPS \
--generate_bs 1
As shown in the table, while the performance of the Baseline and TextMAS is measured higher than the reported results, the performance of LatentMAS is measured even lower than the Baseline.
Additionally, when I conducted experiments on the latent_steps for MBPP+, I observed a trend where performance decreases as the latent_step increases.
Results
| latent_steps |
MBPP+ |
| 0 |
0.714 |
| 10 |
0.621 |
| 20 |
0.613 |
Furthermore, I set generate_bs to 1 for LatentMAS because a significant drop in performance was observed during batch inference. To address this, I attempted left-padding since Qwen is a decoder-only model, but the same issue persisted.
Thank you again for sharing such an interesting piece of work. I would deeply appreciate it if you could look into these discrepancies I’ve observed.
Hi LatentMAS team,
First of all, thank you for sharing your amazing work, LatentMAS. I am deeply impressed by the core ideas of your project and have a keen interest in your research.
I am currently working on reproducing your experiments, but I am facing some difficulties and would like to request your guidance. I have been conducting experiments using the Qwen3-4B model in a sequential setting, but I have not been able to reproduce the reported performance across all benchmarks.
Below are the settings, results, and commands I used for MBPP+ and HumanEval+:
Settings
Results
Command
As shown in the table, while the performance of the Baseline and TextMAS is measured higher than the reported results, the performance of LatentMAS is measured even lower than the Baseline.
Additionally, when I conducted experiments on the latent_steps for MBPP+, I observed a trend where performance decreases as the latent_step increases.
Results
Furthermore, I set generate_bs to 1 for LatentMAS because a significant drop in performance was observed during batch inference. To address this, I attempted left-padding since Qwen is a decoder-only model, but the same issue persisted.
Thank you again for sharing such an interesting piece of work. I would deeply appreciate it if you could look into these discrepancies I’ve observed.