This is the official implementation of the paper: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
If you would like to build on top of this project, refer to sglang_soft_thinking_pkg/README.md
, or review the differences from SGLang v0.4.6.post1 in sglang_soft_thinking_pkg/change_0.4.6.post1.diff
.
Our implementation now includes support for Dirichlet and Gumbel-Softmax noise in Soft Thinking sampling, as detailed in the study LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking. For more details, see scripts/st/qwq32b_gumble.sh
.
Relevant parameters:
--add_noise_gumbel_softmax \
--gumbel_softmax_temperature 0.5
--add_noise_dirichlet \
--dirichlet_temperature 1.0 \
To set up the virtual environment for SGLang Soft Thinking inference, execute each line in configure.sh
:
conda create -n st python=3.11 -y && conda activate st
pip install --upgrade pip
pip install torch transformers accelerate jsonlines math_verify openai torch_memory_saver
pip install flash_attn --no-build-isolation # may take more time (20min). try `pip install flash_attn==2.7.3 --no-build-isolation` if find undefined symbol bug
# Install SGLang (0.4.6.post1) tailored for Soft Thinking
cd sglang_soft_thinking_pkg
pip install -e "python[all]"
cd ..
We find it hard to reproduce some results across different devices due to precision issues. We recommend installing the environment with Docker by following docker.sh
:
# For Docker
cd Soft-Thinking
docker build -t soft-thinking:st-cu124-py311 .
# NVIDIA Container Toolkit is required
docker run --gpus all --ipc=host --rm -it \
-v $PWD:/workspace \
soft-thinking:st-cu124-py311 bash
- Clone the repository:
git clone https://github.com/your-repo/soft_thinking.git cd soft_thinking
- Set up the environment: Follow the Environment Setup instructions (Docker is recommended).
- Run a baseline test:
bash scripts/baseline/qwq32b.sh
Use your own OpenAI key in each script.
export OPENAI_API_KEY=""
We use gpt-4.1-2025-04-14
as the LLM judge.
Run the baseline script:
bash scripts/baseline/qwq32b.sh
First, download the model to the models/
directory:
python ./models/download.py --model_name "Qwen/QwQ-32B"
Then, run the baseline inference:
export OPENAI_API_KEY=""
python run_sglang_softthinking.py \
--dataset "aime2024" \
--model_name "./models/Qwen/QwQ-32B" \ # you can use Qwen/QwQ-32B without downloading to ./models
--max_topk 10 \
--max_generated_tokens 32768 \
--temperature 0.6 \
--top_p 0.95 \
--top_k 30 \
--min_p 0.0 \
--after_thinking_temperature 0.6 \
--after_thinking_top_p 0.95 \
--after_thinking_top_k 30 \
--after_thinking_min_p 0.0 \
--early_stopping_entropy_threshold 0.0 \
--early_stopping_length_threshold 256 \
--mem_fraction_static 0.8 \
--start_idx 0 \
--end_idx 100000 \
--num_gpus 8 \
--num_samples 16 \
--use_llm_judge \
--judge_model_name "gpt-4.1-2025-04-14"
Run the Soft Thinking script:
bash scripts/st/qwq32b_st_math.sh
Or directly execute:
export OPENAI_API_KEY=""
python run_sglang_softthinking.py \
--dataset "aime2024" \
--model_name "./models/Qwen/QwQ-32B" \
--max_topk 10 \
--max_generated_tokens 32768 \
--temperature 0.6 \
--top_p 0.95 \
--top_k 30 \
--min_p 0.001 \
--after_thinking_temperature 0.6 \
--after_thinking_top_p 0.95 \
--after_thinking_top_k 30 \
--after_thinking_min_p 0.0 \
--early_stopping_entropy_threshold 0.01 \
--early_stopping_length_threshold 256 \
--mem_fraction_static 0.8 \
--start_idx 0 \
--end_idx 100000 \
--num_gpus 8 \
--num_samples 1 \
--enable_soft_thinking \
--use_llm_judge \
--judge_model_name "gpt-4.1-2025-04-14"
When running coding benchmarks (HumanEval, MBPP, and LiveCodeBench), start by executing without the --reeval
flag. Then, run it again with the --reeval
flag for evaluation. This is due to a multiprocessing bug.
We have uploaded results in ./results
for reproduction. We use the following hyperparameters:
max_topk
: 10min_p
: 0.001early_stopping_entropy_threshold
: 0.01early_stopping_length_threshold
: 256
For optimal results on each benchmark, adjust the following hyperparameters within these ranges:
max_topk
: between 5 and 20min_p
: between 0.0 and 0.005early_stopping_entropy_threshold
: between 0.0 and 0.1early_stopping_length_threshold
: between 256 and 1024
Note:
- Results may vary across different devices even with the same hyperparameters, due to differences in computation precision. We use NVIDIA H100 GPUs for all experiments. We recommend using Docker for reproduction.
This project utilizes a modified version of the SGLang library. The licensing structure is as follows:
-
Our Original Code: The code original to this project (i.e. all code outside the
./sglang_soft_thinking_pkg
directory) is licensed under the MIT License. A copy of the MIT License can be found in the rootLICENCE
file. -
Modified SGLang: The code within the
./sglang_soft_thinking_pkg
directory is a derivative work ofSGLang
(version 0.4.6.post1) and is therefore licensed under Apache License 2.0. The orginal Apache 2.0 license is included in the./sglang_soft_thinking_pkg/LICENSE
file. We have provide achanges_0.4.6.post1.diff
file in that directory to show our modifications.
If you use this code or dataset, please cite our paper:
@article{zhang2025soft,
title={Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space},
author={Zhang, Zhen and He, Xuehai and Yan, Weixiang and Shen, Ao and Zhao, Chenyang and Wang, Shuohang and Shen, Yelong and Wang, Xin Eric},
journal={arXiv preprint arXiv:2505.15778},
year={2025}
}