ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

A training-free framework designed for Large Reasoning Models (LRMs) to automatically optimize the reasoning process.

📖 Overview

Despite significant advancements in Large Reasoning Models (LRMs), they commonly suffer from inefficient reasoning and goal mis-alignment. Existing training-free methods either rely on rigid heuristics or provide only descriptive analysis, lacking effective automated guidance capabilities. This raises a core question: How can we automatically discover and steer the model towards a more efficient and task-aligned reasoning process?

In this project, we introduce ThinkPilot, a plug-and-play, training-free framework that automatically optimizes "think-prefixes" to guide the reasoning process of LRMs through a workflow inspired by evolutionary algorithms.

The core innovations of ThinkPilot include:

🧬 Evolutionary Prefix Optimization: Based on a taxonomy of reasoning behaviors, ThinkPilot uses an evolutionary process of "selection, crossover, and mutation" to automatically discover and iteratively optimize instructional prefixes that guide the model to achieve optimal performance.

🧭 Precise Control over Reasoning Behaviors: Experiments show that "think-prefixes" can reliably control the model's reasoning behaviors (such as planning, reflection, etc.), and ThinkPilot can automatically identify and activate the preferred combination of behaviors for a task to maximize performance.

🚀 Comprehensive and Significant Performance Improvements: ThinkPilot has demonstrated exceptional results across multiple tasks, including improving reasoning efficiency, significantly enhancing safety (reducing the generation rate of unsafe content from 27.0% to 0.7%), and improving instruction-following capabilities.

ThinkPilot's effect on AIME 24: Accuracy increased by 3.4% and token consumption reduced by 20.6% in just two iterations.

🚀 Getting Started

🔔 If you want to directly use the think-prefixes provided in the paper, skip Step 4: Iterative Optimization.

1. Installation

Clone the repository and install the dependencies:

git clone https://github.com/teqkilla/ThinkPilot.git
cd ThinkPilot
pip install -e .

2. API Deployment

You need to deploy the reasoning model and an optional Judge model as API services. We provide deployment scripts based on vLLM.

Use the pipeline/vllm.sh script to start the service, configuring it with command-line arguments:

bash pipeline/vllm.sh \
  --model-path /path/to/your/reasoning-model \
  --port 8012 \
  --cuda-devices "0,1"

Deployment Parameter Details:

Parameter	Description
`--model-path PATH`	[Required] Path to the model to be deployed.
`--port PORT`	API service port.
`--cuda-devices DEVICES`	GPU IDs to use, separated by commas (e.g., "0,1,2").

You will need to deploy separate API services for the reasoning model, the judge model, and the mutation model.

3. Using Pre-built Think-Prefixes (Optional)

To help you reproduce the results from the paper or to directly evaluate performance, we have provided all the think-prefixes used in the paper's experiments in the /paper_prefixes directory.

If you do not want to start the iteration from scratch but wish to use these optimized prefixes directly to evaluate the model's performance, you can skip the next "Iterative Optimization" step and proceed directly to the "Direct Evaluation" section, specifying the corresponding file at runtime using the --prefix-path parameter.

4. Iterative Optimization

This is the core feature of ThinkPilot, which automatically finds the optimal think-prefix through an evolutionary algorithm. Use the pipeline/run.sh script to start the iterative process.

Runnable Example Command:

bash pipeline/run.sh \
  --tokenizer /path/to/your/tokenizer \
  --enabled math500,ifeval \
  --save-dir results/my_iteration \
  --api-url http://localhost:8012/v1 \
  --judge-api http://localhost:8014/v1 \
  --num-loops 2 \
  --api-url-template http://localhost:8015/v1 \
  --prefix-path pipeline/extract.txt

Parameter Details:

Parameter	Description
`--tokenizer PATH`	[Required] Directory path for the tokenizer corresponding to the evaluation model.
`--enabled BENCHMARKS`	Benchmarks to be evaluated, separated by commas (e.g., `math500,ifeval`).
`--save-dir PATH`	Directory to save the iteration process and results.
`--api-url URL`	API address of the model to be evaluated.
`--judge-api URL`	API address of the Judge LLM used for scoring.
`--prefix-path PATH`	File path for the initial seed prefixes.
`--num-loops NUM`	Total number of optimization iterations.
`--max-eval-num NUM`	Number of prefixes to sample in each iteration.
`--api-url-template URL`	Model API used to generate new templates (for crossover and mutation).
`--topn NUM`	At the end of each round, select the top N results from each benchmark for the next evolution.
`--max-tokens NUM`	Maximum number of tokens for the reasoning model to generate.
`--temperature TEMP`	Generation temperature for the reasoning model.
`--max-concurrent NUM`	Number of concurrent inference requests.
`-h, --help`	Display help information.

5. Direct Evaluation

If you have already found satisfactory think-prefixes through iteration, or want to use the pre-built prefixes we provide to verify the model's performance, you can use the pipeline/run_eval.sh script.

Runnable Example Command:

bash pipeline/run_eval.sh \
  --tokenizer /path/to/your/tokenizer \
  --enabled aime2024,gpqa \
  --save-dir results/eval_with_paper_prefix \
  --api-url http://10.80.13.97:8012/v1 \
  --prefix-path pipeline/extract.txt \
  --num-runs 3

Parameter Details:

Parameter	Description
`--tokenizer PATH`	[Required] Directory path for the tokenizer corresponding to the evaluation model.
`--enabled BENCHMARKS`	Benchmarks to be evaluated, separated by commas.
`--save-dir PATH`	Directory to save the evaluation results.
`--api-url URL`	API address of the model to be evaluated.
`--prefix-path PATH`	File path for the `think-prefix` to be evaluated.
`--num-runs NUM`	Number of times to repeat the evaluation. If greater than 1, results will be saved in subdirectories with a `_runX` suffix.
`--judge-api URL`	API address of the Judge LLM used for scoring.
`--max-tokens NUM`	Maximum number of tokens for the reasoning model to generate.
`--temperature TEMP`	Generation temperature for the reasoning model.
`--max-concurrent NUM`	Number of concurrent inference requests.
`-h, --help`	Display help information.

6. (Optional) Post-evaluation: Calculate Output Length

After the evaluation is complete, you can use the pipeline/length_token.py script to calculate the token length of the model's output.

Configuration and Execution

Open the pipeline/length_token.py file and configure internal variables, such as base_dir and local_tokenizer_dir.
Run the script from the project root directory:

python pipeline/length_token.py

The results will be saved in the output directory you specified.

📋 Experimental Prefixes

To ensure full transparency, we have released the think-prefixes corresponding to all experimental data points in the paper. You can find them in the /paper_prefixes directory, organized according to the tables and figures in the paper.

A Note on Reproducibility: When using these prefixes, please be aware that performance scores may vary slightly from those reported in the paper. This is an expected characteristic when working with large language models, due to their inherent stochasticity and sensitivity to different environments. While exact numerical replication can be challenging, the main scientific conclusions, performance trends, and the relative advantages of our method should remain consistent.

✍️ Citation

If you use our work in your research, please cite the following paper:

@article{li2025thinkpilot,
  title={ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization},
  author={Sunzhu Li and Zhiyu Lin and Shuling Yang and Jiale Zhao and Wei Chen},
  journal={arXiv preprint arXiv:2510.12063},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
config		config
data		data
fig		fig
paper_prefixes		paper_prefixes
pipeline		pipeline
scripts		scripts
think_pilot_eval		think_pilot_eval
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

📖 Overview

🚀 Getting Started

1. Installation

2. API Deployment

3. Using Pre-built Think-Prefixes (Optional)

4. Iterative Optimization

5. Direct Evaluation

6. (Optional) Post-evaluation: Calculate Output Length

📋 Experimental Prefixes

✍️ Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

teqkilla/ThinkPilot

Folders and files

Latest commit

History

Repository files navigation

ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

📖 Overview

🚀 Getting Started

1. Installation

2. API Deployment

3. Using Pre-built Think-Prefixes (Optional)

4. Iterative Optimization

5. Direct Evaluation

6. (Optional) Post-evaluation: Calculate Output Length

📋 Experimental Prefixes

✍️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages