This repository contains the code and scripts used for the experiments in the paper:
"Distilling Tool Knowledge into Language Models via Back-Translated Traces"
Please cite the paper if you use this code or data for your research.
This repository provides a collection of tools and pipelines for synthetic data generation, model finetuning, and evaluation for mathematical reasoning tasks. The codebase is structured into modular components to facilitate experimentation and reproduction of the results presented in the paper.
The codebase is based on CAMEL: https://github.com/camel-ai/camel.
A comprehensive toolkit for mathematical problem solving and dataset processing, featuring AI-powered math agents and back-translation capabilities for distilling tool knowledge into language models.
math-dataset/
├── src/
│ ├── solver_agent/ # Math problem solving agent
│ ├── back_translation/ # Back translation and reasoning enhancement
│ └── finetuning/ # Model fine-tuning utilities
├── tests/ # Integration and unit tests
├── scripts/ # Utility scripts for setup and testing
├── data/ # Dataset files and results
├── logs/ # Experiment logs and outputs
└── MATH/ # Math dataset files
- Python 3.10-3.12
- uv package manager
-
Clone the repository
git clone <repository-url> cd math-dataset
-
Run the setup script
./scripts/setup.sh
-
Activate the environment
source .venv/bin/activate
The pipeline follows four main stages as described in the paper:
The first step is to use the solver_agent
to solve problems and generate Tool-Integrated Reasoning (TIR) traces. These traces capture the model's step-by-step reasoning process when using external tools.
cd src/solver_agent
python main.py --num 10 --dataset algebra --level 1 --model gpt-4o-mini --sympy_toolkit --code_toolkit
Example usage:
# Solve 10 algebra problems using GPT-4o-mini with toolkits
python main.py --num 10 --dataset algebra --level 1 --model gpt-4o-mini --sympy_toolkit
# Use with code execution toolkit for computational problems
python main.py --num 5 --dataset intermediate_algebra --code_toolkit --model gpt-4o-mini
# Multi-step reasoning for complex problems
python main.py --num 3 --dataset precalculus --multi_step --model gpt-4o-mini
Next, the generated TIR traces are processed by the back-translation pipeline located in src/back_translation/
. This stage refines and polishes the raw traces into high-quality, human-readable solutions suitable for training.
cd src/back_translation
python back_translation_main.py
With the smoothed dataset, you can finetune a language model using the modular scripts in src/finetuning/
.
cd src/finetuning
python main_finetune.py \
--model "Qwen/Qwen2.5-7B-Instruct" \
--train_epochs 3 \
--rank 64 \
--cuda_device 0 \
--repo_name "my-awesome-math-model" \
--hf_token "your_hf_token_here"
Finally, the solver_agent
can be used again to evaluate the finetuned model's performance on standard benchmarks. In this mode, the agent solves problems without the tool-integration and back-translation pipeline to measure its final reasoning capabilities.
You can use the same principle to run multiple finetuning jobs with different hyperparameters:
# Finetune with rank 32 on GPU 0
CUDA_VISIBLE_DEVICES=0 python src/finetuning/main_finetune.py --rank 32 --repo_name model-rank32 &
# Finetune with rank 64 on GPU 1
CUDA_VISIBLE_DEVICES=1 python src/finetuning/main_finetune.py --rank 64 --repo_name model-rank64 &
wait
This approach gives you the flexibility to parallelize your workflow on any multi-GPU machine.
The core math-solving component that uses AI models with various toolkits:
- Multi-step conversation for complex problem solving
- SymPy toolkit for symbolic mathematics
- Code execution toolkit for computational problems
- Geometry toolkit for geometric problems (when available)
- Evaluation system for solution verification
Key features:
- Support for multiple AI models (OpenAI, Qwen, etc.)
- Comprehensive logging and database storage
- Configurable toolkits and solving strategies
- Real-time performance metrics
Enhances mathematical reasoning by generating explanations and verifying solutions:
- Solution enhancement with detailed explanations
- Reasoning quality assessment
- Chain-of-thought generation
- Solution verification using multiple models
Tools for training and fine-tuning models on mathematical datasets:
- Dataset preparation and preprocessing
- Training pipelines for various model architectures
- Evaluation metrics and benchmarking
- Model optimization techniques
Run the comprehensive integration test suite:
./scripts/run_tests.sh
This will test:
- Math agent initialization and problem solving
- Back translation workflow
- Component integration
- End-to-end functionality (with API key)
Create a .env
file in the project root:
# OpenAI API Key (required for GPT models)
OPENAI_API_KEY=your_openai_api_key
# Optional: Other API keys for different models
MISTRAL_API_KEY=your_mistral_key
GROQ_API_KEY=your_groq_key
SAMBA_API_KEY=your_samba_key
The project supports various AI models:
- OpenAI models:
gpt-4o-mini
,gpt-4
,gpt-3.5-turbo
- Qwen models:
Qwen/Qwen2.5-7B-Instruct
,Qwen/Qwen2.5-Math-7B
- Other models: Mistral, Groq, etc.
Available toolkits for enhanced problem solving:
- SymPy: Symbolic mathematics and equation solving
- Code Execution: Python code execution for computational problems
- Math: Basic arithmetic operations
- Geometry: Geometric problem solving (when available)
The project works with various mathematical datasets:
- MATH: Competition mathematics dataset
- GSM8K: Grade school math word problems
- AIME: American Invitational Mathematics Examination
- AMC: American Mathematics Competitions
- Custom datasets: Support for custom problem formats
src/solver_agent/
: Core math-solving logicsrc/back_translation/
: Reasoning enhancement toolssrc/finetuning/
: Model training utilitiestests/
: Integration and unit testsscripts/
: Setup and utility scripts
- New Toolkits: Add to
src/solver_agent/math_solver.py
- New Models: Configure in model initialization
- New Datasets: Extend
src/solver_agent/math_loader.py
- Tests: Add integration tests in
tests/
The project follows Python best practices:
- Type hints where applicable
- Comprehensive logging
- Error handling and validation
- Modular design for extensibility
The system tracks various performance metrics:
- Accuracy: Percentage of correctly solved problems
- Tool Usage: Which toolkits were employed
- Solving Time: Time taken per problem
- Error Analysis: Types and frequencies of errors
Results are stored in:
- Database (SQLite) for structured data
- CSV files for analysis
- Log files for debugging
- Import Errors: Ensure virtual environment is activated
- API Errors: Check API keys in
.env
file - Missing Data: Download required datasets to
MATH/
directory - Permission Errors: Make scripts executable with
chmod +x
- Check the logs in
logs/
directory - Run tests to verify setup:
./scripts/run_tests.sh
- Review component-specific READMEs in
src/*/README.md
If you find this work useful, please cite the following paper:
@inproceedings{huang2025distillingtoolknowledgelanguage,
title={Distilling Tool Knowledge into Language Models via Back-Translated Traces},
author={Xingyue Huang and Xianglong Hu and Zifeng Ding and Yuan He and Rishabh and Waleed Alzarooni and Ziyu Ye and Wendong Fan and Bailan He and Haige Bo and Changran Hu and Guohao Li},
year={2025},
booktitle={ICML 2025 Workshop on Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures},
}
[Add your license information here]
[Add contribution guidelines here]