This is the official repository for When More is Less: Understanding Chain-of-Thought Length in LLMs (ICLR 2026)
This repository contains two main components:
- Synthetic Experiments: Training small transformer models on arithmetic and dynamic programming tasks to study optimal CoT length in controlled settings
- Real-world Analysis: Analyzing CoT length patterns in state-of-the-art LLMs on MATH500 and WinoGrande datasets(codes can adapt to more datasets)
The synthetic training component is inspired by karpathy/nanoGPT.
cot-length/
├── synthetic/ # Synthetic experiments
│ ├── dataset/ # Dataset generation
│ │ ├── gen_arith_data.py # Arithmetic dataset generator
│ │ ├── gen_dp_data.py # DP dataset generator
│ │ └── gen_*_test_data.py # Test data generators
│ ├── model/ # Model architectures
│ │ ├── vanilla_gpt2.py # Standard GPT-2 implementation
│ │ └── looped_gpt2.py # Looped transformer variant
│ ├── scripts/ # Training/evaluation scripts
│ │ ├── run_train.sh # Training script
│ │ └── run_eval.sh # Evaluation script
│ ├── train.py # Main training script
│ ├── eval.py # Main evaluation script
│ └── tokenizor.py # Tokenization utilities
├── real/ # Real-world analysis
│ ├── run_math500.py # MATH500 sample generation
│ ├── run_winogrande.py # WinoGrande sample generation
│ ├── eval_*_cot_length.py # CoT length analysis
│ └── eval_*_task_difficulty.py # Task difficulty analysis
└── README.md
The synthetic experiments train small transformer models to understand optimal CoT length patterns on two algorithmic tasks:
- Arithmetic Dataset: Addition problems with step-by-step solutions.
- Dynamic Programming Dataset: Maximum Path Sum in a Number Triangle problem with bottom-up dp solutions.
Choose one of the two available datasets:
# Generate arithmetic dataset
python3 -m synthetic.dataset.gen_arith_data
# OR generate dynamic programming dataset
python3 -m synthetic.dataset.gen_dp_data# Generate test data for the chosen dataset
python3 -m synthetic.dataset.gen_arith_test_data
# OR
python3 -m synthetic.dataset.gen_dp_test_dataUse the training script to train a transformer model:
# Basic training command
python3 synthetic/train.py --model_size=6 --device='cuda' --iter=25000 --T=80 --t=12
# Or use the provided script
bash synthetic/scripts/run_train.shTraining Parameters:
--model_size: Model size parameter (controls model dimensions)--device: Training device ('cuda', 'mps', or 'cpu')--iter: Number of training iterations--T: Maximum sequence length during training--t: Target CoT length during training
After training, evaluate the model across different CoT lengths:
# Basic evaluation
python3 synthetic/eval.py --test_t=3 --test_T=32 --model_size=6 --t=12 --T=80
# Or use the provided script for comprehensive evaluation
bash synthetic/scripts/run_eval.shEvaluation Parameters:
--test_t: CoT length to evaluate--test_T: Maximum sequence length during evaluation--model_size: Model size (must match training)--device: Evaluation device
To facilitate reproducibility, we have released the ckpts of base models used for RL post-training on huggingface: acetocarmine/M_6_T_80_t_12 and acetocarmine/M_9_T_80_t_12. These are 6-layer and 9-layer GPT-2 models, respectively, trained on a dataset with max operators = 80 and max ops/step = 12.
To inspect the generated datasets:
python3 -m synthetic.tokenizorThe real-world component analyzes CoT length patterns in production LLMs on established benchmarks.
- MATH500: Mathematical reasoning problems
- WinoGrande: Commonsense reasoning tasks
For MATH500:
python3 real/run_math500.py --data math500.jsonl --out outputs --samples 30For WinoGrande:
python3 real/run_winogrande.py --data winogrande_xs --out outputs --samples 30Common Parameters:
--data: Input dataset path/name--out: Output directory for results--samples: Number of samples to generate per question--model: Model name (default: qwen models)--temperature: Sampling temperature (default: 0.7)--max_tokens: Maximum tokens per completion (default: 1024)--max_retries: Maximum API retry attempts (default: 5)--debug: Enable debug logging
After generating samples, analyze CoT length patterns:
Length Analysis:
# Analyze CoT length vs accuracy patterns
python3 real/eval_math500_cot_length.py --data_dir outputs/math500 --output_dir results
python3 real/eval_winogrande_cot_length.py --data_dir outputs/winogrande --output_dir resultsTask Difficulty Analysis:
# Analyze optimal CoT length vs task difficulty correlation
python3 real/eval_math500_task_difficulty.py --data_dir outputs/math500 --output_dir difficulty_results
python3 real/eval_winogrande_task_difficulty.py --data_dir outputs/winogrande --output_dir difficulty_resultsAnalysis Parameters:
--data_dir: Directory containing model outputs--output_dir: Directory to save analysis results--model: Specific model to analyze (optional, analyzes all if not specified)--no-filter: Skip filtering questions with all correct/incorrect answers
This project is licensed under the MIT License - see the LICENSE file for details.
- Synthetic training component inspired by karpathy/nanoGPT
- Real-world datasets: MATH500, WinoGrande, MMLU, GPQA benchmarks
If you find this work useful, please give us a free cite:
@article{wu2025optcotl,
title={When More is Less: Understanding Chain-of-Thought Length in LLMs},
author={Yuyang Wu and Yifei Wang and Ziyu Ye and Tianqi Du and Stefanie Jegelka and Yisen Wang},
journal={arXiv preprint arXiv:2502.07266},
year={2025}
}