A Python library for tool call mastery through self play.
Given a set of functions (tool calls), we generate input output examples from these and then train an LLM to predict the correct list of functions that can map the inputs to the outputs (AKA Induction task). The library also supports training the LLM for generating appropriately challenging input outputs (AKA proposal task).
- Scripts for SFT and RL on Induction and Proposal tasks using the TRL library.
- Synthetic data generation using LLMs.
- Wandb integration for monitoring and analyzing the metrics.
- Evaluation scripts, website to visualize the evaluation metrics.
- A web interface to visualize and interact with the data samples.
- Clean code: 300+ unit tests, CI using Github actions.
- Can train small models (0.1B) locally within a few hours.
Currently LLMs are trained to imitate text on the internet. As a result they do not know what they know and do not know, which causes hallucination. They lack a world model (See Sutton's age of experience). Instead of next token prediction we would like a training approach that lets models take action and learn from its outcome. We call this self supervised tool-use learning (SSTL). We can consider function calling or tool use as taking an action. We also want a task that is infinitely scalable for learning, bound only by computation. Programming by Example (PBE) provides such an environment, where the model is tasked to find a series of functions that transforms a given inputs to outputs.
We would like to develop some meta cognition capabilities in the model. Given any task the model should be able to classify it into one of these 3 possibilities.
- Confidently say it can solve this
- Confidently say it cannot solve this
- Unsure
This lets it keep learning boundlessly in the future by exploring the unsure tasks. A value function in the RL paradigm should help with this.
Let us assume each state represents a list of values (e.g. the integers [1,2,3]). Using the naming convention of Absolute Zero Reasoner
flowchart LR
A["[1, 2, 3]"] -->|inc| B["[2, 3, 4]"]
A -->|double| C["[2, 4, 6]"]
A -->|neg| D["[-1, -2, -3]"]
C -->|inc| E["[3, 5, 7]"]
B -->|double| I["[4,6,8]"]
A -->|int_to_str| F["['1', '2', '3']"]
F -->|repeat| G["['11', '22', '33']"]
D -->|neg| A
E -->|neg| H["[-3, -5, -7]"]
Induction We would like to train a solver model that can give us the shortest path between two states ([1,2,3] -> [3,5,7]) A path is defined as a DAG consisting of pure functions (e.g. double, plus1). Currently the functions take only one input and output only a single output for simplicity. In the future we would like to expand to multiple argument functions to make this library more practical.
Propose Propose a new task for the solver, that is not too easy and not too hard.
Initial Results in WandB
This project requires python 3.12 or later.
git clone https://github.com/abhishekraok/wandering-light.git
cd wandering-light
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Run the full test suite with:
pytestFunctionDef: Class for immutable functions e.g. double.TrajectorySpec: Represents planned sequences of functions without execution e.g. [double, plus1].Trajectory: Class representing triplet of inputs, outputs and function sequences e.g. [(1,2,3), (3,5,7), (double, plus1)].- Execute trajectories and evaluate results with
Executorclass. - Several solvers included:
- RandomSolve: tries random function sequences within a budget.
- BFSSolve: performs breadth-first search up to a maximum depth.
- LLM solvers: use OpenAI, Gemini, Ollama or local models to propose functions.
.
├── wandering_light/ # Main package
│ ├── __init__.py # Package initialization
│ ├── executor.py # Executes individual functions and trajectories
│ ├── function_def.py # Defines the FunctionDef class
│ ├── common_functions.py # Built-in helper functions
│ ├── trajectory.py # Defines TrajectorySpec and Trajectory classes
│ ├── solver.py # Implements search-based solvers
│ ├── synthesize.py # Utilities for synthesizing new functions
│ ├── typed_list.py # Typed list container
│ ├── llm_utils.py # LLM integration utilities
│ ├── constants.py # Project constants
│ ├── evals/ # Evaluation scripts and workflows
│ ├── training/ # Training scripts (SFT, RL)
│ └── web_ui/ # Web interface
├── tests/ # pytest test cases for core functionality
├── pyproject.toml # Project configuration and dependencies
├── pytest.ini # pytest configuration
└── README.md # This file
You can search for a trajectory that maps a specific input to an output using one of the built-in solvers.
from wandering_light.solver import get_solver_by_name
solver = get_solver_by_name("bfs", budget=3)
trajectory = solver.solve(input_list=[1, 2, 3], output_list=[3, 5, 7])
print("Found trajectory:", trajectory)Create an evaluation file
python wandering_light/evals/create_data.py --saveRun using an LLM API
python wandering_light/evals/run_evaluation.py --eval_file=evals/data/eval_data_v20250831_160239.py --solver_names=["gemini"] --num_samples 100 --budget 1See solver.py for additional solver names (openai, ollama, local models, etc.).
Quick check, without online evaluation
python wandering_light/training/sft.py --no-evalTo train on the full dataset, use
python wandering_light/training/sft.py --full-run Suppose you store the output dir to SFT_OUTPUT_DIR, evaluate to ensure you have a decent success rate (between 30-70% ideally). You can use the evaluation command below.
Next you can do RL.
python wandering_light/training/rl_grpo.py --batch-size 32 --model-name $SFT_OUTPUT_DIR --full-run --wandb-run-name $NAMEAssuming you have a checkpoint shown below.
python wandering_light/evals/run_evaluation.py --budget=1 --eval_file=wandering_light/evals/data/random_inputs_500.py --solver_names=[trained_local] --budget 1 --model-name abhishekraok/induction-basicfns-opt125m-longsftTo see the results of all the past evaluations run
streamlit run wandering_light/evals/dashboard.pyThe data generator.
First finetune it using SFT, using the --task proposer flag. Then evaluate it.
python wandering_light/evals/evaluate_proposer.py --model abhishekraok/proposer-basicfns-opt125m-sft2k --solver-model abhishekraok/induction-basicfns-opt125m-longsftwhich should output
EvalResult(parse_rate=0.96, avg_function_count=2.02, avg_function_count_ratio=1.38, solver_success_rate=0.15, num_samples=100, frac_non_zero_std=0.31)