Complexity: 🛑 Advanced
This example demonstrates how to use the NeMo Agent Toolkit finetuning harness with OpenPipe ART (Agent Reinforcement Trainer) to improve an LLM's performance at playing Tic-Tac-Toe through reinforcement learning.
The model learns to play against a random opponent, receiving rewards based on game-theoretic position evaluation rather than simple win/loss outcomes. This continuous reward signal enables more effective learning than sparse binary rewards.
- Prerequisites
- How the Example Works
- Step 1: Running Pre-Training Baseline Evaluation
- Step 2: Starting the OpenPipe ART Training Server
- Step 3: Running Finetuning
- Step 4: Understanding the Reward Function
- Step 5: Viewing Training Logs and Metrics
- Step 6: Running Post-Training Evaluation
- Best Practices and Troubleshooting
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 40GB VRAM (A100) | 80GB VRAM (H100) |
| RAM | 32GB | 64GB |
| Storage | 50GB free | 100GB free |
Note: The Qwen2.5-3B-Instruct model requires approximately 20GB of VRAM for inference and additional memory for training gradients. An 80GB H100 provides comfortable headroom for larger batch sizes and sequence lengths.
-
Python 3.11+
-
NeMo Agent Toolkit with the OpenPipe ART plugin. This example is meant to be run using a NeMo Agent Toolkit installation from source. You can follow the NeMo Agent Toolkit Installation Guide to set up your environment.
-
OpenPipe ART installed in a separate virtual environment:
OpenPipe ART has specific dependency requirements that may conflict with NeMo Agent Toolkit. We recommend installing it in an isolated environment:
# Create a separate virtual environment for ART uv venv art-env --python 3.13 source art-env/bin/activate export HF_TOKEN=<your_huggingface_token> # Install OpenPipe ART uv pip install --no-cache 'openpipe-art[backend]==0.4.11' # Verify installation art --help
For detailed installation instructions, see the OpenPipe ART Getting Started Guide.
-
This example package in your NeMo Agent Toolkit environment:
uv pip install -e examples/finetuning/rl_with_openpipe_art
-
The rest of this example assumes you are in the root of the NeMo Agent Toolkit repository. Please execute all commands from there.
The LLM plays Tic-Tac-Toe against a random opponent. In each game:
- The LLM is assigned a role (
XorO) - Players alternate turns, with
Xalways going first - The LLM must output valid moves in XML format:
<move> <row>2</row> <col>2</col> </move>
- The game continues until someone wins or the board is full (draw)
Training against a random opponent provides several benefits:
- Consistent difficulty: The opponent doesn't improve, providing a stable training signal
- Exploitable patterns: The model can learn to capitalize on random mistakes
- Clear improvement signal: Win rate against random play is a meaningful metric
- Faster iteration: No need to manage self-play complexity
Against a random opponent, a perfect Tic-Tac-Toe player should win or draw almost every game (winning ~95% when going first as X).
The workflow is defined in src/rl_with_openpipe_art/rl_with_openpipe_art.py:
@register_function(config_type=RlWithOpenpipeArtFunctionConfig)
async def rl_with_openpipe_art_function(config, builder):
player_model = await builder.get_llm(config.player_model)
opponent_model = await builder.get_llm(config.opponent_model) if config.opponent_model else player_model
async def _play_game(role: str) -> str:
# Create players
player_x = LLMTicTacToePlayer(...) # X goes first
player_o = LLMTicTacToePlayer(...)
# Run the game
game = TicTacToeGame(player_x, player_o, role)
winner = game.play()
# Return result
if role == "X":
return "Win!" if winner == 1 else "Lose!" if winner == -1 else "Draw!"
else:
return "Win!" if winner == -1 else "Lose!" if winner == 1 else "Draw!"
yield FunctionInfo.from_fn(_play_game)The workflow:
- Creates two LLM players (or one LLM + one random player)
- Runs a complete game, tracking intermediate steps
- Records move quality scores at each step for reward shaping
- Returns the game outcome
The training data (data/data.json) contains game scenarios:
[
{"id": 1, "question": "X", "answer": "Win!"},
{"id": 2, "question": "O", "answer": "Win!"},
...
]question: The role the LLM plays (XorO)answer: The expected outcome (alwaysWin!since the goal is to learn to win)
Before training, establish a baseline to measure improvement.
In your ART virtual environment, start vLLM to serve the base model:
# Activate the ART environment
source art-env/bin/activate
export HF_TOKEN=<your_huggingface_token>
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-3B-InstructWait for the server to fully load the model. You should see:
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Verify the server is running:
curl http://localhost:8000/v1/modelsIn a separate terminal with your NeMo Agent Toolkit environment activated:
# This is a dummy key for local vLLM usage
export OPENAI_API_KEY=default
# Run the pre-training evaluation
nat eval --config_file examples/finetuning/rl_with_openpipe_art/configs/config_pre_train.yml --reps 3This runs 72 games (12 as X, 12 as O, 3 times each) and reports the win percentage.
Record this baseline score for comparison after training.
Once the evaluation completes, stop the vLLM server (Ctrl+C) to free GPU memory for training.
The ART server handles both inference and training. It runs vLLM for serving the model and Unsloth for GRPO weight updates using LoRA adapters by default.
Note: The default configuration uses Unsloth LoRA finetuning. Full-weight training requires additional TorchTune configuration through the
torchtune_argsfield in the trainer adapter backend config. Refer to the OpenPipe ART documentation for details.
In your ART virtual environment:
# Activate the ART environment
source art-env/bin/activate
export HF_TOKEN=<your_huggingface_token>
# Start the ART server
art --host 0.0.0.0 --port 7623Note: The ART server listens on port
7623for training commands and starts vLLM internally on port8000for inference.
Wait for the server to initialize. You should see output indicating:
- Training server ready
- API endpoints available
Sample output:
INFO: Started server process [3671624]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7623 (Press CTRL+C to quit)With the ART server running, start the finetuning process.
The training configuration is in src/rl_with_openpipe_art/configs/config.yml:
llms:
openpipe_llm:
_type: openai
# With LoRA finetuning (default): model_name must match backend.name below
# so that inference routes to the latest LoRA checkpoint, not the base model.
# With full-weight training: model_name must match backend.base_model below
# as updated weights are loaded directly into vLLM under the base model name.
model_name: tic_tac_toe_training_run
base_url: http://localhost:8000/v1
api_key: default
temperature: 0.4 # Some randomness for exploration
workflow:
_type: rl_with_openpipe_art
player_model: openpipe_llm
max_parser_retries: 2 # Retry on malformed XML
eval:
general:
max_concurrency: 16 # Parallel game execution
output_dir: .tmp/nat/examples/rl_openpipe/eval/finetune
dataset:
_type: json
file_path: examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/data/data.json
evaluators:
rl_accuracy:
_type: step_value_computation # Uses alpha-beta reward function
trajectory_builders:
openpipe_traj_builder:
_type: openpipe_art_traj_builder
num_generations: 1 # Games per example per epoch
trainer_adapters:
openpipe_trainer_adapter:
_type: openpipe_art_trainer_adapter
backend:
ip: "0.0.0.0"
port: 7623
name: "tic_tac_toe_training_run"
project: "tic_tac_toe_project"
base_model: "Qwen/Qwen2.5-3B-Instruct"
api_key: "default"
init_args:
max_seq_length: 8192
engine_args:
gpu_memory_utilization: 0.9
tensor_parallel_size: 1
training:
learning_rate: 1e-5
beta: 0.1
finetuning:
enabled: true
trainer: openpipe_trainer
trajectory_builder: openpipe_traj_builder
trainer_adapter: openpipe_trainer_adapter
reward_function:
name: rl_accuracy
num_epochs: 8
output_dir: ./.tmp/nat/finetuning/tic_tac_toeImportant: With LoRA finetuning (the default), the ART backend registers each LoRA adapter in vLLM under the training run name (
backend.name). Themodel_namein the LLM config must match this name so that inference requests are routed to the latest LoRA checkpoint. Ifmodel_namepoints to the base model (Qwen/Qwen2.5-3B-Instruct), every epoch will evaluate the unchanged base model, and GRPO training will have no effect.
In your NeMo Agent Toolkit environment:
# This is a dummy key for local vLLM usage
export OPENAI_API_KEY=default
nat finetune --config_file examples/finetuning/rl_with_openpipe_art/configs/config.ymlTraining progress is logged to the console and saved to files:
INFO - Starting finetuning with config: src/rl_with_openpipe_art/configs/config.yml
INFO - Initializing OpenPipe ART Runner
INFO - Successfully registered with ART backend.
INFO - Starting finetuning run with 30 epochs
INFO - Starting epoch 1 for run art_run_a1b2c3d4
INFO - Starting 1 evaluation runs for run_id: art_run_a1b2c3d4
INFO - Built 48 trajectories across 48 examples
INFO - Epoch 1 progress logged - Avg Reward: 0.4523, Trajectories: 48
INFO - Training art_run_a1b2c3d4 completed successfully.
INFO - Completed epoch 1/30
INFO - Starting epoch 2 for run art_run_a1b2c3d4
...
Training typically takes upwards of 40 minutes for 10 epochs on an H100.
The reward function is the key to effective RL training. This example uses a sophisticated alpha-beta pruning based reward instead of simple win/loss signals.
Simple win/loss rewards have significant problems for training:
| Issue | Description |
|---|---|
| Sparsity | Reward only at game end (after 5-9 moves) |
| Credit assignment | Which moves caused the win/loss? |
| No gradient for draws | Draws give 0 reward, no learning signal |
| Binary signal | No difference between "barely won" and "dominated" |
Alpha-beta pruning is a search algorithm that determines the optimal play in two-player games. It works by:
- Building a game tree: All possible future moves and responses
- Minimax evaluation: Assuming both players play optimally
- Pruning branches: Skipping moves that can't affect the outcome
For Tic-Tac-Toe, alpha-beta can solve the entire game tree, determining:
- Forced win: A position where perfect play guarantees victory
- Forced loss: A position where the opponent can force a win
- Drawn position: Neither player can force a win
The reward function is implemented in two files:
Located at: src/rl_with_openpipe_art/core.py:110-285
def evaluate_board_for_player(board: np.ndarray, player_val: int) -> float:
"""
Evaluate the position from the perspective of `player_val`.
Output ranges:
- Non-terminal positions: [0, 1] continuous
- Forced future win: (1, 11] = base + 10
- Immediate win: (1, 16] = base + 15
- Forced loss or already lost: 0.0
"""The function combines two components:
1. Static Heuristic Evaluation (continuous, no search):
def static_eval(b: np.ndarray) -> float:
"""Heuristic position evaluation in [-1, 1]."""
# Count threats, control, position quality
score_raw = (
4.0 * (my_two_open - opp_two_open) # Strong threats
+ 1.5 * (my_one_open - opp_one_open) # Influence
+ 1.5 * center # Center control
+ 0.75 * corners.sum() # Corner control
+ 0.25 * edges.sum() # Edge control
)
return float(np.tanh(score_raw / 5.0)) # Squash to [-1, 1]2. Game-Theoretic Solver (alpha-beta search):
def solve_outcome(b: np.ndarray, side_to_move: int, alpha=-1.0, beta=1.0) -> float:
"""
Full-depth minimax with alpha-beta pruning.
Returns: +1 (forced win), 0 (draw), -1 (forced loss)
"""
# Recursively evaluate all possible continuations
# Prune branches that can't improve the result
if side_to_move == player_val:
# Maximizing: find best move for us
for move in available_moves(b):
best = max(best, solve_outcome(child, -side_to_move, alpha, beta))
alpha = max(alpha, best)
if alpha >= beta:
break # Beta cut-off
else:
# Minimizing: opponent's best response
for move in available_moves(b):
best = min(best, solve_outcome(child, -side_to_move, alpha, beta))
beta = min(beta, best)
if alpha >= beta:
break # Alpha cut-off
return best3. Combined Reward Mapping:
| Position Type | Reward Range | Description |
|---|---|---|
| Already lost | 0.0 |
Terminal loss state |
| Forced future loss | 0.0 |
Opponent can force win |
| Game-theoretic draw | [0, 1] |
Continuous heuristic |
| Non-terminal (no forced outcome) | [0, 1] |
Continuous heuristic |
| Forced future win | base + 10 |
(1, 11] |
| Immediate win (on board) | base + 15 |
(1, 16] |
Located at: src/rl_with_openpipe_art/accuracy_evaluator.py:39-72
@staticmethod
def episode_value_from_states(
state_values: list[float], # Rewards from each move
gamma_base: float = 0.8, # Temporal discount
delta_bonus: float = 0.95, # Bonus decay
) -> float:
"""Compute episode value with temporal discounting."""
s = np.asarray(state_values, dtype=float)
T = len(s) - 1
# 1) Split into base [0,1] and bonus (>0 if forced/actual win)
base = np.minimum(s, 1.0)
bonus = np.maximum(s - 1.0, 0.0)
# 2) Reverse-discounted base: earlier moves matter more
exponents = np.arange(T, -1, -1) # T, T-1, ..., 0
w = gamma_base ** exponents
w = w / w.sum()
R_base = float(np.dot(w, base)) # Weighted average in [0, 1]
# 3) Bonus: max spike, time-decayed (reward early wins)
if np.any(bonus > 0):
bonus_weights = delta_bonus ** exponents
U_time = float(np.max(bonus * bonus_weights))
else:
U_time = 0.0
# 4) Final episode score
return R_base + U_time| Property | Benefit for RL |
|---|---|
| Continuous | Smooth gradients, stable training |
| Dense | Reward at every move, not just game end |
| Informative | Distinguishes good moves from great moves |
| Theoretically grounded | Based on perfect play analysis |
| Temporally weighted | Earlier good moves are more valuable |
| Bonus for winning | Strong signal to learn winning patterns |
During each game, the workflow records move quality:
# In rl_with_openpipe_art.py
if current_player.name == self.role:
# Record intermediate step with position value
self.step_manager.push_intermediate_step(
IntermediateStepPayload(
event_type=IntermediateStepType.CUSTOM_END,
name="agent_move",
metadata={
"step": turn_index,
"value": evaluate_board_for_player(self.board, current_player.value)
}
)
)The evaluator then aggregates these step-level values into an episode reward.
After training, check the output directory:
.tmp/nat/finetuning/tic_tac_toe/
├── training_metrics.jsonl # Per-epoch metrics
├── reward_history.json # Reward progression
├── reward_plot.png # Visual reward chart
The training_metrics.jsonl file contains detailed per-epoch data:
{
"epoch": 0,
"timestamp": "2025-01-15T10:30:45.123456",
"run_id": "art_run_a1b2c3d4",
"avg_reward": 0.4523,
"min_reward": 0.0,
"max_reward": 1.2341,
"num_trajectories": 48,
"num_groups": 48
}When training is complete, view the reward progression plot. The Y-axis shows average episode reward, and the X-axis shows epochs. Your plot should look similar to this, but results may vary:
After training completes, evaluate the improved model.
The ART server continues serving the finetuned model weights. Do not restart it, as the updated weights are in memory.
# This is a dummy key for local vLLM usage
export OPENAI_API_KEY=default
nat eval --config_file examples/finetuning/rl_with_openpipe_art/configs/config_post_train.yml --reps 3Compare the post-training win percentage against the pre-training baseline. You should see a notable improvement.
Note
Due to the stochastic nature of reinforcement learning, you may notice a decrease in performance in some training attempts. Please try running the training again or follow the troubleshooting guide below.
| Value | Effect |
|---|---|
1e-7 |
Very stable, slow learning |
1e-6 |
Recommended starting point |
5e-6 |
Faster learning, may be unstable |
1e-5 |
Aggressive, risk of divergence |
trajectory_builders:
openpipe_traj_builder:
num_generations: 4 # Try 4-8 for better GRPO signalMore generations per example provide better comparison signal for GRPO but increase training time.
When settingnum_generations to 1, the trajectory builder uses all examples in the dataset in one large group.
Conversely, increasing num_generations causes each input data point to be evaluated multiple times per epoch, generating
more trajectories and finer reward comparisons. Each example then gets its own group.
llms:
openpipe_llm:
temperature: 0.4 # Balance exploration/exploitation| Value | Effect |
|---|---|
0.0 |
Deterministic, no exploration |
0.2-0.4 |
Recommended for training |
0.6+ |
High exploration, noisier gradients |
0.1 |
Use for final evaluation (near-deterministic) |
Start with 20-30 epochs and monitor the reward plot. Stop if:
- Rewards plateau for 5+ epochs
- Validation performance decreases (overfitting)
Enable curriculum learning for more stable training:
finetuning:
curriculum_learning:
enabled: true
initial_percentile: 0.3 # Start with easiest 30%
increment_percentile: 0.2 # Add 20% each expansion
expansion_interval: 5 # Expand every 5 epochsCause: ART server not running or wrong port.
Solution:
# Check if ART server is running
curl http://localhost:7623/healthCause: Insufficient GPU memory.
Solutions:
- Reduce
gpu_memory_utilization:engine_args: gpu_memory_utilization: 0.7
- Reduce
max_seq_length:init_args: max_seq_length: 4096
- Reduce
max_concurrency:eval: general: max_concurrency: 8
Cause: Workflow not producing intermediate steps or evaluator errors.
Solutions:
- Check workflow registration:
nat info --components
- Verify evaluator is registered:
nat info --evaluators
- Run a single game manually to debug:
nat eval --config_file=... --max_examples=1
Cause: Model not following the prompt format.
Solutions:
- Increase
max_parser_retries:workflow: max_parser_retries: 3
- Lower temperature for more deterministic outputs
- Check if base model supports the task (try a larger model)
Possible causes:
- Learning rate too low: Try
5e-6 - Not enough generations: Increase
num_generationsto 2-4 - Model already optimal: Check if baseline is already high
- Reward function issue: Verify evaluator is computing rewards correctly
-
Increase batch parallelism:
eval: general: max_concurrency: 32 # If GPU memory allows
-
Use multiple generations:
trajectory_builders: openpipe_traj_builder: num_generations: 4 # Better GRPO signal
-
Enable prefix caching (vLLM):
python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-3B-Instruct \ --enable-prefix-caching
| File | Description |
|---|---|
src/rl_with_openpipe_art/rl_with_openpipe_art.py |
Main workflow: game loop, player management |
src/rl_with_openpipe_art/core.py |
Game logic, board evaluation, alpha-beta solver |
src/rl_with_openpipe_art/llm_agents.py |
LLM player wrapper, move parsing, prompts |
src/rl_with_openpipe_art/accuracy_evaluator.py |
Reward computation, episode aggregation |
src/rl_with_openpipe_art/evaluator_register.py |
Evaluator registration |
src/rl_with_openpipe_art/register.py |
Workflow component registration |
configs/config.yml |
Training configuration |
configs/config_pre_train.yml |
Pre-training evaluation configuration |
configs/config_post_train.yml |
Post-training evaluation configuration |
data/data.json |
Training dataset |
data/eval_data.json |
Evaluation dataset |
