Reinforcement Learning with OpenPipe ART: Tic-Tac-Toe Example

Complexity: 🛑 Advanced

This example demonstrates how to use the NeMo Agent Toolkit finetuning harness with OpenPipe ART (Agent Reinforcement Trainer) to improve an LLM's performance at playing Tic-Tac-Toe through reinforcement learning.

The model learns to play against a random opponent, receiving rewards based on game-theoretic position evaluation rather than simple win/loss outcomes. This continuous reward signal enables more effective learning than sparse binary rewards.

Prerequisites
How the Example Works
Step 1: Running Pre-Training Baseline Evaluation
Step 2: Starting the OpenPipe ART Training Server
Step 3: Running Finetuning
Step 4: Understanding the Reward Function
Step 5: Viewing Training Logs and Metrics
Step 6: Running Post-Training Evaluation
Best Practices and Troubleshooting

Prerequisites

Hardware Requirements

Component	Minimum	Recommended
GPU	40GB VRAM (A100)	80GB VRAM (H100)
RAM	32GB	64GB
Storage	50GB free	100GB free

Note: The Qwen2.5-3B-Instruct model requires approximately 20GB of VRAM for inference and additional memory for training gradients. An 80GB H100 provides comfortable headroom for larger batch sizes and sequence lengths.

Software Requirements

Python 3.11+
NeMo Agent Toolkit with the OpenPipe ART plugin. This example is meant to be run using a NeMo Agent Toolkit installation from source. You can follow the NeMo Agent Toolkit Installation Guide to set up your environment.
OpenPipe ART installed in a separate virtual environment:

OpenPipe ART has specific dependency requirements that may conflict with NeMo Agent Toolkit. We recommend installing it in an isolated environment:
```
# Create a separate virtual environment for ART
uv venv art-env --python 3.13
source art-env/bin/activate
export HF_TOKEN=<your_huggingface_token>
# Install OpenPipe ART
uv pip install --no-cache 'openpipe-art[backend]==0.4.11'

# Verify installation
art --help
```
For detailed installation instructions, see the OpenPipe ART Getting Started Guide.
This example package in your NeMo Agent Toolkit environment:
```
uv pip install -e examples/finetuning/rl_with_openpipe_art
```
The rest of this example assumes you are in the root of the NeMo Agent Toolkit repository. Please execute all commands from there.

How the Example Works

The Task

The LLM plays Tic-Tac-Toe against a random opponent. In each game:

The LLM is assigned a role (X or O)
Players alternate turns, with X always going first
The LLM must output valid moves in XML format:
```
<move>
  <row>2</row>
  <col>2</col>
</move>
```
The game continues until someone wins or the board is full (draw)

Why a Random Opponent?

Training against a random opponent provides several benefits:

Consistent difficulty: The opponent doesn't improve, providing a stable training signal
Exploitable patterns: The model can learn to capitalize on random mistakes
Clear improvement signal: Win rate against random play is a meaningful metric
Faster iteration: No need to manage self-play complexity

Against a random opponent, a perfect Tic-Tac-Toe player should win or draw almost every game (winning ~95% when going first as X).

The Workflow

The workflow is defined in src/rl_with_openpipe_art/rl_with_openpipe_art.py:

@register_function(config_type=RlWithOpenpipeArtFunctionConfig)
async def rl_with_openpipe_art_function(config, builder):
    player_model = await builder.get_llm(config.player_model)
    opponent_model = await builder.get_llm(config.opponent_model) if config.opponent_model else player_model

    async def _play_game(role: str) -> str:
        # Create players
        player_x = LLMTicTacToePlayer(...)  # X goes first
        player_o = LLMTicTacToePlayer(...)

        # Run the game
        game = TicTacToeGame(player_x, player_o, role)
        winner = game.play()

        # Return result
        if role == "X":
            return "Win!" if winner == 1 else "Lose!" if winner == -1 else "Draw!"
        else:
            return "Win!" if winner == -1 else "Lose!" if winner == 1 else "Draw!"

    yield FunctionInfo.from_fn(_play_game)

The workflow:

Creates two LLM players (or one LLM + one random player)
Runs a complete game, tracking intermediate steps
Records move quality scores at each step for reward shaping
Returns the game outcome

Training Data

The training data (data/data.json) contains game scenarios:

[
  {"id": 1, "question": "X", "answer": "Win!"},
  {"id": 2, "question": "O", "answer": "Win!"},
  ...
]

question: The role the LLM plays (X or O)
answer: The expected outcome (always Win! since the goal is to learn to win)

Step 1: Running Pre-Training Baseline Evaluation

Before training, establish a baseline to measure improvement.

1.1 Start the vLLM Inference Server

In your ART virtual environment, start vLLM to serve the base model:

# Activate the ART environment
source art-env/bin/activate
export HF_TOKEN=<your_huggingface_token>
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-3B-Instruct

Wait for the server to fully load the model. You should see:

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Verify the server is running:

curl http://localhost:8000/v1/models

1.2 Run Pre-Training Evaluation

In a separate terminal with your NeMo Agent Toolkit environment activated:

# This is a dummy key for local vLLM usage
export OPENAI_API_KEY=default
# Run the pre-training evaluation
nat eval --config_file examples/finetuning/rl_with_openpipe_art/configs/config_pre_train.yml --reps 3

This runs 72 games (12 as X, 12 as O, 3 times each) and reports the win percentage.

Record this baseline score for comparison after training.

1.3 Stop the vLLM Server

Once the evaluation completes, stop the vLLM server (Ctrl+C) to free GPU memory for training.

Step 2: Starting the OpenPipe ART Training Server

The ART server handles both inference and training. It runs vLLM for serving the model and Unsloth for GRPO weight updates using LoRA adapters by default.

Note: The default configuration uses Unsloth LoRA finetuning. Full-weight training requires additional TorchTune configuration through the torchtune_args field in the trainer adapter backend config. Refer to the OpenPipe ART documentation for details.

In your ART virtual environment:

# Activate the ART environment
source art-env/bin/activate
export HF_TOKEN=<your_huggingface_token>

# Start the ART server
art --host 0.0.0.0 --port 7623

Note: The ART server listens on port 7623 for training commands and starts vLLM internally on port 8000 for inference.

Wait for the server to initialize. You should see output indicating:

Training server ready
API endpoints available

Sample output:

INFO:     Started server process [3671624]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7623 (Press CTRL+C to quit)

Step 3: Running Finetuning

With the ART server running, start the finetuning process.

3.1 Review the Training Configuration

The training configuration is in src/rl_with_openpipe_art/configs/config.yml:

llms:
  openpipe_llm:
    _type: openai
    # With LoRA finetuning (default): model_name must match backend.name below
    # so that inference routes to the latest LoRA checkpoint, not the base model.
    # With full-weight training: model_name must match backend.base_model below
    # as updated weights are loaded directly into vLLM under the base model name.
    model_name: tic_tac_toe_training_run
    base_url: http://localhost:8000/v1
    api_key: default
    temperature: 0.4  # Some randomness for exploration

workflow:
  _type: rl_with_openpipe_art
  player_model: openpipe_llm
  max_parser_retries: 2  # Retry on malformed XML

eval:
  general:
    max_concurrency: 16  # Parallel game execution
    output_dir: .tmp/nat/examples/rl_openpipe/eval/finetune
    dataset:
      _type: json
      file_path: examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/data/data.json

  evaluators:
    rl_accuracy:
      _type: step_value_computation  # Uses alpha-beta reward function

trajectory_builders:
  openpipe_traj_builder:
    _type: openpipe_art_traj_builder
    num_generations: 1  # Games per example per epoch

trainer_adapters:
  openpipe_trainer_adapter:
    _type: openpipe_art_trainer_adapter
    backend:
      ip: "0.0.0.0"
      port: 7623
      name: "tic_tac_toe_training_run"
      project: "tic_tac_toe_project"
      base_model: "Qwen/Qwen2.5-3B-Instruct"
      api_key: "default"
      init_args:
        max_seq_length: 8192
      engine_args:
        gpu_memory_utilization: 0.9
        tensor_parallel_size: 1
    training:
      learning_rate: 1e-5
      beta: 0.1

finetuning:
  enabled: true
  trainer: openpipe_trainer
  trajectory_builder: openpipe_traj_builder
  trainer_adapter: openpipe_trainer_adapter
  reward_function:
    name: rl_accuracy
  num_epochs: 8
  output_dir: ./.tmp/nat/finetuning/tic_tac_toe

Important: With LoRA finetuning (the default), the ART backend registers each LoRA adapter in vLLM under the training run name (backend.name). The model_name in the LLM config must match this name so that inference requests are routed to the latest LoRA checkpoint. If model_name points to the base model (Qwen/Qwen2.5-3B-Instruct), every epoch will evaluate the unchanged base model, and GRPO training will have no effect.

3.2 Start Training

In your NeMo Agent Toolkit environment:

# This is a dummy key for local vLLM usage
export OPENAI_API_KEY=default
nat finetune --config_file examples/finetuning/rl_with_openpipe_art/configs/config.yml

3.3 Monitor Training Progress

Training progress is logged to the console and saved to files:

INFO - Starting finetuning with config: src/rl_with_openpipe_art/configs/config.yml
INFO - Initializing OpenPipe ART Runner
INFO - Successfully registered with ART backend.
INFO - Starting finetuning run with 30 epochs

INFO - Starting epoch 1 for run art_run_a1b2c3d4
INFO - Starting 1 evaluation runs for run_id: art_run_a1b2c3d4
INFO - Built 48 trajectories across 48 examples
INFO - Epoch 1 progress logged - Avg Reward: 0.4523, Trajectories: 48
INFO - Training art_run_a1b2c3d4 completed successfully.
INFO - Completed epoch 1/30

INFO - Starting epoch 2 for run art_run_a1b2c3d4
...

Training typically takes upwards of 40 minutes for 10 epochs on an H100.

Step 4: Understanding the Reward Function

The reward function is the key to effective RL training. This example uses a sophisticated alpha-beta pruning based reward instead of simple win/loss signals.

4.1 Why Not Use Simple Win/Loss Rewards?

Simple win/loss rewards have significant problems for training:

Issue	Description
Sparsity	Reward only at game end (after 5-9 moves)
Credit assignment	Which moves caused the win/loss?
No gradient for draws	Draws give 0 reward, no learning signal
Binary signal	No difference between "barely won" and "dominated"

4.2 Alpha-Beta Pruning: A Primer

Alpha-beta pruning is a search algorithm that determines the optimal play in two-player games. It works by:

Building a game tree: All possible future moves and responses
Minimax evaluation: Assuming both players play optimally
Pruning branches: Skipping moves that can't affect the outcome

For Tic-Tac-Toe, alpha-beta can solve the entire game tree, determining:

Forced win: A position where perfect play guarantees victory
Forced loss: A position where the opponent can force a win
Drawn position: Neither player can force a win

4.3 The Reward Function Implementation

The reward function is implemented in two files:

`core.py`: Position Evaluation (`evaluate_board_for_player`)

Located at: src/rl_with_openpipe_art/core.py:110-285

def evaluate_board_for_player(board: np.ndarray, player_val: int) -> float:
    """
    Evaluate the position from the perspective of `player_val`.

    Output ranges:
      - Non-terminal positions: [0, 1] continuous
      - Forced future win: (1, 11] = base + 10
      - Immediate win: (1, 16] = base + 15
      - Forced loss or already lost: 0.0
    """

The function combines two components:

1. Static Heuristic Evaluation (continuous, no search):

def static_eval(b: np.ndarray) -> float:
    """Heuristic position evaluation in [-1, 1]."""
    # Count threats, control, position quality
    score_raw = (
        4.0 * (my_two_open - opp_two_open)    # Strong threats
        + 1.5 * (my_one_open - opp_one_open)  # Influence
        + 1.5 * center                         # Center control
        + 0.75 * corners.sum()                 # Corner control
        + 0.25 * edges.sum()                   # Edge control
    )
    return float(np.tanh(score_raw / 5.0))  # Squash to [-1, 1]

2. Game-Theoretic Solver (alpha-beta search):

def solve_outcome(b: np.ndarray, side_to_move: int, alpha=-1.0, beta=1.0) -> float:
    """
    Full-depth minimax with alpha-beta pruning.
    Returns: +1 (forced win), 0 (draw), -1 (forced loss)
    """
    # Recursively evaluate all possible continuations
    # Prune branches that can't improve the result
    if side_to_move == player_val:
        # Maximizing: find best move for us
        for move in available_moves(b):
            best = max(best, solve_outcome(child, -side_to_move, alpha, beta))
            alpha = max(alpha, best)
            if alpha >= beta:
                break  # Beta cut-off
    else:
        # Minimizing: opponent's best response
        for move in available_moves(b):
            best = min(best, solve_outcome(child, -side_to_move, alpha, beta))
            beta = min(beta, best)
            if alpha >= beta:
                break  # Alpha cut-off
    return best

3. Combined Reward Mapping:

Position Type	Reward Range	Description
Already lost	`0.0`	Terminal loss state
Forced future loss	`0.0`	Opponent can force win
Game-theoretic draw	`[0, 1]`	Continuous heuristic
Non-terminal (no forced outcome)	`[0, 1]`	Continuous heuristic
Forced future win	`base + 10`	`(1, 11]`
Immediate win (on board)	`base + 15`	`(1, 16]`

`accuracy_evaluator.py`: Episode Reward Aggregation

Located at: src/rl_with_openpipe_art/accuracy_evaluator.py:39-72

@staticmethod
def episode_value_from_states(
    state_values: list[float],  # Rewards from each move
    gamma_base: float = 0.8,    # Temporal discount
    delta_bonus: float = 0.95,  # Bonus decay
) -> float:
    """Compute episode value with temporal discounting."""
    s = np.asarray(state_values, dtype=float)
    T = len(s) - 1

    # 1) Split into base [0,1] and bonus (>0 if forced/actual win)
    base = np.minimum(s, 1.0)
    bonus = np.maximum(s - 1.0, 0.0)

    # 2) Reverse-discounted base: earlier moves matter more
    exponents = np.arange(T, -1, -1)  # T, T-1, ..., 0
    w = gamma_base ** exponents
    w = w / w.sum()
    R_base = float(np.dot(w, base))  # Weighted average in [0, 1]

    # 3) Bonus: max spike, time-decayed (reward early wins)
    if np.any(bonus > 0):
        bonus_weights = delta_bonus ** exponents
        U_time = float(np.max(bonus * bonus_weights))
    else:
        U_time = 0.0

    # 4) Final episode score
    return R_base + U_time

4.4 Why This Reward Function Works

Property	Benefit for RL
Continuous	Smooth gradients, stable training
Dense	Reward at every move, not just game end
Informative	Distinguishes good moves from great moves
Theoretically grounded	Based on perfect play analysis
Temporally weighted	Earlier good moves are more valuable
Bonus for winning	Strong signal to learn winning patterns

4.5 How Intermediate Steps Are Tracked

During each game, the workflow records move quality:

# In rl_with_openpipe_art.py
if current_player.name == self.role:
    # Record intermediate step with position value
    self.step_manager.push_intermediate_step(
        IntermediateStepPayload(
            event_type=IntermediateStepType.CUSTOM_END,
            name="agent_move",
            metadata={
                "step": turn_index,
                "value": evaluate_board_for_player(self.board, current_player.value)
            }
        )
    )

The evaluator then aggregates these step-level values into an episode reward.

Step 5: Viewing Training Logs and Metrics

5.1 Output Directory Structure

After training, check the output directory:

.tmp/nat/finetuning/tic_tac_toe/
├── training_metrics.jsonl      # Per-epoch metrics
├── reward_history.json         # Reward progression
├── reward_plot.png             # Visual reward chart

5.2 Training Metrics

The training_metrics.jsonl file contains detailed per-epoch data:

{
  "epoch": 0,
  "timestamp": "2025-01-15T10:30:45.123456",
  "run_id": "art_run_a1b2c3d4",
  "avg_reward": 0.4523,
  "min_reward": 0.0,
  "max_reward": 1.2341,
  "num_trajectories": 48,
  "num_groups": 48
}

5.3 Example Reward Plot

When training is complete, view the reward progression plot. The Y-axis shows average episode reward, and the X-axis shows epochs. Your plot should look similar to this, but results may vary:

Step 6: Running Post-Training Evaluation

After training completes, evaluate the improved model.

6.1 Keep the ART Server Running

The ART server continues serving the finetuned model weights. Do not restart it, as the updated weights are in memory.

6.2 Run Post-Training Evaluation

# This is a dummy key for local vLLM usage
export OPENAI_API_KEY=default
nat eval --config_file examples/finetuning/rl_with_openpipe_art/configs/config_post_train.yml --reps 3

Compare the post-training win percentage against the pre-training baseline. You should see a notable improvement.

Note

Due to the stochastic nature of reinforcement learning, you may notice a decrease in performance in some training attempts. Please try running the training again or follow the troubleshooting guide below.

Best Practices and Troubleshooting

Training Parameter Suggestions

Learning Rate

Value	Effect
`1e-7`	Very stable, slow learning
`1e-6`	Recommended starting point
`5e-6`	Faster learning, may be unstable
`1e-5`	Aggressive, risk of divergence

Number of Generations (`num_generations`)

trajectory_builders:
  openpipe_traj_builder:
    num_generations: 4  # Try 4-8 for better GRPO signal

More generations per example provide better comparison signal for GRPO but increase training time.

When settingnum_generations to 1, the trajectory builder uses all examples in the dataset in one large group. Conversely, increasing num_generations causes each input data point to be evaluated multiple times per epoch, generating more trajectories and finer reward comparisons. Each example then gets its own group.

Temperature

llms:
  openpipe_llm:
    temperature: 0.4  # Balance exploration/exploitation

Value	Effect
`0.0`	Deterministic, no exploration
`0.2-0.4`	Recommended for training
`0.6+`	High exploration, noisier gradients
`0.1`	Use for final evaluation (near-deterministic)

Number of Epochs

Start with 20-30 epochs and monitor the reward plot. Stop if:

Rewards plateau for 5+ epochs
Validation performance decreases (overfitting)

Curriculum Learning (Optional)

Enable curriculum learning for more stable training:

finetuning:
  curriculum_learning:
    enabled: true
    initial_percentile: 0.3      # Start with easiest 30%
    increment_percentile: 0.2     # Add 20% each expansion
    expansion_interval: 5         # Expand every 5 epochs

Common Issues

"Failed to connect to ART backend"

Cause: ART server not running or wrong port.

Solution:

# Check if ART server is running
curl http://localhost:7623/health

"CUDA out of memory"

Cause: Insufficient GPU memory.

Solutions:

Reduce gpu_memory_utilization:

engine_args:
  gpu_memory_utilization: 0.7

Reduce max_seq_length:
```
init_args:
  max_seq_length: 4096
```
Reduce max_concurrency:
```
eval:
  general:
    max_concurrency: 8
```

"No trajectories collected"

Cause: Workflow not producing intermediate steps or evaluator errors.

Solutions:

Check workflow registration:
```
nat info --components
```
Verify evaluator is registered:
```
nat info --evaluators
```

Run a single game manually to debug:

nat eval --config_file=... --max_examples=1

"Model outputs invalid XML"

Cause: Model not following the prompt format.

Solutions:

Increase max_parser_retries:
```
workflow:
  max_parser_retries: 3
```
Lower temperature for more deterministic outputs
Check if base model supports the task (try a larger model)

"Rewards not improving"

Possible causes:

Learning rate too low: Try 5e-6
Not enough generations: Increase num_generations to 2-4
Model already optimal: Check if baseline is already high
Reward function issue: Verify evaluator is computing rewards correctly

Performance Optimization

Increase batch parallelism:

eval:
  general:
    max_concurrency: 32  # If GPU memory allows

Use multiple generations:

trajectory_builders:
  openpipe_traj_builder:
    num_generations: 4  # Better GRPO signal

Enable prefix caching (vLLM):

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-3B-Instruct \
    --enable-prefix-caching

File Reference

File	Description
`src/rl_with_openpipe_art/rl_with_openpipe_art.py`	Main workflow: game loop, player management
`src/rl_with_openpipe_art/core.py`	Game logic, board evaluation, alpha-beta solver
`src/rl_with_openpipe_art/llm_agents.py`	LLM player wrapper, move parsing, prompts
`src/rl_with_openpipe_art/accuracy_evaluator.py`	Reward computation, episode aggregation
`src/rl_with_openpipe_art/evaluator_register.py`	Evaluator registration
`src/rl_with_openpipe_art/register.py`	Workflow component registration
`configs/config.yml`	Training configuration
`configs/config_pre_train.yml`	Pre-training evaluation configuration
`configs/config_post_train.yml`	Post-training evaluation configuration
`data/data.json`	Training dataset
`data/eval_data.json`	Evaluation dataset

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Reinforcement Learning with OpenPipe ART: Tic-Tac-Toe Example

Table of Contents

Prerequisites

Hardware Requirements

Software Requirements

How the Example Works

The Task

Why a Random Opponent?

The Workflow

Training Data

Step 1: Running Pre-Training Baseline Evaluation

1.1 Start the vLLM Inference Server

1.2 Run Pre-Training Evaluation

1.3 Stop the vLLM Server

Step 2: Starting the OpenPipe ART Training Server

Step 3: Running Finetuning

3.1 Review the Training Configuration

3.2 Start Training

3.3 Monitor Training Progress

Step 4: Understanding the Reward Function

4.1 Why Not Use Simple Win/Loss Rewards?

4.2 Alpha-Beta Pruning: A Primer

4.3 The Reward Function Implementation

core.py: Position Evaluation (evaluate_board_for_player)

accuracy_evaluator.py: Episode Reward Aggregation

4.4 Why This Reward Function Works

4.5 How Intermediate Steps Are Tracked

Step 5: Viewing Training Logs and Metrics

5.1 Output Directory Structure

5.2 Training Metrics

5.3 Example Reward Plot

Step 6: Running Post-Training Evaluation

6.1 Keep the ART Server Running

6.2 Run Post-Training Evaluation

Best Practices and Troubleshooting

Training Parameter Suggestions

Learning Rate

Number of Generations (num_generations)

Temperature

Number of Epochs

Curriculum Learning (Optional)

Common Issues

"Failed to connect to ART backend"

"CUDA out of memory"

"No trajectories collected"

"Model outputs invalid XML"

"Rewards not improving"

Performance Optimization

File Reference

Further Reading

`core.py`: Position Evaluation (`evaluate_board_for_player`)

`accuracy_evaluator.py`: Episode Reward Aggregation

Number of Generations (`num_generations`)