This guide covers the integration between the NVIDIA NeMo Agent Toolkit finetuning harness and OpenPipe ART (Agent Reinforcement Trainer), an open-source framework for teaching LLMs through reinforcement learning.
OpenPipe ART is designed to improve agent performance and reliability through experience. It provides:
- GRPO Training: Uses Group Relative Policy Optimization, which compares multiple responses to the same prompt rather than requiring a separate value function
- Async Client-Server Architecture: Separates inference from training, allowing you to run inference anywhere while training happens on GPU infrastructure
- Easy Integration: Designed to work with existing LLM applications with minimal code changes
- Built-in Observability: Integrations with Weights & Biases, Langfuse, and OpenPipe for monitoring and debugging
ART is well-suited for scenarios where:
- You want to improve agent reliability on specific tasks
- You have a way to score agent performance (even if you don't have "correct" answers)
- You're working with agentic workflows that make decisions or take actions
- You want to iterate quickly with online training
┌─────────────────────────────────────────────────────────────────────────┐
│ Your Application │
│ │
│ ┌─────────────────────┐ │
│ │ Workflow │ ◄──── Uses model for inference │
│ └─────────────────────┘ │
│ │ │
│ │ Trajectories │
│ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────────────────┐ │
│ │ ARTTrajectoryBuilder│────────►│ ART Backend Server │ │
│ │ ARTTrainerAdapter │ │ │ │
│ └─────────────────────┘ │ ┌─────────────────────────────┐│ │
│ │ │ │ vLLM Inference Engine ││ │
│ │ Training request │ │ (serves updated weights) ││ │
│ │ │ └─────────────────────────────┘│ │
│ └─────────────────────►│ ┌─────────────────────────────┐│ │
│ │ │ GRPO Trainer (TorchTune) ││ │
│ │ │ (updates model weights) ││ │
│ │ └─────────────────────────────┘│ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
The ART backend runs on GPU infrastructure and provides:
- vLLM Inference Engine: Serves the model for inference with log probability support
- GRPO Trainer: Performs weight updates based on submitted trajectories
NeMo Agent Toolkit connects to this backend through the ARTTrainerAdapter, which handles the protocol for submitting trajectories and monitoring training.
The following table highlights the current support matrix for using ART with different agent frameworks in the NeMo Agent Toolkit:
| Agent Framework | Support |
|---|---|
| LangChain or LangGraph | ✅ Supported |
| Google ADK | ✅ Supported |
| LlamaIndex | ✅ Supported |
| All others | 🛠️ In Progress |
Install the OpenPipe ART plugin package:
pip install nvidia-nat-openpipe-artThis provides:
openpipe_art_traj_builder: The trajectory builder implementationopenpipe_art_trainer_adapter: The trainer adapter for ARTopenpipe_art_trainer: The trainer orchestrator
You'll also need to set up an ART backend server. See the ART documentation for server setup instructions.
llms:
training_llm:
_type: openai
model_name: Qwen/Qwen2.5-3B-Instruct
base_url: http://localhost:8000/v1 # ART inference endpoint
api_key: default
temperature: 0.4
workflow:
_type: my_workflow
llm: training_llm
eval:
general:
max_concurrency: 16
output_dir: .tmp/nat/finetuning/eval
dataset:
_type: json
file_path: data/training_data.json
evaluators:
my_reward:
_type: my_custom_evaluator
trajectory_builders:
art_builder:
_type: openpipe_art_traj_builder
num_generations: 2
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
ip: "localhost"
port: 7623
name: "my_training_run"
project: "my_project"
base_model: "Qwen/Qwen2.5-3B-Instruct"
api_key: "default"
training:
learning_rate: 1e-6
trainers:
art_trainer:
_type: openpipe_art_trainer
finetuning:
enabled: true
trainer: art_trainer
trajectory_builder: art_builder
trainer_adapter: art_adapter
reward_function:
name: my_reward
num_epochs: 20
output_dir: .tmp/nat/finetuning/outputtrajectory_builders:
art_builder:
_type: openpipe_art_traj_builder
num_generations: 2 # Trajectories per example| Field | Type | Default | Description |
|---|---|---|---|
num_generations |
int |
2 |
Number of trajectory generations per example. More generations provide better GRPO signal but increase computation time. |
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
ip: "0.0.0.0"
port: 7623
name: "training_run_name"
project: "project_name"
base_model: "Qwen/Qwen2.5-3B-Instruct"
api_key: "default"
delete_old_checkpoints: false
init_args:
max_seq_length: 8192
engine_args:
gpu_memory_utilization: 0.9
tensor_parallel_size: 1
training:
learning_rate: 1e-6
beta: 0.0Backend Configuration
| Field | Type | Default | Description |
|---|---|---|---|
ip |
str |
- | IP address of the ART backend server |
port |
int |
- | Port of the ART backend server |
name |
str |
"trainer_run" |
Name for this training run |
project |
str |
"trainer_project" |
Project name for organization |
base_model |
str |
"Qwen/Qwen2.5-7B-Instruct" |
Base model being trained (must match server) |
api_key |
str |
"default" |
API key for authentication |
delete_old_checkpoints |
bool |
false |
Delete old checkpoints before training |
Model Initialization Arguments (init_args)
| Field | Type | Default | Description |
|---|---|---|---|
max_seq_length |
int |
- | Maximum sequence length for the model |
vLLM Engine Arguments (engine_args)
| Field | Type | Default | Description |
|---|---|---|---|
gpu_memory_utilization |
float |
- | Fraction of GPU memory to use (0.0-1.0) |
tensor_parallel_size |
int |
- | Number of GPUs for tensor parallelism |
Training Arguments
| Field | Type | Default | Description |
|---|---|---|---|
learning_rate |
float |
5e-5 |
Learning rate for GRPO updates |
beta |
float |
0.0 |
KL penalty coefficient |
trainers:
art_trainer:
_type: openpipe_art_trainerThe trainer has no additional configuration options; it uses the shared finetuning configuration.
The ARTTrajectoryBuilder collects training trajectories through the NeMo Agent Toolkit evaluation system:
┌─────────────────────────────────────────────────────────────────────────┐
│ ARTTrajectoryBuilder Flow │
│ │
│ start_run() │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Launch N parallel evaluation runs (num_generations) │ │
│ │ │ │
│ │ Each run: │ │
│ │ 1. Loads the training dataset │ │
│ │ 2. Runs the workflow on each example │ │
│ │ 3. Captures intermediate steps (with logprobs from LLM calls) │ │
│ │ 4. Computes reward using configured evaluator │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ finalize() │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Wait for all evaluation runs to complete │ │
│ │ │ │
│ │ For each result: │ │
│ │ 1. Extract reward from evaluator output │ │
│ │ 2. Filter intermediate steps to target functions │ │
│ │ 3. Parse steps into OpenAI message format │ │
│ │ 4. Validate assistant messages have logprobs │ │
│ │ 5. Group trajectories by example ID │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Return TrajectoryCollection │
│ (grouped by example for GRPO) │
└─────────────────────────────────────────────────────────────────────────┘
Key Implementation Details:
-
Parallel Generation: Multiple evaluation runs execute concurrently using
asyncio.create_task(). This generates diverse trajectories for the same inputs. -
Log Probability Extraction: The builder parses intermediate steps to extract log probabilities from LLM responses. Messages without logprobs are skipped since they can't be used for training.
-
Target Function Filtering: Only steps from functions listed in
finetuning.target_functionsare included. This lets you focus training on specific parts of complex workflows. -
Grouping for GRPO: Trajectories are organized as
list[list[Trajectory]]where each inner list contains all generations for a single example. This structure enables group-relative policy optimization.
The ARTTrainerAdapter converts NeMo Agent Toolkit trajectories to ART's format and manages training:
┌─────────────────────────────────────────────────────────────────────────┐
│ ARTTrainerAdapter Flow │
│ │
│ initialize() │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Create ART Backend client │ │
│ │ 2. Create TrainableModel with configuration │ │
│ │ 3. Register model with backend │ │
│ │ 4. Verify backend health │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ submit(trajectories) │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Validate episode ordering │ │
│ │ - First message: user or system │ │
│ │ - No consecutive assistant messages │ │
│ │ │ │
│ │ 2. Convert to ART TrajectoryGroup format │ │
│ │ - EpisodeItem → dict or Choice │ │
│ │ - Include logprobs in Choice objects │ │
│ │ │ │
│ │ 3. Submit via model.train() (async) │ │
│ │ │ │
│ │ 4. Return TrainingJobRef for tracking │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ wait_until_complete() │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Poll task status until done │ │
│ │ Return final TrainingJobStatus │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Key Implementation Details:
-
ART Client Management: The adapter maintains an
art.Backendclient andart.TrainableModelinstance that persist across epochs. -
Trajectory Conversion: NeMo Agent Toolkit
Trajectoryobjects are converted to ART'sart.Trajectoryformat:# NeMo Agent Toolkit format EpisodeItem(role=EpisodeItemRole.ASSISTANT, content="...", logprobs=...) # Converted to ART format Choice(index=0, logprobs=..., message={"role": "assistant", "content": "..."}, finish_reason="stop")
-
Message Validation: The adapter validates that conversations follow expected patterns (user/system first, no consecutive assistant messages).
-
Async Training: Training is submitted as an async task, allowing the trainer to monitor progress without blocking.
The ARTTrainer orchestrates the complete training loop:
┌─────────────────────────────────────────────────────────────────────────┐
│ ARTTrainer Flow │
│ │
│ initialize() │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Generate unique run ID │ │
│ │ 2. Initialize trajectory builder │ │
│ │ 3. Initialize trainer adapter │ │
│ │ 4. Set up curriculum learning state │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ run(num_epochs) │
│ │ │
│ for epoch in range(num_epochs): │
│ │ │
│ ├─── Validation (if interval reached) ─────────────────────────┐ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Run evaluation on validation dataset │ │ │
│ │ │ Record metrics (avg_reward, etc.) │ │ │
│ │ │ Store in validation history │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ ◄──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ├─── run_epoch() ──────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ 1. Start trajectory collection │ │ │
│ │ │ 2. Finalize and compute metrics │ │ │
│ │ │ 3. Apply curriculum learning (filter groups) │ │ │
│ │ │ 4. Submit to trainer adapter │ │ │
│ │ │ 5. Log progress and generate plots │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ ◄──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ├─── Wait for training to complete │
│ │ │
│ └─── Check status, break on failure │
│ │
│ Return list of TrainingJobStatus │
└─────────────────────────────────────────────────────────────────────────┘
Key Implementation Details:
-
Curriculum Learning: The trainer implements curriculum learning to progressively include harder examples:
- Groups trajectories by average reward
- Filters out groups with insufficient variance (no learning signal)
- Starts with easiest fraction, expands at intervals
-
Validation: Optionally runs evaluation on a separate validation dataset to monitor generalization.
-
Progress Visualization: Generates reward plots (
reward_plot.png) showing training and validation reward progression. -
Metrics Logging: Writes detailed metrics to JSONL files for analysis:
training_metrics.jsonl: Per-epoch metricsreward_history.json: Reward progressioncurriculum_state.json: Curriculum learning state
-
ART Backend Server: You need a running ART server with your model loaded. See ART documentation for setup.
-
LLM with Logprobs: Your LLM must return log probabilities. For vLLM, use the
--enable-log-probsflag. -
Training Dataset: A JSON/JSONL dataset with your training examples.
-
Reward Function: An evaluator that can score workflow outputs.
You must have OpenPipe ART plugin installed (nvidia-nat-openpipe-art), and an OpenPipe ART server running
and configured to accept training jobs.
# Basic training
nat finetune --config_file=configs/finetune.yml
# With validation
nat finetune --config_file=configs/finetune.yml \
--validation_dataset=data/val.json \
--validation_interval=5
# Override epochs
nat finetune --config_file=configs/finetune.yml \
-o finetuning.num_epochs 50During training, check:
-
Console Output: Shows epoch progress, reward statistics, trajectory counts
-
Metrics Files: In your
output_dir:training_metrics.jsonl: Detailed per-epoch metricsreward_plot.png: Visual reward progressionreward_history.json: Raw reward data
-
ART Server Logs: Training progress from the ART side
Example console output:
INFO - Starting epoch 1 for run art_run_a1b2c3d4
INFO - Starting 2 evaluation runs for run_id: art_run_a1b2c3d4
INFO - Built 100 trajectories across 50 examples for run_id: art_run_a1b2c3d4
INFO - Submitted 100 trajectories in 50 groups for training
INFO - Epoch 1 progress logged - Avg Reward: 0.4523, Trajectories: 100
INFO - Training art_run_a1b2c3d4 completed successfully.
INFO - Completed epoch 1/20
For larger models, configure tensor parallelism:
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
engine_args:
tensor_parallel_size: 2 # Use 2 GPUs
gpu_memory_utilization: 0.85If you encounter OOM errors:
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
init_args:
max_seq_length: 4096 # Reduce sequence length
engine_args:
gpu_memory_utilization: 0.7 # Leave more headroomEnable curriculum learning to improve training stability:
finetuning:
curriculum_learning:
enabled: true
initial_percentile: 0.3 # Start with easiest 30%
increment_percentile: 0.2 # Add 20% each expansion
expansion_interval: 5 # Expand every 5 epochs
min_reward_diff: 0.1 # Filter no-variance groups
sort_ascending: false # Easy-to-hardFor multi-component workflows, target specific functions:
finetuning:
target_functions:
- my_agent_function
- tool_calling_function
target_model: training_llm # Only include steps from this model"Failed to connect to ART backend"
-
Verify the server is running:
curl http://localhost:7623/health
-
Check IP and port in configuration
-
Verify network connectivity (firewalls, etc.)
"No valid assistant messages with logprobs"
- Ensure your LLM provider returns logprobs
- For vLLM: verify
--enable-log-probsflag - Check your LLM configuration
"CUDA out of memory"
- Reduce
gpu_memory_utilization - Reduce
max_seq_length - Reduce
num_generations(fewer parallel trajectories) - Increase
tensor_parallel_size(distribute across GPUs)
"No trajectories collected for epoch"
- Check
target_functionsmatches your workflow - Verify workflow produces intermediate steps
- Check evaluator is returning rewards
- Look for errors in evaluation logs
Rewards not increasing
- Increase
num_generationsfor better GRPO signal - Try curriculum learning to focus on learnable examples
- Adjust learning rate
- Verify reward function is well-calibrated
- Check for sufficient variance in trajectory groups
The examples/finetuning/rl_with_openpipe_art directory contains a complete working example demonstrating:
- Custom workflow with intermediate step tracking
- Custom reward evaluator with reward shaping
- Full configuration for ART integration
- Training and evaluation datasets
See the example's README for detailed instructions.
- Finetuning Concepts - Core concepts and RL fundamentals
- Extending the Finetuning Harness - Creating custom components
- OpenPipe ART Documentation - Official ART documentation
- Custom Evaluators - Creating reward functions