:::{warning} Experimental Feature: The Finetuning Harness is experimental and may change in future releases. Future versions may introduce breaking changes without notice. :::
The NeMo Agent Toolkit provides a powerful finetuning harness designed for in-situ reinforcement learning of agentic LLM workflows. This guide introduces the foundational concepts, explains the design philosophy, and provides the background knowledge needed to effectively use the harness.
Finetuning is the process of taking a pre-trained language model and further training it on a specific task or domain. Unlike training from scratch, finetuning leverages the knowledge the model already has and adapts it for your particular use case.
There are several approaches to finetuning:
| Approach | Description | Use Case |
|---|---|---|
| Supervised Fine-Tuning (SFT) | Train on input-output pairs with known correct answers | When you have labeled examples of desired behavior |
| Reinforcement Learning (RL) | Train based on reward signals from outcomes | When you can evaluate quality but don't have "correct" answers |
| Direct Preference Optimization (DPO) | Train on pairs of preferred vs. rejected outputs | When you have human preference data |
| RLHF | RL guided by a learned reward model from human feedback | Complex alignment tasks |
The finetuning harness is designed primarily for reinforcement learning approaches, where agents learn through trial and error based on reward signals.
To understand the finetuning harness, you need to understand core RL concepts. This section explains them in the context of LLM agents.
Reinforcement learning is a paradigm where an agent learns to make decisions by interacting with an environment and receiving rewards.
┌─────────────────────────────────────────────────────────────────┐
│ The RL Loop │
│ │
│ ┌─────────┐ action ┌─────────────┐ │
│ │ Agent │ ───────────► │ Environment │ │
│ │ (LLM) │ │ (Task/API) │ │
│ └─────────┘ ◄─────────── └─────────────┘ │
│ ▲ state, reward │
│ │ │
│ └──── Agent updates policy based on rewards │
└─────────────────────────────────────────────────────────────────┘
In the context of LLM agents:
- Agent: The language model making decisions (generating text, calling tools, etc.)
- Environment: The task, tools, APIs, or simulated world the agent interacts with
- State: The current context (conversation history, tool outputs, etc.)
- Action: The agent's response (generated text, tool call, decision)
- Reward: A numerical signal indicating how well the agent performed
A policy is the agent's strategy for choosing actions given a state. For LLMs, the policy is essentially the model's probability distribution over possible next tokens given the conversation history.
When we finetune an LLM with RL, we're adjusting its policy to favor actions that lead to higher rewards.
An episode is a complete interaction from start to finish. In a conversational agent, an episode might be:
- User asks a question
- Agent thinks and calls tools
- Agent receives tool results
- Agent formulates a response
- User provides feedback or the task completes
A trajectory (also called a rollout) is the recorded sequence of everything that happened during an episode:
Trajectory = [State₀, Action₀, Reward₀, State₁, Action₁, Reward₁, ..., StateₙAction, ₙ, Rewardₙ]
For LLM agents, a trajectory typically looks like:
trajectory = [
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": "<tool_call>get_weather('Paris')</tool_call>"},
{"role": "tool", "content": "Sunny, 22°C"},
{"role": "assistant", "content": "The weather in Paris is sunny at 22°C."},
]
# Final reward: 1.0 (correct answer):::{note} Trajectory vs. Rollout: These terms are often used interchangeably. "Rollout" emphasizes the process of generating the sequence (rolling out the policy), while "trajectory" emphasizes the recorded data. In NeMo Agent Toolkit, we use "trajectory" for the data structure. :::
A reward is the immediate feedback signal after an action. Rewards can be:
- Sparse: Only given at the end (e.g., task success = 1, failure = 0)
- Dense: Given at each step (e.g., partial credit for intermediate progress)
The return is the total accumulated reward over an episode, often with discounting:
Return = R₀ + γR₁ + γ²R₂ + ... + γⁿRₙ
Where γ (gamma) is the discount factor (typically 0.9-0.99). Discounting means:
- Immediate rewards are worth more than future rewards
- Prevents infinite returns in continuing tasks
- Encourages efficient solutions
One of the hardest problems in RL is credit assignment: figuring out which actions were responsible for the final outcome.
If your agent had a 10-step conversation and got a reward at the end, which of those 10 steps were good? Which were bad? This is particularly challenging for LLM agents with long conversations.
Common approaches:
- Outcome-based: Assign the same reward to all steps (simple but noisy)
- Reward shaping: Provide intermediate rewards for good behaviors
- Advantage estimation: Use value functions to estimate which actions were better than expected
The harness supports reward shaping through intermediate step metadata, allowing you to record step-quality signals during execution.
-
On-policy: The agent learns from trajectories generated by its current policy. The data must be "fresh" because old trajectories were generated by a different policy.
-
Off-policy: The agent can learn from trajectories generated by any policy, including old versions or even other agents.
Most modern LLM RL methods (like GRPO, PPO) are on-policy, meaning you need to regenerate trajectories after each training update. This is why the harness runs evaluation (to collect trajectories) at the start of each epoch.
GRPO is the algorithm used by OpenPipe ART. Instead of comparing actions to a baseline value function, GRPO compares multiple responses to the same prompt:
Given prompt P, generate N responses: [R₁, R₂, ..., Rₙ]
Score each response: [S₁, S₂, ..., Sₙ]
Learn to increase probability of high-scoring responses
Learn to decrease probability of low-scoring responses
This is why the harness groups trajectories by example ID—each group contains multiple generations for the same input, enabling GRPO optimization.
Advantages of GRPO:
- No need to train a separate value function
- More stable than PPO for language tasks
- Natural fit for LLM generation (sample multiple completions)
PPO is a popular RL algorithm that constrains policy updates to prevent large changes:
- Collect trajectories with current policy
- Compute advantages (how much better/worse than expected)
- Update policy, but clip updates to stay close to the old policy
- Repeat
PPO requires a value function (critic) that estimates expected returns, adding complexity compared to GRPO.
DPO sidesteps RL entirely by treating preference learning as a classification problem:
- Given pairs of (preferred, rejected) responses
- Train the model to increase probability of preferred response
- Simultaneously decrease probability of rejected response
DPO is simpler than RL methods but requires preference data rather than reward signals.
Curriculum learning is a training strategy inspired by how humans learn: starting with easy examples and gradually introducing harder ones.
Without curriculum learning, your model trains on all examples equally. This can cause problems:
- Easy examples dominate: If 90% of examples are easy, the model focuses on those
- Hard examples cause instability: Difficult examples with high variance can destabilize training
- Inefficient learning: Time spent on already-mastered examples is wasted
Epoch 1-5: Train on easiest 30% of examples
Epoch 6-10: Train on easiest 50% of examples
Epoch 11-15: Train on easiest 70% of examples
Epoch 16+: Train on all examples
The harness determines difficulty by the average reward achieved on each example group. Examples where the model already performs well are "easy"; examples where it struggles are "hard."
finetuning:
curriculum_learning:
enabled: true
initial_percentile: 0.3 # Start with easiest 30%
increment_percentile: 0.2 # Add 20% more each expansion
expansion_interval: 5 # Expand every 5 epochs
min_reward_diff: 0.1 # Skip groups with no variance
sort_ascending: false # false = easy-to-hardKey parameters:
| Parameter | Description |
|---|---|
initial_percentile |
Fraction of examples to start with (0.0-1.0) |
increment_percentile |
How much to add at each expansion |
expansion_interval |
Epochs between expansions |
min_reward_diff |
Minimum reward variance to include a group |
sort_ascending |
true for hard-to-easy, false for easy-to-hard |
The min_reward_diff parameter is crucial. If all trajectories for an example have the same reward, there's no learning signal—the model can't learn what's better or worse.
Example A: Trajectories with rewards [0.8, 0.9, 0.7, 0.85]
→ Variance exists, model can learn to prefer 0.9 trajectory
Example B: Trajectories with rewards [1.0, 1.0, 1.0, 1.0]
→ No variance, all trajectories equally good, no learning signal
→ Filtered out if reward_diff < min_reward_diff
Log probabilities (logprobs) are essential for policy gradient methods. When the model generates a token, it assigns probabilities to all possible tokens. The logprob is the log of that probability.
Policy gradient methods update the model by:
- Looking at what the model generated
- Checking the probability it assigned to that generation
- Increasing/decreasing that probability based on reward
:::{note} Without logprobs, we can't compute this gradient. This is why:
- The harness requires logprobs for assistant messages
- Your LLM inference endpoint must return logprobs
- Trajectories without logprobs are filtered out during training :::
For OpenAI-compatible APIs:
response = client.chat.completions.create(
model="your-model",
messages=messages,
logprobs=True, # Enable logprobs
top_logprobs=5 # How many alternative tokens to return
)For vLLM:
# Start vLLM with logprobs enabled
python -m vllm.entrypoints.openai.api_server \
--model your-model \
--enable-log-probsThe finetuning harness is built on three foundational principles:
The harness is intentionally decoupled from training backends and optimization algorithms. This separation allows:
- Backend Flexibility: Train with any RL backend (OpenPipe ART, NeMo Aligner, custom implementations)
- Algorithm Agnosticism: Support GRPO, PPO, DPO, or SFT without code changes
- Infrastructure Independence: Run locally, on cloud GPUs, or across distributed clusters
The decoupling is achieved through abstract interfaces that define what needs to happen, not how:
# Interface defines the contract
class TrainerAdapter(ABC):
async def submit(self, trajectories: TrajectoryCollection) -> TrainingJobRef:
"""Submit trajectories for training."""
raise NotImplementedError
# Implementation handles the specifics
class ARTTrainerAdapter(TrainerAdapter):
async def submit(self, trajectories: TrajectoryCollection) -> TrainingJobRef:
# Convert to ART format
# Submit to ART server
# Return job referenceThe harness uses a three-component architecture that separates concerns:
┌─────────────────────────────────────────────────────────────────────────┐
│ Trainer │
│ (Orchestrates the entire finetuning loop across epochs) │
│ │
│ ┌───────────────────────┐ ┌───────────────────────────┐ │
│ │ TrajectoryBuilder │ │ TrainerAdapter │ │
│ │ │ │ │ │
│ │ - Runs evaluations │ ──────► │ - Validates trajectories │ │
│ │ - Collects episodes │ │ - Submits to backend │ │
│ │ - Computes rewards │ │ - Monitors training │ │
│ │ - Groups trajectories│ │ - Reports status │ │
│ └───────────────────────┘ └───────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Remote Training │
│ Backend │
│ (OpenPipe ART, etc.) │
└─────────────────────────┘
This architecture ensures:
- Single responsibility: Each component does one thing well
- Independent evolution: Components can be upgraded separately
- Easy testing: Mock any component for unit tests
- Flexibility: Mix and match components for different scenarios
A trajectory in NeMo Agent Toolkit represents a complete interaction sequence:
class Trajectory(BaseModel):
episode: list[EpisodeItem] | list[DPOItem] # The sequence of messages/actions
reward: float # The outcome reward for this trajectory
shaped_rewards: list[float] | None # Optional step-wise rewards
metadata: dict | None # Additional contextAn episode item represents a single message or action:
class EpisodeItem(BaseModel):
role: EpisodeItemRole # USER, ASSISTANT, SYSTEM, TOOL, etc.
content: str # The message content
logprobs: Any | None # Log probabilities (required for ASSISTANT)
metadata: dict | None # Step-specific metadataThe role can be:
| Role | Description |
|---|---|
USER |
Human or system input to the agent |
ASSISTANT |
Model-generated response |
SYSTEM |
System prompt or instructions |
TOOL |
Tool/function call result |
FUNCTION |
Function call (legacy format) |
ENVIRONMENT |
Environment state or feedback |
For DPO training, a trajectory consists of preferred and rejected responses:
class DPOItem(BaseModel):
"""
A single step in an episode for DPO training.
"""
prompt: list[OpenAIMessage] | str = Field(description="The prompt messages leading to the response.")
chosen_response: str = Field(description="The response chosen as better by the reward model.")
rejected_response: str = Field(description="The response rejected as worse by the reward model.")The OpenAIMessage type is the standard message format used in OpenAI-compatible chat APIs. It consists of:
class OpenAIMessage(BaseModel):
"""
A message in the OpenAI chat format.
"""
role: str = Field(description="The role of the message (e.g., 'user', 'assistant').")
content: str = Field(description="The content of the message.")Trajectories are organized into collections that group related examples:
class TrajectoryCollection(BaseModel):
trajectories: list[list[Trajectory]] # Grouped trajectories
run_id: str # Unique identifierThe nested list structure (list[list[Trajectory]]) is critical:
trajectories = [
# Group 1: All trajectories for "What is Python?"
[
Trajectory(episode=[...], reward=0.9), # Generation 1
Trajectory(episode=[...], reward=0.7), # Generation 2
Trajectory(episode=[...], reward=0.95), # Generation 3
],
# Group 2: All trajectories for "Explain recursion"
[
Trajectory(episode=[...], reward=0.6),
Trajectory(episode=[...], reward=0.8),
Trajectory(episode=[...], reward=0.5),
],
# ... more groups
]
This structure enables:
- GRPO: Compare responses to the same prompt
- Curriculum learning: Filter groups by average reward
- Variance analysis: Identify examples with no learning signal
Reward functions determine how well an agent performed. The harness uses the NeMo Agent Toolkit evaluator system to compute rewards:
eval:
evaluators:
my_reward:
_type: custom_evaluator
# Evaluator configuration...
finetuning:
reward_function:
name: my_reward # References the evaluator aboveThis design allows:
- Reuse of evaluation metrics as training signals
- Complex multi-criteria rewards through evaluator composition
- Consistent scoring between evaluation and training
A typical training loop in the NeMo Agent Toolkit harness:
┌────────────────────────────────────────────────────────────────────────┐
│ Training Loop │
│ │
│ for epoch in range(num_epochs): │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ 1. TRAJECTORY COLLECTION │ │
│ │ - Run workflow on training dataset │ │
│ │ - Generate N trajectories per example │ │
│ │ - Compute rewards using configured evaluator │ │
│ │ - Group trajectories by example ID │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ 2. CURRICULUM FILTERING (if enabled) │ │
│ │ - Sort groups by average reward │ │
│ │ - Filter out low-variance groups │ │
│ │ - Select top percentile of groups │ │
│ │ - Expand percentile at intervals │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ 3. TRAINING SUBMISSION │ │
│ │ - Convert trajectories to backend format │ │
│ │ - Submit to training backend │ │
│ │ - Wait for training to complete │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ 4. LOGGING & MONITORING │ │
│ │ - Record metrics (avg reward, num trajectories, etc.) │ │
│ │ - Generate visualizations │ │
│ │ - Run validation (if configured) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ └──────────────────► Next epoch │
└────────────────────────────────────────────────────────────────────────┘
llms:
training_model:
_type: openai
model_name: Qwen/Qwen2.5-3B-Instruct
base_url: http://localhost:8000/v1
api_key: default
workflow:
_type: my_workflow
llm: training_model
eval:
general:
max_concurrency: 16
output_dir: .tmp/nat/finetuning/eval
dataset:
_type: json
file_path: data/training_data.json
evaluators:
accuracy:
_type: my_accuracy_evaluator
trajectory_builders:
my_builder:
_type: my_trajectory_builder
num_generations: 2
trainer_adapters:
my_adapter:
_type: my_trainer_adapter
trainers:
my_trainer:
_type: my_trainer
finetuning:
enabled: true
trainer: my_trainer
trajectory_builder: my_builder
trainer_adapter: my_adapter
reward_function:
name: accuracy
num_epochs: 10
output_dir: .tmp/nat/finetuning| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
false |
Whether finetuning is enabled |
trainer |
str |
- | Name of the trainer to use |
trajectory_builder |
str |
- | Name of the trajectory builder |
trainer_adapter |
str |
- | Name of the trainer adapter |
reward_function.name |
str |
- | Name of the evaluator for rewards |
target_functions |
list[str] |
["<workflow>"] |
Functions to extract trajectories from |
target_model |
str |
null |
Specific model to target |
num_epochs |
int |
1 |
Number of training epochs |
output_dir |
Path |
.tmp/nat/finetuning |
Output directory |
curriculum_learning |
object |
see below | Curriculum learning config |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
false |
Enable curriculum learning |
initial_percentile |
float |
0.3 |
Starting fraction of examples |
increment_percentile |
float |
0.2 |
Fraction to add each expansion |
expansion_interval |
int |
5 |
Epochs between expansions |
min_reward_diff |
float |
0.1 |
Minimum variance threshold |
sort_ascending |
bool |
false |
Sort direction (false=easy-to-hard) |
random_subsample |
float |
null |
Optional random subsampling |
Run finetuning from the command line:
nat finetune --config_file=path/to/config.yml| Option | Description |
|---|---|
--config_file |
Path to the configuration file (required) |
--dataset |
Override the dataset path from config |
--result_json_path |
JSON path to extract results (default: $) |
--endpoint |
Remote endpoint for workflow execution |
--endpoint_timeout |
HTTP timeout in seconds (default: 300) |
--override, -o |
Override config values |
--validation_dataset |
Path to validation dataset |
--validation_interval |
Validate every N epochs (default: 5) |
--validation_config_file |
Separate config for validation |
# Basic finetuning
nat finetune --config_file=configs/finetune.yml
# Override number of epochs
nat finetune --config_file=configs/finetune.yml -o finetuning.num_epochs 20
# With validation
nat finetune --config_file=configs/finetune.yml \
--validation_dataset=data/val.json \
--validation_interval=3
# Using remote endpoint
nat finetune --config_file=configs/finetune.yml \
--endpoint=http://localhost:8000/generate \
--endpoint_timeout=600- Extending the Finetuning Harness - Creating custom components
- OpenPipe ART Integration - Using the ART backend
- Evaluating Workflows - Understanding evaluators for rewards