Complexity: 🟨 Intermediate
This example demonstrates how to evaluate and profile AI agent performance using the NVIDIA NeMo Agent Toolkit. You'll learn to systematically measure your agent's accuracy and analyze its behavior using the Simple Calculator workflow.
- Tunable RAG Evaluator Integration: Demonstrates the
nat evalcommand with Tunable RAG Evaluator to measure agent response accuracy against ground truth datasets. - Performance Analysis Framework: Shows systematic evaluation of agent behavior, accuracy, and response quality using standardized test datasets.
- Question-by-Question Analysis: Provides detailed breakdown of individual responses with comprehensive metrics for identifying failure patterns and areas for improvement.
- Evaluation Dataset Management: Demonstrates how to work with structured evaluation datasets (
simple_calculator.json) for consistent and reproducible testing. - Results Interpretation: Shows how to analyze evaluation metrics and generate comprehensive performance reports for agent optimization.
- Accuracy Evaluation: Measure and validate agent responses using the Tunable RAG Evaluator
- Performance Analysis: Understand agent behavior through systematic evaluation
- Dataset Management: Work with evaluation datasets for consistent testing
- Results Interpretation: Analyze evaluation metrics to improve agent performance
- Agent toolkit: Ensure you have the Agent toolkit installed. If you have not already done so, follow the instructions in the Install Guide to create the development environment and install NeMo Agent Toolkit.
- Base workflow: This example builds upon the Getting Started Simple Calculator example. Make sure you are familiar with the example before proceeding.
Install this evaluation example:
uv pip install -e examples/evaluation_and_profiling/simple_calculator_evalEvaluate the Simple Calculator agent's accuracy against a test dataset:
nat eval --config_file examples/evaluation_and_profiling/simple_calculator_eval/configs/config-tunable-rag-eval.ymlNote
If you encounter rate limiting ([429] Too Many Requests) during evaluation, try setting the eval.general.max_concurrency value either in the YAML directly or via the command line with: --override eval.general.max_concurrency 1.
The configuration file specified above contains configurations for the NeMo Agent Toolkit evaluation and profiler capabilities. Additional documentation for evaluation configuration can be found in the evaluation guide. Furthermore, similar documentation for profiling configuration can be found in the profiling guide.
This command:
- Uses the test dataset from
examples/getting_started/simple_calculator/data/simple_calculator.json - Applies the Tunable RAG Evaluator to measure response accuracy
- Saves detailed results to
.tmp/nat/examples/getting_started/simple_calculator/tuneable_eval_output.json
The evaluation generates comprehensive metrics including:
- Accuracy Scores: Quantitative measures of response correctness
- Question-by-Question Analysis: Detailed breakdown of individual responses
- Performance Metrics: Overall quality assessments
- Error Analysis: Identification of common failure patterns