Simple Calculator - Evaluation and Profiling

Complexity: 🟨 Intermediate

This example demonstrates how to evaluate and profile AI agent performance using the NVIDIA NeMo Agent Toolkit. You'll learn to systematically measure your agent's accuracy and analyze its behavior using the Simple Calculator workflow.

Key Features

Tunable RAG Evaluator Integration: Demonstrates the nat eval command with Tunable RAG Evaluator to measure agent response accuracy against ground truth datasets.
Performance Analysis Framework: Shows systematic evaluation of agent behavior, accuracy, and response quality using standardized test datasets.
Question-by-Question Analysis: Provides detailed breakdown of individual responses with comprehensive metrics for identifying failure patterns and areas for improvement.
Evaluation Dataset Management: Demonstrates how to work with structured evaluation datasets (simple_calculator.json) for consistent and reproducible testing.
Results Interpretation: Shows how to analyze evaluation metrics and generate comprehensive performance reports for agent optimization.

What You'll Learn

Accuracy Evaluation: Measure and validate agent responses using the Tunable RAG Evaluator
Performance Analysis: Understand agent behavior through systematic evaluation
Dataset Management: Work with evaluation datasets for consistent testing
Results Interpretation: Analyze evaluation metrics to improve agent performance

Prerequisites

Agent toolkit: Ensure you have the Agent toolkit installed. If you have not already done so, follow the instructions in the Install Guide to create the development environment and install NeMo Agent Toolkit.
Base workflow: This example builds upon the Getting Started Simple Calculator example. Make sure you are familiar with the example before proceeding.

Installation

Install this evaluation example:

uv pip install -e examples/evaluation_and_profiling/simple_calculator_eval

Run the Workflow

Running Evaluation

Evaluate the Simple Calculator agent's accuracy against a test dataset:

nat eval --config_file examples/evaluation_and_profiling/simple_calculator_eval/configs/config-tunable-rag-eval.yml

Note

If you encounter rate limiting ([429] Too Many Requests) during evaluation, try setting the eval.general.max_concurrency value either in the YAML directly or via the command line with: --override eval.general.max_concurrency 1.

The configuration file specified above contains configurations for the NeMo Agent Toolkit evaluation and profiler capabilities. Additional documentation for evaluation configuration can be found in the evaluation guide. Furthermore, similar documentation for profiling configuration can be found in the profiling guide.

This command:

Uses the test dataset from examples/getting_started/simple_calculator/data/simple_calculator.json
Applies the Tunable RAG Evaluator to measure response accuracy
Saves detailed results to .tmp/nat/examples/getting_started/simple_calculator/tuneable_eval_output.json

Understanding Results

The evaluation generates comprehensive metrics including:

Accuracy Scores: Quantitative measures of response correctness
Question-by-Question Analysis: Detailed breakdown of individual responses
Performance Metrics: Overall quality assessments
Error Analysis: Identification of common failure patterns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Calculator - Evaluation and Profiling

Key Features

What You'll Learn

Prerequisites

Installation

Run the Workflow

Running Evaluation

Understanding Results

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Simple Calculator - Evaluation and Profiling

Key Features

What You'll Learn

Prerequisites

Installation

Run the Workflow

Running Evaluation

Understanding Results