Skip to content
Draft
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
588f349
feat: add AG-UI protocol integration for event-based agent evaluation
contextablemark Oct 27, 2025
fbc510f
Merge branch 'explodinggradients:main' into feature/ag-ui
contextablemark Oct 27, 2025
ee86714
feat: add FastAPI endpoint integration for AG-UI agents
contextablemark Oct 27, 2025
569354c
Merge branch 'explodinggradients:main' into feature/ag-ui
contextablemark Oct 28, 2025
9bdf83a
Update to get things to run for AI Tinkerers.
contextablemark Oct 31, 2025
7346b44
feat: add chunk event support to AG-UI integration
contextablemark Nov 1, 2025
5f3aaa6
Merge branch 'explodinggradients:main' into feature/ag-ui
contextablemark Nov 1, 2025
260cf45
feat: add MultiTurnSample support to AG-UI integration
contextablemark Nov 1, 2025
4ba4a07
refactor: use type-based checking for AG-UI message snapshots
contextablemark Nov 1, 2025
a67261a
Slight refactoring on ag_ui.py integration. Added detailed example u…
contextablemark Nov 1, 2025
013ce37
chore: revert trivial whitespace changes to CLAUDE.md
contextablemark Nov 1, 2025
f1f0848
Updated README with different framework names.
contextablemark Nov 2, 2025
545124e
Addressing issues with make run-ci
contextablemark Nov 4, 2025
008d6c4
Added "How-to" docs and Jupyter notebook.
contextablemark Nov 4, 2025
4fc413a
Addressing formatting issue.
contextablemark Nov 4, 2025
fb7dd52
More formatting / import issues.
contextablemark Nov 4, 2025
d6fe626
Merge branch 'explodinggradients:main' into feature/ag-ui
contextablemark Nov 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
315 changes: 315 additions & 0 deletions examples/ragas_examples/ag_ui_agent_evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
# AG-UI Agent Evaluation Examples

This example demonstrates how to evaluate agents built with the **AG-UI protocol** using Ragas metrics.

## What is AG-UI?

AG-UI (Agent-to-UI) is a protocol for streaming agent events from backend to frontend. It defines a standardized event format for agent-to-UI communication, enabling real-time streaming of agent actions, tool calls, and responses.

## Prerequisites

Before running these examples, you need to have an AG-UI compatible agent running. Follow the [AG-UI Quickstart Guide](https://docs.ag-ui.com/quickstart/applications) to set up your agent.

### Popular AG-UI Compatible Frameworks

- **Google ADK (Agent Development Kit)** - Google's framework for building AI agents
- **Pydantic AI** - Type-safe agent framework using Pydantic
- **Mastra** - Modular, TypeScript-based agentic AI framework
- **Crew.ai** - Python framework for orchestrating collaborative, specialized AI agent teams
- And more...

### Example Setup

Here's a quick overview of setting up an AG-UI agent (refer to the [official documentation](https://docs.ag-ui.com/quickstart/applications) for detailed instructions):

1. Choose your agent framework (e.g., Google ADK, Pydantic AI)
2. Implement your agent with the required tools
3. Start the AG-UI server (typically runs at `http://localhost:8000/chat` or `http://localhost:8000/agentic_chat`)
4. Verify the endpoint is accessible

## Installation

Install the required dependencies:

```bash
# From the ragas repository root
uv pip install -e ".[dev]"

# Or install specific dependencies
pip install ragas langchain-openai
```

## Evaluation Scenarios

This example includes two evaluation scenarios:

### 1. Scientist Biographies (Factual Correctness)

Tests the agent's ability to provide factually correct information about famous scientists.

- **Metric**: `FactualCorrectness` - Measures how accurate the agent's responses are compared to reference answers
- **Dataset**: `test_data/scientist_biographies.csv` - 5 questions about scientists (Einstein, Fleming, Newton, etc.)
- **Sample Type**: `SingleTurnSample` - Simple question-answer pairs

### 2. Weather Tool Usage (Tool Call F1)

Tests the agent's ability to correctly invoke the weather tool when appropriate.

- **Metric**: `ToolCallF1` - F1 score measuring precision and recall of tool invocations
- **Dataset**: `test_data/weather_tool_calls.csv` - 5 queries requiring weather tool calls
- **Sample Type**: `MultiTurnSample` - Multi-turn conversations with tool call expectations

## Usage

### Basic Usage

Run both evaluation scenarios:

```bash
cd examples/ragas_examples/ag_ui_agent_evals
python evals.py --endpoint-url http://localhost:8000/agentic_chat
```

### Command Line Options

```bash
# Specify a different endpoint
python evals.py --endpoint-url http://localhost:8010/chat

# Use a different evaluator model
python evals.py --evaluator-model gpt-4o

# Skip the factual correctness evaluation
python evals.py --skip-factual

# Skip the tool call evaluation
python evals.py --skip-tool-eval

# Specify output directory for results
python evals.py --output-dir ./results

# Combine options
python evals.py \
--endpoint-url http://localhost:8000/agentic_chat \
--evaluator-model gpt-4o-mini \
--output-dir ./my_results
```

### Using uv (Recommended)

```bash
# Run with uv from the examples directory
cd examples
uv run python ragas_examples/ag_ui_agent_evals/evals.py --endpoint-url http://localhost:8000/agentic_chat
```

## Expected Output

### Console Output

The script will print detailed evaluation results:

```
================================================================================
Starting Scientist Biographies Evaluation
================================================================================
Loading scientist biographies dataset from .../test_data/scientist_biographies.csv
Loaded 5 scientist biography samples
Evaluating against endpoint: http://localhost:8000/agentic_chat

================================================================================
Scientist Biographies Evaluation Results
================================================================================
user_input ... factual_correctness(mode=f1)
0 Who originated the theory of relativity... ... 0.75
1 Who discovered penicillin and when... ... 1.00
...

Average Factual Correctness: 0.7160
Perfect scores (1.0): 2/5

Results saved to: .../scientist_biographies_results_20250101_143022.csv

================================================================================
Starting Weather Tool Usage Evaluation
================================================================================
...
Average Tool Call F1: 1.0000
Perfect scores (F1=1.0): 5/5
Failed scores (F1=0.0): 0/5

Results saved to: .../weather_tool_calls_results_20250101_143045.csv

================================================================================
All evaluations completed successfully!
================================================================================
```

### CSV Output Files

Results are saved as timestamped CSV files:

- `scientist_biographies_results_YYYYMMDD_HHMMSS.csv`
- `weather_tool_calls_results_YYYYMMDD_HHMMSS.csv`

Example CSV structure:

```csv
user_input,response,reference,factual_correctness(mode=f1)
"Who originated the theory of relativity...","Albert Einstein...","Albert Einstein originated...",0.75
```

## Customizing the Evaluation

### Adding New Test Cases

#### For Factual Correctness

Edit `test_data/scientist_biographies.csv`:

```csv
user_input,reference
"Your question here","Your reference answer here"
```

#### For Tool Call Evaluation

Edit `test_data/weather_tool_calls.csv`:

```csv
user_input,reference_tool_calls
"What's the weather in Paris?","[{\"name\": \"weatherTool\", \"args\": {\"location\": \"Paris\"}}]"
```

### Using Different Metrics

Modify `evals.py` to include additional Ragas metrics:

```python
from ragas.metrics import AnswerRelevancy, ContextPrecision

# In evaluate_scientist_biographies function:
metrics = [
FactualCorrectness(),
AnswerRelevancy(), # Add additional metrics
]
```

### Evaluating Your Own Agent

1. **Ensure your agent supports AG-UI protocol**
- Agent must expose an endpoint that accepts AG-UI messages
- Agent must return Server-Sent Events (SSE) with AG-UI event format

2. **Update the endpoint URL**
```bash
python evals.py --endpoint-url http://your-agent:port/your-endpoint
```

3. **Customize test data**
- Create new CSV files with your test cases
- Update the loader functions in `evals.py` if needed

## Troubleshooting

### Connection Errors

```
Error: Connection refused at http://localhost:8000/agentic_chat
```

**Solution**: Ensure your AG-UI agent is running and accessible at the specified endpoint.

### Import Errors

```
ImportError: No module named 'ragas'
```

**Solution**: Install ragas and its dependencies:
```bash
pip install ragas langchain-openai
```

### API Key Errors

```
Error: OpenAI API key not found
```

**Solution**: Set your OpenAI API key:
```bash
export OPENAI_API_KEY='your-api-key-here'
```

### Agent Timeout

```
Error: Request timeout after 60.0 seconds
```

**Solution**: Your agent may be slow to respond. You can increase the timeout in the code or optimize your agent's performance.

## Understanding the Results

### Factual Correctness Metric

- **Range**: 0.0 to 1.0
- **1.0**: Perfect match between response and reference
- **0.5-0.9**: Partially correct with some missing or incorrect information
- **<0.5**: Significant discrepancies with the reference

### Tool Call F1 Metric

- **Range**: 0.0 to 1.0
- **1.0**: Perfect tool call accuracy (correct tools with correct arguments)
- **0.5-0.9**: Some correct tools but missing some or calling extra tools
- **0.0**: Incorrect tool usage or no tool calls when expected

## Integration with Your Workflow

### CI/CD Integration

You can integrate these evaluations into your CI/CD pipeline:

```bash
# In your CI script
python evals.py \
--endpoint-url http://staging-agent:8000/chat \
--output-dir ./test-results \
|| exit 1
```

### Tracking Performance Over Time

Save results with timestamps to track improvements:

```bash
# Run evaluations regularly
python evals.py --output-dir ./historical-results/$(date +%Y%m%d)
```

### Automated Testing

Create a simple test harness:

```python
import subprocess
import sys

result = subprocess.run(
["python", "evals.py", "--endpoint-url", "http://localhost:8000/chat"],
capture_output=True
)

if result.returncode != 0:
print("Evaluation failed!")
sys.exit(1)
```

## Additional Resources

- [AG-UI Documentation](https://docs.ag-ui.com)
- [AG-UI Quickstart](https://docs.ag-ui.com/quickstart/applications)
- [Ragas Documentation](https://docs.ragas.io)
- [Ragas AG-UI Integration Guide](https://docs.ragas.io/integrations/ag-ui)
52 changes: 52 additions & 0 deletions examples/ragas_examples/ag_ui_agent_evals/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""
AG-UI Agent Evaluation Examples

This package demonstrates how to evaluate agents built with the AG-UI protocol
using Ragas metrics.

## What is AG-UI?

AG-UI (Agent-to-UI) is a protocol for streaming agent events from backend to frontend.
It defines a standardized event format for agent-to-UI communication.

## Getting Started

Before running these examples, you'll need to have an AG-UI compatible agent running.
Follow the AG-UI quickstart guide to set up your agent:

https://docs.ag-ui.com/quickstart/applications

Popular agent frameworks that support AG-UI include:
- Google ADK (Agent Development Kit)
- Pydantic AI
- And more...

## Running the Examples

Once you have your AG-UI agent endpoint running (typically at
http://localhost:8000/chat or http://localhost:8000/agentic_chat), you can run
the evaluation examples:

```bash
# From the examples directory
cd ragas_examples/ag_ui_agent_evals
uv run python evals.py --endpoint-url http://localhost:8000/agentic_chat
```

## Evaluation Scenarios

This package includes two evaluation scenarios:

1. **Scientist Biographies** - Tests factual correctness of agent responses
using the FactualCorrectness metric with SingleTurnSample datasets.

2. **Weather Tool Usage** - Tests tool calling accuracy using the ToolCallF1
metric with MultiTurnSample datasets.

## Results

Evaluation results are saved as CSV files with timestamps for tracking performance
over time.
"""

__version__ = "0.1.0"
Loading
Loading