Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
622aed4
Add new category for 'Evaluation from SDK' in JSON configuration
mmabrouk Nov 12, 2025
e4680e0
Add quick start guide and example notebook for Agenta SDK evaluations
mmabrouk Nov 12, 2025
b27daa8
Update tutorial links and add new Jupyter notebooks for capturing use…
mmabrouk Nov 12, 2025
5ec8ccf
Update .gitignore to include all files and directories in the tests/d…
mmabrouk Nov 12, 2025
37b5bde
Add documentation for managing testsets in Agenta SDK
mmabrouk Nov 12, 2025
d974709
Remove Jupyter notebook for evaluations with SDK, consolidating docum…
mmabrouk Nov 12, 2025
31d2bd7
Add documentation for configuring evaluators in Agenta SDK
mmabrouk Nov 12, 2025
c169100
Add documentation for configuring applications in Agenta SDK
mmabrouk Nov 12, 2025
f963c72
Add documentation for running evaluations programmatically from the SDK
mmabrouk Nov 12, 2025
1141565
Update quick start guide to include troubleshooting resources for eva…
mmabrouk Nov 12, 2025
769b5c7
Enhance quick start guide for Agenta SDK evaluations by adding OpenAI…
mmabrouk Nov 12, 2025
ed0cfdc
Update quick start guide to include OpenAI API key setup for LLM-as-a…
mmabrouk Nov 12, 2025
ce934da
Enhance quick start guide by adding evaluation name and description p…
mmabrouk Nov 12, 2025
bd8d5e1
Refactor testset management documentation by removing description par…
mmabrouk Nov 12, 2025
ad7b387
Add documentation for managing testsets and configuring evaluators in…
mmabrouk Nov 12, 2025
74e6e0d
Refactor quick start guide by removing redundant header and improving…
mmabrouk Nov 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,5 @@ sdk/agenta/templates/agenta.py
web/ee/public/__env.js
web/oss/public/__env.js

web/oss/tests/datalayer/results
web/oss/tests/datalayer/results
.*
1 change: 1 addition & 0 deletions api/ee/tests/manual/evaluations/sdk/quick_start.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ async def run_evaluation():
# Run evaluation
print("Running evaluation...")
eval_result = await aevaluate(
name="My First Eval",
testsets=[my_testset.id],
applications=[capital_quiz_app],
evaluators=[
Expand Down
2 changes: 1 addition & 1 deletion docs/blog/entries/annotate-your-llm-response-preview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This is useful to:
- Run custom evaluation workflows
- Measure application performance in real-time

Check out the how to [annotate traces from API](/observability/trace-with-python-sdk/annotate-traces) for more details. Or try our new tutorial (available as [jupyter notebook](https://github.com/Agenta-AI/agenta/blob/main/examples/jupyter/capture_user_feedback.ipynb)) [here](/tutorials/cookbooks/capture-user-feedback).
Check out the how to [annotate traces from API](/observability/trace-with-python-sdk/annotate-traces) for more details. Or try our new tutorial (available as [jupyter notebook](https://github.com/Agenta-AI/agenta/blob/main/examples/jupyter/observability/capture_user_feedback.ipynb)) [here](/tutorials/cookbooks/capture-user-feedback).

<Image
style={{
Expand Down
296 changes: 296 additions & 0 deletions docs/docs/evaluation/evaluation-from-sdk/01-quick-start.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
---
title: "Quick Start"
sidebar_label: "Quick Start"
description: "Learn how to run evaluations programmatically with the Agenta SDK in under 5 minutes"
sidebar_position: 1
---

import GoogleColabButton from "@site/src/components/GoogleColabButton";

This guide shows you how to create your first evaluation using the Agenta SDK. You'll build a simple application that answers geography questions, then create evaluators to check if the answers are correct.

<GoogleColabButton notebookPath="examples/jupyter/evaluation/quick-start.ipynb">
Open in Google Colaboratory
</GoogleColabButton>

## What You'll Build

By the end of this guide, you'll have:
- An application that returns country capitals
- Two evaluators that check if answers are correct
- A complete evaluation run with results

The entire example takes less than 100 lines of code.

## Prerequisites

Install the Agenta SDK:

```bash
pip install agenta
```

Set your environment variables:

```bash
export AGENTA_API_KEY="your-api-key"
export AGENTA_HOST="https://cloud.agenta.ai"
export OPENAI_API_KEY="your-openai-api-key" # Required for LLM-as-a-judge evaluator
```

## Step 1: Initialize Agenta

Create a new Python file and initialize the SDK:

```python
import agenta as ag

ag.init()
```

## Step 2: Create Your Application

An application is any function that processes inputs and returns outputs. Use the `@ag.application` decorator to mark your function:

```python
@ag.application(
slug="capital_finder",
name="Capital Finder",
description="Returns the capital of a given country"
)
async def capital_finder(country: str):
"""
Your application logic goes here.
For this example, we'll use a simple dictionary lookup.
"""
capitals = {
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}
return capitals.get(country, "Unknown")
```

The function receives parameters from your test data. In this case, it gets `country` from the testcase and returns the capital city.

## Step 3: Create an Evaluator

An evaluator checks if your application's output is correct. Use the `@ag.evaluator` decorator:

```python
@ag.evaluator(
slug="exact_match",
name="Exact Match Evaluator",
description="Checks if the output exactly matches the expected answer"
)
async def exact_match(capital: str, outputs: str):
"""
Compare the application's output to the expected answer.

Args:
capital: The expected answer from the testcase
outputs: What your application returned

Returns:
A dictionary with score and success flag
"""
is_correct = outputs == capital
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}
```

The evaluator receives two types of inputs:
- Fields from your testcase (like `capital`)
- The application's output (always called `outputs`)

## Step 4: Create Test Data

Define your test cases as a list of dictionaries:

```python
test_data = [
{"country": "Germany", "capital": "Berlin"},
{"country": "France", "capital": "Paris"},
{"country": "Spain", "capital": "Madrid"},
{"country": "Italy", "capital": "Rome"},
]
```

Each dictionary represents one test case. The keys become parameters that your application and evaluators can access.

## Step 5: Run the Evaluation

Import the evaluation functions and run your test:

```python
import asyncio
from agenta.sdk.evaluations import aevaluate

async def run_evaluation():
# Create a testset from your data
testset = await ag.testsets.acreate(
name="Country Capitals",
data=test_data,
)

# Run evaluation
result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[exact_match],
)

return result

# Run the evaluation
if __name__ == "__main__":
eval_result = asyncio.run(run_evaluation())
print(f"Evaluation complete!")
```

## Complete Example

Here's the full code in one place:

```python
import asyncio
import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize SDK
ag.init()

# Define test data
test_data = [
{"country": "Germany", "capital": "Berlin"},
{"country": "France", "capital": "Paris"},
{"country": "Spain", "capital": "Madrid"},
{"country": "Italy", "capital": "Rome"},
]

# Create application
@ag.application(
slug="capital_finder",
name="Capital Finder",
)
async def capital_finder(country: str):
capitals = {
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}
return capitals.get(country, "Unknown")

# Create evaluator
@ag.evaluator(
slug="exact_match",
name="Exact Match",
)
async def exact_match(capital: str, outputs: str):
is_correct = outputs == capital
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}

# Run evaluation
async def main():
testset = await ag.testsets.acreate(
name="Country Capitals",
data=test_data,
)

result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[exact_match],
)

print(f"Evaluation complete!")

if __name__ == "__main__":
asyncio.run(main())
```

## Understanding the Data Flow

When you run an evaluation, here's what happens:

1. **Testcase data** flows to the application
- Input: `{"country": "Germany", "capital": "Berlin"}`
- Application receives: `country="Germany"`
- Application returns: `"Berlin"`

2. **Both testcase data and application output** flow to the evaluator
- Evaluator receives: `capital="Berlin"` (expected answer from testcase)
- Evaluator receives: `outputs="Berlin"` (what the application returned)
- Evaluator compares them and returns: `{"score": 1.0, "success": True}`

3. **Results are collected** and stored in Agenta
- You can view them in the web interface
- Or access them programmatically from the result object

## Next Steps

Now that you've created your first evaluation, you can:

- Learn how to [configure custom evaluators](/evaluation/evaluation-from-sdk/configuring-evaluators) with different scoring logic
- Explore [built-in evaluators](/evaluation/evaluation-from-sdk/configuring-evaluators#built-in-evaluators) like LLM-as-a-judge
- Understand how to [configure your application](/evaluation/evaluation-from-sdk/configuring-applications) for different use cases
- Run [multiple evaluators](/evaluation/evaluation-from-sdk/running-evaluations) in a single evaluation

## Common Patterns

### Using Multiple Evaluators

You can run several evaluators on the same application:

```python
result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[
exact_match,
case_insensitive_match,
similarity_check,
],
)
```

Each evaluator runs independently and produces its own scores.

### Accessing Additional Test Data

Your evaluators can access any field from the testcase:

```python
@ag.evaluator(slug="region_aware")
async def region_aware(country: str, region: str, outputs: str):
# You can access multiple fields from the testcase
# and use them in your evaluation logic
pass
```

### Returning Multiple Metrics

Evaluators can return multiple scores:

```python
@ag.evaluator(slug="detailed_eval")
async def detailed_eval(expected: str, outputs: str):
return {
"exact_match": 1.0 if outputs == expected else 0.0,
"length_diff": abs(len(outputs) - len(expected)),
"success": outputs == expected,
}
```

## Getting Help

If you run into issues:
- Join our [Discord community](https://discord.gg/agenta)
- Open an issue on [GitHub](https://github.com/agenta-ai/agenta)
Loading