Agenta-AI · mmabrouk · Nov 12, 2025 · Nov 12, 2025 · Nov 12, 2025 · Nov 12, 2025
diff --git a/.gitignore b/.gitignore
@@ -47,4 +47,5 @@ sdk/agenta/templates/agenta.py
 web/ee/public/__env.js
 web/oss/public/__env.js
 
-web/oss/tests/datalayer/results
+web/oss/tests/datalayer/results
+.*
diff --git a/api/ee/tests/manual/evaluations/sdk/quick_start.py b/api/ee/tests/manual/evaluations/sdk/quick_start.py
@@ -169,6 +169,7 @@ async def run_evaluation():
     # Run evaluation
     print("Running evaluation...")
     eval_result = await aevaluate(
+        name="My First Eval",
         testsets=[my_testset.id],
         applications=[capital_quiz_app],
         evaluators=[

diff --git a/docs/blog/entries/annotate-your-llm-response-preview.mdx b/docs/blog/entries/annotate-your-llm-response-preview.mdx
@@ -20,7 +20,7 @@ This is useful to:
 - Run custom evaluation workflows
 - Measure application performance in real-time
 
-Check out the how to [annotate traces from API](/observability/trace-with-python-sdk/annotate-traces) for more details. Or try our new tutorial (available as [jupyter notebook](https://github.com/Agenta-AI/agenta/blob/main/examples/jupyter/capture_user_feedback.ipynb)) [here](/tutorials/cookbooks/capture-user-feedback).
+Check out the how to [annotate traces from API](/observability/trace-with-python-sdk/annotate-traces) for more details. Or try our new tutorial (available as [jupyter notebook](https://github.com/Agenta-AI/agenta/blob/main/examples/jupyter/observability/capture_user_feedback.ipynb)) [here](/tutorials/cookbooks/capture-user-feedback).
 
 <Image
   style={{

diff --git a/docs/docs/evaluation/evaluation-from-sdk/01-quick-start.mdx b/docs/docs/evaluation/evaluation-from-sdk/01-quick-start.mdx
@@ -0,0 +1,296 @@
+---
+title: "Quick Start"
+sidebar_label: "Quick Start"
+description: "Learn how to run evaluations programmatically with the Agenta SDK in under 5 minutes"
+sidebar_position: 1
+---
+
+import GoogleColabButton from "@site/src/components/GoogleColabButton";
+
+This guide shows you how to create your first evaluation using the Agenta SDK. You'll build a simple application that answers geography questions, then create evaluators to check if the answers are correct.
+
+<GoogleColabButton notebookPath="examples/jupyter/evaluation/quick-start.ipynb">
+  Open in Google Colaboratory
+</GoogleColabButton>
+
+## What You'll Build
+
+By the end of this guide, you'll have:
+- An application that returns country capitals
+- Two evaluators that check if answers are correct
+- A complete evaluation run with results
+
+The entire example takes less than 100 lines of code.
+
+## Prerequisites
+
+Install the Agenta SDK:
+
+```bash
+pip install agenta
+```
+
+Set your environment variables:
+
+```bash
+export AGENTA_API_KEY="your-api-key"
+export AGENTA_HOST="https://cloud.agenta.ai"
+export OPENAI_API_KEY="your-openai-api-key" # Required for LLM-as-a-judge evaluator
+```
+
+## Step 1: Initialize Agenta
+
+Create a new Python file and initialize the SDK:
+
+```python
+import agenta as ag
+
+ag.init()
+```
+
+## Step 2: Create Your Application
+
+An application is any function that processes inputs and returns outputs. Use the `@ag.application` decorator to mark your function:
+
+```python
+@ag.application(
+    slug="capital_finder",
+    name="Capital Finder",
+    description="Returns the capital of a given country"
+)
+async def capital_finder(country: str):
+    """
+    Your application logic goes here.
+    For this example, we'll use a simple dictionary lookup.
+    """
+    capitals = {
+        "Germany": "Berlin",
+        "France": "Paris",
+        "Spain": "Madrid",
+        "Italy": "Rome",
+    }
+    return capitals.get(country, "Unknown")
+```
+
+The function receives parameters from your test data. In this case, it gets `country` from the testcase and returns the capital city.
+
+## Step 3: Create an Evaluator
+
+An evaluator checks if your application's output is correct. Use the `@ag.evaluator` decorator:
+
+```python
+@ag.evaluator(
+    slug="exact_match",
+    name="Exact Match Evaluator",
+    description="Checks if the output exactly matches the expected answer"
+)
+async def exact_match(capital: str, outputs: str):
+    """
+    Compare the application's output to the expected answer.
+
+    Args:
+        capital: The expected answer from the testcase
+        outputs: What your application returned
+
+    Returns:
+        A dictionary with score and success flag
+    """
+    is_correct = outputs == capital
+    return {
+        "score": 1.0 if is_correct else 0.0,
+        "success": is_correct,
+    }
+```
+
+The evaluator receives two types of inputs:
+- Fields from your testcase (like `capital`)
+- The application's output (always called `outputs`)
+
+## Step 4: Create Test Data
+
+Define your test cases as a list of dictionaries:
+
+```python
+test_data = [
+    {"country": "Germany", "capital": "Berlin"},
+    {"country": "France", "capital": "Paris"},
+    {"country": "Spain", "capital": "Madrid"},
+    {"country": "Italy", "capital": "Rome"},
+]
+```
+
+Each dictionary represents one test case. The keys become parameters that your application and evaluators can access.
+
+## Step 5: Run the Evaluation
+
+Import the evaluation functions and run your test:
+
+```python
+import asyncio
+from agenta.sdk.evaluations import aevaluate
+
+async def run_evaluation():
+    # Create a testset from your data
+    testset = await ag.testsets.acreate(
+        name="Country Capitals",
+        data=test_data,
+    )
+
+    # Run evaluation
+    result = await aevaluate(
+        testsets=[testset.id],
+        applications=[capital_finder],
+        evaluators=[exact_match],
+    )
+
+    return result
+
+# Run the evaluation
+if __name__ == "__main__":
+    eval_result = asyncio.run(run_evaluation())
+    print(f"Evaluation complete!")
+```
+
+## Complete Example
+
+Here's the full code in one place:
+
+```python
+import asyncio
+import agenta as ag
+from agenta.sdk.evaluations import aevaluate
+
+# Initialize SDK
+ag.init()
+
+# Define test data
+test_data = [
+    {"country": "Germany", "capital": "Berlin"},
+    {"country": "France", "capital": "Paris"},
+    {"country": "Spain", "capital": "Madrid"},
+    {"country": "Italy", "capital": "Rome"},
+]
+
+# Create application
+@ag.application(
+    slug="capital_finder",
+    name="Capital Finder",
+)
+async def capital_finder(country: str):
+    capitals = {
+        "Germany": "Berlin",
+        "France": "Paris",
+        "Spain": "Madrid",
+        "Italy": "Rome",
+    }
+    return capitals.get(country, "Unknown")
+
+# Create evaluator
+@ag.evaluator(
+    slug="exact_match",
+    name="Exact Match",
+)
+async def exact_match(capital: str, outputs: str):
+    is_correct = outputs == capital
+    return {
+        "score": 1.0 if is_correct else 0.0,
+        "success": is_correct,
+    }
+
+# Run evaluation
+async def main():
+    testset = await ag.testsets.acreate(
+        name="Country Capitals",
+        data=test_data,
+    )
+
+    result = await aevaluate(
+        testsets=[testset.id],
+        applications=[capital_finder],
+        evaluators=[exact_match],
+    )
+
+    print(f"Evaluation complete!")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## Understanding the Data Flow
+
+When you run an evaluation, here's what happens:
+
+1. **Testcase data** flows to the application
+   - Input: `{"country": "Germany", "capital": "Berlin"}`
+   - Application receives: `country="Germany"`
+   - Application returns: `"Berlin"`
+
+2. **Both testcase data and application output** flow to the evaluator
+   - Evaluator receives: `capital="Berlin"` (expected answer from testcase)
+   - Evaluator receives: `outputs="Berlin"` (what the application returned)
+   - Evaluator compares them and returns: `{"score": 1.0, "success": True}`
+
+3. **Results are collected** and stored in Agenta
+   - You can view them in the web interface
+   - Or access them programmatically from the result object
+
+## Next Steps
+
+Now that you've created your first evaluation, you can:
+
+- Learn how to [configure custom evaluators](/evaluation/evaluation-from-sdk/configuring-evaluators) with different scoring logic
+- Explore [built-in evaluators](/evaluation/evaluation-from-sdk/configuring-evaluators#built-in-evaluators) like LLM-as-a-judge
+- Understand how to [configure your application](/evaluation/evaluation-from-sdk/configuring-applications) for different use cases
+- Run [multiple evaluators](/evaluation/evaluation-from-sdk/running-evaluations) in a single evaluation
+
+## Common Patterns
+
+### Using Multiple Evaluators
+
+You can run several evaluators on the same application:
+
+```python
+result = await aevaluate(
+    testsets=[testset.id],
+    applications=[capital_finder],
+    evaluators=[
+        exact_match,
+        case_insensitive_match,
+        similarity_check,
+    ],
+)
+```
+
+Each evaluator runs independently and produces its own scores.
+
+### Accessing Additional Test Data
+
+Your evaluators can access any field from the testcase:
+
+```python
+@ag.evaluator(slug="region_aware")
+async def region_aware(country: str, region: str, outputs: str):
+    # You can access multiple fields from the testcase
+    # and use them in your evaluation logic
+    pass
+```
+
+### Returning Multiple Metrics
+
+Evaluators can return multiple scores:
+
+```python
+@ag.evaluator(slug="detailed_eval")
+async def detailed_eval(expected: str, outputs: str):
+    return {
+        "exact_match": 1.0 if outputs == expected else 0.0,
+        "length_diff": abs(len(outputs) - len(expected)),
+        "success": outputs == expected,
+    }
+```
+
+## Getting Help
+
+If you run into issues:
+- Join our [Discord community](https://discord.gg/agenta)
+- Open an issue on [GitHub](https://github.com/agenta-ai/agenta)