Skip to content

Commit c3c924d

Browse files
authored
Merge pull request #2900 from Agenta-AI/release/v0.62.1
v0.62.1
2 parents 90cd47a + bc39174 commit c3c924d

File tree

14 files changed

+160
-33
lines changed

14 files changed

+160
-33
lines changed

api/oss/src/resources/evaluators/evaluators.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -229,12 +229,12 @@
229229
"description": "Extract information from the user's response.",
230230
"type": "object",
231231
"properties": {
232-
"correctness": {
232+
"score": {
233233
"type": "boolean",
234234
"description": "The grade results",
235235
}
236236
},
237-
"required": ["correctness"],
237+
"required": ["score"],
238238
"strict": True,
239239
},
240240
},
@@ -264,12 +264,12 @@
264264
"description": "Extract information from the user's response.",
265265
"type": "object",
266266
"properties": {
267-
"correctness": {
267+
"score": {
268268
"type": "boolean",
269269
"description": "The hallucination detection result",
270270
}
271271
},
272-
"required": ["correctness"],
272+
"required": ["score"],
273273
"strict": True,
274274
},
275275
},
@@ -339,12 +339,12 @@
339339
"description": "Extract information from the user's response.",
340340
"type": "object",
341341
"properties": {
342-
"correctness": {
342+
"score": {
343343
"type": "boolean",
344344
"description": "The grade results",
345345
}
346346
},
347-
"required": ["correctness"],
347+
"required": ["score"],
348348
"strict": True,
349349
},
350350
},

api/oss/src/services/converters.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313
HumanEvaluationScenario,
1414
EvaluationScenarioOutput,
1515
)
16-
from oss.src.services import db_manager
1716
from oss.src.models.db_models import (
1817
EvaluationDB,
1918
HumanEvaluationDB,

api/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "api"
3-
version = "0.62.0"
3+
version = "0.62.1"
44
description = "Agenta API"
55
authors = [
66
{ name = "Mahmoud Mabrouk", email = "mahmoud@agenta.ai" },
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: "Customize LLM-as-a-Judge Output Schemas"
3+
slug: customize-llm-as-a-judge-output-schemas
4+
date: 2025-11-10
5+
tags: [v0.62.0]
6+
description: "Learn how to customize LLM-as-a-Judge evaluator output schemas with binary, multiclass, or custom JSON formats. Enable reasoning for better evaluation quality and structure feedback to match your workflow needs."
7+
---
8+
9+
import Image from "@theme/IdealImage";
10+
11+
The LLM-as-a-Judge evaluator now supports custom output schemas. You can define exactly what feedback structure you need for your evaluations.
12+
13+
14+
<div style={{display: 'flex', justifyContent: 'center', gap: '24px', margin: '20px 0'}}>
15+
<Image
16+
img={require('/static/images/changelog/changelog-llm-as-a-judge-response-1.png')}
17+
alt="Custom output schemas in LLM-as-a-Judge - Example 1"
18+
style={{width: '48%', minWidth: 0}}
19+
/>
20+
<Image
21+
img={require('/static/images/changelog/changelog-llm-as-a-judge-response-2.png')}
22+
alt="Custom output schemas in LLM-as-a-Judge - Example 2"
23+
style={{width: '48%', minWidth: 0}}
24+
/>
25+
</div>
26+
27+
## What's New
28+
29+
### **Flexible Output Types**
30+
Configure the evaluator to return different types of outputs:
31+
- **Binary**: Return a simple yes/no or pass/fail score
32+
- **Multiclass**: Choose from multiple predefined categories
33+
- **Custom JSON**: Define any structure that fits your use case
34+
35+
### **Include Reasoning for Better Quality**
36+
Enable the reasoning option to have the LLM explain its evaluation. This improves prediction quality because the model thinks through its assessment before providing a score.
37+
38+
When you include reasoning, the evaluator returns both the score and a detailed explanation of how it arrived at that judgment.
39+
40+
### **Advanced: Raw JSON Schema**
41+
For complete control, provide a raw JSON schema. The evaluator will return responses that match your exact structure.
42+
43+
This lets you capture multiple scores, categorical labels, confidence levels, and custom fields in a single evaluation pass. You can structure the output however your workflow requires.
44+
45+
### **Use Custom Schemas in Evaluation**
46+
Once configured, your custom schemas work seamlessly in the evaluation workflow. The results display in the evaluation dashboard with all your custom fields visible.
47+
48+
This makes it easy to analyze multiple dimensions of quality in a single evaluation run.
49+
50+
## Example Use Cases
51+
52+
**Binary Score with Reasoning:**
53+
Return a simple correct/incorrect judgment along with an explanation of why the output succeeded or failed.
54+
55+
**Multi-dimensional Feedback:**
56+
Capture separate scores for accuracy, relevance, completeness, and tone in one evaluation. Include reasoning for each dimension.
57+
58+
**Structured Classification:**
59+
Return categorical labels (excellent/good/fair/poor) along with specific issues found and suggestions for improvement.
60+
61+
## Getting Started
62+
63+
To use custom output schemas with LLM-as-a-Judge:
64+
65+
1. Open the evaluator configuration
66+
2. Select your desired output type (binary, multiclass, or custom)
67+
3. Enable reasoning if you want explanations
68+
4. For advanced use, provide your JSON schema
69+
5. Run your evaluation
70+
71+
Learn more in the [LLM-as-a-Judge documentation](/evaluation/configure-evaluators/llm-as-a-judge).

docs/blog/main.mdx

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,33 @@ import Image from "@theme/IdealImage";
1010

1111
<section class="changelog">
1212

13+
### [Customize LLM-as-a-Judge Output Schemas](/changelog/customize-llm-as-a-judge-output-schemas)
14+
15+
_10 November 2025_
16+
17+
**v0.62.0**
18+
19+
<div style={{display: 'flex', justifyContent: 'center', gap: '24px', margin: '20px 0'}}>
20+
<Image
21+
img={require('/static/images/changelog/changelog-llm-as-a-judge-response-1.png')}
22+
alt="Custom output schemas in LLM-as-a-Judge - Example 1"
23+
style={{width: '48%', minWidth: 0}}
24+
/>
25+
<Image
26+
img={require('/static/images/changelog/changelog-llm-as-a-judge-response-2.png')}
27+
alt="Custom output schemas in LLM-as-a-Judge - Example 2"
28+
style={{width: '48%', minWidth: 0}}
29+
/>
30+
</div>
31+
32+
The LLM-as-a-Judge evaluator now supports custom output schemas. Create multiple feedback outputs per evaluator with any structure you need.
33+
34+
You can configure output types (binary, multiclass), include reasoning to improve prediction quality, or provide a raw JSON schema with any structure you define. Use these custom schemas in your evaluations to capture exactly the feedback you need.
35+
36+
Learn more in the [LLM-as-a-Judge documentation](/evaluation/configure-evaluators/llm-as-a-judge).
37+
38+
---
39+
1340
### [Documentation Overhaul](/changelog/documentation-architecture-overhaul)
1441

1542
_3 November 2025_

docs/docs/evaluation/configure-evaluators/05-llm-as-a-judge.mdx

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
title: "LLM-as-a-Judge"
33
---
44

5+
import Image from "@theme/IdealImage";
6+
57
LLM-as-a-Judge is an evaluator that uses an LLM to assess LLM outputs. It's particularly useful for evaluating text generation tasks or chatbots where there's no single correct answer.
68

79
![Configuration of LLM-as-a-judge](/images/evaluation/configure-evaluators-3.png)
@@ -56,4 +58,28 @@ ANSWER ONLY THE SCORE. DO NOT USE MARKDOWN. DO NOT PROVIDE ANYTHING OTHER THAN T
5658

5759
### The Model
5860

59-
The model can be configured to select one of the supported options (`gpt-3.5-turbo`, `gpt-4o`, `gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `claude-3-5-sonnet`, `claude-3-5-haiku`, `claude-3-5-opus`). To use LLM-as-a-Judge, you'll need to set your OpenAI or Anthropic API key in the settings. The key is saved locally and only sent to our servers for evaluation—it's not stored there.
61+
The model can be configured to select one of the supported options (`gpt-4o`, `gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `claude-3-5-sonnet`, `claude-3-5-haiku`, `claude-3-5-opus`). To use LLM-as-a-Judge, you'll need to set your OpenAI or Anthropic API key in the settings. The key is saved locally and only sent to our servers for evaluation; it's not stored there.
62+
63+
### Output Schema
64+
65+
You can configure the output schema to control what the LLM evaluator returns. This allows you to get structured feedback tailored to your evaluation needs.
66+
67+
#### Basic Configuration
68+
69+
The basic configuration lets you choose from common output types:
70+
71+
- **Binary**: Returns a simple pass/fail or yes/no judgment
72+
- **Multiclass**: Returns a classification from a predefined set of categories
73+
- **Continuous**: Returns a score between a minimum and maximum value
74+
75+
You can also enable **Include Reasoning** to have the evaluator explain its judgment. This option significantly improves the quality of evaluations by making the LLM's decision process transparent.
76+
77+
<Image img={require('/static/images/changelog/changelog-llm-as-a-judge-response-1.png')} alt="Basic output schema configuration" style={{display: 'block', margin: '20px auto', textAlign: 'center'}} />
78+
79+
80+
#### Advanced Configuration
81+
82+
For complete control, you can provide a custom JSON schema. This lets you define any output structure you need. For example, you could return multiple scores, confidence levels, detailed feedback categories, or any combination of fields.
83+
84+
85+
<Image img={require('/static/images/changelog/changelog-llm-as-a-judge-response-2.png')} alt="Advanced output schema configuration" style={{display: 'block', margin: '20px auto', textAlign: 'center'}} />
66.7 KB
Loading
105 KB
Loading

sdk/agenta/sdk/workflows/handlers.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -511,20 +511,24 @@ def field_match_test_v0(
511511
correct_answer = inputs[correct_answer_key]
512512

513513
if not isinstance(outputs, str) and not isinstance(outputs, dict):
514-
raise InvalidOutputsV0Error(expected=["dict", "str"], got=outputs)
514+
# raise InvalidOutputsV0Error(expected=["dict", "str"], got=outputs)
515+
return {"success": False}
515516

516517
outputs_dict = outputs
517518
if isinstance(outputs, str):
518519
try:
519520
outputs_dict = loads(outputs)
520521
except json.JSONDecodeError as e:
521-
raise InvalidOutputsV0Error(expected="dict", got=outputs) from e
522+
# raise InvalidOutputsV0Error(expected="dict", got=outputs) from e
523+
return {"success": False}
522524

523525
if not isinstance(outputs_dict, dict):
524-
raise InvalidOutputsV0Error(expected=["dict", "str"], got=outputs)
526+
# raise InvalidOutputsV0Error(expected=["dict", "str"], got=outputs)
527+
return {"success": False}
525528

526529
if not json_field in outputs_dict:
527-
raise MissingOutputV0Error(path=json_field)
530+
# raise MissingOutputV0Error(path=json_field)
531+
return {"success": False}
528532

529533
# --------------------------------------------------------------------------
530534
success = outputs_dict[json_field] == correct_answer

sdk/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "agenta"
3-
version = "0.62.0"
3+
version = "0.62.1"
44
description = "The SDK for agenta is an open-source LLMOps platform."
55
readme = "README.md"
66
authors = [

0 commit comments

Comments
 (0)