Skip to content

Commit dba25c1

Browse files
authored
Eval/feature/unified eval inputs (#37523)
* mid-progress * more base eval stuff * async stuff * change 1 eval to use base * fix initial bugs * apply new base to rai evals * add reworked evals, and remove content safety chat * update tests * renaming evals * refactor to derive conversation converter, add 2 evals, surface retrieval * changelog and init * run black * pylint * cspell * re-skip * optional cred * change async passthrough * test full * comments * cspell * cspell * run black * typehint * fix types for saving * upddate conftest and recorddigns * pylint * fix fluency unit tests * last new assets
1 parent e1e2e0f commit dba25c1

36 files changed

+1095
-1501
lines changed

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,23 @@
55
### Features Added
66

77
- Added `type` field to `AzureOpenAIModelConfiguration` and `OpenAIModelConfiguration`
8+
- The following evaluators now support `conversation` as an alternative input to their usual single-turn inputs:
9+
- `ViolenceEvaluator`
10+
- `SexualEvaluator`
11+
- `SelfHarmEvaluator`
12+
- `HateUnfairnessEvaluator`
13+
- `ProtectedMaterialEvaluator`
14+
- `IndirectAttackEvaluator`
15+
- `CoherenceEvaluator`
16+
- `RelevanceEvaluator`
17+
- `FluencyEvaluator`
18+
- `GroundednessEvaluator`
19+
- Surfaced `RetrievalScoreEvaluator`, formally an internal part of `ChatEvaluator` as a standalone conversation-only evaluator.
820

921
### Breaking Changes
1022

23+
24+
- Removed `ContentSafetyChatEvaluator` and `ChatEvaluator`
1125
- The `evaluator_config` parameter of `evaluate` now maps in evaluator name to a dictionary `EvaluatorConfig`, which is a `TypedDict`. The
1226
`column_mapping` between `data` or `target` and evaluator field names should now be specified inside this new dictionary:
1327

sdk/evaluation/azure-ai-evaluation/assets.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22
"AssetsRepo": "Azure/azure-sdk-assets",
33
"AssetsRepoPrefixPath": "python",
44
"TagPrefix": "python/evaluation/azure-ai-evaluation",
5-
"Tag": "python/evaluation/azure-ai-evaluation_26cf396fa1"
5+
"Tag": "python/evaluation/azure-ai-evaluation_051cb9dfbd"
66
}

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,8 @@
44

55
from ._evaluate._evaluate import evaluate
66
from ._evaluators._bleu import BleuScoreEvaluator
7-
from ._evaluators._chat import ChatEvaluator
87
from ._evaluators._coherence import CoherenceEvaluator
98
from ._evaluators._content_safety import (
10-
ContentSafetyChatEvaluator,
119
ContentSafetyEvaluator,
1210
HateUnfairnessEvaluator,
1311
SelfHarmEvaluator,
@@ -22,6 +20,7 @@
2220
from ._evaluators._protected_material import ProtectedMaterialEvaluator
2321
from ._evaluators._qa import QAEvaluator
2422
from ._evaluators._relevance import RelevanceEvaluator
23+
from ._evaluators._retrieval import RetrievalEvaluator
2524
from ._evaluators._rouge import RougeScoreEvaluator, RougeType
2625
from ._evaluators._similarity import SimilarityEvaluator
2726
from ._evaluators._xpia import IndirectAttackEvaluator
@@ -41,17 +40,16 @@
4140
"RelevanceEvaluator",
4241
"SimilarityEvaluator",
4342
"QAEvaluator",
44-
"ChatEvaluator",
4543
"ViolenceEvaluator",
4644
"SexualEvaluator",
4745
"SelfHarmEvaluator",
4846
"HateUnfairnessEvaluator",
4947
"ContentSafetyEvaluator",
50-
"ContentSafetyChatEvaluator",
5148
"IndirectAttackEvaluator",
5249
"BleuScoreEvaluator",
5350
"GleuScoreEvaluator",
5451
"MeteorScoreEvaluator",
52+
"RetrievalEvaluator",
5553
"RougeScoreEvaluator",
5654
"RougeType",
5755
"ProtectedMaterialEvaluator",

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,12 @@ def _validate_input_data_for_evaluator(evaluator, evaluator_name, df_data, is_ta
158158
]
159159

160160
missing_inputs = [col for col in required_inputs if col not in df_data.columns]
161+
if missing_inputs and "conversation" in required_inputs:
162+
non_conversation_inputs = [val for val in required_inputs if val != "conversation"]
163+
if len(missing_inputs) == len(non_conversation_inputs) and [
164+
input in non_conversation_inputs for input in missing_inputs
165+
]:
166+
missing_inputs = []
161167
if missing_inputs:
162168
if not is_target_fn:
163169
msg = f"Missing required inputs for evaluator {evaluator_name} : {missing_inputs}."

0 commit comments

Comments
 (0)