You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/observability.md
+57-49Lines changed: 57 additions & 49 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: lgayhardt
6
6
ms.author: lagayhar
7
7
manager: scottpolly
8
8
ms.reviewer: mithigpe
9
-
ms.date: 05/19/2025
9
+
ms.date: 07/31/2025
10
10
ms.service: azure-ai-foundry
11
11
ms.topic: concept-article
12
12
ms.custom:
@@ -32,71 +32,79 @@ This is where evaluators become essential. These specialized tools measure both
32
32
33
33
Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users. The following supported evaluators provide comprehensive assessment capabilities across different AI application types and concerns:
| F1 Score | Harmonic mean of precision and recall in token overlaps between response and ground truth. | Response, ground truth |
51
+
| BLEU | Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth. | Response, ground truth |
52
+
| GLEU | Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth. | Response, ground truth |
53
+
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth. | Response, ground truth |
54
+
| METEOR | Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth. | Response, ground truth |
49
55
50
-
| Evaluator | Purpose |
51
-
|--|--|
52
-
| Intent Resolution | Measures how accurately the agent identifies and addresses user intentions.|
53
-
| Task Adherence | Measures how well the agent follows through on identified tasks. |
54
-
| Tool Call Accuracy | Measures how well the agent selects and calls the correct tools to.|
56
+
To learn more, see [Textual similarity evaluators](./evaluation-evaluators/textual-similarity-evaluators.md)
| Retrieval | Measures how effectively the system retrieves relevant information. | Query , context |
63
+
| Document Retrieval | Measures accuracy in retrieval results given ground truth. | Ground truth , retrieved documents, |
64
+
| Groundedness | Measures how consistent the response is with respect to the retrieved context. | Query (optional), context, response |
65
+
| Groundedness Pro | Measures whether the response is consistent with respect to the retrieved context. | Query, context, response |
66
+
| Relevance | Measures how relevant the response is with respect to the query. | Query, response|
67
+
| Response Completeness | Measures to what extent the response is complete (not missing critical information) with respect to the ground truth. | Response, ground truth |
58
68
59
-
| Evaluator | Purpose |
60
-
|--|--|
61
-
| Fluency | Measures natural language quality and readability. |
62
-
| Coherence | Measures logical consistency and flow of responses.|
63
-
| QA | Measures comprehensively various quality aspects in question-answering.|
69
+
To learn more, see [Retrieval-augmented Generation (RAG) evaluators](./evaluation-evaluators/rag-evaluators.md).
64
70
71
+
### Safety and security (preview)
65
72
66
-
[**Safety and Security (preview):**](./evaluation-evaluators/risk-safety-evaluators.md)
73
+
| Evaluator | Purpose | Inputs |
74
+
|--|--|--|
75
+
| Hate and Unfairness | Identifies biased, discriminatory, or hateful content. | Query, response |
76
+
| Sexual | Identifies inappropriate sexual content. | Query, response |
| Model Labeler| Classifies content using custom guidelines and labels. | Query, response, ground truth |
101
+
| String Checker | Performs flexible text validations and pattern matching. | Response |
102
+
| Text Similarity | Evaluates the quality of text or determine semantic closeness. | Response, ground truth |
103
+
| Model Scorer| Generates numerical scores (customized range) for content based on custom guidelines. | Query, response, ground truth |
93
104
94
-
| Evaluator | Purpose |
95
-
|--|--|
96
-
| Model Labeler | Classifies content using custom guidelines and labels. |
97
-
| Model Scorer | Generates numerical scores (customized range) for content based on custom guidelines. |
98
-
| String Checker | Performs flexible text validations and pattern matching. |
99
-
| Textual Similarity | Evaluates the quality of text or determine semantic closeness. |
105
+
To learn more, see [Azure OpenAI Graders](./evaluation-evaluators/azure-openai-graders.md).
106
+
107
+
### Evaluators in the development lifecycle
100
108
101
109
By using these evaluators strategically throughout the development lifecycle, teams can build more reliable, safe, and effective AI applications that meet user needs while minimizing potential risks.
0 commit comments