Skip to content

Commit 5952713

Browse files
Merge pull request #6352 from MicrosoftDocs/main
Auto Publish – main to live - 2025-08-02 05:01 UTC
2 parents 8d9fce9 + 2ead236 commit 5952713

File tree

1 file changed

+57
-49
lines changed

1 file changed

+57
-49
lines changed

articles/ai-foundry/concepts/observability.md

Lines changed: 57 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: mithigpe
9-
ms.date: 05/19/2025
9+
ms.date: 07/31/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: concept-article
1212
ms.custom:
@@ -32,71 +32,79 @@ This is where evaluators become essential. These specialized tools measure both
3232

3333
Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users. The following supported evaluators provide comprehensive assessment capabilities across different AI application types and concerns:
3434

35+
### General purpose
3536

36-
[**RAG (Retrieval Augmented Generation)**:](./evaluation-evaluators/rag-evaluators.md)
37+
| Evaluator | Purpose | Inputs |
38+
|--|--|--|
39+
| Coherence | Measures logical consistency and flow of responses.| Query, response |
40+
| Fluency | Measures natural language quality and readability. | Response |
41+
| QA | Measures comprehensively various quality aspects in question-answering.| Query, context, response, ground truth |
3742

38-
| Evaluator | Purpose |
39-
|--|--|
40-
| Retrieval | Measures how effectively the system retrieves relevant information. |
41-
| Document Retrieval | Measures accuracy in retrieval results given ground truth. |
42-
| Groundedness | Measures how consistent the response is with respect to the retrieved context. |
43-
| Groundedness Pro | Measures whether the response is consistent with respect to the retrieved context. |
44-
| Relevance | Measures how relevant the response is with respect to the query. |
45-
| Response Completeness | Measures to what extent the response is complete (not missing critical information) with respect to the ground truth. |
43+
To learn more, see [General purpose evaluators](./evaluation-evaluators/general-purpose-evaluators.md).
4644

45+
### Textual similarity
4746

48-
[**Agents (preview):**](./evaluation-evaluators/agent-evaluators.md)
47+
| Evaluator | Purpose | Inputs |
48+
|--|--|--|
49+
| Similarity | AI-assisted textual similarity measurement. | Query, context, ground truth |
50+
| F1 Score | Harmonic mean of precision and recall in token overlaps between response and ground truth. | Response, ground truth |
51+
| BLEU | Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth. | Response, ground truth |
52+
| GLEU | Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth. | Response, ground truth |
53+
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth. | Response, ground truth |
54+
| METEOR | Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth. | Response, ground truth |
4955

50-
| Evaluator | Purpose |
51-
|--|--|
52-
| Intent Resolution | Measures how accurately the agent identifies and addresses user intentions.|
53-
| Task Adherence | Measures how well the agent follows through on identified tasks. |
54-
| Tool Call Accuracy | Measures how well the agent selects and calls the correct tools to.|
56+
To learn more, see [Textual similarity evaluators](./evaluation-evaluators/textual-similarity-evaluators.md)
5557

58+
### RAG (retrieval augmented generation)
5659

57-
[**General Purpose:**](./evaluation-evaluators/general-purpose-evaluators.md)
60+
| Evaluator | Purpose | Inputs |
61+
|--|--|--|
62+
| Retrieval | Measures how effectively the system retrieves relevant information. | Query , context |
63+
| Document Retrieval | Measures accuracy in retrieval results given ground truth. | Ground truth , retrieved documents, |
64+
| Groundedness | Measures how consistent the response is with respect to the retrieved context. | Query (optional), context, response |
65+
| Groundedness Pro | Measures whether the response is consistent with respect to the retrieved context. | Query, context, response |
66+
| Relevance | Measures how relevant the response is with respect to the query. | Query, response|
67+
| Response Completeness | Measures to what extent the response is complete (not missing critical information) with respect to the ground truth. | Response, ground truth |
5868

59-
| Evaluator | Purpose |
60-
|--|--|
61-
| Fluency | Measures natural language quality and readability. |
62-
| Coherence | Measures logical consistency and flow of responses.|
63-
| QA | Measures comprehensively various quality aspects in question-answering.|
69+
To learn more, see [Retrieval-augmented Generation (RAG) evaluators](./evaluation-evaluators/rag-evaluators.md).
6470

71+
### Safety and security (preview)
6572

66-
[**Safety and Security (preview):**](./evaluation-evaluators/risk-safety-evaluators.md)
73+
| Evaluator | Purpose | Inputs |
74+
|--|--|--|
75+
| Hate and Unfairness | Identifies biased, discriminatory, or hateful content. | Query, response |
76+
| Sexual | Identifies inappropriate sexual content. | Query, response |
77+
| Violence | Detects violent content or incitement. | Query, response |
78+
| Self-Harm | Detects content promoting or describing self-harm.| Query, response |
79+
| Content Safety | Comprehensive assessment of various safety concerns. | Query, response |
80+
| Protected Materials | Detects unauthorized use of copyrighted or protected content. | Query, response |
81+
| Code Vulnerability | Identifies security issues in generated code. | Query, response |
82+
| Ungrounded Attributes | Detects fabricated or hallucinated information inferred from user interactions. | Query, context, response |
6783

68-
| Evaluator | Purpose |
69-
|--|--|
70-
| Violence | Detects violent content or incitement. |
71-
| Sexual | Identifies inappropriate sexual content. |
72-
| Self-Harm | Detects content promoting or describing self-harm.|
73-
| Hate and Unfairness | Identifies biased, discriminatory, or hateful content. |
74-
| Ungrounded Attributes | Detects fabricated or hallucinated information inferred from user interactions. |
75-
| Code Vulnerability | Identifies security issues in generated code. |
76-
| Protected Materials | Detects unauthorized use of copyrighted or protected content. |
77-
| Content Safety | Comprehensive assessment of various safety concerns. |
84+
To learn more, see [Risk and safety evaluators](./evaluation-evaluators/risk-safety-evaluators.md).
7885

86+
### Agents (preview)
7987

80-
[**Textual Similarity:**](./evaluation-evaluators/textual-similarity-evaluators.md)
88+
| Evaluator | Purpose | Inputs |
89+
|--|--|--|
90+
| Intent Resolution | Measures how accurately the agent identifies and addresses user intentions.| Query, response |
91+
| Task Adherence | Measures how well the agent follows through on identified tasks. | Query, response, tool definitions (optional) |
92+
| Tool Call Accuracy | Measures how well the agent selects and calls the correct tools to.| Query, either response or tool calls, tool definitions |
8193

82-
| Evaluator | Purpose |
83-
|--|--|
84-
| Similarity | AI-assisted textual similarity measurement. |
85-
| F1 Score | Harmonic mean of precision and recall in token overlaps between response and ground truth. |
86-
| BLEU | Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth. |
87-
| GLEU | Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth. |
88-
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth. |
89-
| METEOR | Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth. |
94+
To learn more, see [Agent evaluators](./evaluation-evaluators/agent-evaluators.md).
9095

96+
### Azure OpenAI graders (preview)
9197

92-
[**Azure OpenAI Graders (preview):**](./evaluation-evaluators/azure-openai-graders.md)
98+
| Evaluator | Purpose | Inputs |
99+
|--|--|--|
100+
| Model Labeler| Classifies content using custom guidelines and labels. | Query, response, ground truth |
101+
| String Checker | Performs flexible text validations and pattern matching. | Response |
102+
| Text Similarity | Evaluates the quality of text or determine semantic closeness. | Response, ground truth |
103+
| Model Scorer| Generates numerical scores (customized range) for content based on custom guidelines. | Query, response, ground truth |
93104

94-
| Evaluator | Purpose |
95-
|--|--|
96-
| Model Labeler | Classifies content using custom guidelines and labels. |
97-
| Model Scorer | Generates numerical scores (customized range) for content based on custom guidelines. |
98-
| String Checker | Performs flexible text validations and pattern matching. |
99-
| Textual Similarity | Evaluates the quality of text or determine semantic closeness. |
105+
To learn more, see [Azure OpenAI Graders](./evaluation-evaluators/azure-openai-graders.md).
106+
107+
### Evaluators in the development lifecycle
100108

101109
By using these evaluators strategically throughout the development lifecycle, teams can build more reliable, safe, and effective AI applications that meet user needs while minimizing potential risks.
102110

0 commit comments

Comments
 (0)