Skip to content

Commit bfbbcff

Browse files
authored
Update Groundedness and Relevance Evaluators Prompts (#43514)
* Enhance Relevance Evaluator Prompt with Improved Multi-turn Conversation Handling * Remove RelevanceEvaluator from test_prompty_based_evaluator_custom_credentials * Update asset tag * Merge groundedness changes and update assets * Update Relevance Prompt
1 parent 20b7288 commit bfbbcff

File tree

5 files changed

+57
-36
lines changed

5 files changed

+57
-36
lines changed

sdk/evaluation/azure-ai-evaluation/assets.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22
"AssetsRepo": "Azure/azure-sdk-assets",
33
"AssetsRepoPrefixPath": "python",
44
"TagPrefix": "python/evaluation/azure-ai-evaluation",
5-
"Tag": "python/evaluation/azure-ai-evaluation_b613e35220"
5+
"Tag": "python/evaluation/azure-ai-evaluation_86c673042d"
66
}

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ def _has_context(self, eval_input: dict) -> bool:
238238

239239
@override
240240
async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]:
241-
if "query" not in eval_input:
241+
if eval_input.get("query", None) is None:
242242
return await super()._do_eval(eval_input)
243243

244244
contains_context = self._has_context(eval_input)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_groundedness/groundedness_without_query.prompty

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,16 @@ system:
2929

3030
user:
3131
# Definition
32-
**Groundedness** refers to how faithfully a response adheres to the information provided in the CONTEXT, ensuring that all content is directly supported by the context without introducing unsupported information or omitting critical details. It evaluates the fidelity and precision of the response in relation to the source material.
32+
**Groundedness** refers to how well a response is anchored in the provided context, evaluating its relevance, accuracy, and completeness based exclusively on that context. It assesses the extent to which the response directly and fully addresses the information without introducing unrelated or incorrect information.
33+
34+
> Context is the source of truth for evaluating the response.
35+
> Evaluate the groundedness of the response message based on the provided context.
3336

3437
# Ratings
35-
## [Groundedness: 1] (Completely Ungrounded Response)
36-
**Definition:** The response is entirely unrelated to the CONTEXT, introducing topics or information that have no connection to the provided material.
38+
## [Groundedness: 1] (Completely Unrelated Response)
39+
**Definition:** A response that does not relate to the context in any way.
40+
- Does not relate to the context at all.
41+
- Talks about the general topic but does not respond to the context.
3742

3843
**Examples:**
3944
**Context:** The company's profits increased by 20% in the last quarter.
@@ -42,8 +47,8 @@ user:
4247
**Context:** The new smartphone model features a larger display and improved battery life.
4348
**Response:** The history of ancient Egypt is fascinating and full of mysteries.
4449

45-
## [Groundedness: 2] (Contradictory Response)
46-
**Definition:** The response directly contradicts or misrepresents the information provided in the CONTEXT.
50+
## [Groundedness: 2] (Attempts to Respond but Contains Incorrect Information)
51+
**Definition:** A response that attempts to relate to the context but includes incorrect information not supported by the context. It may misstate facts, misinterpret the context, or provide erroneous details. Even if some points are correct, the presence of inaccuracies makes the response unreliable.
4752

4853
**Examples:**
4954
**Context:** The company's profits increased by 20% in the last quarter.
@@ -52,18 +57,18 @@ user:
5257
**Context:** The new smartphone model features a larger display and improved battery life.
5358
**Response:** The new smartphone model has a smaller display and shorter battery life.
5459

55-
## [Groundedness: 3] (Accurate Response with Unsupported Additions)
56-
**Definition:** The response accurately includes information from the CONTEXT but adds details, opinions, or explanations that are not supported by the provided material.
60+
## [Groundedness: 3] (Accurate but Vague Response)
61+
**Definition:** A response that provides accurate information from the context but is overly generic or vague, not meaningfully engaging with the specific details in the context. The information is correct but lacks specificity and detail.
5762

5863
**Examples:**
59-
**Context:** The company's profits increased by 20% in the last quarter.
60-
**Response:** The company's profits increased by 20% in the last quarter due to their aggressive marketing strategy.
64+
**Context:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.
65+
**Response:** The company is doing well financially.
6166

62-
**Context:** The new smartphone model features a larger display and improved battery life.
63-
**Response:** The new smartphone model features a larger display, improved battery life, and comes with a free case.
67+
**Context:** The new smartphone model features a larger display, improved battery life, and an upgraded camera system.
68+
**Response:** The smartphone has some nice features.
6469

65-
## [Groundedness: 4] (Incomplete Response Missing Critical Details)
66-
**Definition:** The response contains information from the CONTEXT but omits essential details that are necessary for a comprehensive understanding of the main point.
70+
## [Groundedness: 4] (Partially Correct Response)
71+
**Definition:** A response that provides correct information from the context but is incomplete or lacks specific details mentioned in the context. It captures some of the necessary information but omits key elements needed for a full understanding.
6772

6873
**Examples:**
6974
**Context:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.
@@ -73,7 +78,7 @@ user:
7378
**Response:** The new smartphone model features a larger display and improved battery life.
7479

7580
## [Groundedness: 5] (Fully Grounded and Complete Response)
76-
**Definition:** The response is entirely based on the CONTEXT, accurately and thoroughly conveying all essential information without introducing unsupported details or omitting critical points.
81+
**Definition:** A response that thoroughly and accurately conveys information from the context, including all relevant details. It directly addresses the context with precise information, demonstrating complete understanding without adding extraneous information.
7782

7883
**Examples:**
7984
**Context:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -20,22 +20,25 @@ inputs:
2020
---
2121

2222
system:
23-
You are a Relevance-Judge, an impartial evaluator that scores how well the RESPONSE addresses the QUERY using the definitions provided.
23+
You are a Relevance-Judge, an impartial evaluator that scores how well the RESPONSE addresses the user's queries in the CONVERSATION_HISTORY using the definitions provided.
2424

2525
user:
2626
ROLE
2727
====
28-
You are a Relevance Evaluator. Your task is to judge how relevant a RESPONSE is to a QUERY using the Relevance definitions provided.
28+
You are a Relevance Evaluator. Your task is to judge how relevant a RESPONSE is to the CONVERSATION_HISTORY using the Relevance definitions provided.
2929

3030
INPUT
3131
=====
32-
QUERY: {{query}}
32+
CONVERSATION_HISTORY: {{query}}
3333
RESPONSE: {{response}}
3434

35+
CONVERSATION_HISTORY is the full dialogue between the user and the agent up to the user's latest message. For single-turn interactions, this will be just the user's query.
36+
RESPONSE is the agent's reply to the user's latest message.
37+
3538
TASK
3639
====
3740
Output a JSON object with:
38-
1) a concise explanation of 15-60 words justifying your score based on how well the response is relevant to the query.
41+
1) a concise explanation of 15-60 words justifying your score based on how well the response is relevant to the user's queries in the CONVERSATION_HISTORY.
3942
2) an integer score from 1 (very poor) to 5 (excellent) using the rubric below.
4043

4144
The explanation should always precede the score and should clearly justify the score based on the rubric definitions.
@@ -49,13 +52,14 @@ Response format exactly as follows:
4952

5053
EVALUATION STEPS
5154
================
52-
A. Read the QUERY and RESPONSE carefully.
53-
B. Compare the RESPONSE against the rubric below:
54-
- Does the response directly address the query?
55+
A. Read the CONVERSATION_HISTORY and RESPONSE carefully.
56+
B. Identify the user's query from the latest message (use conversation history for context if needed).
57+
C. Compare the RESPONSE against the rubric below:
58+
- Does the response directly address the user's query?
5559
- Is the information complete, partial, or off-topic?
5660
- Is it vague, generic, or insightful?
57-
C. Match the response to the best score from the rubric.
58-
D. Provide a short explanation and the score using the required format.
61+
D. Match the response to the best score from the rubric.
62+
E. Provide a short explanation and the score using the required format.
5963

6064
SCORING RUBRIC
6165
==============
@@ -64,7 +68,7 @@ SCORING RUBRIC
6468
Definition: The response is unrelated to the question. It provides off-topic information and does not attempt to address the question posed.
6569

6670
**Example A**
67-
QUERY: What is the team preparing for?
71+
CONVERSATION_HISTORY: What is the team preparing for?
6872
RESPONSE: I went grocery shopping yesterday evening.
6973

7074
Expected Output:
@@ -75,7 +79,7 @@ Expected Output:
7579

7680

7781
**Example B**
78-
QUERY: When will the company's new product line launch?
82+
CONVERSATION_HISTORY: When will the company's new product line launch?
7983
RESPONSE: International travel can be very rewarding and educational.
8084

8185
Expected Output:
@@ -89,7 +93,7 @@ Expected Output:
8993
Definition: The response is loosely or formally related to the query but fails to deliver any meaningful, specific, or useful information. This includes vague phrases, non-answers, or failure/error messages.
9094

9195
**Example A**
92-
QUERY: What is the event about?
96+
CONVERSATION_HISTORY: What is the event about?
9397
RESPONSE: It’s something important.
9498

9599
Expected Output:
@@ -99,7 +103,7 @@ Expected Output:
99103
}
100104

101105
**Example B**
102-
QUERY: What’s the weather in Paris?
106+
CONVERSATION_HISTORY: What’s the weather in Paris?
103107
RESPONSE: I tried to find the forecast but the query failed.
104108

105109
Expected Output:
@@ -112,7 +116,7 @@ Expected Output:
112116
Definition: The response addresses the query and includes relevant information, but omits essential components or detail. The answer is on-topic but insufficient to fully satisfy the request.
113117

114118
**Example A**
115-
QUERY: What amenities does the new apartment complex provide?
119+
CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
116120
RESPONSE: The apartment complex has a gym.
117121

118122
Expected Output:
@@ -122,7 +126,7 @@ Expected Output:
122126
}
123127

124128
**Example B**
125-
QUERY: What services does the premium membership include?
129+
CONVERSATION_HISTORY: What services does the premium membership include?
126130
RESPONSE: It includes priority customer support.
127131

128132
Expected Output:
@@ -137,7 +141,7 @@ Expected Output:
137141
Definition: The response fully addresses the question with accurate and sufficient information, covering all essential aspects. Very minor omissions are acceptable as long as the core information is intact and the intent is clearly conveyed.
138142

139143
**Example A**
140-
QUERY: What amenities does the new apartment complex provide?
144+
CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
141145
RESPONSE: The apartment complex provides a gym, swimming pool, and 24/7 security.
142146

143147
Expected Output:
@@ -147,7 +151,7 @@ Expected Output:
147151
}
148152

149153
**Example B**
150-
QUERY: What services does the premium membership include?
154+
CONVERSATION_HISTORY: What services does the premium membership include?
151155
RESPONSE: The premium membership includes priority customer support, exclusive content access, and early product releases.
152156

153157
Expected Output:
@@ -161,7 +165,7 @@ Expected Output:
161165
Definition: The response not only fully and accurately answers the question, but also adds meaningful elaboration, interpretation, or context that enhances the user's understanding. This goes beyond just listing relevant details — it offers insight into why the information matters, how it's useful, or what impact it has.
162166

163167
**Example A**
164-
QUERY: What amenities does the new apartment complex provide?
168+
CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
165169
RESPONSE: The apartment complex provides a gym, swimming pool, and 24/7 security, designed to offer residents a comfortable and active lifestyle while ensuring their safety.
166170

167171
Expected Output:
@@ -171,11 +175,24 @@ Expected Output:
171175
}
172176

173177
**Example B**
174-
QUERY: What services does the premium membership include?
178+
CONVERSATION_HISTORY: What services does the premium membership include?
175179
RESPONSE: The premium membership includes priority customer support, exclusive content access, and early product releases — tailored for users who want quicker resolutions and first access to new features.
176180

177181
Expected Output:
178182
{
179183
"explanation": "The response covers all essential services and adds valuable insight about the target user and benefits, enriching the response beyond basic listing.",
180184
"score": 5
181185
}
186+
187+
### Multi-turn Conversation Example
188+
When evaluating responses in a multi-turn conversation, consider the conversation context to understand the user's intent:
189+
190+
**Example - Multi-turn Context**
191+
CONVERSATION_HISTORY: [{"role":"user","content":"I'm planning a vacation to Europe."},{"role":"assistant","content":"That sounds exciting! What time of year are you thinking of traveling?"},{"role":"user","content":"Probably in July. What's the weather like then?"}]
192+
RESPONSE: [{"role":"assistant","content":"July is summer in Europe with generally warm and pleasant weather. Most countries have temperatures between 20-25°C (68-77°F). It's a popular travel time, so expect crowds at major tourist attractions and higher accommodation prices."}]
193+
194+
Expected Output:
195+
{
196+
"explanation": "The response directly addresses the weather question while providing valuable context about crowds and pricing that's relevant to vacation planning established in the conversation.",
197+
"score": 5
198+
}

sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_builtin_evaluators.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1232,7 +1232,6 @@ def test_multimodal_evaluator_protected_material_json(self, request, proj_scope,
12321232
@pytest.mark.parametrize(
12331233
"evaluator_cls",
12341234
[
1235-
RelevanceEvaluator,
12361235
FluencyEvaluator,
12371236
SimilarityEvaluator,
12381237
CoherenceEvaluator,

0 commit comments

Comments
 (0)