You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Extract large evaluation prompts to template files for better maintainability
- Remove redundant API key checks in test methods (already covered by @pytest.mark.requires_api_keys)
- Optimize API-dependent tests to reduce CI timeout risk
- Reduce test iterations and sample sizes for faster CI execution
Addresses Copilot feedback and CI stability issues.
You are an expert evaluator of contextual grounding in text. Your task is to assess how well contextual references (pronouns, temporal expressions, spatial references, etc.) have been resolved to their concrete referents.
2
+
3
+
INPUT CONTEXT MESSAGES:
4
+
{context_messages}
5
+
6
+
ORIGINAL TEXT WITH CONTEXTUAL REFERENCES:
7
+
{original_text}
8
+
9
+
GROUNDED TEXT (what the system produced):
10
+
{grounded_text}
11
+
12
+
EXPECTED GROUNDINGS:
13
+
{expected_grounding}
14
+
15
+
Please evaluate the grounding quality on these dimensions:
16
+
17
+
1. PRONOUN_RESOLUTION (0-1): How well are pronouns (he/she/they/him/her/them) resolved to specific entities? If no pronouns are present, score as 1.0. If pronouns remain unchanged from the original text, this indicates no grounding was performed and should receive a low score (0.0-0.2).
18
+
19
+
2. TEMPORAL_GROUNDING (0-1): How well are relative time expressions converted to absolute times? If no temporal expressions are present, score as 1.0. If temporal expressions remain unchanged when they should be grounded, this indicates incomplete grounding.
20
+
21
+
3. SPATIAL_GROUNDING (0-1): How well are place references (there/here/that place) resolved to specific locations? If no spatial references are present, score as 1.0. If spatial references remain unchanged when they should be grounded, this indicates incomplete grounding.
22
+
23
+
4. COMPLETENESS (0-1): Are all context-dependent references that exist in the text properly resolved? This should be high (0.8-1.0) if all relevant references were grounded, moderate (0.4-0.7) if some were missed, and low (0.0-0.3) if most/all were missed.
24
+
25
+
5. ACCURACY (0-1): Are the groundings factually correct given the context?
26
+
27
+
IMPORTANT SCORING PRINCIPLES:
28
+
- Only penalize dimensions that are actually relevant to the text
29
+
- If no pronouns exist, pronoun_resolution_score = 1.0 (not applicable = perfect)
30
+
- If no temporal expressions exist, temporal_grounding_score = 1.0 (not applicable = perfect)
31
+
- If no spatial references exist, spatial_grounding_score = 1.0 (not applicable = perfect)
32
+
- The overall_score should reflect performance on relevant dimensions only
33
+
34
+
CRITICAL: If the grounded text is identical to the original text, this means NO grounding was performed. In this case:
35
+
- Set relevant dimension scores to 0.0 based on what should have been grounded
36
+
- Set irrelevant dimension scores to 1.0 (not applicable)
37
+
- COMPLETENESS should be 0.0 since nothing was resolved
38
+
- OVERALL_SCORE should be very low (0.0-0.2) if grounding was expected
39
+
40
+
Return your evaluation as JSON in this format:
41
+
{{
42
+
"pronoun_resolution_score": 0.95,
43
+
"temporal_grounding_score": 0.90,
44
+
"spatial_grounding_score": 0.85,
45
+
"completeness_score": 0.92,
46
+
"accuracy_score": 0.88,
47
+
"overall_score": 0.90,
48
+
"explanation": "Brief explanation of the scoring rationale"
49
+
}}
50
+
51
+
Be strict in your evaluation - only give high scores when grounding is complete and accurate.
You are an expert evaluator of memory extraction systems. Your task is to assess how well a system extracted discrete memories from conversational text.
2
+
3
+
ORIGINAL CONVERSATION:
4
+
{original_conversation}
5
+
6
+
EXTRACTED MEMORIES:
7
+
{extracted_memories}
8
+
9
+
EXPECTED EXTRACTION CRITERIA:
10
+
{expected_criteria}
11
+
12
+
Please evaluate the memory extraction quality on these dimensions:
13
+
14
+
1. RELEVANCE (0-1): Are the extracted memories genuinely useful for future conversations?
15
+
2. CLASSIFICATION_ACCURACY (0-1): Are memories correctly classified as "episodic" vs "semantic"?
16
+
3. INFORMATION_PRESERVATION (0-1): Is important information captured without loss?
17
+
4. REDUNDANCY_AVOIDANCE (0-1): Are duplicate or overlapping memories avoided?
18
+
5. COMPLETENESS (0-1): Are all extractable valuable memories identified?
19
+
6. ACCURACY (0-1): Are the extracted memories factually correct?
20
+
21
+
CLASSIFICATION GUIDELINES:
22
+
- EPISODIC: Personal experiences, events, user preferences, specific interactions
23
+
- SEMANTIC: General knowledge, facts, procedures, definitions not in training data
24
+
25
+
Return your evaluation as JSON in this format:
26
+
{{
27
+
"relevance_score": 0.95,
28
+
"classification_accuracy_score": 0.90,
29
+
"information_preservation_score": 0.85,
30
+
"redundancy_avoidance_score": 0.92,
31
+
"completeness_score": 0.88,
32
+
"accuracy_score": 0.94,
33
+
"overall_score": 0.90,
34
+
"explanation": "Brief explanation of the scoring rationale",
35
+
"suggested_improvements": "Specific suggestions for improvement"
36
+
}}
37
+
38
+
Be strict in your evaluation - only give high scores when extraction is comprehensive and accurate.
Copy file name to clipboardExpand all lines: tests/test_contextual_grounding_integration.py
+9-56Lines changed: 9 additions & 56 deletions
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,7 @@
11
11
importjson
12
12
importos
13
13
fromdatetimeimportUTC, datetime, timedelta
14
+
frompathlibimportPath
14
15
15
16
importpytest
16
17
importulid
@@ -180,62 +181,16 @@ def get_all_examples(cls):
180
181
classLLMContextualGroundingJudge:
181
182
"""LLM-as-a-Judge system for evaluating contextual grounding quality"""
182
183
183
-
EVALUATION_PROMPT="""
184
-
You are an expert evaluator of contextual grounding in text. Your task is to assess how well contextual references (pronouns, temporal expressions, spatial references, etc.) have been resolved to their concrete referents.
185
-
186
-
INPUT CONTEXT MESSAGES:
187
-
{context_messages}
188
-
189
-
ORIGINAL TEXT WITH CONTEXTUAL REFERENCES:
190
-
{original_text}
191
-
192
-
GROUNDED TEXT (what the system produced):
193
-
{grounded_text}
194
-
195
-
EXPECTED GROUNDINGS:
196
-
{expected_grounding}
197
-
198
-
Please evaluate the grounding quality on these dimensions:
199
-
200
-
1. PRONOUN_RESOLUTION (0-1): How well are pronouns (he/she/they/him/her/them) resolved to specific entities? If no pronouns are present, score as 1.0. If pronouns remain unchanged from the original text, this indicates no grounding was performed and should receive a low score (0.0-0.2).
201
-
202
-
2. TEMPORAL_GROUNDING (0-1): How well are relative time expressions converted to absolute times? If no temporal expressions are present, score as 1.0. If temporal expressions remain unchanged when they should be grounded, this indicates incomplete grounding.
203
-
204
-
3. SPATIAL_GROUNDING (0-1): How well are place references (there/here/that place) resolved to specific locations? If no spatial references are present, score as 1.0. If spatial references remain unchanged when they should be grounded, this indicates incomplete grounding.
205
-
206
-
4. COMPLETENESS (0-1): Are all context-dependent references that exist in the text properly resolved? This should be high (0.8-1.0) if all relevant references were grounded, moderate (0.4-0.7) if some were missed, and low (0.0-0.3) if most/all were missed.
207
-
208
-
5. ACCURACY (0-1): Are the groundings factually correct given the context?
209
-
210
-
IMPORTANT SCORING PRINCIPLES:
211
-
- Only penalize dimensions that are actually relevant to the text
212
-
- If no pronouns exist, pronoun_resolution_score = 1.0 (not applicable = perfect)
213
-
- If no temporal expressions exist, temporal_grounding_score = 1.0 (not applicable = perfect)
214
-
- If no spatial references exist, spatial_grounding_score = 1.0 (not applicable = perfect)
215
-
- The overall_score should reflect performance on relevant dimensions only
216
-
217
-
CRITICAL: If the grounded text is identical to the original text, this means NO grounding was performed. In this case:
218
-
- Set relevant dimension scores to 0.0 based on what should have been grounded
219
-
- Set irrelevant dimension scores to 1.0 (not applicable)
220
-
- COMPLETENESS should be 0.0 since nothing was resolved
221
-
- OVERALL_SCORE should be very low (0.0-0.2) if grounding was expected
222
-
223
-
Return your evaluation as JSON in this format:
224
-
{{
225
-
"pronoun_resolution_score": 0.95,
226
-
"temporal_grounding_score": 0.90,
227
-
"spatial_grounding_score": 0.85,
228
-
"completeness_score": 0.92,
229
-
"accuracy_score": 0.88,
230
-
"overall_score": 0.90,
231
-
"explanation": "Brief explanation of the scoring rationale"
232
-
}}
233
-
234
-
Be strict in your evaluation - only give high scores when grounding is complete and accurate.
Copy file name to clipboardExpand all lines: tests/test_llm_judge_evaluation.py
+11-76Lines changed: 11 additions & 76 deletions
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@
9
9
"""
10
10
11
11
importjson
12
-
importos
12
+
frompathlibimportPath
13
13
14
14
importpytest
15
15
@@ -22,49 +22,14 @@
22
22
classMemoryExtractionJudge:
23
23
"""LLM-as-a-Judge system for evaluating discrete memory extraction quality"""
24
24
25
-
EXTRACTION_EVALUATION_PROMPT="""
26
-
You are an expert evaluator of memory extraction systems. Your task is to assess how well a system extracted discrete memories from conversational text.
27
-
28
-
ORIGINAL CONVERSATION:
29
-
{original_conversation}
30
-
31
-
EXTRACTED MEMORIES:
32
-
{extracted_memories}
33
-
34
-
EXPECTED EXTRACTION CRITERIA:
35
-
{expected_criteria}
36
-
37
-
Please evaluate the memory extraction quality on these dimensions:
38
-
39
-
1. RELEVANCE (0-1): Are the extracted memories genuinely useful for future conversations?
40
-
2. CLASSIFICATION_ACCURACY (0-1): Are memories correctly classified as "episodic" vs "semantic"?
41
-
3. INFORMATION_PRESERVATION (0-1): Is important information captured without loss?
42
-
4. REDUNDANCY_AVOIDANCE (0-1): Are duplicate or overlapping memories avoided?
43
-
5. COMPLETENESS (0-1): Are all extractable valuable memories identified?
44
-
6. ACCURACY (0-1): Are the extracted memories factually correct?
45
-
46
-
CLASSIFICATION GUIDELINES:
47
-
- EPISODIC: Personal experiences, events, user preferences, specific interactions
48
-
- SEMANTIC: General knowledge, facts, procedures, definitions not in training data
49
-
50
-
Return your evaluation as JSON in this format:
51
-
{{
52
-
"relevance_score": 0.95,
53
-
"classification_accuracy_score": 0.90,
54
-
"information_preservation_score": 0.85,
55
-
"redundancy_avoidance_score": 0.92,
56
-
"completeness_score": 0.88,
57
-
"accuracy_score": 0.94,
58
-
"overall_score": 0.90,
59
-
"explanation": "Brief explanation of the scoring rationale",
60
-
"suggested_improvements": "Specific suggestions for improvement"
61
-
}}
62
-
63
-
Be strict in your evaluation - only give high scores when extraction is comprehensive and accurate.
0 commit comments