You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Only extract information that would be genuinely useful for future interactions.
255
278
2. Do not extract procedural knowledge - that is handled by the system's built-in tools and prompts.
256
279
3. You are a large language model - do not extract facts that you already know.
280
+
4. CRITICAL: ALWAYS ground ALL contextual references - never leave ANY pronouns, relative times, or vague place references unresolved.
281
+
5. MANDATORY: Replace every instance of "he/she/they/him/her/them/his/hers/theirs" with the actual person's name.
282
+
6. MANDATORY: Replace possessive pronouns like "her experience" with "Sarah's experience" (if "her" refers to Sarah).
283
+
7. If you cannot determine what a contextual reference refers to, either omit that memory or use generic terms like "someone" instead of ungrounded pronouns.
257
284
258
285
Message:
259
286
{message}
260
287
288
+
STEP-BY-STEP PROCESS:
289
+
1. First, identify all pronouns in the text: he, she, they, him, her, them, his, hers, theirs
290
+
2. Determine what person each pronoun refers to based on the context
291
+
3. Replace every single pronoun with the actual person's name
292
+
4. Extract the grounded memories with NO pronouns remaining
Copy file name to clipboardExpand all lines: tests/test_contextual_grounding_integration.py
+26-8Lines changed: 26 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -197,12 +197,29 @@ class LLMContextualGroundingJudge:
197
197
198
198
Please evaluate the grounding quality on these dimensions:
199
199
200
-
1. PRONOUN_RESOLUTION (0-1): How well are pronouns (he/she/they/him/her/them) resolved to specific entities?
201
-
2. TEMPORAL_GROUNDING (0-1): How well are relative time expressions converted to absolute times?
202
-
3. SPATIAL_GROUNDING (0-1): How well are place references (there/here/that place) resolved to specific locations?
203
-
4. COMPLETENESS (0-1): Are all context-dependent references resolved (no "he", "there", "yesterday" left ungrounded)?
200
+
1. PRONOUN_RESOLUTION (0-1): How well are pronouns (he/she/they/him/her/them) resolved to specific entities? If no pronouns are present, score as 1.0. If pronouns remain unchanged from the original text, this indicates no grounding was performed and should receive a low score (0.0-0.2).
201
+
202
+
2. TEMPORAL_GROUNDING (0-1): How well are relative time expressions converted to absolute times? If no temporal expressions are present, score as 1.0. If temporal expressions remain unchanged when they should be grounded, this indicates incomplete grounding.
203
+
204
+
3. SPATIAL_GROUNDING (0-1): How well are place references (there/here/that place) resolved to specific locations? If no spatial references are present, score as 1.0. If spatial references remain unchanged when they should be grounded, this indicates incomplete grounding.
205
+
206
+
4. COMPLETENESS (0-1): Are all context-dependent references that exist in the text properly resolved? This should be high (0.8-1.0) if all relevant references were grounded, moderate (0.4-0.7) if some were missed, and low (0.0-0.3) if most/all were missed.
207
+
204
208
5. ACCURACY (0-1): Are the groundings factually correct given the context?
205
209
210
+
IMPORTANT SCORING PRINCIPLES:
211
+
- Only penalize dimensions that are actually relevant to the text
212
+
- If no pronouns exist, pronoun_resolution_score = 1.0 (not applicable = perfect)
213
+
- If no temporal expressions exist, temporal_grounding_score = 1.0 (not applicable = perfect)
214
+
- If no spatial references exist, spatial_grounding_score = 1.0 (not applicable = perfect)
215
+
- The overall_score should reflect performance on relevant dimensions only
216
+
217
+
CRITICAL: If the grounded text is identical to the original text, this means NO grounding was performed. In this case:
218
+
- Set relevant dimension scores to 0.0 based on what should have been grounded
219
+
- Set irrelevant dimension scores to 1.0 (not applicable)
220
+
- COMPLETENESS should be 0.0 since nothing was resolved
221
+
- OVERALL_SCORE should be very low (0.0-0.2) if grounding was expected
0 commit comments