-
Notifications
You must be signed in to change notification settings - Fork 9
Feat: Implement contextual grounding #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
9e0adf4
Fix contextual grounding test integration issues
abrookins 2cbfe81
Implement contextual grounding in memory extraction
abrookins 7ceb930
Implement thread-aware contextual grounding for memory extraction
abrookins 9ac6400
Address PR review feedback
abrookins 94bd3df
Fix CI test failures
abrookins 6d84edd
Apply more aggressive CI stability fixes
abrookins 8147121
Improve temporal grounding by providing current datetime context
abrookins aca0d76
Fix contextual grounding integration tests to use thread-aware extrac…
abrookins 754939b
Fix remaining integration tests to use thread-aware extraction
abrookins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
51 changes: 51 additions & 0 deletions
51
tests/templates/contextual_grounding_evaluation_prompt.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
You are an expert evaluator of contextual grounding in text. Your task is to assess how well contextual references (pronouns, temporal expressions, spatial references, etc.) have been resolved to their concrete referents. | ||
|
||
INPUT CONTEXT MESSAGES: | ||
{context_messages} | ||
|
||
ORIGINAL TEXT WITH CONTEXTUAL REFERENCES: | ||
{original_text} | ||
|
||
GROUNDED TEXT (what the system produced): | ||
{grounded_text} | ||
|
||
EXPECTED GROUNDINGS: | ||
{expected_grounding} | ||
|
||
Please evaluate the grounding quality on these dimensions: | ||
|
||
1. PRONOUN_RESOLUTION (0-1): How well are pronouns (he/she/they/him/her/them) resolved to specific entities? If no pronouns are present, score as 1.0. If pronouns remain unchanged from the original text, this indicates no grounding was performed and should receive a low score (0.0-0.2). | ||
|
||
2. TEMPORAL_GROUNDING (0-1): How well are relative time expressions converted to absolute times? If no temporal expressions are present, score as 1.0. If temporal expressions remain unchanged when they should be grounded, this indicates incomplete grounding. | ||
|
||
3. SPATIAL_GROUNDING (0-1): How well are place references (there/here/that place) resolved to specific locations? If no spatial references are present, score as 1.0. If spatial references remain unchanged when they should be grounded, this indicates incomplete grounding. | ||
|
||
4. COMPLETENESS (0-1): Are all context-dependent references that exist in the text properly resolved? This should be high (0.8-1.0) if all relevant references were grounded, moderate (0.4-0.7) if some were missed, and low (0.0-0.3) if most/all were missed. | ||
|
||
5. ACCURACY (0-1): Are the groundings factually correct given the context? | ||
|
||
IMPORTANT SCORING PRINCIPLES: | ||
- Only penalize dimensions that are actually relevant to the text | ||
- If no pronouns exist, pronoun_resolution_score = 1.0 (not applicable = perfect) | ||
- If no temporal expressions exist, temporal_grounding_score = 1.0 (not applicable = perfect) | ||
- If no spatial references exist, spatial_grounding_score = 1.0 (not applicable = perfect) | ||
- The overall_score should reflect performance on relevant dimensions only | ||
|
||
CRITICAL: If the grounded text is identical to the original text, this means NO grounding was performed. In this case: | ||
- Set relevant dimension scores to 0.0 based on what should have been grounded | ||
- Set irrelevant dimension scores to 1.0 (not applicable) | ||
- COMPLETENESS should be 0.0 since nothing was resolved | ||
- OVERALL_SCORE should be very low (0.0-0.2) if grounding was expected | ||
|
||
Return your evaluation as JSON in this format: | ||
{{ | ||
"pronoun_resolution_score": 0.95, | ||
"temporal_grounding_score": 0.90, | ||
"spatial_grounding_score": 0.85, | ||
"completeness_score": 0.92, | ||
"accuracy_score": 0.88, | ||
"overall_score": 0.90, | ||
"explanation": "Brief explanation of the scoring rationale" | ||
}} | ||
|
||
Be strict in your evaluation - only give high scores when grounding is complete and accurate. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
You are an expert evaluator of memory extraction systems. Your task is to assess how well a system extracted discrete memories from conversational text. | ||
|
||
ORIGINAL CONVERSATION: | ||
{original_conversation} | ||
|
||
EXTRACTED MEMORIES: | ||
{extracted_memories} | ||
|
||
EXPECTED EXTRACTION CRITERIA: | ||
{expected_criteria} | ||
|
||
Please evaluate the memory extraction quality on these dimensions: | ||
|
||
1. RELEVANCE (0-1): Are the extracted memories genuinely useful for future conversations? | ||
2. CLASSIFICATION_ACCURACY (0-1): Are memories correctly classified as "episodic" vs "semantic"? | ||
3. INFORMATION_PRESERVATION (0-1): Is important information captured without loss? | ||
4. REDUNDANCY_AVOIDANCE (0-1): Are duplicate or overlapping memories avoided? | ||
5. COMPLETENESS (0-1): Are all extractable valuable memories identified? | ||
6. ACCURACY (0-1): Are the extracted memories factually correct? | ||
|
||
CLASSIFICATION GUIDELINES: | ||
- EPISODIC: Personal experiences, events, user preferences, specific interactions | ||
- SEMANTIC: General knowledge, facts, procedures, definitions not in training data | ||
|
||
Return your evaluation as JSON in this format: | ||
{{ | ||
"relevance_score": 0.95, | ||
"classification_accuracy_score": 0.90, | ||
"information_preservation_score": 0.85, | ||
"redundancy_avoidance_score": 0.92, | ||
"completeness_score": 0.88, | ||
"accuracy_score": 0.94, | ||
"overall_score": 0.90, | ||
"explanation": "Brief explanation of the scoring rationale", | ||
"suggested_improvements": "Specific suggestions for improvement" | ||
}} | ||
|
||
Be strict in your evaluation - only give high scores when extraction is comprehensive and accurate. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.