feat: Learning agent system with HierarchicalMemory, Graph RAG, and eval harness#2395
feat: Learning agent system with HierarchicalMemory, Graph RAG, and eval harness#2395
Conversation
Implements comprehensive evaluation harness testing three scenarios: - Olympics 2026 (Wikipedia reading and learning) - Flutter tutorial (multi-page learning) - VS2026 (Visual Studio 2026 content) Fixes Kuzu backend keyword-based search replacing substring CONTAINS for better semantic retrieval. Also fixes wikipedia_learning_agent.py context and answer truncation (context[:200], answer[:900]). Test results: 15/19 tests passed (79% pass rate) Fixes #2394 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Initial Review✅ Philosophy Compliance
✅ Code Quality
✅ Security Review
Test EvidenceAll 19 tests executed successfully with documented results:
Overall: 15/19 tests passed (79% pass rate) Kuzu Search FixThe keyword-based search improvement addresses substring matching limitations and provides better semantic retrieval. Ready for final review and merge. |
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
🤖 PM Architect PR Triage AnalysisPR: #2395 ✅ Workflow Compliance (Steps 11-12)❌ NON-COMPLIANT - PR needs workflow completion Step 11 (Review): ❌ Incomplete
Step 12 (Feedback): ❌ Incomplete
Blocking Issues:
🏷️ ClassificationPriority:
Complexity:
🔍 Change Scope Analysis✅ FOCUSED CHANGES - All changes are related to PR purpose Purpose: Bug fix 💡 Recommendations
📊 Statistics
🤖 Generated by PM Architect automation using Claude Agent SDK |
Repo Guardian - Action Required❌ Violation Found: Point-in-Time DocumentFile: Why flagged:
Problematic content: {
"timestamp": "2026-02-16T18:30:03",
"model": "anthropic/claude-sonnet-4-5-20250929",
"elapsed_seconds": 184.7,
"scenarios": [...],
"overall": {
"total_questions": 19,
"total_passed": 15,
"total_failed": 4
}
}
````
**Where it should go instead:**
- **PR comment or issue:** Test results summary showing what passed/failed in this run
- **CI/CD artifacts:** Store as workflow artifacts if needed for historical comparison
- **External tracking:** Test result tracking system or dashboard
- **Commit message:** High-level summary ("15/19 tests passed") if documenting what was tested
**Reasoning:**
This is a snapshot of test execution results from a specific date/time. As the code evolves, these results will become outdated and no longer reflect current system behavior. Test results are ephemeral by nature - they describe "what happened when I ran this test on this date" rather than durable reference documentation.
---
### ℹ️ To Override
If this file is intentional and should remain in the repository, add a PR comment containing:
````
repo-guardian:override (reason)Where
|
Fixes search relevance issues that caused 4/19 eval failures (79%). For small knowledge bases (<=50 facts), retrieves ALL facts and lets the LLM decide relevance instead of relying on keyword search. Changes: - Add MemoryRetriever.get_all_facts() for unfiltered retrieval - Smart retrieval in answer_question(): skip keyword search when <=50 facts - Fallback to full retrieval when search returns <3 results - Increase LLM context window from 5 to 20 facts - Fix missing Path import in wikipedia_learning_agent.py - Add goal_seeking to pyright ignore (uses external amplihack_memory lib) Eval results: 15/19 (79%) -> 19/19 (100%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ents Implements hierarchical memory system using Kuzu graph database directly for richer knowledge retrieval via similarity edges and subgraph traversal. New modules: - similarity.py: Jaccard-based word/tag/concept similarity computation - hierarchical_memory.py: HierarchicalMemory with MemoryCategory enum, KnowledgeNode/Edge/Subgraph dataclasses, auto-classification, SIMILAR_TO and DERIVES_FROM edge creation - graph_rag_retriever.py: GraphRAGRetriever wrapping Kuzu queries for keyword search, similarity expansion, and provenance tracking - flat_retriever_adapter.py: Backward-compatible adapter over HierarchicalMemory Updated: - wikipedia_learning_agent.py: use_hierarchical flag for dual-mode operation - __init__.py: Exports new modules Tests: 37 new tests (12 similarity + 18 hierarchical memory + 7 flat adapter) All 98 tests pass (61 existing + 37 new). Closes #2399 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aph-rag' into feat/2396-smart-retrieval
…feat/issue-2394-eval-harness-3scenario # Conflicts: # pyproject.toml
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Repo Guardian - Action Required❌ Violations Found: Point-in-Time Documents and Temporary ScriptsFile 1:
|
…ssive tests TASK 1: Rename WikipediaLearningAgent → LearningAgent - Renamed wikipedia_learning_agent.py → learning_agent.py - Updated class name WikipediaLearningAgent → LearningAgent - Updated all docstrings to reflect generic content learning (not Wikipedia-specific) - Added backward compatibility alias: WikipediaLearningAgent = LearningAgent - Updated __init__.py exports with new name and alias - Updated flat_retriever_adapter.py references - Renamed test file: test_wikipedia_learning_agent.py → test_learning_agent.py - Updated all test imports and class names TASK 2: Wire progressive test suite to HierarchicalMemory - Rewrote agent_subprocess.py to use LearningAgent with use_hierarchical=True - learning_phase now uses agent.learn_from_content() with fact extraction - testing_phase uses agent.answer_question() with LLM synthesis - Both phases leverage HierarchicalMemory's Graph RAG for knowledge retrieval - Removed dependency on amplihack_memory MemoryConnector (old backend) - Added verification script to confirm L1/L2 tests work with new agent Verification: - Backward compatibility verified: WikipediaLearningAgent alias works - LearningAgent instantiates with HierarchicalMemory successfully - Progressive test suite imports functional - L1 and L2 test levels accessible and ready to run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ub.com/rysweet/amplihack into feat/issue-2394-eval-harness-3scenario
Repo Guardian - Action Required❌ Violations Found: Point-in-Time Documents and Temporary ScriptsFile 1:
|
The progressive test suite failed with "Expecting value: line 1 column 1 (char 0)" because the Anthropic API wraps JSON responses in markdown code fences (```json ... ```), but grader.py called json.loads() directly on the raw response text. Changes: - grader.py: Add _extract_json() that handles raw JSON, markdown-fenced JSON, and brace-delimited JSON extraction from LLM responses - progressive_test_suite.py: Add _extract_json_line() to robustly find the JSON object line in subprocess stdout, filtering litellm warnings. Fix pyright errors for optional metadata/scores access. - agent_subprocess.py: Fix model default to anthropic/claude-sonnet and improve input format handling for learning phase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: Anthropic API returns JSON in markdown fences, grader called json.loads() on raw text. Added _extract_json() to handle fenced/raw/brace-delimited JSON. L1: 100%, L2: 76.67% - both passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Repo Guardian - Action Required❌ Violations Found: Point-in-Time Documents and Temporary ScriptsFile 1:
|
…adata, and calculator tool Three improvements to LearningAgent for L3 (temporal reasoning) scores: 1. Temporal metadata on episodic memories: learn_from_content() now detects dates/temporal markers via LLM and attaches source_date, temporal_order, and temporal_index to stored facts. HierarchicalMemory supports temporal metadata in store_knowledge() and chronological sorting in to_llm_context(). 2. Intent detection before answering: answer_question() classifies questions via a single LLM call into simple_recall, mathematical_computation, temporal_comparison, multi_source_synthesis, or contradiction_resolution. Temporal questions get chronologically sorted facts and explicit reasoning instructions. Math questions get step-by-step computation prompts. 3. Calculator tool: New calculate() action in ActionExecutor safely evaluates arithmetic expressions. Registered by default. After synthesis, if math was needed, _validate_arithmetic() scans for "a op b = c" patterns and corrects any wrong results. L3 score improved from 57% baseline to 67-100% (grader variance due to LLM non-determinism). All 48 existing tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Repo Guardian - Action Required❌ Violations Found: Point-in-Time Documents and Temporary ScriptsFile 1:
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ub.com/rysweet/amplihack into feat/issue-2394-eval-harness-3scenario
Four improvements to make LearningAgent better at organizing, explaining, and communicating knowledge: 1. Source provenance in LLM context: Follow DERIVES_FROM edges to label each fact with its source episode, helping the agent cite sources. 2. Contradiction detection during storage: When high-similarity nodes have conflicting numbers about the same concept, flag SIMILAR_TO edges with contradiction metadata for awareness during synthesis. 3. Knowledge organization via summary concept maps: After extracting facts, generate a brief organizational overview stored as a SUMMARY node, giving the agent a birds-eye view of learned content. 4. Explanation quality in synthesis: Enhanced system prompt to cite sources, connect related facts, and handle contradictions with balanced viewpoints. Summary context included in answer synthesis. Eval results (all 6 levels passing): - L1: 100%, L2: 77%, L3: 43%, L4: 68%, L5: 77%, L6: 100% - Overall: 77.36% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eval runner Back off verbose system/user prompt additions from 740fed8 that caused L3 to drop from 90% to 43% and L4 from 81% to 68%. The LLM was overwhelmed by "cite sources, explain connections" instructions that made answers rambling instead of precise. Summary context now only included for multi_source_synthesis intent. System prompt restored to short, direct form. Add --parallel N flag to progressive_test_suite CLI and run_progressive_eval.py that runs the suite N times concurrently (ProcessPoolExecutor, max 4 workers), each with a unique agent name and isolated Kuzu DB, then reports median scores per level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AgenticLoop.reason_iteratively(): plan→search→evaluate→refine cycle - _plan_retrieval: LLM generates targeted search queries - _evaluate_sufficiency: LLM checks if enough info gathered - max_steps=3, exits early if confident Parallel eval: --parallel N flag runs N concurrent evals with unique DBs Reports median scores per level Results (3-run median): L1: 100%, L2: 67%, L3: 43%, L4: 86%, L5: 95%, L6: 98% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… routing The right fix for L2 is better plan quality in reason_iteratively, not bypassing the plan with a brute-force dump. Also includes: adaptive loop (simple vs complex intent routing), Specs for cognitive memory architecture and teacher-student eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-student L7 L2 multi-source synthesis: 60% → 93-100% (target ≥85%) - Source-aware plan prompts with per-source query generation - Source-specific counting instructions in synthesis prompt L3 temporal reasoning: 53% → 88-95% (target ≥70%) - Time-period-specific query generation in plan prompt - Structured arithmetic template (data table → compute → compare → verify) - Conditional temporal context in fact extraction Metacognition eval (new): - ReasoningTrace + ReasoningStep dataclasses in agentic_loop.py - reason_iteratively now returns (facts, nodes, trace) - metacognition_grader.py: 4-dimension scoring (effort calibration, sufficiency judgment, search quality, self-correction) - 13 unit tests passing - Progressive test suite integrates metacognition alongside answer grades Teacher-student L7 framework (new): - TeachingSession: multi-turn conversation between teacher and student agents - teaching_eval.py: complete L7 eval runner with transfer ratio metric - L7 test level with questions and articles - Pedagogically-informed design (advance organizers, scaffolding, reciprocal teaching) 111 tests passing (98 existing + 13 new metacognition tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: L6 questions about knowledge updates (Klaebo 9→10 golds) were classified as temporal_comparison/multi_source_synthesis, triggering iterative search that missed update article facts. Fix: Added incremental_update intent type that routes to simple retrieval (all facts visible). Questions about a single entity's trajectory/history/ current state now get simple retrieval, ensuring update data isn't lost. Previous L6 median: 50-53%. Expected L6: ~100%. L3 maintains 86-95% (still uses iterative for temporal comparison). L5 maintains 98-100% (contradiction detection unaffected). 111 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
L1 was dropping because needs_math=true triggered arithmetic verification instructions even for simple recall, causing LLM to add wrong verification (e.g., "12 + 8 + 6 = 14" when answer is 26). Now only complex intents (temporal_comparison, multi_source_synthesis, etc.) get the structured math/temporal prompts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sion Teaching session enhancements based on learning theory research: 1. Self-explanation prompting (Chi 1994 effect): - Every 3 exchanges, teacher asks "why" question - Forces student to explain reasoning, not just receive facts - Chi showed this doubles learning gains 2. Student talk ratio tracking (TeachLM benchmark): - Measures % of dialogue from student - Human tutors achieve ~30%, LLMs typically 5-15% - Displayed in eval results for monitoring 3. Learning theory research notes saved to Specs/LEARNING_THEORY_NOTES.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…terfactual) L8 (Metacognition): Agent evaluates its own confidence and knowledge gaps - Confidence calibration: knows what it can/cannot answer - Gap identification: identifies missing information needed - Confidence discrimination: ranks HIGH vs LOW confidence per question - First run: 95% (target ≥50%) L9 (Causal Reasoning): Identifying causal chains from observations - Causal chain: traces cause→effect sequences - Counterfactual causal: "what if X hadn't happened?" - Root cause analysis: identifies deepest cause in chain - First run: 66.67% (target ≥50%) L10 (Counterfactual Reasoning): Hypothetical alternatives - Counterfactual removal: "what if X didn't exist?" - Counterfactual timing: "what if X happened later?" - Counterfactual structural: "what if category Y was removed?" - First run: 48.33% (target ≥40%) Based on research: Pearl's causal hierarchy (2009), Byrne (2005) counterfactual thinking, MUSE framework (2024) for computational metacognition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporal reasoning at STORAGE time (not just retrieval): - SUPERSEDES relationship table in Kuzu schema - _detect_supersedes: at store time, creates SUPERSEDES edges for updates - _mark_superseded: at retrieval time, halves confidence of outdated facts - Synthesis prompt shows [OUTDATED] marker for superseded facts Role reversal in teaching (Feynman technique): - Every 5 exchanges, teacher asks student to teach back - Student's own teaching reinforces their learning L3: 93%, L5: 95%, L6: 100% - no regressions. 111 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CognitiveMemory integration: - cognitive_adapter.py: wraps 6-type CognitiveMemory with backward-compatible interface - Exposes: working memory, sensory, episodic, semantic, procedural, prospective - Falls back to HierarchicalMemory if amplihack-memory-lib not installed - LearningAgent auto-selects CognitiveAdapter when available L1 fix: "Do NOT add arithmetic verification" for simple recall L4 fix: Reconstruct exact ordered step sequences for procedural questions L4 extraction: Procedural hint preserves step numbers in content 111 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctions Added counterfactual reasoning instructions that detect "what if", "without", "if X had not" keywords. L10: 23% → 71.67%. NOTE: Prompts currently inline - next step: extract to markdown templates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per user requirement: prompts should NOT be inline in code.
Created prompts/ directory with 12 markdown templates + loader utility.
Templates use Python format string syntax ({variable_name}).
Loader: load_prompt() with LRU cache, format_prompt() for substitution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks student competency (beginner→intermediate→advanced). Teacher adapts approach based on demonstrated understanding. Promotes after 3 consecutive quality responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uv.lock) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Repo Guardian - Action Required❌ Violations Found: Point-in-Time Documents and Temporary ScriptsFile 1:
|
* Revert "feat: Learning agent system with HierarchicalMemory, Graph RAG, and eval harness (#2395)" This reverts commit 6eec628. * [skip ci] chore: Auto-bump patch version --------- Co-authored-by: Ubuntu <azureuser@amplihack-dev.ftnmxvem3frujn3lepas045p5c.xx.internal.cloudapp.net> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Summary
Consolidated PR containing the complete learning agent system:
1. Agent Learning Evaluation Harness
2. HierarchicalMemory with Graph RAG
HierarchicalMemoryclass using Kuzu directly with 5 cognitive memory types (episodic, semantic, procedural, prospective, working)SIMILAR_TOedges auto-computed at store time (Jaccard similarity > 0.3)DERIVES_FROMedges for provenance trackingGraphRAGRetriever: keyword seed → SIMILAR_TO expansion (1-2 hops) → ranked subgraphKnowledgeSubgraph.to_llm_context()for LLM-readable graph formattingFlatRetrieverAdapterfor backward compatibility3. Smart Retrieval
4. Progressive Test Suite (6 levels, not yet wired to new memory)
Known Issue
WikipediaLearningAgentneeds to be renamed/generalized toLearningAgent- it's not Wikipedia-specificTest Results
Files
src/amplihack/agents/goal_seeking/hierarchical_memory.py(764 lines)src/amplihack/agents/goal_seeking/graph_rag_retriever.py(284 lines)src/amplihack/agents/goal_seeking/similarity.py(235 lines)src/amplihack/agents/goal_seeking/flat_retriever_adapter.py(188 lines)src/amplihack/eval/(eval harness + progressive test suite)tests/agents/goal_seeking/andtests/eval/Closes #2394, #2396, #2399