-
Notifications
You must be signed in to change notification settings - Fork 110
H-5847: R&D Planning Agent #8188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8188 +/- ##
=======================================
Coverage 58.90% 58.90%
=======================================
Files 1193 1193
Lines 112723 112723
Branches 5013 5013
=======================================
+ Hits 66394 66396 +2
+ Misses 45571 45569 -2
Partials 758 758
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fdb65e6 to
426a377
Compare
7bfffe8 to
f1255fd
Compare
f1255fd to
cf06b67
Compare
cf06b67 to
0e31161
Compare
apps/hash-ai-agent/src/mastra/workflows/planning-workflow.test.ts
Dismissed
Show dismissed
Hide dismissed
apps/hash-ai-agent/src/mastra/workflows/planning-workflow.test.ts
Dismissed
Show dismissed
Hide dismissed
apps/hash-ai-agent/src/mastra/workflows/planning-workflow.test.ts
Dismissed
Show dismissed
Hide dismissed
0e31161 to
43952c1
Compare
43952c1 to
f7aecae
Compare
PR SummaryIntroduces an LLM-driven R&D planning framework (PlanSpec schema, planner agent), adds deterministic/LLM scorers, validation/topology tools, fixtures with E2E tests, and minor config/script updates.
Written by Cursor Bugbot for commit a597c38. This will update automatically on new commits. Configure here. |
apps/hash-ai-agent/src/mastra/fixtures/decomposition-prompts/fixtures.test.ts
Show resolved
Hide resolved
apps/hash-ai-agent/src/mastra/fixtures/decomposition-prompts/fixtures.test.ts
Show resolved
Hide resolved
f22fe4f to
1b42d20
Compare
1b42d20 to
c6bd39a
Compare
Benchmark results
|
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2002 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1001 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 3314 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 1526 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 2078 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 1033 | Flame Graph |
policy_resolution_medium
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 102 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 51 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 269 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 107 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 133 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 63 | Flame Graph |
policy_resolution_none
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 8 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 3 | Flame Graph |
policy_resolution_small
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 52 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 25 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 94 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 26 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 66 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 29 | Flame Graph |
read_scaling_complete
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id;one_depth | 1 entities | Flame Graph | |
| entity_by_id;one_depth | 10 entities | Flame Graph | |
| entity_by_id;one_depth | 25 entities | Flame Graph | |
| entity_by_id;one_depth | 5 entities | Flame Graph | |
| entity_by_id;one_depth | 50 entities | Flame Graph | |
| entity_by_id;two_depth | 1 entities | Flame Graph | |
| entity_by_id;two_depth | 10 entities | Flame Graph | |
| entity_by_id;two_depth | 25 entities | Flame Graph | |
| entity_by_id;two_depth | 5 entities | Flame Graph | |
| entity_by_id;two_depth | 50 entities | Flame Graph | |
| entity_by_id;zero_depth | 1 entities | Flame Graph | |
| entity_by_id;zero_depth | 10 entities | Flame Graph | |
| entity_by_id;zero_depth | 25 entities | Flame Graph | |
| entity_by_id;zero_depth | 5 entities | Flame Graph | |
| entity_by_id;zero_depth | 50 entities | Flame Graph |
read_scaling_linkless
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | 1 entities | Flame Graph | |
| entity_by_id | 10 entities | Flame Graph | |
| entity_by_id | 100 entities | Flame Graph | |
| entity_by_id | 1000 entities | Flame Graph | |
| entity_by_id | 10000 entities | Flame Graph |
representative_read_entity
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph |
representative_read_entity_type
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| get_entity_type_by_id | Account ID: bf5a9ef5-dc3b-43cf-a291-6210c0321eba
|
Flame Graph |
representative_read_multiple_entities
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_property | traversal_paths=0 | 0 | |
| entity_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=0 | 0 | |
| link_by_source_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true |
scenarios
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| full_test | query-limited | Flame Graph | |
| full_test | query-unlimited | Flame Graph | |
| linked_queries | query-limited | Flame Graph | |
| linked_queries | query-unlimited | Flame Graph |
🌟 What is the purpose of this PR?
Introduces a technical design for decomposing complex R&D goals into structured, executable plans using a planning agent.
A core concept here, is treating LLM planning as a "compiler front-end" that produces an Intermediate Representation (IR) — the
PlanSpec— which can be validated, scored, and eventually compiled into executable workflows.This PR establishes the foundational patterns for plan generation and quality evaluation. Artifacts for this phase are mostly tests and scorers.
There are examples of four test runs' console outputs in the Demo section below
🔗 Related links
agent/docs/PLAN-task-decomposition.md— Full design document and implementation planagent/docs/E2E-test-results-2024-12-17.md— Latest E2E test outputs🚫 Blocked by
None
🔍 What does this change?
Core Schema & Types
schemas/plan-spec.ts— Full Zod schema forPlanSpecwith 4 step types:research— Parallelizable information gatheringsynthesize— Combining findings (integrative) or evaluating results (evaluative)experiment— Testing hypotheses (exploratory or confirmatory with preregistration)develop— Building/implementing artifactsschemas/planning-fixture.ts— Types for test fixtures (PlanningFixture,ExpectedPlanCharacteristics)constants.ts— 12 agent capability profiles withcanHandlemappings for executor assignmentValidation & Analysis
tools/plan-validator.ts— 12 structural validation checks:tools/topology-analyzer.ts— DAG analysis utilities:Scoring System
scorers/plan-scorers.ts— 4 deterministic scorers (no LLM, fast):scorePlanStructure— DAG validity, parallelism, step type diversityscorePlanCoverage— Requirement/hypothesis coveragescoreExperimentRigor— Preregistration, success criteriascoreUnknownsCoverage— Epistemic completenessscorers/plan-llm-scorers.ts— 3 LLM-based judges:goalAlignmentScorer— Does plan address the goal?planGranularityScorer— Are steps appropriately sized?hypothesisTestabilityScorer— Are hypotheses testable?Planning Agent
agents/planner-agent.ts—generatePlan(goal, context)function that uses structured output to produce validPlanSpecinstancesTest Fixtures
4 fixtures of increasing complexity in
fixtures/decomposition-prompts/:summarize-papersexplore-and-recommendhypothesis-validationct-database-goalE2E Test Suite
workflows/planning-workflow.test.ts— Comprehensive E2E tests:RUN_LLM_SCORERS=truePre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
ct-database-goal fixture fails validation — The LLM occasionally generates confirmatory experiments without
preregisteredCommitments. This is a known prompt engineering issue that will be addressed in the revision workflow.explore-and-recommend generates unexpected content — The LLM adds hypotheses and experiments not specified in the fixture expectations. This is valid behavior (more thorough than minimum), but indicates fixture expectations may need adjustment.
🐾 Next steps
Per
PLAN-task-decomposition.mdSection 18:dountilloop: generate → validate → feedback → regenerate (max 3 attempts)🛡 What tests cover this?
plan-validator.test.ts— 25 negative fixture tests for validationplan-scorers.test.ts— 23 unit tests for deterministic scorersplan-llm-scorers.test.ts— 6 tests for LLM judgesfixtures.test.ts— 4 fixture validation testsplanning-workflow.test.ts— E2E pipeline tests (3/4 passing)❓ How to test this?
cd apps/hash-ai-agentnpx vitest run src/mastra/scorers/plan-scorers.test.tsnpx vitest run src/mastra/workflows/planning-workflow.test.tsRUN_LLM_SCORERS=true npx vitest run src/mastra/workflows/planning-workflow.test.ts📹 Demo
Individual Fixture Tests
summarize-papers (4.2s) — PASS
explore-and-recommend (13.9s) — PASS (with notes)
Note: The LLM generated hypotheses and experiments that the fixture didn't expect. This is not a validation failure — the plan is valid, just more thorough than the minimum expected.
hypothesis-validation (15.4s) — PASS
ct-database-goal (15.8s) — FAIL
Failure Reason: The LLM generated a confirmatory experiment (S14) without including
preregisteredCommitments. This is a known issue — the prompt needs to more strongly emphasize this requirement, or a revision loop needs to catch and fix it.Summary Report Test
Summary Report (49.0s) — runs all fixtures sequentially