H-5847: R&D Planning Agent #8188

lunelson · 2025-12-16T17:12:24Z

🌟 What is the purpose of this PR?

Introduces a technical design for decomposing complex R&D goals into structured, executable plans using a planning agent.

A core concept here, is treating LLM planning as a "compiler front-end" that produces an Intermediate Representation (IR) — the PlanSpec — which can be validated, scored, and eventually compiled into executable workflows.

This PR establishes the foundational patterns for plan generation and quality evaluation. Artifacts for this phase are mostly tests and scorers.

There are examples of four test runs' console outputs in the Demo section below

🔗 Related links

agent/docs/PLAN-task-decomposition.md — Full design document and implementation plan
agent/docs/E2E-test-results-2024-12-17.md — Latest E2E test outputs

🚫 Blocked by

None

🔍 What does this change?

Core Schema & Types

schemas/plan-spec.ts — Full Zod schema for PlanSpec with 4 step types:
- research — Parallelizable information gathering
- synthesize — Combining findings (integrative) or evaluating results (evaluative)
- experiment — Testing hypotheses (exploratory or confirmatory with preregistration)
- develop — Building/implementing artifacts
schemas/planning-fixture.ts — Types for test fixtures (PlanningFixture, ExpectedPlanCharacteristics)
constants.ts — 12 agent capability profiles with canHandle mappings for executor assignment

Validation & Analysis

tools/plan-validator.ts — 12 structural validation checks:
- DAG validity (no cycles, valid references)
- Executor compatibility
- Preregistration requirements for confirmatory experiments
- Input/output consistency
tools/topology-analyzer.ts — DAG analysis utilities:
- Entry/exit point detection
- Critical path calculation
- Parallel group identification

Scoring System

scorers/plan-scorers.ts — 4 deterministic scorers (no LLM, fast):
- scorePlanStructure — DAG validity, parallelism, step type diversity
- scorePlanCoverage — Requirement/hypothesis coverage
- scoreExperimentRigor — Preregistration, success criteria
- scoreUnknownsCoverage — Epistemic completeness
scorers/plan-llm-scorers.ts — 3 LLM-based judges:
- goalAlignmentScorer — Does plan address the goal?
- planGranularityScorer — Are steps appropriately sized?
- hypothesisTestabilityScorer — Are hypotheses testable?

Planning Agent

agents/planner-agent.ts — generatePlan(goal, context) function that uses structured output to produce valid PlanSpec instances

Test Fixtures

4 fixtures of increasing complexity in fixtures/decomposition-prompts/:

Fixture	Complexity	Step Types
`summarize-papers`	Simple linear	research → synthesize
`explore-and-recommend`	Parallel research	research (parallel) → synthesize (evaluative)
`hypothesis-validation`	With experiments	research → experiment → synthesize
`ct-database-goal`	Full R&D cycle	All 4 types, hypotheses, experiments

E2E Test Suite

workflows/planning-workflow.test.ts — Comprehensive E2E tests:
- Runs all 4 fixtures through the full pipeline
- Validates generated plans
- Runs deterministic scorers
- Optional LLM scorers via RUN_LLM_SCORERS=true
- Generates summary report with score table

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

does not modify any publishable blocks or libraries, or modifications do not need publishing

📜 Does this require a change to the docs?

The changes in this PR:

are internal and do not require a docs change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

do not affect the execution graph

⚠️ Known issues

ct-database-goal fixture fails validation — The LLM occasionally generates confirmatory experiments without preregisteredCommitments. This is a known prompt engineering issue that will be addressed in the revision workflow.
explore-and-recommend generates unexpected content — The LLM adds hypotheses and experiments not specified in the fixture expectations. This is valid behavior (more thorough than minimum), but indicates fixture expectations may need adjustment.

🐾 Next steps

Per PLAN-task-decomposition.md Section 18:

Revision workflow loop — Implement dountil loop: generate → validate → feedback → regenerate (max 3 attempts)
Supervisor agent — LLM approval gate before plan finalization
Prompt improvements — Strengthen preregisteredCommitments requirement
Stub execution — Low priority, deferred

🛡 What tests cover this?

plan-validator.test.ts — 25 negative fixture tests for validation
plan-scorers.test.ts — 23 unit tests for deterministic scorers
plan-llm-scorers.test.ts — 6 tests for LLM judges
fixtures.test.ts — 4 fixture validation tests
planning-workflow.test.ts — E2E pipeline tests (3/4 passing)

❓ How to test this?

Checkout the branch
cd apps/hash-ai-agent
Run unit tests: npx vitest run src/mastra/scorers/plan-scorers.test.ts
Run E2E tests: npx vitest run src/mastra/workflows/planning-workflow.test.ts
(Optional) Run with LLM scorers: RUN_LLM_SCORERS=true npx vitest run src/mastra/workflows/planning-workflow.test.ts

📹 Demo

Individual Fixture Tests

summarize-papers (4.2s) — PASS

============================================================
  FIXTURE: summarize-papers
============================================================
Goal: Summarize 3 recent papers on retrieval-augmented generation (RAG) 
           and produce a comparis...

--- Generating Plan ---
  ID: rag-paper-summary-comparison-plan
  Goal Summary: Summarize 3 recent RAG papers and create a comparison table....
  Steps: 3
  Requirements: 3
  Hypotheses: 0
  Step types: {"research":2,"synthesize":1}

--- Validation ---
  Valid: true
  Errors: 0

--- Topology Analysis ---
  Entry points: [S1]
  Exit points: [S3]
  Critical path: 3 steps
  Parallel groups: 3

--- Deterministic Scores ---
  Overall: 92.8%
  Structure: 76.7%
  Coverage: 100.0%
  Experiment Rigor: 100.0%
  Unknowns Coverage: 93.3%

--- Expected Characteristics Check ---
  All expected characteristics met

  (LLM scorers skipped — set RUN_LLM_SCORERS=true to enable)

  Duration: 4.2s

explore-and-recommend (13.9s) — PASS (with notes)

============================================================
  FIXTURE: explore-and-recommend
============================================================
Goal: Research approaches to vector database indexing and recommend 
           the best approach for our ...

--- Generating Plan ---
  ID: vector-db-indexing-research-plan
  Goal Summary: Research vector database indexing approaches and recommend the best for 10M docu...
  Steps: 11
  Requirements: 7
  Hypotheses: 2
  Step types: {"research":4,"synthesize":5,"experiment":2}

--- Validation ---
  Valid: true
  Errors: 0

--- Topology Analysis ---
  Entry points: [S1]
  Exit points: [S11]
  Critical path: 8 steps
  Parallel groups: 8

--- Deterministic Scores ---
  Overall: 92.5%
  Structure: 85.9%
  Coverage: 92.9%
  Experiment Rigor: 92.5%
  Unknowns Coverage: 100.0%

--- Expected Characteristics Check ---
  Issues:
    - Unexpected hypotheses: 2
    - Unexpected experiment steps: 2

  (LLM scorers skipped — set RUN_LLM_SCORERS=true to enable)

  Duration: 13.9s

Note: The LLM generated hypotheses and experiments that the fixture didn't expect. This is not a validation failure — the plan is valid, just more thorough than the minimum expected.

hypothesis-validation (15.4s) — PASS

============================================================
  FIXTURE: hypothesis-validation
============================================================
Goal: Test whether fine-tuning a small LLM (e.g., Llama 3 8B) on 
           domain-specific data outperfo...

--- Generating Plan ---
  ID: entity-extraction-llm-comparison-plan
  Goal Summary: Compare fine-tuned small LLM vs. few-shot large LLM for entity extraction....
  Steps: 12
  Requirements: 4
  Hypotheses: 2
  Step types: {"research":3,"synthesize":3,"experiment":3,"develop":3}

--- Validation ---
  Valid: true
  Errors: 0

--- Topology Analysis ---
  Entry points: [S1, S2, S3]
  Exit points: [S12]
  Critical path: 8 steps
  Parallel groups: 8

--- Deterministic Scores ---
  Overall: 95.3%
  Structure: 86.0%
  Coverage: 100.0%
  Experiment Rigor: 95.0%
  Unknowns Coverage: 100.0%

--- Expected Characteristics Check ---
  All expected characteristics met

  (LLM scorers skipped — set RUN_LLM_SCORERS=true to enable)

  Duration: 15.4s

ct-database-goal (15.8s) — FAIL

============================================================
  FIXTURE: ct-database-goal
============================================================
Goal: Create a backend language and database that is natively aligned 
           with category-theoretica...

--- Generating Plan ---
  ID: ct-db-backend-plan
  Goal Summary: Create a backend language and database natively aligned with category theory, su...
  Steps: 17
  Requirements: 8
  Hypotheses: 4
  Step types: {"research":4,"synthesize":8,"experiment":4,"develop":1}

--- Validation ---
  Valid: false
  Errors: 1
    [MISSING_PREREGISTERED_COMMITMENTS] Confirmatory experiment "S14" must have preregistered commitments

  Duration: 15.8s

Failure Reason: The LLM generated a confirmatory experiment (S14) without including preregisteredCommitments. This is a known issue — the prompt needs to more strongly emphasize this requirement, or a revision loop needs to catch and fix it.

Summary Report Test

Summary Report (49.0s) — runs all fixtures sequentially

============================================================
  SUMMARY REPORT
============================================================

Total: 4 fixtures
Successful: 3
Failed: 1

Failures:
  - ct-database-goal: Validation failed: Confirmatory experiment "S14" must have preregistered commitments

Deterministic Scores:
  Fixture                     | Overall | Structure | Coverage | Rigor | Unknowns
  -------------------------------------------------------------------------------------
  summarize-papers             |     93% |       77% |     100% |  100% |      93%
  explore-and-recommend        |     92% |       86% |      93% |   93% |     100%
  hypothesis-validation        |     95% |       86% |     100% |   95% |     100%

Total duration: 49.0s

codecov · 2025-12-16T17:16:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.90%. Comparing base (06cc531) to head (a597c38).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8188   +/-   ##
=======================================
  Coverage   58.90%   58.90%           
=======================================
  Files        1193     1193           
  Lines      112723   112723           
  Branches     5013     5013           
=======================================
+ Hits        66394    66396    +2     
+ Misses      45571    45569    -2     
  Partials      758      758

Flag	Coverage Δ
rust.harpc-codec	`84.70% <ø> (ø)`
rust.hash-graph-validation	`83.45% <ø> (ø)`
rust.hashql-hir	`89.10% <ø> (ø)`
rust.hashql-syntax-jexpr	`94.05% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

apps/hash-ai-agent/src/mastra/workflows/planning-workflow.test.ts

cursor · 2025-12-17T15:28:35Z

PR Summary

Introduces an LLM-driven R&D planning framework (PlanSpec schema, planner agent), adds deterministic/LLM scorers, validation/topology tools, fixtures with E2E tests, and minor config/script updates.

Planning Framework (apps/hash-ai-agent)
- Schema: Add schemas/plan-spec.ts (PlanSpec IR with 4 step types) and schemas/planning-fixture.ts.
- Agent: Add agents/planner-agent.ts (generatePlan() with structured output); register in src/mastra/index.ts.
- Constants: Add executor capability profiles in src/mastra/constants.ts.
- Scorers:
  - Deterministic: scorers/plan-scorers.ts (+ unit tests).
  - LLM-based: scorers/plan-llm-scorers.ts (+ opt-in tests via RUN_LLM_SCORERS).
- Validation/Analysis: Integrate existing plan-validator and add topology usage in tests.
- Fixtures & Tests: Add fixtures in fixtures/decomposition-prompts/* and suite fixtures.test.ts; includes complex CT database case.
Docs & Wiki
- Add planning design and results: agent/plans/PLAN-task-decomposition.md, E2E-test-results-2024-12-17.md, prompt templates and wiki notes.
Tooling
- Update package.json test/eval scripts and add baseline-browser-mapping.
- Update markdownlint ignores and extend AGENTS.md with contextual rules.

^{Written by Cursor Bugbot for commit a597c38. This will update automatically on new commits. Configure here.}

apps/hash-ai-agent/src/mastra/fixtures/decomposition-prompts/fixtures.test.ts

apps/hash-ai-agent/src/mastra/scorers/plan-llm-scorers.ts

apps/hash-ai-agent/src/mastra/scorers/plan-scorers.ts

apps/hash-ai-agent/src/mastra/agents/planner-agent.ts

apps/hash-ai-agent/src/mastra/scorers/plan-scorers.ts

apps/hash-ai-agent/src/mastra/scorers/plan-scorers.test.ts

apps/hash-ai-agent/src/mastra/agents/planner-agent.ts

github-actions · 2025-12-22T18:21:46Z

Benchmark results

@rust/hash-graph-benches – Integrations

policy_resolution_large

Function	Value	Mean	Flame graphs
resolve_policies_for_actor	user: empty, selectivity: high, policies: 2002	$$26.4 \mathrm{ms} \pm 209 \mathrm{μs}\left({\color{gray}0.843 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: low, policies: 1	$$3.23 \mathrm{ms} \pm 12.9 \mathrm{μs}\left({\color{gray}0.562 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: medium, policies: 1001	$$11.9 \mathrm{ms} \pm 76.5 \mathrm{μs}\left({\color{gray}1.39 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: high, policies: 3314	$$41.8 \mathrm{ms} \pm 289 \mathrm{μs}\left({\color{gray}-0.652 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: low, policies: 1	$$13.8 \mathrm{ms} \pm 82.8 \mathrm{μs}\left({\color{gray}-0.002 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: medium, policies: 1526	$$23.0 \mathrm{ms} \pm 140 \mathrm{μs}\left({\color{gray}-0.438 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: high, policies: 2078	$$30.1 \mathrm{ms} \pm 195 \mathrm{μs}\left({\color{lightgreen}-28.786 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: low, policies: 1	$$3.57 \mathrm{ms} \pm 15.4 \mathrm{μs}\left({\color{lightgreen}-82.224 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: medium, policies: 1033	$$13.6 \mathrm{ms} \pm 97.9 \mathrm{μs}\left({\color{lightgreen}-51.137 \mathrm{\%}}\right) $$	Flame Graph

policy_resolution_medium

Function	Value	Mean	Flame graphs
resolve_policies_for_actor	user: empty, selectivity: high, policies: 102	$$3.61 \mathrm{ms} \pm 18.0 \mathrm{μs}\left({\color{gray}0.006 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: low, policies: 1	$$2.84 \mathrm{ms} \pm 12.8 \mathrm{μs}\left({\color{gray}0.808 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: medium, policies: 51	$$3.18 \mathrm{ms} \pm 14.8 \mathrm{μs}\left({\color{gray}-0.340 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: high, policies: 269	$$4.97 \mathrm{ms} \pm 26.4 \mathrm{μs}\left({\color{gray}0.146 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: low, policies: 1	$$3.37 \mathrm{ms} \pm 12.3 \mathrm{μs}\left({\color{gray}-0.002 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: medium, policies: 107	$$3.93 \mathrm{ms} \pm 13.9 \mathrm{μs}\left({\color{gray}0.576 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: high, policies: 133	$$4.22 \mathrm{ms} \pm 23.8 \mathrm{μs}\left({\color{gray}4.03 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: low, policies: 1	$$3.26 \mathrm{ms} \pm 15.7 \mathrm{μs}\left({\color{gray}0.812 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: medium, policies: 63	$$3.85 \mathrm{ms} \pm 22.1 \mathrm{μs}\left({\color{gray}0.661 \mathrm{\%}}\right) $$	Flame Graph

policy_resolution_none

Function	Value	Mean	Flame graphs
resolve_policies_for_actor	user: empty, selectivity: high, policies: 2	$$2.53 \mathrm{ms} \pm 12.8 \mathrm{μs}\left({\color{red}6.44 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: low, policies: 1	$$2.42 \mathrm{ms} \pm 9.62 \mathrm{μs}\left({\color{gray}4.41 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: medium, policies: 1	$$2.48 \mathrm{ms} \pm 10.0 \mathrm{μs}\left({\color{gray}3.04 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: high, policies: 8	$$2.79 \mathrm{ms} \pm 11.3 \mathrm{μs}\left({\color{red}5.74 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: low, policies: 1	$$2.62 \mathrm{ms} \pm 14.1 \mathrm{μs}\left({\color{red}5.04 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: medium, policies: 3	$$2.77 \mathrm{ms} \pm 12.4 \mathrm{μs}\left({\color{gray}3.16 \mathrm{\%}}\right) $$	Flame Graph

policy_resolution_small

Function	Value	Mean	Flame graphs
resolve_policies_for_actor	user: empty, selectivity: high, policies: 52	$$2.91 \mathrm{ms} \pm 11.9 \mathrm{μs}\left({\color{red}5.03 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: low, policies: 1	$$2.62 \mathrm{ms} \pm 13.7 \mathrm{μs}\left({\color{red}8.20 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: empty, selectivity: medium, policies: 25	$$2.79 \mathrm{ms} \pm 12.5 \mathrm{μs}\left({\color{red}7.65 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: high, policies: 94	$$3.23 \mathrm{ms} \pm 12.7 \mathrm{μs}\left({\color{gray}3.33 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: low, policies: 1	$$2.84 \mathrm{ms} \pm 11.8 \mathrm{μs}\left({\color{red}6.33 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: seeded, selectivity: medium, policies: 26	$$3.05 \mathrm{ms} \pm 15.4 \mathrm{μs}\left({\color{red}5.25 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: high, policies: 66	$$3.17 \mathrm{ms} \pm 14.6 \mathrm{μs}\left({\color{red}5.69 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: low, policies: 1	$$2.79 \mathrm{ms} \pm 10.6 \mathrm{μs}\left({\color{red}6.02 \mathrm{\%}}\right) $$	Flame Graph
resolve_policies_for_actor	user: system, selectivity: medium, policies: 29	$$3.06 \mathrm{ms} \pm 15.7 \mathrm{μs}\left({\color{red}6.12 \mathrm{\%}}\right) $$	Flame Graph

read_scaling_complete

Function	Value	Mean	Flame graphs
entity_by_id;one_depth	1 entities	$$38.9 \mathrm{ms} \pm 130 \mathrm{μs}\left({\color{gray}2.23 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;one_depth	10 entities	$$76.4 \mathrm{ms} \pm 358 \mathrm{μs}\left({\color{gray}1.87 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;one_depth	25 entities	$$43.3 \mathrm{ms} \pm 149 \mathrm{μs}\left({\color{gray}-0.125 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;one_depth	5 entities	$$45.7 \mathrm{ms} \pm 217 \mathrm{μs}\left({\color{gray}1.51 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;one_depth	50 entities	$$54.0 \mathrm{ms} \pm 285 \mathrm{μs}\left({\color{gray}2.68 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;two_depth	1 entities	$$40.4 \mathrm{ms} \pm 144 \mathrm{μs}\left({\color{gray}0.440 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;two_depth	10 entities	$$418 \mathrm{ms} \pm 763 \mathrm{μs}\left({\color{gray}1.29 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;two_depth	25 entities	$$94.0 \mathrm{ms} \pm 398 \mathrm{μs}\left({\color{gray}1.09 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;two_depth	5 entities	$$84.0 \mathrm{ms} \pm 298 \mathrm{μs}\left({\color{gray}1.08 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;two_depth	50 entities	$$278 \mathrm{ms} \pm 678 \mathrm{μs}\left({\color{gray}0.272 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;zero_depth	1 entities	$$14.5 \mathrm{ms} \pm 54.1 \mathrm{μs}\left({\color{gray}-1.691 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;zero_depth	10 entities	$$14.9 \mathrm{ms} \pm 82.1 \mathrm{μs}\left({\color{gray}1.38 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;zero_depth	25 entities	$$15.2 \mathrm{ms} \pm 82.7 \mathrm{μs}\left({\color{gray}2.45 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;zero_depth	5 entities	$$14.9 \mathrm{ms} \pm 52.9 \mathrm{μs}\left({\color{gray}2.80 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id;zero_depth	50 entities	$$18.0 \mathrm{ms} \pm 105 \mathrm{μs}\left({\color{gray}1.06 \mathrm{\%}}\right) $$	Flame Graph

read_scaling_linkless

Function	Value	Mean	Flame graphs
entity_by_id	1 entities	$$14.7 \mathrm{ms} \pm 66.9 \mathrm{μs}\left({\color{gray}2.53 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	10 entities	$$14.8 \mathrm{ms} \pm 80.8 \mathrm{μs}\left({\color{gray}2.26 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	100 entities	$$14.7 \mathrm{ms} \pm 80.1 \mathrm{μs}\left({\color{gray}1.96 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	1000 entities	$$15.0 \mathrm{ms} \pm 86.3 \mathrm{μs}\left({\color{gray}-1.371 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	10000 entities	$$22.2 \mathrm{ms} \pm 153 \mathrm{μs}\left({\color{gray}0.126 \mathrm{\%}}\right) $$	Flame Graph

representative_read_entity

Function	Value	Mean	Flame graphs
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/block/v/1`	$$29.0 \mathrm{ms} \pm 238 \mathrm{μs}\left({\color{gray}-1.358 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/book/v/1`	$$30.6 \mathrm{ms} \pm 313 \mathrm{μs}\left({\color{gray}1.63 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/building/v/1`	$$29.3 \mathrm{ms} \pm 294 \mathrm{μs}\left({\color{gray}-0.161 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/organization/v/1`	$$29.4 \mathrm{ms} \pm 230 \mathrm{μs}\left({\color{gray}1.25 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/page/v/2`	$$29.9 \mathrm{ms} \pm 303 \mathrm{μs}\left({\color{gray}2.60 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/person/v/1`	$$29.2 \mathrm{ms} \pm 297 \mathrm{μs}\left({\color{gray}-1.022 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/playlist/v/1`	$$28.7 \mathrm{ms} \pm 286 \mathrm{μs}\left({\color{gray}-3.507 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/song/v/1`	$$29.5 \mathrm{ms} \pm 287 \mathrm{μs}\left({\color{gray}2.43 \mathrm{\%}}\right) $$	Flame Graph
entity_by_id	entity type ID: `https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1`	$$29.9 \mathrm{ms} \pm 241 \mathrm{μs}\left({\color{gray}1.86 \mathrm{\%}}\right) $$	Flame Graph

representative_read_entity_type

Function	Value	Mean	Flame graphs
get_entity_type_by_id	Account ID: `bf5a9ef5-dc3b-43cf-a291-6210c0321eba`	$$8.06 \mathrm{ms} \pm 33.1 \mathrm{μs}\left({\color{gray}0.162 \mathrm{\%}}\right) $$	Flame Graph

representative_read_multiple_entities

Function	Value	Mean	Flame graphs
entity_by_property	traversal_paths=0	0	$$45.9 \mathrm{ms} \pm 206 \mathrm{μs}\left({\color{gray}0.455 \mathrm{\%}}\right) $$
entity_by_property	traversal_paths=255	1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true	$$93.9 \mathrm{ms} \pm 427 \mathrm{μs}\left({\color{gray}0.586 \mathrm{\%}}\right) $$
entity_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false	$$51.4 \mathrm{ms} \pm 253 \mathrm{μs}\left({\color{gray}0.042 \mathrm{\%}}\right) $$
entity_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true	$$59.5 \mathrm{ms} \pm 291 \mathrm{μs}\left({\color{gray}-0.240 \mathrm{\%}}\right) $$
entity_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true	$$67.9 \mathrm{ms} \pm 318 \mathrm{μs}\left({\color{gray}0.312 \mathrm{\%}}\right) $$
entity_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true	$$74.1 \mathrm{ms} \pm 304 \mathrm{μs}\left({\color{gray}0.352 \mathrm{\%}}\right) $$
link_by_source_by_property	traversal_paths=0	0	$$49.5 \mathrm{ms} \pm 242 \mathrm{μs}\left({\color{gray}0.043 \mathrm{\%}}\right) $$
link_by_source_by_property	traversal_paths=255	1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true	$$76.5 \mathrm{ms} \pm 365 \mathrm{μs}\left({\color{gray}0.657 \mathrm{\%}}\right) $$
link_by_source_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false	$$56.0 \mathrm{ms} \pm 276 \mathrm{μs}\left({\color{gray}0.056 \mathrm{\%}}\right) $$
link_by_source_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true	$$63.9 \mathrm{ms} \pm 367 \mathrm{μs}\left({\color{gray}0.452 \mathrm{\%}}\right) $$
link_by_source_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true	$$66.2 \mathrm{ms} \pm 317 \mathrm{μs}\left({\color{gray}0.946 \mathrm{\%}}\right) $$
link_by_source_by_property	traversal_paths=2	1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true	$$66.0 \mathrm{ms} \pm 331 \mathrm{μs}\left({\color{gray}0.961 \mathrm{\%}}\right) $$

scenarios

Function	Value	Mean	Flame graphs
full_test	query-limited	$$136 \mathrm{ms} \pm 494 \mathrm{μs}\left({\color{gray}4.38 \mathrm{\%}}\right) $$	Flame Graph
full_test	query-unlimited	$$132 \mathrm{ms} \pm 470 \mathrm{μs}\left({\color{gray}-0.484 \mathrm{\%}}\right) $$	Flame Graph
linked_queries	query-limited	$$38.8 \mathrm{ms} \pm 161 \mathrm{μs}\left({\color{lightgreen}-62.011 \mathrm{\%}}\right) $$	Flame Graph
linked_queries	query-unlimited	$$582 \mathrm{ms} \pm 1.04 \mathrm{ms}\left({\color{gray}-0.191 \mathrm{\%}}\right) $$	Flame Graph

github-actions bot added area/deps Relates to third-party dependencies (area) area/apps > hash* Affects HASH (a `hash-*` app) area/infra Relates to version control, CI, CD or IaC (area) area/libs Relates to first-party libraries/crates/packages (area) area/apps labels Dec 16, 2025

vercel bot deployed to Preview – petrinaut December 16, 2025 17:13 View deployment

vercel bot deployed to Preview – hash December 16, 2025 17:16 View deployment

lunelson changed the base branch from main to ln/h-5746-sync-research-and-plans December 16, 2025 17:38

lunelson force-pushed the ln/h-5746-sync-research-and-plans branch from fdb65e6 to 426a377 Compare December 16, 2025 17:40

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from 7bfffe8 to f1255fd Compare December 16, 2025 17:40

vercel bot temporarily deployed to Preview – petrinaut December 16, 2025 17:40 Inactive

github-actions bot removed area/deps Relates to third-party dependencies (area) area/libs Relates to first-party libraries/crates/packages (area) labels Dec 16, 2025

vercel bot deployed to Preview – hash December 16, 2025 17:53 View deployment

vilkinsons assigned lunelson Dec 16, 2025

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from f1255fd to cf06b67 Compare December 17, 2025 14:30

vercel bot deployed to Preview – petrinaut December 17, 2025 14:34 View deployment

vercel bot deployed to Preview – hash December 17, 2025 14:37 View deployment

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from cf06b67 to 0e31161 Compare December 17, 2025 15:12

github-advanced-security bot found potential problems Dec 17, 2025

View reviewed changes

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from 0e31161 to 43952c1 Compare December 17, 2025 15:15

vercel bot deployed to Preview – petrinaut December 17, 2025 15:19 View deployment

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from 43952c1 to f7aecae Compare December 17, 2025 15:27

lunelson marked this pull request as ready for review December 17, 2025 15:28

graphite-app bot requested a review from a team December 17, 2025 15:28

vercel bot deployed to Preview – petrinaut December 17, 2025 15:31 View deployment

cursor bot reviewed Dec 17, 2025

View reviewed changes

apps/hash-ai-agent/src/mastra/fixtures/decomposition-prompts/fixtures.test.ts Show resolved Hide resolved

apps/hash-ai-agent/src/mastra/fixtures/decomposition-prompts/fixtures.test.ts Show resolved Hide resolved

vercel bot deployed to Preview – hash December 17, 2025 15:35 View deployment

more review fixes

b352721

vercel bot temporarily deployed to Preview – petrinaut December 22, 2025 10:41 Inactive

vercel bot deployed to Preview – hash December 22, 2025 10:48 View deployment

cursor bot reviewed Dec 22, 2025

View reviewed changes

apps/hash-ai-agent/src/mastra/scorers/plan-llm-scorers.ts Show resolved Hide resolved

further review fixes

1628d78

vercel bot temporarily deployed to Preview – petrinaut December 22, 2025 11:22 Inactive

cursor bot reviewed Dec 22, 2025

View reviewed changes

apps/hash-ai-agent/src/mastra/scorers/plan-scorers.ts Outdated Show resolved Hide resolved

apps/hash-ai-agent/src/mastra/agents/planner-agent.ts Outdated Show resolved Hide resolved

review fixes again

7800d23

vercel bot temporarily deployed to Preview – petrinaut December 22, 2025 14:02 Inactive

cursor bot reviewed Dec 22, 2025

View reviewed changes

apps/hash-ai-agent/src/mastra/scorers/plan-scorers.ts Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview – petrinaut December 22, 2025 14:43 Inactive

github-actions bot added the area/deps Relates to third-party dependencies (area) label Dec 22, 2025

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from f22fe4f to 1b42d20 Compare December 22, 2025 15:43

vercel bot deployed to Preview – hashdotdesign December 22, 2025 15:47 View deployment

vercel bot deployed to Preview – petrinaut December 22, 2025 15:49 View deployment

cursor bot reviewed Dec 22, 2025

View reviewed changes

apps/hash-ai-agent/src/mastra/scorers/plan-scorers.test.ts Show resolved Hide resolved

apps/hash-ai-agent/src/mastra/agents/planner-agent.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview – hash December 22, 2025 15:57 View deployment

review and format fixes

c6bd39a

lunelson force-pushed the ln/h-5847-dynamic-workflows branch from 1b42d20 to c6bd39a Compare December 22, 2025 17:00

vercel bot deployed to Preview – hashdotdesign December 22, 2025 17:02 View deployment

vercel bot deployed to Preview – petrinaut December 22, 2025 17:04 View deployment

vercel bot deployed to Preview – hash December 22, 2025 17:08 View deployment

review fixes

a597c38

vercel bot temporarily deployed to Preview – petrinaut December 22, 2025 17:14 Inactive

CiaranMn approved these changes Dec 23, 2025

View reviewed changes

lunelson added this pull request to the merge queue Dec 23, 2025

Merged via the queue into main with commit b6338db Dec 23, 2025
169 checks passed

lunelson deleted the ln/h-5847-dynamic-workflows branch December 23, 2025 09:12

H-5847: R&D Planning Agent #8188

H-5847: R&D Planning Agent #8188

Uh oh!

Conversation

lunelson commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🌟 What is the purpose of this PR?

🔗 Related links

🚫 Blocked by

🔍 What does this change?

Core Schema & Types

Validation & Analysis

Scoring System

Planning Agent

Test Fixtures

E2E Test Suite

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

📜 Does this require a change to the docs?

🕸️ Does this require a change to the Turbo Graph?

⚠️ Known issues

🐾 Next steps

🛡 What tests cover this?

❓ How to test this?

📹 Demo

Summary Report Test

Uh oh!

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2025

Benchmark results

@rust/hash-graph-benches – Integrations

policy_resolution_large

policy_resolution_medium

policy_resolution_none

policy_resolution_small

read_scaling_complete

read_scaling_linkless

representative_read_entity

representative_read_entity_type

representative_read_multiple_entities

scenarios

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

lunelson commented Dec 16, 2025 •

edited

Loading

codecov bot commented Dec 16, 2025 •

edited

Loading

cursor bot commented Dec 17, 2025 •

edited

Loading