Skip to content

feat(evals): add retrieval-accuracy evaluation framework#45

Open
amondnet wants to merge 1 commit intomainfrom
feat/retrieval-evals
Open

feat(evals): add retrieval-accuracy evaluation framework#45
amondnet wants to merge 1 commit intomainfrom
feat/retrieval-evals

Conversation

@amondnet
Copy link
Contributor

@amondnet amondnet commented Feb 11, 2026

Summary

  • Add IR-style retrieval evals adapting the rpg-encoder queries.json approach to measure whether agents can correctly identify relevant files in the Next.js codebase
  • 20-query dataset spanning 7 categories (client, routing, server, build, devtools, telemetry) with ground truth validated against nextjs-rpg.json
  • Baseline (cc-retrieval) and RPG-enhanced (cc-rpg-retrieval) experiment configs for A/B comparison

What's included

Component File(s)
Query dataset fixtures/nextjs-queries.json (20 queries, 3 difficulty levels)
Eval template evals/_templates/retrieval/{EVAL.ts,PROMPT.md,package.json}
Generated evals 20 evals/retrieval-nq-* directories (one per query)
Generator script scripts/generate-retrieval-evals.ts
Experiments experiments/cc-retrieval.ts, experiments/cc-rpg-retrieval.ts
Aggregation scripts/aggregate-retrieval-metrics.ts

Metrics computed

  • Accuracy@1, @3, @5, @10
  • Mean Reciprocal Rank (MRR)
  • Precision and Recall (top-10)
  • Breakdowns by difficulty and category

Test plan

  • npx tsx scripts/generate-retrieval-evals.ts produces 20 directories
  • npx tsc --noEmit passes for experiment configs
  • bun run lint:fix passes
  • npx tsx scripts/aggregate-retrieval-metrics.ts handles empty results gracefully
  • npx @pleaseai/agent-eval cc-retrieval runs baseline
  • npx @pleaseai/agent-eval cc-rpg-retrieval runs RPG-enhanced

Summary by cubic

Adds a retrieval-accuracy evaluation suite to measure whether agents can correctly find relevant Next.js source files. Includes a 20-query dataset, per-query evals, and scripts to generate and aggregate metrics.

  • New Features

    • Shared EVAL.ts and PROMPT with ground truth per query; computes Acc@1/3/5/10, MRR, precision, recall.
    • 20-query dataset (fixtures/nextjs-queries.json) across multiple categories; generator script creates eval dirs.
    • Baseline (cc-retrieval) and RPG-enhanced (cc-rpg-retrieval) experiment configs for A/B comparison.
    • Aggregation script produces overall and difficulty/category breakdowns.
  • Bug Fixes

    • README: update env setup to use CLAUDE_CODE_OAUTH_TOKEN.

Written for commit 25b2abb. Summary will update on new commits.

Add IR-style retrieval evals that measure whether agents (with/without
RPG tools) can correctly identify relevant files in the Next.js codebase,
adapting the rpg-encoder benchmark queries.json approach.

- 20-query dataset (nextjs-queries.json) spanning 7 categories
- Shared EVAL.ts template computing Acc@K, MRR, Precision, Recall
- Generator script to produce per-query eval directories
- Baseline (cc-retrieval) and RPG-enhanced (cc-rpg-retrieval) experiments
- Metrics aggregation script with difficulty/category breakdowns
@amondnet amondnet force-pushed the feat/retrieval-evals branch from 804812f to 25b2abb Compare February 11, 2026 12:38
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @amondnet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the evaluation capabilities for AI coding agents by introducing a comprehensive retrieval-accuracy evaluation framework. This framework allows for precise measurement of an agent's ability to locate relevant code files given a natural language query, backed by a new dataset and detailed metric aggregation. Concurrently, it integrates an interactive encoding protocol into the Model Context Protocol (MCP) server, providing a structured, step-by-step process for agents to build and refine a Repository Planning Graph (RPG) through semantic analysis and hierarchical organization.

Highlights

  • Retrieval Accuracy Evaluation Framework: Introduced a new framework to measure how effectively AI agents can identify relevant source files in the Next.js codebase based on natural-language queries.
  • Query Dataset and Ground Truth: Added a 20-query dataset spanning 7 categories (client, routing, server, build, devtools, telemetry) with ground truth validated against nextjs-rpg.json.
  • Experiment Configurations: Included baseline (cc-retrieval) and RPG-enhanced (cc-rpg-retrieval) experiment configurations to facilitate A/B comparison of agent performance.
  • Metric Aggregation: Provided a script to aggregate and report standard Information Retrieval (IR) metrics, including Accuracy@K, Mean Reciprocal Rank (MRR), Precision, and Recall, with breakdowns by difficulty and category.
  • Interactive Encoding Protocol: Integrated a new interactive encoding protocol into the Model Context Protocol (MCP) server, offering tools and resources for guided semantic lifting, feature synthesis, and hierarchical organization of code entities.
  • Encoder Utility Refactoring: Refactored core file discovery, entity extraction, and dependency injection logic into shared, exported utility functions within the RPGEncoder module.
Changelog
  • agent-evals/README.md
    • Updated API key instructions to reflect the use of CLAUDE_CODE_OAUTH_TOKEN.
  • agent-evals/evals/_templates/retrieval/EVAL.ts
    • Added a new TypeScript evaluation script for measuring retrieval accuracy.
  • agent-evals/evals/_templates/retrieval/PROMPT.md
    • Added a new Markdown prompt template for retrieval evaluation queries.
  • agent-evals/evals/_templates/retrieval/package.json
    • Added a new package.json file for the retrieval evaluation template, including vitest.
  • agent-evals/evals/retrieval-nq-001-find-the-files-that-handle-image-optimiz/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-001'.
  • agent-evals/evals/retrieval-nq-001-find-the-files-that-handle-image-optimiz/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-001'.
  • agent-evals/evals/retrieval-nq-001-find-the-files-that-handle-image-optimiz/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-001'.
  • agent-evals/evals/retrieval-nq-001-find-the-files-that-handle-image-optimiz/package.json
    • Added a specific package.json for retrieval query 'nq-001'.
  • agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-002'.
  • agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-002'.
  • agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-002'.
  • agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/package.json
    • Added a specific package.json for retrieval query 'nq-002'.
  • agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-003'.
  • agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-003'.
  • agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-003'.
  • agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/package.json
    • Added a specific package.json for retrieval query 'nq-003'.
  • agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-004'.
  • agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-004'.
  • agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-004'.
  • agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/package.json
    • Added a specific package.json for retrieval query 'nq-004'.
  • agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-005'.
  • agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-005'.
  • agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-005'.
  • agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/package.json
    • Added a specific package.json for retrieval query 'nq-005'.
  • agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-006'.
  • agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-006'.
  • agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-006'.
  • agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/package.json
    • Added a specific package.json for retrieval query 'nq-006'.
  • agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-007'.
  • agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-007'.
  • agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-007'.
  • agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/package.json
    • Added a specific package.json for retrieval query 'nq-007'.
  • agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-008'.
  • agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-008'.
  • agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-008'.
  • agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/package.json
    • Added a specific package.json for retrieval query 'nq-008'.
  • agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-009'.
  • agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-009'.
  • agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-009'.
  • agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/package.json
    • Added a specific package.json for retrieval query 'nq-009'.
  • agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-010'.
  • agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-010'.
  • agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-010'.
  • agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/package.json
    • Added a specific package.json for retrieval query 'nq-010'.
  • agent-evals/evals/retrieval-nq-011-find-the-files-responsible-for-telemetry/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-011'.
  • agent-evals/evals/retrieval-nq-011-find-the-files-responsible-for-telemetry/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-011'.
  • agent-evals/evals/retrieval-nq-011-find-the-files-responsible-for-telemetry/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-011'.
  • agent-evals/evals/retrieval-nq-011-find-the-files-responsible-for-telemetry/package.json
    • Added a specific package.json for retrieval query 'nq-011'.
  • agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-012'.
  • agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-012'.
  • agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-012'.
  • agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/package.json
    • Added a specific package.json for retrieval query 'nq-012'.
  • agent-evals/evals/retrieval-nq-013-find-the-files-that-handle-html-metadata/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-013'.
  • agent-evals/evals/retrieval-nq-013-find-the-files-that-handle-html-metadata/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-013'.
  • agent-evals/evals/retrieval-nq-013-find-the-files-that-handle-html-metadata/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-013'.
  • agent-evals/evals/retrieval-nq-013-find-the-files-that-handle-html-metadata/package.json
    • Added a specific package.json for retrieval query 'nq-013'.
  • agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-014'.
  • agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-014'.
  • agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-014'.
  • agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/package.json
    • Added a specific package.json for retrieval query 'nq-014'.
  • agent-evals/evals/retrieval-nq-015-find-the-files-that-implement-static-pag/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-015'.
  • agent-evals/evals/retrieval-nq-015-find-the-files-that-implement-static-pag/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-015'.
  • agent-evals/evals/retrieval-nq-015-find-the-files-that-implement-static-pag/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-015'.
  • agent-evals/evals/retrieval-nq-015-find-the-files-that-implement-static-pag/package.json
    • Added a specific package.json for retrieval query 'nq-015'.
  • agent-evals/evals/retrieval-nq-016-find-the-files-that-implement-the-script/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-016'.
  • agent-evals/evals/retrieval-nq-016-find-the-files-that-implement-the-script/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-016'.
  • agent-evals/evals/retrieval-nq-016-find-the-files-that-implement-the-script/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-016'.
  • agent-evals/evals/retrieval-nq-016-find-the-files-that-implement-the-script/package.json
    • Added a specific package.json for retrieval query 'nq-016'.
  • agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-017'.
  • agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-017'.
  • agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-017'.
  • agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/package.json
    • Added a specific package.json for retrieval query 'nq-017'.
  • agent-evals/evals/retrieval-nq-018-find-the-files-that-handle-custom-route/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-018'.
  • agent-evals/evals/retrieval-nq-018-find-the-files-that-handle-custom-route/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-018'.
  • agent-evals/evals/retrieval-nq-018-find-the-files-that-handle-custom-route/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-018'.
  • agent-evals/evals/retrieval-nq-018-find-the-files-that-handle-custom-route/package.json
    • Added a specific package.json for retrieval query 'nq-018'.
  • agent-evals/evals/retrieval-nq-019-find-the-files-that-implement-the-base-s/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-019'.
  • agent-evals/evals/retrieval-nq-019-find-the-files-that-implement-the-base-s/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-019'.
  • agent-evals/evals/retrieval-nq-019-find-the-files-that-implement-the-base-s/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-019'.
  • agent-evals/evals/retrieval-nq-019-find-the-files-that-implement-the-base-s/package.json
    • Added a specific package.json for retrieval query 'nq-019'.
  • agent-evals/evals/retrieval-nq-020-find-the-files-that-implement-dynamic-im/EVAL.ts
    • Added a specific retrieval evaluation script for query 'nq-020'.
  • agent-evals/evals/retrieval-nq-020-find-the-files-that-implement-dynamic-im/GROUND_TRUTH.json
    • Added ground truth data for retrieval query 'nq-020'.
  • agent-evals/evals/retrieval-nq-020-find-the-files-that-implement-dynamic-im/PROMPT.md
    • Added a specific prompt for retrieval query 'nq-020'.
  • agent-evals/evals/retrieval-nq-020-find-the-files-that-implement-dynamic-im/package.json
    • Added a specific package.json for retrieval query 'nq-020'.
  • agent-evals/experiments/cc-retrieval.ts
    • Added a new experiment configuration for baseline retrieval evaluation using 'claude-code'.
  • agent-evals/experiments/cc-rpg-retrieval.ts
    • Added a new experiment configuration for RPG-enhanced retrieval evaluation, including setup for an RPG MCP server.
  • agent-evals/fixtures/nextjs-queries.json
    • Added a JSON dataset containing 20 natural language queries for Next.js codebase retrieval evaluations.
  • agent-evals/scripts/aggregate-retrieval-metrics.ts
    • Added a new script to collect and aggregate retrieval metrics from experiment results, generating a Markdown report.
  • agent-evals/scripts/generate-retrieval-evals.ts
    • Added a new script to programmatically generate individual retrieval evaluation directories based on the nextjs-queries.json fixture.
  • src/encoder/encoder.ts
    • Updated imports to include ParseResult from ../utils/ast.
    • Moved discoverFiles, generateEntityId, extractEntitiesFromFile, resolveImportPath, and injectDependencies into exported utility functions.
    • Removed private helper methods discoverFiles, walkDirectory, matchesPattern, globMatch, matchSegments, matchSegment, generateEntityId, mapEntityType, injectDependencies, buildFilePathMap, extractFileDependencies, and resolveImportPath from the RPGEncoder class.
    • Modified RPGEncoder to utilize the newly exported utility functions for file discovery, entity extraction, and dependency injection.
  • src/mcp/interactive/encoder.ts
    • Added a new InteractiveEncoder class to manage the interactive encoding workflow, including methods for building the structural index, submitting semantic features, finalizing features, synthesizing file-level features, submitting hierarchy assignments, and routing entities.
  • src/mcp/interactive/index.ts
    • Added an index file for the interactive MCP module, exporting the InteractiveEncoder and related types.
    • Added registerInteractiveProtocol function to register interactive encoding tools, resources, and prompts on an MCP server.
  • src/mcp/interactive/prompt-texts.ts
    • Added constants defining instruction prompts for semantic feature extraction, domain discovery, hierarchy assignment, entity routing, file-level synthesis, and the overall encoding workflow.
  • src/mcp/interactive/prompts.ts
    • Added functions to register two new prompts, rpg-encode-repo and rpg-route-entities, on the MCP server to guide agents through interactive encoding workflows.
  • src/mcp/interactive/resources.ts
    • Added functions to register five new read-only resources on the MCP server, providing access to encoding status, entity batches, hierarchy context, routing candidates, and synthesis batches.
  • src/mcp/interactive/state.ts
    • Added a new InteractiveState class to manage the state of interactive encoding, including entities, batch boundaries, lifted features, file-level features, synthesized features, hierarchy assignments, pending routing, and graph revision tracking.
  • src/mcp/interactive/tools.ts
    • Added functions to register six new mutation tools on the MCP server: rpg_build_index, rpg_submit_features, rpg_finalize_features, rpg_submit_synthesis, rpg_submit_hierarchy, and rpg_submit_routing, enabling agents to interactively build and refine an RPG.
  • src/mcp/server.ts
    • Updated createMcpServer to optionally register the interactive encoding protocol.
    • Modified the main function to accept an --interactive command-line flag to enable the interactive protocol.
  • tests/interactive-encoder.test.ts
    • Added unit tests for the InteractiveState class, covering batching logic, coverage tracking, graph revision generation, and entity retrieval by scope and batch.
  • tests/mcp-interactive.integration.test.ts
    • Added integration tests for the full interactive encoding workflow, including building the structural index, submitting features, finalizing features, synthesizing file features, submitting hierarchy, and handling errors and pagination.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive retrieval-accuracy evaluation framework and a new interactive encoding protocol for MCP. The changes are extensive and well-structured, including new scripts for generating and aggregating evaluations, templates, experiment configurations, and a significant refactoring of the encoder to support the new interactive mode. The code quality is high, and the new features are well-tested. My review focuses on improving the robustness of the new scripts and suggesting the use of a standard library to reduce custom code maintenance.

I am having trouble creating individual review comments. Click here to see my feedback.

agent-evals/scripts/aggregate-retrieval-metrics.ts (106-111)

medium

The script will crash if any metrics.json file is malformed or empty, which can happen with failed or interrupted experiment runs. It would be more robust to wrap the JSON.parse call and subsequent logic in a try...catch block to handle potential parsing errors gracefully. This would allow the script to continue aggregating results from other valid files instead of failing completely.

            try {
              const metrics: Metrics = JSON.parse(readFileSync(metricsPath, 'utf-8'))
              const key = `${experiment.name}/${model.name}`
              if (!experimentMetrics.has(key)) {
                experimentMetrics.set(key, [])
              }
              experimentMetrics.get(key)!.push(metrics)
            } catch (e) {
              console.warn(`[WARN] Skipping malformed metrics file: ${metricsPath}`)
            }

src/encoder/encoder.ts (143-145)

medium

This file includes a custom implementation for glob pattern matching. While it covers basic cases, it's generally better to use a well-established and thoroughly tested library like micromatch. Using a standard library would make the code more robust, handle a wider range of glob syntax and edge cases correctly, and reduce the amount of custom code to maintain.

After adding micromatch as a dependency, this function and the related globMatch, matchSegments, and matchSegment helpers could be replaced with a simpler implementation:

import micromatch from 'micromatch';

function matchesPattern(filePath: string, patterns: string[]): boolean {
  // micromatch handles path normalization and multiple patterns
  return micromatch.isMatch(filePath, patterns);
}

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 issues found across 100 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="agent-evals/README.md">

<violation number="1" location="agent-evals/README.md:18">
P2: Typo: "Claud Code" should be "Claude Code".</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts:113">
P2: MRR is computed over the unbounded predicted list while all other metrics are scoped to top-10. This means a correct result at rank 50 yields a non-zero MRR, inconsistent with accuracy@10, precision@10, and recall@10. Pass the same top-10 slice for consistency.</violation>
</file>

<file name="agent-evals/evals/_templates/retrieval/package.json">

<violation number="1" location="agent-evals/evals/_templates/retrieval/package.json:5">
P2: Vitest major version mismatch with parent project. The root `agent-evals/package.json` uses `"vitest": "^2.1.0"` but this template specifies `"^3.1.3"`. Align on a single major version to avoid breaking-change surprises and dependency conflicts.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts:68">
P1: Duplicate predictions inflate `recall` (and `precision`) — `recall` can exceed 1.0. The `predicted` array is not deduplicated before counting hits, so duplicate matching paths each count as a separate hit. Deduplicate normalized predictions before computing set-based metrics.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/EVAL.ts:113">
P2: MRR is computed over the full `predicted` array while all other metrics are capped at top-10. This inconsistency can produce misleading results—e.g., a non-zero MRR alongside zero accuracy@10/precision/recall. Cap MRR at the same K=10 cutoff for consistency.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/EVAL.ts:68">
P2: Duplicate predicted paths inflate `recall` (can exceed 1.0) and `precision`. The `hits` count uses `.filter()` on the raw array, so duplicate entries of a relevant file each count as a separate hit. Deduplicate the normalized predicted paths before computing metrics to match standard IR definitions.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/EVAL.ts:29">
P2: `JSON.parse(raw)` on untrusted agent output (`answer.json`) is not wrapped in a try/catch. If the agent produces malformed JSON, this throws a raw `SyntaxError` instead of returning `[]` or a meaningful failure. Since `loadAnswer` already handles the missing-file case defensively, it should handle parse errors the same way.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts:64">
P1: Bug: `recall` can return values > 1.0 when `predicted` contains duplicate file paths. The `filter` counts every duplicate that matches the expected set, so `hits` can exceed `expected.size`. Deduplicate the normalized predictions before computing the intersection.</violation>
</file>

<file name="agent-evals/evals/_templates/retrieval/EVAL.ts">

<violation number="1" location="agent-evals/evals/_templates/retrieval/EVAL.ts:113">
P2: MRR is computed over the full predicted array with no K limit, unlike all other metrics which are bounded to the top 10. If the agent returns >10 files and the first hit is beyond position 10, MRR will be non-zero while accuracy@10 / precision@10 / recall@10 are all zero, giving contradictory results.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json">

<violation number="1" location="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json:6">
P1: Ground truth contradicts the eval prompt. The PROMPT.md instructs agents to "Focus on implementation files (`.ts`, `.tsx`), not type declarations (`.d.ts`)" yet 2 of the 3 expected files here are `.d.ts` type declarations. This is the only query (out of 20) with this mismatch — agents following instructions correctly will score poorly. Replace the `.d.ts` entries with the actual implementation files for font optimization (e.g., `src/server/font-utils.ts` plus the implementation sources under `font/google/` and `font/local/`).</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/EVAL.ts:64">
P2: Bug: `recall` (and `precision`) count duplicate predicted paths as separate hits, which can produce recall values > 1.0. Since `predicted` comes from agent-generated `answer.json`, duplicates are plausible. Deduplicate the normalized predicted paths before counting hits.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts:113">
P2: MRR is computed over the full predicted list while all other metrics are capped at K=10, creating an inconsistency. If the agent lists a correct file beyond position 10, MRR will be nonzero but accuracy@10, precision@10, and recall@10 will all be zero, making cross-metric comparisons misleading. Cap the MRR computation at 10 for consistency (MRR@10).</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts:68">
P2: Precision and recall can produce invalid values (recall > 1.0) when predicted paths contain duplicates after normalization. For example, if the agent outputs both `./foo.ts` and `foo.ts`, both normalize to `foo.ts` and are counted as separate hits. Deduplicate normalized predictions before computing hits.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts:68">
P2: Bug: `recall` (and `precision`) count duplicate predictions, allowing `recall` to exceed 1.0. Normalize and deduplicate `predicted` before counting hits to ensure correct metric computation.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts:68">
P2: Recall (and precision) can produce invalid values > 1.0 when predicted paths contain duplicates after normalization (e.g., `./a.ts` and `a.ts` both normalize to `a.ts`). Deduplicate normalized predicted paths before computing hit counts.</violation>
</file>

<file name="agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts">

<violation number="1" location="agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts:44">
P2: `normalizePath` only strips a single leading `./` or `/` — paths like `././foo` normalize to `./foo` instead of `foo`, and embedded `./` segments (e.g., `a/./b`) are not resolved. This can cause false metric mismatches. Consider using a more robust normalization, e.g., replacing all `./` segments.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

return hits / predicted.length
}

function recall(predicted: string[], expected: Set<string>): number {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Duplicate predictions inflate recall (and precision) — recall can exceed 1.0. The predicted array is not deduplicated before counting hits, so duplicate matching paths each count as a separate hit. Deduplicate normalized predictions before computing set-based metrics.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts, line 68:

<comment>Duplicate predictions inflate `recall` (and `precision`) — `recall` can exceed 1.0. The `predicted` array is not deduplicated before counting hits, so duplicate matching paths each count as a separate hit. Deduplicate normalized predictions before computing set-based metrics.</comment>

<file context>
@@ -0,0 +1,123 @@
+  return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+  if (expected.size === 0)
+    return 0
</file context>
Fix with Cubic

function precision(predicted: string[], expected: Set<string>): number {
if (predicted.length === 0)
return 0
const hits = predicted.filter(p => expected.has(normalizePath(p))).length
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Bug: recall can return values > 1.0 when predicted contains duplicate file paths. The filter counts every duplicate that matches the expected set, so hits can exceed expected.size. Deduplicate the normalized predictions before computing the intersection.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts, line 64:

<comment>Bug: `recall` can return values > 1.0 when `predicted` contains duplicate file paths. The `filter` counts every duplicate that matches the expected set, so `hits` can exceed `expected.size`. Deduplicate the normalized predictions before computing the intersection.</comment>

<file context>
@@ -0,0 +1,123 @@
+function precision(predicted: string[], expected: Set<string>): number {
+  if (predicted.length === 0)
+    return 0
+  const hits = predicted.filter(p => expected.has(normalizePath(p))).length
+  return hits / predicted.length
+}
</file context>
Fix with Cubic

"query": "Find the files that handle font optimization and loading",
"expect": [
"src/server/font-utils.ts",
"font/google/index.d.ts",
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Ground truth contradicts the eval prompt. The PROMPT.md instructs agents to "Focus on implementation files (.ts, .tsx), not type declarations (.d.ts)" yet 2 of the 3 expected files here are .d.ts type declarations. This is the only query (out of 20) with this mismatch — agents following instructions correctly will score poorly. Replace the .d.ts entries with the actual implementation files for font optimization (e.g., src/server/font-utils.ts plus the implementation sources under font/google/ and font/local/).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json, line 6:

<comment>Ground truth contradicts the eval prompt. The PROMPT.md instructs agents to "Focus on implementation files (`.ts`, `.tsx`), not type declarations (`.d.ts`)" yet 2 of the 3 expected files here are `.d.ts` type declarations. This is the only query (out of 20) with this mismatch — agents following instructions correctly will score poorly. Replace the `.d.ts` entries with the actual implementation files for font optimization (e.g., `src/server/font-utils.ts` plus the implementation sources under `font/google/` and `font/local/`).</comment>

<file context>
@@ -0,0 +1,11 @@
+  "query": "Find the files that handle font optimization and loading",
+  "expect": [
+    "src/server/font-utils.ts",
+    "font/google/index.d.ts",
+    "font/local/index.d.ts"
+  ],
</file context>
Fix with Cubic

Edit `.env.local` and add your API keys:
- `AI_GATEWAY_API_KEY` - Vercel AI Gateway API key ([get yours](https://vercel.com/dashboard))
- `VERCEL_TOKEN` - Vercel personal access token ([create one](https://vercel.com/account/tokens))
- `CLAUDE_CODE_OAUTH_TOKEN` - Claud Code OAuth Token AI Gateway API key (`claude setup-token`)
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Typo: "Claud Code" should be "Claude Code".

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/README.md, line 18:

<comment>Typo: "Claud Code" should be "Claude Code".</comment>

<file context>
@@ -15,8 +15,7 @@ Test AI coding agents to measure what actually works.
    Edit `.env.local` and add your API keys:
-   - `AI_GATEWAY_API_KEY` - Vercel AI Gateway API key ([get yours](https://vercel.com/dashboard))
-   - `VERCEL_TOKEN` - Vercel personal access token ([create one](https://vercel.com/account/tokens))
+   - `CLAUDE_CODE_OAUTH_TOKEN` - Claud Code OAuth Token AI Gateway API key (`claude setup-token`)
 
 ## Running Evals
</file context>
Fix with Cubic

accuracy_at_3: accuracyAtK(predicted, expected, 3),
accuracy_at_5: accuracyAtK(predicted, expected, 5),
accuracy_at_10: accuracyAtK(predicted, expected, 10),
mrr: meanReciprocalRank(predicted, expected),
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: MRR is computed over the unbounded predicted list while all other metrics are scoped to top-10. This means a correct result at rank 50 yields a non-zero MRR, inconsistent with accuracy@10, precision@10, and recall@10. Pass the same top-10 slice for consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts, line 113:

<comment>MRR is computed over the unbounded predicted list while all other metrics are scoped to top-10. This means a correct result at rank 50 yields a non-zero MRR, inconsistent with accuracy@10, precision@10, and recall@10. Pass the same top-10 slice for consistency.</comment>

<file context>
@@ -0,0 +1,123 @@
+    accuracy_at_3: accuracyAtK(predicted, expected, 3),
+    accuracy_at_5: accuracyAtK(predicted, expected, 5),
+    accuracy_at_10: accuracyAtK(predicted, expected, 10),
+    mrr: meanReciprocalRank(predicted, expected),
+    precision: precision(predicted.slice(0, 10), expected),
+    recall: recall(predicted.slice(0, 10), expected),
</file context>
Fix with Cubic

accuracy_at_3: accuracyAtK(predicted, expected, 3),
accuracy_at_5: accuracyAtK(predicted, expected, 5),
accuracy_at_10: accuracyAtK(predicted, expected, 10),
mrr: meanReciprocalRank(predicted, expected),
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: MRR is computed over the full predicted list while all other metrics are capped at K=10, creating an inconsistency. If the agent lists a correct file beyond position 10, MRR will be nonzero but accuracy@10, precision@10, and recall@10 will all be zero, making cross-metric comparisons misleading. Cap the MRR computation at 10 for consistency (MRR@10).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts, line 113:

<comment>MRR is computed over the full predicted list while all other metrics are capped at K=10, creating an inconsistency. If the agent lists a correct file beyond position 10, MRR will be nonzero but accuracy@10, precision@10, and recall@10 will all be zero, making cross-metric comparisons misleading. Cap the MRR computation at 10 for consistency (MRR@10).</comment>

<file context>
@@ -0,0 +1,123 @@
+    accuracy_at_3: accuracyAtK(predicted, expected, 3),
+    accuracy_at_5: accuracyAtK(predicted, expected, 5),
+    accuracy_at_10: accuracyAtK(predicted, expected, 10),
+    mrr: meanReciprocalRank(predicted, expected),
+    precision: precision(predicted.slice(0, 10), expected),
+    recall: recall(predicted.slice(0, 10), expected),
</file context>
Fix with Cubic

return hits / predicted.length
}

function recall(predicted: string[], expected: Set<string>): number {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Precision and recall can produce invalid values (recall > 1.0) when predicted paths contain duplicates after normalization. For example, if the agent outputs both ./foo.ts and foo.ts, both normalize to foo.ts and are counted as separate hits. Deduplicate normalized predictions before computing hits.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts, line 68:

<comment>Precision and recall can produce invalid values (recall > 1.0) when predicted paths contain duplicates after normalization. For example, if the agent outputs both `./foo.ts` and `foo.ts`, both normalize to `foo.ts` and are counted as separate hits. Deduplicate normalized predictions before computing hits.</comment>

<file context>
@@ -0,0 +1,123 @@
+  return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+  if (expected.size === 0)
+    return 0
</file context>
Fix with Cubic

return hits / predicted.length
}

function recall(predicted: string[], expected: Set<string>): number {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Bug: recall (and precision) count duplicate predictions, allowing recall to exceed 1.0. Normalize and deduplicate predicted before counting hits to ensure correct metric computation.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts, line 68:

<comment>Bug: `recall` (and `precision`) count duplicate predictions, allowing `recall` to exceed 1.0. Normalize and deduplicate `predicted` before counting hits to ensure correct metric computation.</comment>

<file context>
@@ -0,0 +1,123 @@
+  return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+  if (expected.size === 0)
+    return 0
</file context>
Fix with Cubic

return hits / predicted.length
}

function recall(predicted: string[], expected: Set<string>): number {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Recall (and precision) can produce invalid values > 1.0 when predicted paths contain duplicates after normalization (e.g., ./a.ts and a.ts both normalize to a.ts). Deduplicate normalized predicted paths before computing hit counts.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts, line 68:

<comment>Recall (and precision) can produce invalid values > 1.0 when predicted paths contain duplicates after normalization (e.g., `./a.ts` and `a.ts` both normalize to `a.ts`). Deduplicate normalized predicted paths before computing hit counts.</comment>

<file context>
@@ -0,0 +1,123 @@
+  return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+  if (expected.size === 0)
+    return 0
</file context>
Fix with Cubic

}

function normalizePath(p: string): string {
return p.replace(/^\.?\//, '').replace(/\/+/g, '/')
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: normalizePath only strips a single leading ./ or / — paths like ././foo normalize to ./foo instead of foo, and embedded ./ segments (e.g., a/./b) are not resolved. This can cause false metric mismatches. Consider using a more robust normalization, e.g., replacing all ./ segments.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts, line 44:

<comment>`normalizePath` only strips a single leading `./` or `/` — paths like `././foo` normalize to `./foo` instead of `foo`, and embedded `./` segments (e.g., `a/./b`) are not resolved. This can cause false metric mismatches. Consider using a more robust normalization, e.g., replacing all `./` segments.</comment>

<file context>
@@ -0,0 +1,123 @@
+}
+
+function normalizePath(p: string): string {
+  return p.replace(/^\.?\//, '').replace(/\/+/g, '/')
+}
+
</file context>
Fix with Cubic

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
63.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant