feat(evals): add retrieval-accuracy evaluation framework#45
feat(evals): add retrieval-accuracy evaluation framework#45
Conversation
Add IR-style retrieval evals that measure whether agents (with/without RPG tools) can correctly identify relevant files in the Next.js codebase, adapting the rpg-encoder benchmark queries.json approach. - 20-query dataset (nextjs-queries.json) spanning 7 categories - Shared EVAL.ts template computing Acc@K, MRR, Precision, Recall - Generator script to produce per-query eval directories - Baseline (cc-retrieval) and RPG-enhanced (cc-rpg-retrieval) experiments - Metrics aggregation script with difficulty/category breakdowns
804812f to
25b2abb
Compare
Summary of ChangesHello @amondnet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the evaluation capabilities for AI coding agents by introducing a comprehensive retrieval-accuracy evaluation framework. This framework allows for precise measurement of an agent's ability to locate relevant code files given a natural language query, backed by a new dataset and detailed metric aggregation. Concurrently, it integrates an interactive encoding protocol into the Model Context Protocol (MCP) server, providing a structured, step-by-step process for agents to build and refine a Repository Planning Graph (RPG) through semantic analysis and hierarchical organization. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive retrieval-accuracy evaluation framework and a new interactive encoding protocol for MCP. The changes are extensive and well-structured, including new scripts for generating and aggregating evaluations, templates, experiment configurations, and a significant refactoring of the encoder to support the new interactive mode. The code quality is high, and the new features are well-tested. My review focuses on improving the robustness of the new scripts and suggesting the use of a standard library to reduce custom code maintenance.
I am having trouble creating individual review comments. Click here to see my feedback.
agent-evals/scripts/aggregate-retrieval-metrics.ts (106-111)
The script will crash if any metrics.json file is malformed or empty, which can happen with failed or interrupted experiment runs. It would be more robust to wrap the JSON.parse call and subsequent logic in a try...catch block to handle potential parsing errors gracefully. This would allow the script to continue aggregating results from other valid files instead of failing completely.
try {
const metrics: Metrics = JSON.parse(readFileSync(metricsPath, 'utf-8'))
const key = `${experiment.name}/${model.name}`
if (!experimentMetrics.has(key)) {
experimentMetrics.set(key, [])
}
experimentMetrics.get(key)!.push(metrics)
} catch (e) {
console.warn(`[WARN] Skipping malformed metrics file: ${metricsPath}`)
}src/encoder/encoder.ts (143-145)
This file includes a custom implementation for glob pattern matching. While it covers basic cases, it's generally better to use a well-established and thoroughly tested library like micromatch. Using a standard library would make the code more robust, handle a wider range of glob syntax and edge cases correctly, and reduce the amount of custom code to maintain.
After adding micromatch as a dependency, this function and the related globMatch, matchSegments, and matchSegment helpers could be replaced with a simpler implementation:
import micromatch from 'micromatch';
function matchesPattern(filePath: string, patterns: string[]): boolean {
// micromatch handles path normalization and multiple patterns
return micromatch.isMatch(filePath, patterns);
}There was a problem hiding this comment.
16 issues found across 100 files
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="agent-evals/README.md">
<violation number="1" location="agent-evals/README.md:18">
P2: Typo: "Claud Code" should be "Claude Code".</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts:113">
P2: MRR is computed over the unbounded predicted list while all other metrics are scoped to top-10. This means a correct result at rank 50 yields a non-zero MRR, inconsistent with accuracy@10, precision@10, and recall@10. Pass the same top-10 slice for consistency.</violation>
</file>
<file name="agent-evals/evals/_templates/retrieval/package.json">
<violation number="1" location="agent-evals/evals/_templates/retrieval/package.json:5">
P2: Vitest major version mismatch with parent project. The root `agent-evals/package.json` uses `"vitest": "^2.1.0"` but this template specifies `"^3.1.3"`. Align on a single major version to avoid breaking-change surprises and dependency conflicts.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts:68">
P1: Duplicate predictions inflate `recall` (and `precision`) — `recall` can exceed 1.0. The `predicted` array is not deduplicated before counting hits, so duplicate matching paths each count as a separate hit. Deduplicate normalized predictions before computing set-based metrics.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-009-find-the-files-that-handle-server-action/EVAL.ts:113">
P2: MRR is computed over the full `predicted` array while all other metrics are capped at top-10. This inconsistency can produce misleading results—e.g., a non-zero MRR alongside zero accuracy@10/precision/recall. Cap MRR at the same K=10 cutoff for consistency.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-017-find-the-files-that-implement-the-use-ca/EVAL.ts:68">
P2: Duplicate predicted paths inflate `recall` (can exceed 1.0) and `precision`. The `hits` count uses `.filter()` on the raw array, so duplicate entries of a relevant file each count as a separate hit. Deduplicate the normalized predicted paths before computing metrics to match standard IR definitions.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-012-find-the-files-that-implement-hot-module/EVAL.ts:29">
P2: `JSON.parse(raw)` on untrusted agent output (`answer.json`) is not wrapped in a try/catch. If the agent produces malformed JSON, this throws a raw `SyntaxError` instead of returning `[]` or a meaningful failure. Since `loadAnswer` already handles the missing-file case defensively, it should handle parse errors the same way.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts:64">
P1: Bug: `recall` can return values > 1.0 when `predicted` contains duplicate file paths. The `filter` counts every duplicate that matches the expected set, so `hits` can exceed `expected.size`. Deduplicate the normalized predictions before computing the intersection.</violation>
</file>
<file name="agent-evals/evals/_templates/retrieval/EVAL.ts">
<violation number="1" location="agent-evals/evals/_templates/retrieval/EVAL.ts:113">
P2: MRR is computed over the full predicted array with no K limit, unlike all other metrics which are bounded to the top 10. If the agent returns >10 files and the first hit is beyond position 10, MRR will be non-zero while accuracy@10 / precision@10 / recall@10 are all zero, giving contradictory results.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json">
<violation number="1" location="agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json:6">
P1: Ground truth contradicts the eval prompt. The PROMPT.md instructs agents to "Focus on implementation files (`.ts`, `.tsx`), not type declarations (`.d.ts`)" yet 2 of the 3 expected files here are `.d.ts` type declarations. This is the only query (out of 20) with this mismatch — agents following instructions correctly will score poorly. Replace the `.d.ts` entries with the actual implementation files for font optimization (e.g., `src/server/font-utils.ts` plus the implementation sources under `font/google/` and `font/local/`).</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-014-find-the-files-that-implement-react-serv/EVAL.ts:64">
P2: Bug: `recall` (and `precision`) count duplicate predicted paths as separate hits, which can produce recall values > 1.0. Since `predicted` comes from agent-generated `answer.json`, duplicates are plausible. Deduplicate the normalized predicted paths before counting hits.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts:113">
P2: MRR is computed over the full predicted list while all other metrics are capped at K=10, creating an inconsistency. If the agent lists a correct file beyond position 10, MRR will be nonzero but accuracy@10, precision@10, and recall@10 will all be zero, making cross-metric comparisons misleading. Cap the MRR computation at 10 for consistency (MRR@10).</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts:68">
P2: Precision and recall can produce invalid values (recall > 1.0) when predicted paths contain duplicates after normalization. For example, if the agent outputs both `./foo.ts` and `foo.ts`, both normalize to `foo.ts` and are counted as separate hits. Deduplicate normalized predictions before computing hits.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts:68">
P2: Bug: `recall` (and `precision`) count duplicate predictions, allowing `recall` to exceed 1.0. Normalize and deduplicate `predicted` before counting hits to ensure correct metric computation.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts:68">
P2: Recall (and precision) can produce invalid values > 1.0 when predicted paths contain duplicates after normalization (e.g., `./a.ts` and `a.ts` both normalize to `a.ts`). Deduplicate normalized predicted paths before computing hit counts.</violation>
</file>
<file name="agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts">
<violation number="1" location="agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts:44">
P2: `normalizePath` only strips a single leading `./` or `/` — paths like `././foo` normalize to `./foo` instead of `foo`, and embedded `./` segments (e.g., `a/./b`) are not resolved. This can cause false metric mismatches. Consider using a more robust normalization, e.g., replacing all `./` segments.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| return hits / predicted.length | ||
| } | ||
|
|
||
| function recall(predicted: string[], expected: Set<string>): number { |
There was a problem hiding this comment.
P1: Duplicate predictions inflate recall (and precision) — recall can exceed 1.0. The predicted array is not deduplicated before counting hits, so duplicate matching paths each count as a separate hit. Deduplicate normalized predictions before computing set-based metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/EVAL.ts, line 68:
<comment>Duplicate predictions inflate `recall` (and `precision`) — `recall` can exceed 1.0. The `predicted` array is not deduplicated before counting hits, so duplicate matching paths each count as a separate hit. Deduplicate normalized predictions before computing set-based metrics.</comment>
<file context>
@@ -0,0 +1,123 @@
+ return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+ if (expected.size === 0)
+ return 0
</file context>
| function precision(predicted: string[], expected: Set<string>): number { | ||
| if (predicted.length === 0) | ||
| return 0 | ||
| const hits = predicted.filter(p => expected.has(normalizePath(p))).length |
There was a problem hiding this comment.
P1: Bug: recall can return values > 1.0 when predicted contains duplicate file paths. The filter counts every duplicate that matches the expected set, so hits can exceed expected.size. Deduplicate the normalized predictions before computing the intersection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-010-find-the-files-that-implement-the-next-j/EVAL.ts, line 64:
<comment>Bug: `recall` can return values > 1.0 when `predicted` contains duplicate file paths. The `filter` counts every duplicate that matches the expected set, so `hits` can exceed `expected.size`. Deduplicate the normalized predictions before computing the intersection.</comment>
<file context>
@@ -0,0 +1,123 @@
+function precision(predicted: string[], expected: Set<string>): number {
+ if (predicted.length === 0)
+ return 0
+ const hits = predicted.filter(p => expected.has(normalizePath(p))).length
+ return hits / predicted.length
+}
</file context>
| "query": "Find the files that handle font optimization and loading", | ||
| "expect": [ | ||
| "src/server/font-utils.ts", | ||
| "font/google/index.d.ts", |
There was a problem hiding this comment.
P1: Ground truth contradicts the eval prompt. The PROMPT.md instructs agents to "Focus on implementation files (.ts, .tsx), not type declarations (.d.ts)" yet 2 of the 3 expected files here are .d.ts type declarations. This is the only query (out of 20) with this mismatch — agents following instructions correctly will score poorly. Replace the .d.ts entries with the actual implementation files for font optimization (e.g., src/server/font-utils.ts plus the implementation sources under font/google/ and font/local/).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-007-find-the-files-that-handle-font-optimiza/GROUND_TRUTH.json, line 6:
<comment>Ground truth contradicts the eval prompt. The PROMPT.md instructs agents to "Focus on implementation files (`.ts`, `.tsx`), not type declarations (`.d.ts`)" yet 2 of the 3 expected files here are `.d.ts` type declarations. This is the only query (out of 20) with this mismatch — agents following instructions correctly will score poorly. Replace the `.d.ts` entries with the actual implementation files for font optimization (e.g., `src/server/font-utils.ts` plus the implementation sources under `font/google/` and `font/local/`).</comment>
<file context>
@@ -0,0 +1,11 @@
+ "query": "Find the files that handle font optimization and loading",
+ "expect": [
+ "src/server/font-utils.ts",
+ "font/google/index.d.ts",
+ "font/local/index.d.ts"
+ ],
</file context>
| Edit `.env.local` and add your API keys: | ||
| - `AI_GATEWAY_API_KEY` - Vercel AI Gateway API key ([get yours](https://vercel.com/dashboard)) | ||
| - `VERCEL_TOKEN` - Vercel personal access token ([create one](https://vercel.com/account/tokens)) | ||
| - `CLAUDE_CODE_OAUTH_TOKEN` - Claud Code OAuth Token AI Gateway API key (`claude setup-token`) |
There was a problem hiding this comment.
P2: Typo: "Claud Code" should be "Claude Code".
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/README.md, line 18:
<comment>Typo: "Claud Code" should be "Claude Code".</comment>
<file context>
@@ -15,8 +15,7 @@ Test AI coding agents to measure what actually works.
Edit `.env.local` and add your API keys:
- - `AI_GATEWAY_API_KEY` - Vercel AI Gateway API key ([get yours](https://vercel.com/dashboard))
- - `VERCEL_TOKEN` - Vercel personal access token ([create one](https://vercel.com/account/tokens))
+ - `CLAUDE_CODE_OAUTH_TOKEN` - Claud Code OAuth Token AI Gateway API key (`claude setup-token`)
## Running Evals
</file context>
| accuracy_at_3: accuracyAtK(predicted, expected, 3), | ||
| accuracy_at_5: accuracyAtK(predicted, expected, 5), | ||
| accuracy_at_10: accuracyAtK(predicted, expected, 10), | ||
| mrr: meanReciprocalRank(predicted, expected), |
There was a problem hiding this comment.
P2: MRR is computed over the unbounded predicted list while all other metrics are scoped to top-10. This means a correct result at rank 50 yields a non-zero MRR, inconsistent with accuracy@10, precision@10, and recall@10. Pass the same top-10 slice for consistency.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-008-find-the-files-that-implement-the-build/EVAL.ts, line 113:
<comment>MRR is computed over the unbounded predicted list while all other metrics are scoped to top-10. This means a correct result at rank 50 yields a non-zero MRR, inconsistent with accuracy@10, precision@10, and recall@10. Pass the same top-10 slice for consistency.</comment>
<file context>
@@ -0,0 +1,123 @@
+ accuracy_at_3: accuracyAtK(predicted, expected, 3),
+ accuracy_at_5: accuracyAtK(predicted, expected, 5),
+ accuracy_at_10: accuracyAtK(predicted, expected, 10),
+ mrr: meanReciprocalRank(predicted, expected),
+ precision: precision(predicted.slice(0, 10), expected),
+ recall: recall(predicted.slice(0, 10), expected),
</file context>
| accuracy_at_3: accuracyAtK(predicted, expected, 3), | ||
| accuracy_at_5: accuracyAtK(predicted, expected, 5), | ||
| accuracy_at_10: accuracyAtK(predicted, expected, 10), | ||
| mrr: meanReciprocalRank(predicted, expected), |
There was a problem hiding this comment.
P2: MRR is computed over the full predicted list while all other metrics are capped at K=10, creating an inconsistency. If the agent lists a correct file beyond position 10, MRR will be nonzero but accuracy@10, precision@10, and recall@10 will all be zero, making cross-metric comparisons misleading. Cap the MRR computation at 10 for consistency (MRR@10).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-002-find-the-files-responsible-for-client-si/EVAL.ts, line 113:
<comment>MRR is computed over the full predicted list while all other metrics are capped at K=10, creating an inconsistency. If the agent lists a correct file beyond position 10, MRR will be nonzero but accuracy@10, precision@10, and recall@10 will all be zero, making cross-metric comparisons misleading. Cap the MRR computation at 10 for consistency (MRR@10).</comment>
<file context>
@@ -0,0 +1,123 @@
+ accuracy_at_3: accuracyAtK(predicted, expected, 3),
+ accuracy_at_5: accuracyAtK(predicted, expected, 5),
+ accuracy_at_10: accuracyAtK(predicted, expected, 10),
+ mrr: meanReciprocalRank(predicted, expected),
+ precision: precision(predicted.slice(0, 10), expected),
+ recall: recall(predicted.slice(0, 10), expected),
</file context>
| return hits / predicted.length | ||
| } | ||
|
|
||
| function recall(predicted: string[], expected: Set<string>): number { |
There was a problem hiding this comment.
P2: Precision and recall can produce invalid values (recall > 1.0) when predicted paths contain duplicates after normalization. For example, if the agent outputs both ./foo.ts and foo.ts, both normalize to foo.ts and are counted as separate hits. Deduplicate normalized predictions before computing hits.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-006-find-the-files-that-implement-the-web-re/EVAL.ts, line 68:
<comment>Precision and recall can produce invalid values (recall > 1.0) when predicted paths contain duplicates after normalization. For example, if the agent outputs both `./foo.ts` and `foo.ts`, both normalize to `foo.ts` and are counted as separate hits. Deduplicate normalized predictions before computing hits.</comment>
<file context>
@@ -0,0 +1,123 @@
+ return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+ if (expected.size === 0)
+ return 0
</file context>
| return hits / predicted.length | ||
| } | ||
|
|
||
| function recall(predicted: string[], expected: Set<string>): number { |
There was a problem hiding this comment.
P2: Bug: recall (and precision) count duplicate predictions, allowing recall to exceed 1.0. Normalize and deduplicate predicted before counting hits to ensure correct metric computation.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-004-find-the-files-that-handle-the-webpack-b/EVAL.ts, line 68:
<comment>Bug: `recall` (and `precision`) count duplicate predictions, allowing `recall` to exceed 1.0. Normalize and deduplicate `predicted` before counting hits to ensure correct metric computation.</comment>
<file context>
@@ -0,0 +1,123 @@
+ return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+ if (expected.size === 0)
+ return 0
</file context>
| return hits / predicted.length | ||
| } | ||
|
|
||
| function recall(predicted: string[], expected: Set<string>): number { |
There was a problem hiding this comment.
P2: Recall (and precision) can produce invalid values > 1.0 when predicted paths contain duplicates after normalization (e.g., ./a.ts and a.ts both normalize to a.ts). Deduplicate normalized predicted paths before computing hit counts.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-003-find-the-files-that-implement-server-sid/EVAL.ts, line 68:
<comment>Recall (and precision) can produce invalid values > 1.0 when predicted paths contain duplicates after normalization (e.g., `./a.ts` and `a.ts` both normalize to `a.ts`). Deduplicate normalized predicted paths before computing hit counts.</comment>
<file context>
@@ -0,0 +1,123 @@
+ return hits / predicted.length
+}
+
+function recall(predicted: string[], expected: Set<string>): number {
+ if (expected.size === 0)
+ return 0
</file context>
| } | ||
|
|
||
| function normalizePath(p: string): string { | ||
| return p.replace(/^\.?\//, '').replace(/\/+/g, '/') |
There was a problem hiding this comment.
P2: normalizePath only strips a single leading ./ or / — paths like ././foo normalize to ./foo instead of foo, and embedded ./ segments (e.g., a/./b) are not resolved. This can cause false metric mismatches. Consider using a more robust normalization, e.g., replacing all ./ segments.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-evals/evals/retrieval-nq-005-find-the-files-responsible-for-error-dis/EVAL.ts, line 44:
<comment>`normalizePath` only strips a single leading `./` or `/` — paths like `././foo` normalize to `./foo` instead of `foo`, and embedded `./` segments (e.g., `a/./b`) are not resolved. This can cause false metric mismatches. Consider using a more robust normalization, e.g., replacing all `./` segments.</comment>
<file context>
@@ -0,0 +1,123 @@
+}
+
+function normalizePath(p: string): string {
+ return p.replace(/^\.?\//, '').replace(/\/+/g, '/')
+}
+
</file context>
|




Summary
queries.jsonapproach to measure whether agents can correctly identify relevant files in the Next.js codebasenextjs-rpg.jsoncc-retrieval) and RPG-enhanced (cc-rpg-retrieval) experiment configs for A/B comparisonWhat's included
fixtures/nextjs-queries.json(20 queries, 3 difficulty levels)evals/_templates/retrieval/{EVAL.ts,PROMPT.md,package.json}evals/retrieval-nq-*directories (one per query)scripts/generate-retrieval-evals.tsexperiments/cc-retrieval.ts,experiments/cc-rpg-retrieval.tsscripts/aggregate-retrieval-metrics.tsMetrics computed
Test plan
npx tsx scripts/generate-retrieval-evals.tsproduces 20 directoriesnpx tsc --noEmitpasses for experiment configsbun run lint:fixpassesnpx tsx scripts/aggregate-retrieval-metrics.tshandles empty results gracefullynpx @pleaseai/agent-eval cc-retrievalruns baselinenpx @pleaseai/agent-eval cc-rpg-retrievalruns RPG-enhancedSummary by cubic
Adds a retrieval-accuracy evaluation suite to measure whether agents can correctly find relevant Next.js source files. Includes a 20-query dataset, per-query evals, and scripts to generate and aggregate metrics.
New Features
Bug Fixes
Written for commit 25b2abb. Summary will update on new commits.