Test AI targets on real repo tasks and measure what actually works.
- Local-first — runs on your machine, no cloud accounts or API keys for eval infrastructure
- Repo-backed workspaces — reuse real repos, setup scripts, and existing harnesses instead of rebuilding synthetic tasks
- Portable artifacts — results, traces, and reports are saved in a durable format other tools can consume
- Version-controlled — evals, judges, and results all live in Git
- Hybrid graders — deterministic code checks + LLM-based subjective scoring
- CI/CD native — exit codes, JSONL output, threshold flags for pipeline gating
- Any target — run against agents, model providers, gateways, replay targets, CLI wrappers, transcript providers, and future app or service wrappers
- Eval suite / imports / tests are the task corpus: the prompts, cases, datasets, and imported benchmarks you want to evaluate.
- Category is derived from where the eval lives, such as folder path and file name. Use paths to organize the corpus instead of repeating category labels in every eval.
- Workspace / fixtures / graders are task-owned context: repos, setup scripts, files, fixtures, isolation, deterministic checks, and LLM grading prompts.
- Target is the system under test: an agent, provider, gateway, replay target, CLI wrapper, transcript provider, or future app/service wrapper. Each eval selects one
target, either by label fromtargets.yamlor with an eval-local target object. - Tags are run/result grouping labels.
tags.experimentis the default experiment namespace, such aswith-skillsorwithout-skills; keep suite/category and target/model names out of that tag. - Evaluate options configure runner-level behavior such as repeat policy, optional timeouts, and
max_concurrencyunderevaluate_options. - Default test configures inherited per-test defaults such as score
threshold. - Run is one concrete execution of a tagged eval against a resolved target that writes portable artifacts for readers such as Dashboard, compare, and trend.
1. Install and initialize:
npm install -g agentv
agentv init2. Configure targets in .agentv/targets.yaml — point to the system under test, such as an agent, provider, gateway, replay source, or CLI wrapper. Provider-specific budgets belong here:
targets:
- label: copilot-sdk
provider: anthropic
model: claude-sonnet-4.63. Create an eval in evals/:
description: Code generation quality
tags:
experiment: with-skills
target: copilot-sdk
evaluate_options:
repeat:
count: 3
strategy: pass_any
early_exit: false
max_concurrency: 3
default_test:
threshold: 0.8
workspace:
scope: attempt
repos:
- path: ./fixture
repo: EntityProcess/agentv-contract-fixture
commit: 21a34daed7ebcfe36cbed053607622a55e5e94cb
tests:
- id: fizzbuzz
input: Write FizzBuzz in Python
assert:
- type: contains
value: "fizz"
- Implements correct FizzBuzz logic for multiples of 3, 5, and 15
- type: script
command: ["python3", "./validators/check_syntax.py"]
- type: llm-rubric
value:
- outcome: Solution is simple and idiomatic Python
weight: 0.5
- outcome: Handles the 3, 5, and 15 branches correctly
weight: 1.5Plain assertion strings are short-form rubric criteria: AgentV groups them into
llm-rubric and writes each criterion to grading.json.assertion_results for the
Dashboard. Use explicit type: llm-rubric when you need weights, required flags, or
score_ranges; use string value for promptfoo-compatible free-form rubric
checks; use type: llm-grader only when you need a custom grader prompt,
grader target, or preprocessing. Executable graders use type: script.
The target can be an eval-local object when this eval needs target settings of its own:
description: Code generation quality with Copilot target settings
tags:
experiment: with-skills
target:
extends: copilot-sdk
model: claude-sonnet-4.6
evaluate_options:
repeat:
count: 2
strategy: pass_any
default_test:
threshold: 0.85
tests:
- id: fizzbuzz
input: Write FizzBuzz in Pythontarget: copilot-sdk resolves the target label from .agentv/targets.yaml or targets.yaml and uses its default provider, model, hooks, and provider settings. The object form above starts from copilot-sdk, then applies the eval-local fields for this eval. If extends is omitted, the object defines the full target inline and must include enough provider configuration to run. AgentV records the resolved target information in run artifacts so results can be audited and replayed. The tags.experiment label stays with-skills because the condition is unchanged; the model/provider variation belongs to the resolved target metadata.
Use default_test.threshold for the inherited per-test pass cutoff. default_test can also point at a shared file, matching promptfoo's external defaults pattern:
default_test: file://{{ env.AGENTV_REPO_ROOT }}/.agentv/default-test.yamlAgentV makes AGENTV_REPO_ROOT available during eval/config interpolation. Projects that prefer a short name can define their own reference in .agentv/config.yaml; global-default below is just an example key:
refs:
global-default: file://{{ env.AGENTV_REPO_ROOT }}/.agentv/default-test.yamlThen eval files in that project can use default_test: ref://global-default.
4. Run it:
agentv eval evals/my-eval.yaml5. Compare two runs (pass two index.jsonl manifests — e.g. before and after a change):
agentv compare .agentv/results/<baseline-run-id>/index.jsonl .agentv/results/<candidate-run-id>/index.jsonlEach run writes a portable bundle directly under .agentv/results/<run_id>/. In this example, tags.experiment: with-skills names the condition being measured and target: copilot-sdk selects the system under test from targets.yaml; both are recorded as metadata, not path segments. The root index.jsonl manifest is the portable row index used by scripts, CI, and agentv compare; per-case sidecars include the resolved eval and target configuration used for the run.
agentv eval evals/my-eval.yaml
cat .agentv/results/<run_id>/index.jsonlRun bundle layout:
.agentv/results/
├── 2026-06-30T08-30-00-000Z/ # <run_id> — one committed run bundle
│ ├── index.jsonl # row index for scripts/CI and `agentv compare`
│ ├── summary.json # run rollup: metadata, pass rate, counts, cost
│ └── fizzbuzz--a1b2c3d4/ # <result_dir> for one test/target row
│ ├── summary.json # optional per-case rollup across attempts
│ ├── test/ # generated test bundle: frozen inputs for reproducibility
│ │ ├── EVAL.yaml # resolved eval spec
│ │ ├── targets.yaml # resolved target config
│ │ └── graders/ # grader files used
│ └── attempt-1/ # one materialized attempt
│ ├── result.json # compact attempt manifest
│ ├── grading.json # assertion_results and grader evidence
│ ├── metrics.json # tool calls, transcript stats, behavior metrics
│ ├── timing.json # duration, token usage, cost
│ ├── transcript.json # normalized agent transcript
│ ├── transcript-raw.jsonl # raw agent output (debugging)
│ └── outputs/ # captured stdout and grader outputs
├── .indexes/ # reserved local/rebuildable indexes
└── .cache/ # reserved local cache
Use evaluate() when your application owns the run:
import { evaluate } from '@agentv/sdk';
const { results, summary } = await evaluate({
experiment: 'with-skills',
task: async (input) => runMyAppTarget(input),
threshold: 0.8,
tests: [
{
id: 'fizzbuzz',
input: 'Write FizzBuzz in Python',
assert: [
{ type: 'contains', value: 'fizz' },
'Implements correct FizzBuzz logic for multiples of 3, 5, and 15',
{ type: 'script', command: ['python3', './validators/check_syntax.py'] },
{ type: 'llm-rubric', value: ['Solution is simple and idiomatic Python'] },
],
},
],
});
console.log(`${summary.passed}/${summary.total} passed`);Use defineEval() when you want AgentV to run the TypeScript eval file:
import { defineEval } from '@agentv/sdk';
export default defineEval({
description: 'Code generation quality',
tags: { experiment: 'with-skills' },
target: {
extends: 'copilot-sdk',
model: 'claude-sonnet-4.6',
},
repeat: {
count: 3,
strategy: 'pass_any',
earlyExit: false,
},
threshold: 0.8,
workspace: {
scope: 'attempt',
repos: [
{
path: './fixture',
repo: 'EntityProcess/agentv-contract-fixture',
commit: '21a34daed7ebcfe36cbed053607622a55e5e94cb',
},
],
},
tests: [
{
id: 'fizzbuzz',
input: 'Write FizzBuzz in Python',
assert: [
{ type: 'contains', value: 'fizz' },
'Implements correct FizzBuzz logic for multiples of 3, 5, and 15',
{ type: 'script', command: ['python3', './validators/check_syntax.py'] },
{ type: 'llm-rubric', value: ['Solution is simple and idiomatic Python'] },
],
},
],
});Full docs at agentv.dev/docs.
- Eval files — format and structure
- Custom graders — script graders in any language
- Rubrics — structured criteria scoring
- Targets — configure agents and providers
- Compare results — A/B testing and regression detection
- Ecosystem — how AgentV fits with Agent Control and Langfuse
git clone https://github.com/EntityProcess/agentv.git
cd agentv
bun install && bun run build
bun testSee AGENTS.md for development guidelines.
To simulate a one-command production deployment of AgentV Dashboard with the AgentV examples project and a remote results repository:
AGENTV_RESULTS_REPO=EntityProcess/agentv-evalresults \
scripts/setup-dashboard-deployment.shThe script clones AgentV examples into ~/agentv-dashboard, clones the results
repo, writes the Dashboard project registry under the $AGENTV_HOME config
pair, builds the Docker image, and starts Dashboard at http://localhost:3117.
MIT