Skip to content

Migrate example eval suites to suite.yaml#1618

Merged
christso merged 6 commits into
mainfrom
fix/suite-yaml-examples
Jul 3, 2026
Merged

Migrate example eval suites to suite.yaml#1618
christso merged 6 commits into
mainfrom
fix/suite-yaml-examples

Conversation

@christso

@christso christso commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Rename committed example suites from dataset.eval.yaml / generic eval.yaml to suite.yaml, with baseline files renamed to suite.baseline.jsonl and external row files renamed to cases.* where applicable.
  • Migrate example authoring from test-level criteria to explicit assert entries, normalize file-backed prompt references to file://..., and update README/docs references.
  • Add consistent suite.yaml discovery/validation support and baseline checking coverage; migrate example target files from removed name to label.
  • Add committed local OpenAI-compatible dogfood targets (pi-cli-openai, codex-sdk-openai, copilot-sdk-openai) and remove hard-deprecated cc-mirror target references.
  • Update .agentv/config.yaml to discover current suite names and auto-publish completed CLI/Dashboard result bundles to agentv/results/v1; remove legacy/default trace and OTel sidecar outputs from repo defaults.
  • Align project-local results: config with global ~/.agentv/config.yaml project entries: authored results config uses repo/path/branch/auto_push, and validation now rejects redundant results.mode while runtime loading tolerates legacy mode: github for compatibility.

Verification

  • bun run build
  • bun test apps/cli/test/commands/eval/shared.test.ts packages/core/test/evaluation/validation/file-type.test.ts packages/core/test/evaluation/category.test.ts packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts
  • bun test packages/core/test/evaluation/loaders/config-loader.test.ts packages/core/test/evaluation/validation/config-validator.test.ts packages/core/test/evaluation/results-repo.test.ts apps/cli/test/commands/results/remote-auto-export.test.ts (169 pass)
  • bun apps/cli/src/cli.ts validate examples .agentv/targets.yaml apps/cli/src/templates/.agentv/targets.yaml (109 valid, 0 invalid)
  • bun apps/cli/src/cli.ts validate .agentv/config.yaml (1 valid, 0 invalid)
  • git diff --check

Dogfood

  • Live local OpenAI-compatible endpoint model: gpt-5.3-codex-spark via LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1.
  • High-threshold live dogfood on ignored temporary suite .agentv/results/suite-yaml-live-dogfood-pass.yaml with --threshold 1:
    • pi-cli-openai: 100% PASS, bundle .agentv/results/suite-yaml-live-pass-pi-cli-openai
    • codex-sdk-openai: 100% PASS, bundle .agentv/results/suite-yaml-live-pass-codex-sdk-openai
    • copilot-sdk-openai: 100% PASS after using default chat format, bundle .agentv/results/suite-yaml-live-pass-copilot-sdk-openai-chat
  • Results-branch publish proof: pi-cli-openai on examples/features/readme-quickstart/evals/my-eval.eval.yaml with --threshold 1 --results-require-push and no explicit --output generated timestamped run .agentv/results/2026-07-03T09-20-51-754Z, passed 100%, and pushed to agentv/results/v1:2026-07-03T09-20-51-754Z.
  • Explicit Pi + LLM grader proof: run 2026-07-03T09-20-51-754Z used agent target pi-cli-openai; its manifest contains two live llm-rubric score entries with target=local-openai-grader, scores 1.0, and 1 + 2 rubric assertions respectively.
  • Broad example execution smoke included renamed suites for suite-level input, basic JSONL/cases, external datasets, local CLI, batch CLI, tool trajectory, and workspace artifacts. batch-cli intentionally keeps its missing-output case per README; threshold-0 runs are recorded as smoke only, not correctness dogfood.

Notes

  • Private run bundles remain ignored under .agentv/results/ and are not committed.
  • Published run bundles go to the agentv/results/v1 branch through the existing Git-backed results publishing path.
  • The verification guide now states that threshold-0 execution is smoke coverage, not dogfood evidence.

Entire-Checkpoint: 4f8edb57e3d1
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 3, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: bdcd540
Status: ✅  Deploy successful!
Preview URL: https://edff7662.agentv.pages.dev
Branch Preview URL: https://fix-suite-yaml-examples.agentv.pages.dev

View logs

@christso christso marked this pull request as ready for review July 3, 2026 11:56
@christso christso merged commit f1b0c02 into main Jul 3, 2026
8 checks passed
@christso christso deleted the fix/suite-yaml-examples branch July 3, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant