eval: include structured results, run config, and summary in JSON output by hamza-jeddad · Pull Request #2309 · docker/docker-agent

hamza-jeddad · 2026-04-02T13:37:44Z

What

The eval JSON output (<run-name>.json) now includes everything that was previously only in the log file:

Run metadata: name, timestamp, duration, config (agent, judge model, concurrency, evals dir), and aggregate summary
Per-session eval_result with overall pass/fail, human-readable successes/failures, error, cost, output tokens
Structured checks: size (actual vs expected), tool_calls (F1 score), and relevance (per-criterion pass/fail with judge reasoning)
Judge reasons for ALL criteria — both passed and failed. The LLM judge was already generating reasons for passed criteria but the code was discarding them.

JSON output format change

The output changes from a bare session array to a RunOutput wrapper:

{
  "name": "happy-panda-1234",
  "timestamp": "...",
  "duration": "42s",
  "config": { "agent": "...", "judge_model": "...", "concurrency": 8, "evals_dir": "..." },
  "summary": { "total_evals": 5, ... },
  "sessions": [
    {
      "title": "LedgerLite Container",
      "evals": { ... },
      "eval_result": {
        "passed": true,
        "successes": ["relevance 5/5"],
        "checks": {
          "relevance": {
            "passed": true,
            "passed_count": 5,
            "total": 5,
            "results": [
              { "criterion": "Uses app.py as entrypoint", "passed": true, "reason": "..." },
              { "criterion": "Keeps sample data optional", "passed": false, "reason": "..." }
            ]
          }
        }
      },
      "messages": [...]
    }
  ]
}

Files changed

pkg/session/session.go — New types: EvalResult, EvalResultChecks, SizeCheck, ToolCallsCheck, RelevanceCheck, RelevanceCriterionResult
pkg/evaluation/types.go — RunOutput, RunOutputConfig, Config field on EvalRun
pkg/evaluation/judge.go — CheckRelevance now returns []RelevanceResult (all criteria with reasons) instead of (passed int, failed []RelevanceResult)
pkg/evaluation/save.go — populateEvalResult + updated SaveRunSessionsJSON
pkg/evaluation/eval.go — Updated CheckRelevance caller, set Config on EvalRun
Tests updated accordingly

Notes

No changes to the session DB schema — eval_result only appears in the JSON output
This is a breaking change to the JSON output format (array → object wrapper)

The eval JSON output now includes: - Run metadata: name, timestamp, duration, config (agent, judge model, concurrency, evals dir), and aggregate summary - Per-session eval_result with pass/fail, human-readable successes/failures, error, cost, and output tokens - Structured checks: size (actual vs expected), tool_calls (F1 score), and relevance (per-criterion pass/fail with judge reasoning) - Relevance results list ALL criteria with passed bool, not just failures Output format changes from a bare session array to a RunOutput wrapper. No changes to the session DB schema. Assisted-By: docker-agent

The LLM judge already generates a reason for every criterion (pass and fail), but CheckRelevance was discarding reasons for passed ones. Changes: - RelevanceResult now includes a Passed bool field - CheckRelevance returns []RelevanceResult (all criteria) instead of (passed int, failed []RelevanceResult) - populateEvalResult copies reasons for all criteria into the JSON output - The JSON output now shows reason on every criterion, not just failures Assisted-By: docker-agent

docker-agent

Assessment: 🔴 CRITICAL

Found 1 critical data integrity issue in the relevance checking error handling.

pkg/evaluation/judge.go

When a criterion check errors (context cancelled, judge API failure), the continue left results[i] as a zero-value RelevanceResult with empty Criterion. Pre-populate criterion names before the loop so every entry is well-formed regardless of errors. Addresses review feedback from #2309. Assisted-By: docker-agent

hamza-jeddad · 2026-04-02T13:52:07Z

/review

docker-agent

Assessment: 🟢 APPROVE

No issues found in the changed code. The PR properly refactors the eval JSON output format to include structured results, run configuration, and summary metadata. All changes are well-tested and error handling is appropriate.

Key changes reviewed:

RunOutput wrapper structure with metadata, config, and summary
CheckRelevance now returns results for all criteria (passed and failed) with judge reasoning
Proper updates to all call sites and comprehensive test coverage
Breaking change to JSON output format is documented in PR description

Assisted-By: docker-agent

hamza-jeddad added 2 commits April 2, 2026 14:36

hamza-jeddad requested a review from a team as a code owner April 2, 2026 13:37

docker-agent bot reviewed Apr 2, 2026

View reviewed changes

pkg/evaluation/judge.go Show resolved Hide resolved

docker-agent bot reviewed Apr 2, 2026

View reviewed changes

hamza-jeddad added 2 commits April 2, 2026 15:57

fix: lint issues — gci formatting and testifylint float comparisons

be0ce2f

Assisted-By: docker-agent

fix: missed float comparison lint in save_test.go

b7b134a

Assisted-By: docker-agent

dgageot approved these changes Apr 2, 2026

View reviewed changes

dgageot merged commit 429fed8 into main Apr 2, 2026
8 checks passed

BrewTestBot mentioned this pull request Apr 3, 2026

docker-agent 1.42.0 Homebrew/homebrew-core#275939

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: include structured results, run config, and summary in JSON output#2309

eval: include structured results, run config, and summary in JSON output#2309
dgageot merged 5 commits intomainfrom
eval-structured-json-output

hamza-jeddad commented Apr 2, 2026

Uh oh!

docker-agent bot left a comment

Uh oh!

Uh oh!

hamza-jeddad commented Apr 2, 2026

Uh oh!

docker-agent bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamza-jeddad commented Apr 2, 2026

What

JSON output format change

Files changed

Notes

Uh oh!

docker-agent bot left a comment

Choose a reason for hiding this comment

Assessment: 🔴 CRITICAL

Uh oh!

Uh oh!

hamza-jeddad commented Apr 2, 2026

Uh oh!

docker-agent bot left a comment

Choose a reason for hiding this comment

Assessment: 🟢 APPROVE

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants