Skip to content

Add HTTP steps, retry, resume, failure classification, HTML reports, and CI setup to harness#3048

Open
Myestery wants to merge 51 commits intomainfrom
implement-missing-harness-issues
Open

Add HTTP steps, retry, resume, failure classification, HTML reports, and CI setup to harness#3048
Myestery wants to merge 51 commits intomainfrom
implement-missing-harness-issues

Conversation

@Myestery
Copy link
Contributor

Summary

  • HTTP behavioral check steps with status/body assertions for testing deployed endpoints
  • Retry logic for flaky agent interactions (retry: { attempts, delay })
  • --resume-from flag to skip steps before a given step ID, enabling re-runs from a specific point
  • Failure classification mapping error prefixes (BUILD_FAILED, DEPLOY_FAILED, etc.) to categories with actionable guidance
  • HTML summary report generation alongside summary.json with scenario details, failures, and matrix views
  • Golem state cleanup (golem deploy --reset) between scenarios with --no-cleanup and --workspace flags
  • CI workflow: Postgres 16 service container + Go/Python toolchain setup

Closes

Test plan

  • Unit tests for HTTP steps, retry logic, resume, failure classification, and HTML report
  • Manual verification of HTTP endpoints against a deployed Golem DB app
  • Harness scenario run with --scenario golem-db-app produces HTML report

Myestery and others added 30 commits March 5, 2026 03:49
* Add agent testing harness for skill verification

Implements the skill testing harness (#2872-#2882, #2898) that validates
AI coding agents can discover, load, and follow Golem skill files to
produce correct build artifacts.

- CLI entrypoint with arg parsing and scenario filtering
- YAML scenario loader with Zod schema validation
- AgentDriver interface with Claude Code and Gemini (stub) drivers
- SkillWatcher with inotifywait (Linux), fswatch (macOS), atime and
  presence-based fallback detection
- Skill activation assertion engine (default, strict, allowedExtras)
- Build verification with golem.yaml directory discovery
- JSON report generation per scenario run
- Bootstrap scenario: golem-new-project-ts
- golem-new-project skill (SKILL.md)
- CI workflow for unit + integration tests on Ubuntu
- 21 unit tests (watcher, executor assertions, loader validation)

* Fix CI test glob pattern for Linux compatibility

Change **/*.test.js to *.test.js since /bin/sh (dash) on Linux
does not support globstar. All test files are in a flat directory
so ** is unnecessary.

* Fix CI: golem binary, health check, and exit code on failure

- Download golem v1.4.2 binary from golemcloud/golem releases
- Use golem server run with /healthcheck readiness loop
- Fix executor health check to use /healthcheck endpoint
- Exit with code 1 when any scenario fails
…oke, shell/sleep/trigger, create/delete agent, opencode driver, aggregated report

- Add assertion engine with exit_code, stdout, body, status, and result_json checks (#2895)
- Add scenario-level settings, prerequisites, step timeout, and continue_session (#2887)
- Add deploy verification with implicit build (#2889)
- Add invoke check with expect assertions (#2890)
- Add shell, sleep, and trigger step actions (#2893)
- Add create_agent and delete_agent step actions (#2894)
- Add OpenCode driver stub with `opencode run` (#2897)
- Add aggregated summary report with summary.json output (#2912)
- Add --approval-mode yolo to Gemini CLI to enable all tools (run_shell_command, activate_skill, etc.)
- Symlink skills to .gemini/skills/ so Gemini can discover them
- Watch all agent skill dirs (.claude, .gemini, .agents) for activation
- Fix macOS APFS relatime: reset atime before mtime so reads trigger updates
- Add fswatch -a (access) and -L (follow-links) flags for macOS
- Remove presence-check fallback, log full paths for detected skills
- Move skills dir default from golem/skills to skills/
- Add opencode (opencode-ai) to matrix agents
- Replace Gemini CLI placeholder with actual npm install
- Update path triggers from golem/skills/ to skills/
…nd GitHub summary to harness

- Template variable substitution ({{agent}}, {{language}}, {{workspace}}, {{scenario}}) in step fields
- Conditional step execution with only_if/skip_if on agent, language, and os
- --dry-run flag to validate scenarios and print step summaries without executing
- Graceful Ctrl+C handling with partial result writing via AbortController
- GitHub Actions job summary markdown output via GITHUB_STEP_SUMMARY
- Remove issue number references from comments
Each driver now declares a skillDirs array instead of duplicating
the symlink loop. Removes ~60 lines of repeated code.
Adds CodexAgentDriver using codex exec with session resume support.
Makes --scenarios default to ./scenarios so --scenario can be used alone.
Adds version check fallback between golem and golem-cli binaries
and clarifies that golem should not be built from scratch.
- Pre-create ~/.gemini/ dir to prevent ENOENT on projects.json
- Use GEMINI_API_KEY secret directly
- Add codex agent to CI matrix with OPENAI_API_KEY
OpenCode expects GOOGLE_GENERATIVE_AI_API_KEY, not GEMINI_API_KEY.
Codex CLI requires explicit login rather than reading OPENAI_API_KEY
directly from the environment.
Copies harness and skills to /tmp/harness-run/ with a fresh git init
so agents cannot crawl up into the golem repo.
Restructure to group skill definitions and the testing harness
under a single top-level golem-skills/ directory. Update CI
workflow paths, .gitignore, and AGENTS.md accordingly.
Move issue tracking to PR description. Update OpenCode driver
comment to clarify session continuity status.
Add .refine() to StepSpecSchema ensuring exactly one action field
per step. Define StepSpec as a union type for better type narrowing.
Add negative tests for zero and multiple actions per step.
Extract createDriver() function, add SUPPORTED_AGENTS and
SUPPORTED_LANGUAGES constants, wrap scenario loop in agent/language
matrix. Update report filenames to include agent-language prefix.
Show default timeout (300s) in help text.
Add file existence check via shell step and deploy verification
to demonstrate more harness capabilities.
Share the default timeout value between executor and run.ts
help text via an exported constant.
The scaffolded project has no components, so deploy produces an empty
diff that the CLI misreads as a concurrent modification error.
# Conflicts:
#	golem-skills/tests/harness/src/driver/base.ts
#	golem-skills/tests/harness/src/executor.ts
#	golem-skills/tests/harness/tests/abort.test.ts
#	golem-skills/tests/harness/tests/conditions.test.ts
#	golem-skills/tests/harness/tests/github-summary.test.ts
#	golem-skills/tests/harness/tests/loader.test.ts
#	golem-skills/tests/harness/tests/variables-integration.test.ts
#	golem-skills/tests/harness/tests/variables.test.ts
#	tests/harness/src/driver/claude.ts
#	tests/harness/src/driver/gemini.ts
#	tests/harness/src/driver/opencode.ts
#	tests/harness/src/run.ts
#	tests/harness/tests/assertions.test.ts
#	tests/harness/tests/executor.test.ts
#	tests/harness/tests/watcher.test.ts
…nd GitHub summary to harness (#2960)

- Template variable substitution ({{agent}}, {{language}}, {{workspace}}, {{scenario}}) in step fields
- Conditional step execution with only_if/skip_if on agent, language, and os
- --dry-run flag to validate scenarios and print step summaries without executing
- Graceful Ctrl+C handling with partial result writing via AbortController
- GitHub Actions job summary markdown output via GITHUB_STEP_SUMMARY
- Remove issue number references from comments
… HTML reports to harness

- HTTP behavioral check steps with status/body assertions (#2892)
- Retry logic for flaky agent interactions with attempts/delay (#2916)
- --resume-from flag to skip steps before a given step ID (#2915)
- Failure classification mapping error prefixes to categories with guidance (#2902)
- HTML summary report generation alongside summary.json (#2903)
- Golem state cleanup (deploy --reset) between scenarios (#2913)
- Postgres 16 service container with env vars (#2910)
- Go and Python setup steps for multi-language scenarios (#2909)
@github-actions
Copy link

github-actions bot commented Mar 20, 2026

✅ All contributors have signed the CLA.
Posted by the CLA Assistant Lite bot.

Base automatically changed from skills to main March 20, 2026 20:32
…ness-issues

# Conflicts:
#	.github/workflows/skills-test.yaml
#	golem-skills/tests/harness/src/executor.ts
#	golem-skills/tests/harness/src/html-report.ts
#	golem-skills/tests/harness/src/run.ts
#	golem-skills/tests/harness/tests/loader.test.ts
@Myestery Myestery requested a review from vigoo March 23, 2026 11:45
Myestery added 11 commits March 23, 2026 19:12
Download golem-cli and golem-server from build-golem-binaries workflow
artifacts with search_artifacts to find runs that produced binaries.
Adds GOLEM_PATH=$GITHUB_WORKSPACE to the Run Skill Tests env block
so golem new/build use local SDKs from the repo checkout instead of
published registry versions. Also removes stale html-report test file.
- New skill: golem-db-app — teaches agents to build PostgreSQL-backed
  Golem apps with HTTP endpoints using golem:rdbms/postgres
- New scenario: golem-db-app-ts — creates app, deploys, verifies HTTP
  POST/GET endpoints, and checks DB rows via psql
- Tested with codex (pass) and gemini (build/deploy pass)
headers?: Record<string, string>;
};

export type StepSpec =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a comment for this PR, but I still don't understand this type.
In #2932 (comment) I pointed out that it does not make any sense to have all these different nullable fields in one record, as each step must exactly contain one of them.
I did not properly review when you changed i in that PR, so commenting on it now:

this is still very confusing and redundant with all the ?: undefined fields, can we just have a discriminator tag and the actual payload for each? or some other way to avoid this?

};
}

private async executeStepBody(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more and more different steps, let's extract an execute function for each to keep it more readable

Copy link
Contributor

@vigoo vigoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just added some code organizational comments

curl -L --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/cargo-bins/cargo-binstall/main/install-from-binstall-release.sh | bash
cargo binstall --no-confirm cargo-component@0.21.1

- name: Build TS SDK (for GOLEM_PATH local SDK references)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a cargo make command (build-ts-sdk)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants