Add HTTP steps, retry, resume, failure classification, HTML reports, and CI setup to harness by Myestery · Pull Request #3048 · golemcloud/golem

Myestery · 2026-03-20T19:30:48Z

Summary

HTTP behavioral check steps with status/body assertions for testing deployed endpoints
Retry logic for flaky agent interactions (retry: { attempts, delay })
--resume-from flag to skip steps before a given step ID, enabling re-runs from a specific point
Failure classification mapping error prefixes (BUILD_FAILED, DEPLOY_FAILED, etc.) to categories with actionable guidance
HTML summary report generation alongside summary.json with scenario details, failures, and matrix views
Golem state cleanup (golem deploy --reset) between scenarios with --no-cleanup and --workspace flags
CI workflow: Postgres 16 service container + Go/Python toolchain setup

Closes

Closes Implement behavioral check: http #2892
Closes Add retry logic for flaky agent interactions #2916
Closes Add --resume-from flag (skip passed steps) #2915
Closes Implement failure classification with actionable guidance #2902
Closes Implement HTML summary report #2903
Closes Implement Golem state cleanup between scenarios #2913
Closes CI step: Optional Postgres/MySQL sidecar for DB scenarios #2910
Closes CI step: Install language toolchains #2909

Test plan

Unit tests for HTTP steps, retry logic, resume, failure classification, and HTML report
Manual verification of HTTP endpoints against a deployed Golem DB app
Harness scenario run with --scenario golem-db-app produces HTML report

* Add agent testing harness for skill verification Implements the skill testing harness (#2872-#2882, #2898) that validates AI coding agents can discover, load, and follow Golem skill files to produce correct build artifacts. - CLI entrypoint with arg parsing and scenario filtering - YAML scenario loader with Zod schema validation - AgentDriver interface with Claude Code and Gemini (stub) drivers - SkillWatcher with inotifywait (Linux), fswatch (macOS), atime and presence-based fallback detection - Skill activation assertion engine (default, strict, allowedExtras) - Build verification with golem.yaml directory discovery - JSON report generation per scenario run - Bootstrap scenario: golem-new-project-ts - golem-new-project skill (SKILL.md) - CI workflow for unit + integration tests on Ubuntu - 21 unit tests (watcher, executor assertions, loader validation) * Fix CI test glob pattern for Linux compatibility Change **/*.test.js to *.test.js since /bin/sh (dash) on Linux does not support globstar. All test files are in a flat directory so ** is unnecessary. * Fix CI: golem binary, health check, and exit code on failure - Download golem v1.4.2 binary from golemcloud/golem releases - Use golem server run with /healthcheck readiness loop - Fix executor health check to use /healthcheck endpoint - Exit with code 1 when any scenario fails

…oke, shell/sleep/trigger, create/delete agent, opencode driver, aggregated report - Add assertion engine with exit_code, stdout, body, status, and result_json checks (#2895) - Add scenario-level settings, prerequisites, step timeout, and continue_session (#2887) - Add deploy verification with implicit build (#2889) - Add invoke check with expect assertions (#2890) - Add shell, sleep, and trigger step actions (#2893) - Add create_agent and delete_agent step actions (#2894) - Add OpenCode driver stub with `opencode run` (#2897) - Add aggregated summary report with summary.json output (#2912)

…cies scenario, update paths

- Add --approval-mode yolo to Gemini CLI to enable all tools (run_shell_command, activate_skill, etc.) - Symlink skills to .gemini/skills/ so Gemini can discover them - Watch all agent skill dirs (.claude, .gemini, .agents) for activation - Fix macOS APFS relatime: reset atime before mtime so reads trigger updates - Add fswatch -a (access) and -L (follow-links) flags for macOS - Remove presence-check fallback, log full paths for detected skills - Move skills dir default from golem/skills to skills/

- Add opencode (opencode-ai) to matrix agents - Replace Gemini CLI placeholder with actual npm install - Update path triggers from golem/skills/ to skills/

…nd GitHub summary to harness - Template variable substitution ({{agent}}, {{language}}, {{workspace}}, {{scenario}}) in step fields - Conditional step execution with only_if/skip_if on agent, language, and os - --dry-run flag to validate scenarios and print step summaries without executing - Graceful Ctrl+C handling with partial result writing via AbortController - GitHub Actions job summary markdown output via GITHUB_STEP_SUMMARY - Remove issue number references from comments

Each driver now declares a skillDirs array instead of duplicating the symlink loop. Removes ~60 lines of repeated code.

Adds CodexAgentDriver using codex exec with session resume support. Makes --scenarios default to ./scenarios so --scenario can be used alone.

Adds version check fallback between golem and golem-cli binaries and clarifies that golem should not be built from scratch.

- Pre-create ~/.gemini/ dir to prevent ENOENT on projects.json - Use GEMINI_API_KEY secret directly - Add codex agent to CI matrix with OPENAI_API_KEY

OpenCode expects GOOGLE_GENERATIVE_AI_API_KEY, not GEMINI_API_KEY.

Codex CLI requires explicit login rather than reading OPENAI_API_KEY directly from the environment.

Copies harness and skills to /tmp/harness-run/ with a fresh git init so agents cannot crawl up into the golem repo.

Restructure to group skill definitions and the testing harness under a single top-level golem-skills/ directory. Update CI workflow paths, .gitignore, and AGENTS.md accordingly.

Move issue tracking to PR description. Update OpenCode driver comment to clarify session continuity status.

Add .refine() to StepSpecSchema ensuring exactly one action field per step. Define StepSpec as a union type for better type narrowing. Add negative tests for zero and multiple actions per step.

Extract createDriver() function, add SUPPORTED_AGENTS and SUPPORTED_LANGUAGES constants, wrap scenario loop in agent/language matrix. Update report filenames to include agent-language prefix. Show default timeout (300s) in help text.

Add file existence check via shell step and deploy verification to demonstrate more harness capabilities.

Share the default timeout value between executor and run.ts help text via an exported constant.

The scaffolded project has no components, so deploy produces an empty diff that the CLI misreads as a concurrent modification error.

# Conflicts: # golem-skills/tests/harness/src/driver/base.ts # golem-skills/tests/harness/src/executor.ts # golem-skills/tests/harness/tests/abort.test.ts # golem-skills/tests/harness/tests/conditions.test.ts # golem-skills/tests/harness/tests/github-summary.test.ts # golem-skills/tests/harness/tests/loader.test.ts # golem-skills/tests/harness/tests/variables-integration.test.ts # golem-skills/tests/harness/tests/variables.test.ts # tests/harness/src/driver/claude.ts # tests/harness/src/driver/gemini.ts # tests/harness/src/driver/opencode.ts # tests/harness/src/run.ts # tests/harness/tests/assertions.test.ts # tests/harness/tests/executor.test.ts # tests/harness/tests/watcher.test.ts

…nd GitHub summary to harness (#2960) - Template variable substitution ({{agent}}, {{language}}, {{workspace}}, {{scenario}}) in step fields - Conditional step execution with only_if/skip_if on agent, language, and os - --dry-run flag to validate scenarios and print step summaries without executing - Graceful Ctrl+C handling with partial result writing via AbortController - GitHub Actions job summary markdown output via GITHUB_STEP_SUMMARY - Remove issue number references from comments

…ment-missing-harness-issues

…ness-issues

… HTML reports to harness - HTTP behavioral check steps with status/body assertions (#2892) - Retry logic for flaky agent interactions with attempts/delay (#2916) - --resume-from flag to skip steps before a given step ID (#2915) - Failure classification mapping error prefixes to categories with guidance (#2902) - HTML summary report generation alongside summary.json (#2903) - Golem state cleanup (deploy --reset) between scenarios (#2913)

- Postgres 16 service container with env vars (#2910) - Go and Python setup steps for multi-language scenarios (#2909)

github-actions · 2026-03-20T19:30:59Z

✅ All contributors have signed the CLA.
_{Posted by the CLA Assistant Lite bot.}

…ness-issues # Conflicts: # .github/workflows/skills-test.yaml # golem-skills/tests/harness/src/executor.ts # golem-skills/tests/harness/src/html-report.ts # golem-skills/tests/harness/src/run.ts # golem-skills/tests/harness/tests/loader.test.ts

…download.

Download golem-cli and golem-server from build-golem-binaries workflow artifacts with search_artifacts to find runs that produced binaries.

Adds GOLEM_PATH=$GITHUB_WORKSPACE to the Run Skill Tests env block so golem new/build use local SDKs from the repo checkout instead of published registry versions. Also removes stale html-report test file.

…ness-issues

…ct creation

- New skill: golem-db-app — teaches agents to build PostgreSQL-backed Golem apps with HTTP endpoints using golem:rdbms/postgres - New scenario: golem-db-app-ts — creates app, deploys, verifies HTTP POST/GET endpoints, and checks DB rows via psql - Tested with codex (pass) and gemini (build/deploy pass)

vigoo · 2026-03-24T07:37:36Z

golem-skills/tests/harness/src/executor.ts

+  headers?: Record<string, string>;
+};

 export type StepSpec =


This is not really a comment for this PR, but I still don't understand this type.
In #2932 (comment) I pointed out that it does not make any sense to have all these different nullable fields in one record, as each step must exactly contain one of them.
I did not properly review when you changed i in that PR, so commenting on it now:

this is still very confusing and redundant with all the ?: undefined fields, can we just have a discriminator tag and the actual payload for each? or some other way to avoid this?

vigoo · 2026-03-24T07:38:30Z

golem-skills/tests/harness/src/executor.ts

    };
  }

+  private async executeStepBody(


There are more and more different steps, let's extract an execute function for each to keep it more readable

vigoo

Looks good, just added some code organizational comments

…patible local SDKs

…asm32-wasip1 target

vigoo · 2026-03-25T11:15:46Z

.github/workflows/skills-test.yaml

+          curl -L --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/cargo-bins/cargo-binstall/main/install-from-binstall-release.sh | bash
+          cargo binstall --no-confirm cargo-component@0.21.1
+
+      - name: Build TS SDK (for GOLEM_PATH local SDK references)


This has a cargo make command (build-ts-sdk)

…ifacts

Myestery and others added 30 commits March 5, 2026 03:49

Move golem-new-project skill to golem/skills/, remove adding-dependen…

d7cb9b1

…cies scenario, update paths

Merge branch 'main' into skills

822ee01

Add opencode agent to CI, uncomment Gemini install, update paths

9959988

- Add opencode (opencode-ai) to matrix agents - Replace Gemini CLI placeholder with actual npm install - Update path triggers from golem/skills/ to skills/

Merge branch 'main' into skills

1da3c87

Add --help flag to harness run.ts (#2871)

9125b2e

Extract skill symlink logic into BaseAgentDriver.linkSkills()

3c516ec

Each driver now declares a skillDirs array instead of duplicating the symlink loop. Removes ~60 lines of repeated code.

Add Codex agent driver and default --scenarios flag

f35aa63

Adds CodexAgentDriver using codex exec with session resume support. Makes --scenarios default to ./scenarios so --scenario can be used alone.

Update golem-new-project skill to check golem/golem-cli availability

ade2e95

Adds version check fallback between golem and golem-cli binaries and clarifies that golem should not be built from scratch.

Fix Gemini CI failure and add Codex to CI matrix

ab1beb0

- Pre-create ~/.gemini/ dir to prevent ENOENT on projects.json - Use GEMINI_API_KEY secret directly - Add codex agent to CI matrix with OPENAI_API_KEY

Add GOOGLE_GENERATIVE_AI_API_KEY for OpenCode in CI

81e6fb4

OpenCode expects GOOGLE_GENERATIVE_AI_API_KEY, not GEMINI_API_KEY.

Add codex login step in CI using --with-api-key

e1ec14b

Codex CLI requires explicit login rather than reading OPENAI_API_KEY directly from the environment.

Run harness tests in isolated directory outside the repo

9c05312

Copies harness and skills to /tmp/harness-run/ with a fresh git init so agents cannot crawl up into the golem repo.

Fix git init in CI isolated dir: set user identity and branch name

5ff705c

Fix isolated dir: copy dirs individually, use global tsx

e037b22

Move skills/ and tests/harness/ into golem-skills/ directory

689bcc7

Restructure to group skill definitions and the testing harness under a single top-level golem-skills/ directory. Update CI workflow paths, .gitignore, and AGENTS.md accordingly.

Strip inline ticket references from code comments

aaabacc

Move issue tracking to PR description. Update OpenCode driver comment to clarify session continuity status.

Add discriminated union validation for StepSpec

cb61348

Add .refine() to StepSpecSchema ensuring exactly one action field per step. Define StepSpec as a union type for better type narrowing. Add negative tests for zero and multiple actions per step.

Make --agent and --language optional, default to all

00b9e31

Extract createDriver() function, add SUPPORTED_AGENTS and SUPPORTED_LANGUAGES constants, wrap scenario loop in agent/language matrix. Update report filenames to include agent-language prefix. Show default timeout (300s) in help text.

Expand golem-new-project-ts scenario with deploy and shell steps

275a002

Add file existence check via shell step and deploy verification to demonstrate more harness capabilities.

Extract DEFAULT_STEP_TIMEOUT_SECONDS constant

0fd1472

Share the default timeout value between executor and run.ts help text via an exported constant.

Format code with prettier

58b9e96

Remove deploy step from golem-new-project-ts scenario

f632aae

The scaffolded project has no components, so deploy produces an empty diff that the CLI misreads as a concurrent modification error.

Merge remote-tracking branch 'origin/harness-enhancements' into imple…

38deb4c

…ment-missing-harness-issues

Merge remote-tracking branch 'origin/main' into implement-missing-har…

dd1a875

…ness-issues

Myestery added 2 commits March 20, 2026 20:30

Add Postgres and language toolchain setup to CI workflow

7692c84

- Postgres 16 service container with env vars (#2910) - Go and Python setup steps for multi-language scenarios (#2909)

Base automatically changed from skills to main March 20, 2026 20:32

Myestery requested a review from vigoo March 23, 2026 11:45

Myestery added 11 commits March 23, 2026 19:12

rebuild golem artifact

f83cce7

rebuild golem artifact

0839d37

ci: Install Golem binaries from workflow artifacts instead of direct …

7752751

…download.

Use golem binary artifacts from build workflow instead of v1.4.2 release

71eb392

Download golem-cli and golem-server from build-golem-binaries workflow artifacts with search_artifacts to find runs that produced binaries.

ci: Set GOLEM_PATH in skills-test workflow for local SDK overrides

0df7fa8

Adds GOLEM_PATH=$GITHUB_WORKSPACE to the Run Skill Tests env block so golem new/build use local SDKs from the repo checkout instead of published registry versions. Also removes stale html-report test file.

Update scenario prompt to set GOLEM_TS_PACKAGES_PATH before golem new

0d9ff14

Merge remote-tracking branch 'origin/main' into implement-missing-har…

7f61566

…ness-issues

Update scenario prompt to fix SDK versions to 1.0.0-dev.1 after proje…

9224cb7

…ct creation

Allow golem-db-app as extra skill in golem-new-project-ts scenario

df25c8e

Add SDK version fix instruction to golem-db-app-ts scenario prompt

e9eb72f

vigoo reviewed Mar 24, 2026

View reviewed changes

vigoo requested changes Mar 24, 2026

View reviewed changes

Myestery and others added 5 commits March 24, 2026 16:48

Remove SDK version fix from db-app scenario — GOLEM_PATH provides com…

38dfb10

…patible local SDKs

ci: Build TS SDK before skill tests so GOLEM_PATH local refs work

a29ebcb

ci: Install wasm-rquickjs-cli before building TS SDK

4ef971f

Merge branch 'main' into implement-missing-harness-issues

cc11d0e

ci: Build agent_guest.wasm for TS SDK — install cargo-component and w…

97de1bb

…asm32-wasip1 target

vigoo reviewed Mar 25, 2026

View reviewed changes

Myestery added 2 commits March 25, 2026 12:33

ci: Use cargo make build-sdk-ts for TS SDK build in skill tests

d87b71e

Clean golem-temp before verification deploy to avoid stale cached art…

176abed

…ifacts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTTP steps, retry, resume, failure classification, HTML reports, and CI setup to harness#3048

Add HTTP steps, retry, resume, failure classification, HTML reports, and CI setup to harness#3048
Myestery wants to merge 51 commits intomainfrom
implement-missing-harness-issues

Myestery commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

vigoo Mar 24, 2026

Uh oh!

vigoo Mar 24, 2026

Uh oh!

vigoo left a comment

Uh oh!

vigoo Mar 25, 2026

Uh oh!

Myestery Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Myestery commented Mar 20, 2026

Summary

Closes

Test plan

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigoo Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

vigoo Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

vigoo left a comment

Choose a reason for hiding this comment

Uh oh!

vigoo Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Myestery Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 20, 2026 •

edited

Loading