Skip to content

Add agent testing harness for skill verification#2921

Merged
Myestery merged 3 commits intoskillsfrom
agent-testing-harness
Mar 5, 2026
Merged

Add agent testing harness for skill verification#2921
Myestery merged 3 commits intoskillsfrom
agent-testing-harness

Conversation

@Myestery
Copy link
Contributor

@Myestery Myestery commented Mar 4, 2026

Summary

  • Implements the full agent testing harness (tests/harness/) for validating AI coding agents can discover and follow Golem skill files
  • Adds golem-new-project skill and bootstrap scenario (golem-new-project-ts)
  • Adds CI workflow (skills-test.yaml) with unit tests + integration tests on Ubuntu

Issues

Closes #2872
Closes #2873
Closes #2874
Closes #2875
Closes #2876
Closes #2877
Closes #2878
Closes #2879
Closes #2880
Closes #2881
Closes #2882
Closes #2898
Closes #2906
Closes #2907

What's included

  • CLI entrypoint (run.ts): arg parsing, scenario filtering, config summary
  • Scenario loader: YAML parsing with Zod schema validation
  • AgentDriver interface + Claude Code driver (with skill symlinking, session resumption) + Gemini stub
  • SkillWatcher: inotifywait (Linux), fswatch (macOS), atime comparison, presence-based fallback
  • Assertion engine: default, strict, and allowedExtraSkills modes
  • Build verification: golem build with automatic golem.yaml directory discovery
  • JSON report generation: per-scenario results with timing, skills, and error details
  • 21 unit tests: watcher path extraction, executor assertions, loader validation
  • CI workflow: unit tests job + integration test matrix (claude-code, gemini) x (ts)
  • CI: Install and start Golem server (v1.4.2 binary, /healthcheck readiness loop)
  • CI: Install coding agents (Claude Code via npm, Gemini stub)

Key fixes applied during development

  • stdio: ['pipe', ...]['ignore', ...] to prevent stdin pipe hang with Claude Code
  • Removed invalid fswatch --event Access flag (not a supported event type on macOS)
  • Added workspace cleanup before each run to prevent stale directory conflicts
  • Added findGolemProjectDir() to handle golem new creating a subdirectory
  • Fixed test glob pattern (**/*.test.js*.test.js) for POSIX shell compatibility
  • Fixed golem binary download (plain binary from golemcloud/golem, not tarball from golemcloud/golem-cli)
  • Fixed health check endpoint (/healthcheck instead of /version)
  • Fixed harness exit code to return 1 on scenario failures

Test plan

  • cd tests/harness && npm run build && npm test — 21/21 unit tests pass
  • CI workflow unit-tests job passes
  • CI workflow integration-tests jobs run (fail correctly when API keys not configured)

Implements the skill testing harness (#2872-#2882, #2898) that validates
AI coding agents can discover, load, and follow Golem skill files to
produce correct build artifacts.

- CLI entrypoint with arg parsing and scenario filtering
- YAML scenario loader with Zod schema validation
- AgentDriver interface with Claude Code and Gemini (stub) drivers
- SkillWatcher with inotifywait (Linux), fswatch (macOS), atime and
  presence-based fallback detection
- Skill activation assertion engine (default, strict, allowedExtras)
- Build verification with golem.yaml directory discovery
- JSON report generation per scenario run
- Bootstrap scenario: golem-new-project-ts
- golem-new-project skill (SKILL.md)
- CI workflow for unit + integration tests on Ubuntu
- 21 unit tests (watcher, executor assertions, loader validation)
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Thank you for your contribution! Before we can merge this PR, we need you to sign our Contributor License Agreement. Please read the CLA and post the comment below to sign.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Change **/*.test.js to *.test.js since /bin/sh (dash) on Linux
does not support globstar. All test files are in a flat directory
so ** is unnecessary.
@Myestery Myestery force-pushed the agent-testing-harness branch 4 times, most recently from 28dda12 to fc5f6f4 Compare March 5, 2026 02:15
- Download golem v1.4.2 binary from golemcloud/golem releases
- Use golem server run with /healthcheck readiness loop
- Fix executor health check to use /healthcheck endpoint
- Exit with code 1 when any scenario fails
@Myestery Myestery force-pushed the agent-testing-harness branch from fc5f6f4 to 0ccf003 Compare March 5, 2026 02:23
@Myestery Myestery marked this pull request as ready for review March 5, 2026 02:49
@Myestery Myestery merged commit 3290b88 into skills Mar 5, 2026
26 of 32 checks passed
@Myestery Myestery deleted the agent-testing-harness branch March 5, 2026 02:49
@github-actions github-actions bot locked and limited conversation to collaborators Mar 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant