feat: add AIDLC Evaluation & Reporting Framework by harmjeff · Pull Request #115 · awslabs/aidlc-workflows

harmjeff · 2026-03-13T16:23:53Z

Summary

Adds scripts/aidlc-evaluator/ — a complete automated evaluation framework for validating changes to the AIDLC workflows repository end-to-end
Implements a 6-stage pipeline: execution → post-run tests → quantitative analysis → contract testing → qualitative scoring → report generation
Supports single-model, batch (multi-model), cross-model comparison, CLI harness (Claude Code, Kiro CLI), and IDE harness (Cursor, Cline, Copilot, Kiro, Windsurf) evaluation modes

What's included

Packages (uv workspace with 8 internal packages):

Package	Purpose
`execution`	Two-agent Strands swarm that drives the full AIDLC workflow
`qualitative`	Semantic scoring of generated docs vs. golden baseline via Bedrock LLM
`quantitative`	Static analysis: linting (ruff/eslint), security (bandit/semgrep/npm audit), duplication (PMD CPD)
`contracttest`	Spins up generated apps and validates API endpoints against OpenAPI specs
`nonfunctional`	NFR evaluation: token consumption, execution timing, cross-model consistency
`reporting`	Consolidated Markdown + HTML report generation with baseline regression comparison
`cli-harness`	Adapter framework for CLI-based AI assistants
`ide-harness`	Adapter framework for third-party IDE AI assistants

Other additions:

run.py master entry point dispatching to all evaluation modes
config/ with default.yaml and per-model overrides for 9 Bedrock models (Claude, Nova, Mistral families)
test_cases/ with two golden test cases: sci-calc (default) and all-stages (full workflow)
Docker sandbox (docker/sandbox/) for isolated execution of AI-generated code
Extension hook testing support for validating progressive disclosure opt-in configurations
Full unit test coverage across all packages (~180 test files, ~29K lines added)

Architecture

The pipeline uses a file-based YAML communication model — no in-process state crosses stage boundaries, making each package independently testable and runnable in isolation. The execution stage uses a two-agent Strands swarm (Executor + Simulator) with sandboxed file and command execution to safely run AI-generated code.

Test plan

uv sync in scripts/aidlc-evaluator/
uv run python run.py test — unit tests pass (note: 7 tests in test_run_command.py are expected to fail on Windows due to Unix shell commands)
uv run python run.py full with valid AWS/Bedrock credentials runs end-to-end and produces a report under runs/
uv run python run.py batch --list shows available model configs
uv run python run.py cli --list shows claude-code, kiro-cli
uv run python run.py ide --list shows IDE adapters

🤖 Generated with Claude Code

Evaluation and reporting framework for validating AI-DLC workflow changes. Includes execution, qualitative/quantitative scoring, contract testing, reporting packages, and CLI/IDE harness adapters. Also fixes pytest import-mode collision for same-named test files across packages, and documents known Windows test_run_command.py failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MichaelWalker-git

LGTM

scottschreckengaust

fix failed tests

Replace shell-specific commands with Python equivalents to ensure tests pass on all platforms (Windows/Mac/Linux) when using shell=False: - Replace `echo 'content' > file` with Python pathlib file writing - Replace shell builtin `exit N` with Python `sys.exit(N)` - Replace `echo 'msg' >&2` with Python `sys.stderr.write()` - Update command-not-found test to handle both OSError and exit code 127 All 245 tests now pass successfully on Windows. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

harmjeff · 2026-03-17T14:44:08Z

I fexed the mac issues

…/aidlc-workflows into feature/aidlc-evaluator

scripts/aidlc-evaluator/pyproject.toml

scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/prompt_template.py

scripts/aidlc-evaluator/packages/ide-harness/src/ide_harness/prompt_template.py

…rompt_template.py Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>

scripts/aidlc-evaluator/README.md

scripts/aidlc-evaluator/docker/sandbox/build.sh

scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py

more-profile-region.patch

scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py

scripts/aidlc-evaluator/packages/contracttest/src/contracttest/server.py

scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/post_run.py

Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>

…/server.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>

…st_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>

scottschreckengaust · 2026-03-19T16:45:03Z

============================================================
  Evaluation Complete
============================================================
  Run folder:              ~/github.com/awslabs/aidlc-workflows/scripts/aidlc-evaluator/runs/sci-calc/20260319T152257-aidlc-workflows_main
  Run metrics:             ~//github.com/awslabs/aidlc-workflows/scripts/aidlc-evaluator/runs/sci-calc/20260319T152257-aidlc-workflows_main/run-metrics.yaml
  Post-run tests:          PASS  (168 passed, 0 failed (of 168))
  Code quality:            PASS  (lint: 0 finding(s), 0 error(s); security: 1 finding(s), 0 high)
  Contract tests:          PASS  (88/88 passed, 0 failed, 0 errors)
  Qualitative comparison:  ~//github.com/awslabs/aidlc-workflows/scripts/aidlc-evaluator/runs/sci-calc/20260319T152257-aidlc-workflows_main/qualitative-comparison.yaml
  Qualitative score:       (see above)

scottschreckengaust

works

Kalindi-Dev

LGTM

harmjeff requested review from a team, raj-jain-aws, scottschreckengaust and spraja08 March 13, 2026 16:23

harmjeff requested a review from a team as a code owner March 13, 2026 16:23

MichaelWalker-git approved these changes Mar 16, 2026

View reviewed changes

scottschreckengaust requested changes Mar 17, 2026

View reviewed changes

harmjeff and others added 2 commits March 16, 2026 23:24

Merge branch 'main' into feature/aidlc-evaluator

bb56c27

harmjeff requested a review from scottschreckengaust March 17, 2026 14:43

harmjeff and others added 2 commits March 17, 2026 10:47

Merge branch 'feature/aidlc-evaluator' of https://github.com/harmjeff…

90a83ef

…/aidlc-workflows into feature/aidlc-evaluator

Merge branch 'main' into feature/aidlc-evaluator

4ef79a8

MichaelWalker-git previously approved these changes Mar 17, 2026

View reviewed changes