feat: add AIDLC Evaluation & Reporting Framework#115
Merged
scottschreckengaust merged 29 commits intoawslabs:mainfrom Mar 19, 2026
Merged
feat: add AIDLC Evaluation & Reporting Framework#115scottschreckengaust merged 29 commits intoawslabs:mainfrom
scottschreckengaust merged 29 commits intoawslabs:mainfrom
Conversation
Evaluation and reporting framework for validating AI-DLC workflow changes. Includes execution, qualitative/quantitative scoring, contract testing, reporting packages, and CLI/IDE harness adapters. Also fixes pytest import-mode collision for same-named test files across packages, and documents known Windows test_run_command.py failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace shell-specific commands with Python equivalents to ensure tests pass on all platforms (Windows/Mac/Linux) when using shell=False: - Replace `echo 'content' > file` with Python pathlib file writing - Replace shell builtin `exit N` with Python `sys.exit(N)` - Replace `echo 'msg' >&2` with Python `sys.stderr.write()` - Update command-not-found test to handle both OSError and exit code 127 All 245 tests now pass successfully on Windows. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
Author
|
I fexed the mac issues |
MichaelWalker-git
previously approved these changes
Mar 17, 2026
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/prompt_template.py
Outdated
Show resolved
Hide resolved
scripts/aidlc-evaluator/packages/ide-harness/src/ide_harness/prompt_template.py
Outdated
Show resolved
Hide resolved
…rompt_template.py Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…rompt_template.py Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
scripts/aidlc-evaluator/packages/contracttest/src/contracttest/server.py
Outdated
Show resolved
Hide resolved
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/post_run.py
Outdated
Show resolved
Hide resolved
Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…/server.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Member
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/aidlc-evaluator/— a complete automated evaluation framework for validating changes to the AIDLC workflows repository end-to-endWhat's included
Packages (uv workspace with 8 internal packages):
executionqualitativequantitativecontracttestnonfunctionalreportingcli-harnesside-harnessOther additions:
run.pymaster entry point dispatching to all evaluation modesconfig/withdefault.yamland per-model overrides for 9 Bedrock models (Claude, Nova, Mistral families)test_cases/with two golden test cases:sci-calc(default) andall-stages(full workflow)docker/sandbox/) for isolated execution of AI-generated codeArchitecture
The pipeline uses a file-based YAML communication model — no in-process state crosses stage boundaries, making each package independently testable and runnable in isolation. The execution stage uses a two-agent Strands swarm (Executor + Simulator) with sandboxed file and command execution to safely run AI-generated code.
Test plan
uv syncinscripts/aidlc-evaluator/uv run python run.py test— unit tests pass (note: 7 tests intest_run_command.pyare expected to fail on Windows due to Unix shell commands)uv run python run.py fullwith valid AWS/Bedrock credentials runs end-to-end and produces a report underruns/uv run python run.py batch --listshows available model configsuv run python run.py cli --listshowsclaude-code,kiro-cliuv run python run.py ide --listshows IDE adapters🤖 Generated with Claude Code