Skip to content

feat: add AIDLC Evaluation & Reporting Framework#115

Merged
scottschreckengaust merged 29 commits intoawslabs:mainfrom
harmjeff:feature/aidlc-evaluator
Mar 19, 2026
Merged

feat: add AIDLC Evaluation & Reporting Framework#115
scottschreckengaust merged 29 commits intoawslabs:mainfrom
harmjeff:feature/aidlc-evaluator

Conversation

@harmjeff
Copy link
Copy Markdown
Contributor

@harmjeff harmjeff commented Mar 13, 2026

Summary

  • Adds scripts/aidlc-evaluator/ — a complete automated evaluation framework for validating changes to the AIDLC workflows repository end-to-end
  • Implements a 6-stage pipeline: execution → post-run tests → quantitative analysis → contract testing → qualitative scoring → report generation
  • Supports single-model, batch (multi-model), cross-model comparison, CLI harness (Claude Code, Kiro CLI), and IDE harness (Cursor, Cline, Copilot, Kiro, Windsurf) evaluation modes

What's included

Packages (uv workspace with 8 internal packages):

Package Purpose
execution Two-agent Strands swarm that drives the full AIDLC workflow
qualitative Semantic scoring of generated docs vs. golden baseline via Bedrock LLM
quantitative Static analysis: linting (ruff/eslint), security (bandit/semgrep/npm audit), duplication (PMD CPD)
contracttest Spins up generated apps and validates API endpoints against OpenAPI specs
nonfunctional NFR evaluation: token consumption, execution timing, cross-model consistency
reporting Consolidated Markdown + HTML report generation with baseline regression comparison
cli-harness Adapter framework for CLI-based AI assistants
ide-harness Adapter framework for third-party IDE AI assistants

Other additions:

  • run.py master entry point dispatching to all evaluation modes
  • config/ with default.yaml and per-model overrides for 9 Bedrock models (Claude, Nova, Mistral families)
  • test_cases/ with two golden test cases: sci-calc (default) and all-stages (full workflow)
  • Docker sandbox (docker/sandbox/) for isolated execution of AI-generated code
  • Extension hook testing support for validating progressive disclosure opt-in configurations
  • Full unit test coverage across all packages (~180 test files, ~29K lines added)

Architecture

The pipeline uses a file-based YAML communication model — no in-process state crosses stage boundaries, making each package independently testable and runnable in isolation. The execution stage uses a two-agent Strands swarm (Executor + Simulator) with sandboxed file and command execution to safely run AI-generated code.

Test plan

  • uv sync in scripts/aidlc-evaluator/
  • uv run python run.py test — unit tests pass (note: 7 tests in test_run_command.py are expected to fail on Windows due to Unix shell commands)
  • uv run python run.py full with valid AWS/Bedrock credentials runs end-to-end and produces a report under runs/
  • uv run python run.py batch --list shows available model configs
  • uv run python run.py cli --list shows claude-code, kiro-cli
  • uv run python run.py ide --list shows IDE adapters

🤖 Generated with Claude Code

Evaluation and reporting framework for validating AI-DLC workflow changes.
Includes execution, qualitative/quantitative scoring, contract testing,
reporting packages, and CLI/IDE harness adapters.

Also fixes pytest import-mode collision for same-named test files across
packages, and documents known Windows test_run_command.py failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@harmjeff harmjeff requested a review from a team as a code owner March 13, 2026 16:23
Copy link
Copy Markdown

@MichaelWalker-git MichaelWalker-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@scottschreckengaust scottschreckengaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix failed tests

harmjeff and others added 2 commits March 16, 2026 23:24
Replace shell-specific commands with Python equivalents to ensure tests
pass on all platforms (Windows/Mac/Linux) when using shell=False:

- Replace `echo 'content' > file` with Python pathlib file writing
- Replace shell builtin `exit N` with Python `sys.exit(N)`
- Replace `echo 'msg' >&2` with Python `sys.stderr.write()`
- Update command-not-found test to handle both OSError and exit code 127

All 245 tests now pass successfully on Windows.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@harmjeff
Copy link
Copy Markdown
Contributor Author

I fexed the mac issues

harmjeff and others added 2 commits March 17, 2026 15:16
…rompt_template.py

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…rompt_template.py

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
harmjeff and others added 10 commits March 19, 2026 10:52
Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…/server.py


Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py


Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py


Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py


Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
…st_run.py


Codebuilder fixes

Co-authored-by: Scott Schreckengaust <scottschreckengaust@users.noreply.github.com>
@scottschreckengaust
Copy link
Copy Markdown
Member

============================================================
  Evaluation Complete
============================================================
  Run folder:              ~/github.com/awslabs/aidlc-workflows/scripts/aidlc-evaluator/runs/sci-calc/20260319T152257-aidlc-workflows_main
  Run metrics:             ~//github.com/awslabs/aidlc-workflows/scripts/aidlc-evaluator/runs/sci-calc/20260319T152257-aidlc-workflows_main/run-metrics.yaml
  Post-run tests:          PASS  (168 passed, 0 failed (of 168))
  Code quality:            PASS  (lint: 0 finding(s), 0 error(s); security: 1 finding(s), 0 high)
  Contract tests:          PASS  (88/88 passed, 0 failed, 0 errors)
  Qualitative comparison:  ~//github.com/awslabs/aidlc-workflows/scripts/aidlc-evaluator/runs/sci-calc/20260319T152257-aidlc-workflows_main/qualitative-comparison.yaml
  Qualitative score:       (see above)

Copy link
Copy Markdown
Member

@scottschreckengaust scottschreckengaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works

Copy link
Copy Markdown
Contributor

@Kalindi-Dev Kalindi-Dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scottschreckengaust scottschreckengaust added this pull request to the merge queue Mar 19, 2026
Merged via the queue into awslabs:main with commit aaca23d Mar 19, 2026
@scottschreckengaust scottschreckengaust deleted the feature/aidlc-evaluator branch March 19, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants