-
Notifications
You must be signed in to change notification settings - Fork 160
Implement automated eval test suite for Angular Skills #17007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
5b7cca0
23aecf0
ac1335a
f807aa3
c183089
6e7b838
b2047d8
1691296
2df335e
b22b13f
c684351
94d4bf8
18f3e25
568b04d
5da6711
b181ca0
665264b
b3fa973
1330989
a9da524
566551b
148691b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| name: Skill Eval | ||
|
|
||
| on: | ||
| pull_request: | ||
| paths: | ||
| - 'skills/**' | ||
| - 'evals/**' | ||
|
|
||
| jobs: | ||
| eval: | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Node.js | ||
| uses: actions/setup-node@v4 | ||
| with: | ||
| node-version: '20' | ||
|
|
||
| - name: Install eval dependencies | ||
| working-directory: evals | ||
| run: npm install | ||
|
|
||
| - name: Run skill evals | ||
| working-directory: evals | ||
| run: npx skill-eval _ --suite=all --trials=5 | ||
| env: | ||
| ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} | ||
| GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} | ||
|
|
||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: skill-eval-results | ||
| path: evals/results/ | ||
| retention-days: 30 | ||
|
|
||
| - name: Post summary comment | ||
| if: always() && github.event_name == 'pull_request' | ||
| uses: actions/github-script@v7 | ||
| with: | ||
| script: | | ||
| const fs = require('fs'); | ||
| const path = require('path'); | ||
|
|
||
| const resultsDir = 'evals/results'; | ||
| let summary = '## 📊 Skill Eval Results\n\n'; | ||
|
|
||
| try { | ||
| const files = fs.readdirSync(resultsDir).filter(f => f.endsWith('.json')); | ||
| if (files.length === 0) { | ||
| summary += '> ⚠️ No eval results found. The eval run may have failed.\n'; | ||
| } else { | ||
| summary += '| Task | Pass Rate | pass@5 | Status |\n'; | ||
| summary += '|---|---|---|---|\n'; | ||
|
|
||
| for (const file of files) { | ||
| try { | ||
| const data = JSON.parse(fs.readFileSync(path.join(resultsDir, file), 'utf8')); | ||
| const taskName = data.task || file.replace('.json', ''); | ||
| const passRate = data.passRate != null ? `${(data.passRate * 100).toFixed(0)}%` : 'N/A'; | ||
| const passAtK = data.passAtK != null ? `${(data.passAtK * 100).toFixed(0)}%` : 'N/A'; | ||
| const status = data.passAtK >= 0.8 ? '✅' : data.passAtK >= 0.6 ? '⚠️' : '❌'; | ||
| summary += `| ${taskName} | ${passRate} | ${passAtK} | ${status} |\n`; | ||
| } catch (e) { | ||
| summary += `| ${file} | Error | Error | ❌ |\n`; | ||
| } | ||
| } | ||
|
|
||
| summary += '\n### Thresholds\n'; | ||
| summary += '- ✅ `pass@5 ≥ 80%` — merge gate passed\n'; | ||
| summary += '- ⚠️ `pass@5 ≥ 60%` — needs investigation\n'; | ||
| summary += '- ❌ `pass@5 < 60%` — blocks merge for affected skill\n'; | ||
| } | ||
| } catch (e) { | ||
| summary += `> ⚠️ Could not read results: ${e.message}\n`; | ||
| } | ||
|
|
||
| await github.rest.issues.createComment({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: context.issue.number, | ||
| body: summary, | ||
| }); | ||
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,155 @@ | ||
| # Ignite UI for Angular — Skill Evals | ||
|
|
||
| Automated evaluation suite for the Ignite UI for Angular agent skills. Uses the | ||
| [skill-eval](https://github.com/mgechev/skill-eval) framework to measure skill | ||
| quality, detect regressions, and gate merges. | ||
|
|
||
| ## Overview | ||
|
|
||
| The suite tests three skills: | ||
|
|
||
| | Skill | Task ID | What it tests | | ||
| |---|---|---| | ||
| | `igniteui-angular-grids` | `grid-basic-setup` | Flat grid with sorting and pagination on flat employee data | | ||
| | `igniteui-angular-components` | `component-combo-reactive-form` | Multi-select combo bound to a reactive form control | | ||
| | `igniteui-angular-theming` | `theming-palette-generation` | Custom branded palette with `palette()` and `theme()` | | ||
|
|
||
| Each task includes: | ||
|
|
||
| - **`instruction.md`** — the prompt given to the agent | ||
| - **`tests/test.sh`** — deterministic grader (file checks, compilation, lint) | ||
| - **`prompts/quality.md`** — LLM rubric grader (intent routing, API usage) | ||
| - **`solution/solve.sh`** — reference solution for baseline validation | ||
| - **`environment/Dockerfile`** — isolated environment for agent execution | ||
| - **`skills/`** — symlinked or copied skill files under test | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Node.js 20+ | ||
| - Docker (for isolated agent execution) | ||
| - An API key for the agent provider (Gemini or Anthropic) | ||
|
|
||
| ## Running Evals Locally | ||
|
|
||
| ### Install dependencies | ||
|
|
||
| ```bash | ||
| cd evals | ||
| npm install | ||
| ``` | ||
|
|
||
| ### Run a single task | ||
|
|
||
| ```bash | ||
| # Gemini (default) | ||
| GEMINI_API_KEY=your-key npm run eval -- grid-basic-setup | ||
|
|
||
| # Claude | ||
| ANTHROPIC_API_KEY=your-key npm run eval -- grid-basic-setup --agent=claude | ||
| ``` | ||
|
|
||
| ### Run all tasks | ||
|
|
||
| ```bash | ||
| GEMINI_API_KEY=your-key npm run eval:all | ||
| ``` | ||
|
|
||
| ### Options | ||
|
|
||
| ```bash | ||
| # Adjust trials (default: 5) | ||
| npm run eval -- grid-basic-setup --trials=5 | ||
|
|
||
| # Run locally without Docker | ||
| npm run eval -- grid-basic-setup --provider=local | ||
|
|
||
| # Validate graders against the reference solution | ||
| npm run eval -- grid-basic-setup --validate --provider=local | ||
|
|
||
| # Run multiple trials in parallel | ||
| npm run eval -- grid-basic-setup --parallel=3 | ||
| ``` | ||
|
|
||
| ### Preview results | ||
|
|
||
| ```bash | ||
| # CLI report | ||
| npm run preview | ||
|
|
||
| # Web UI at http://localhost:3847 | ||
| npm run preview:browser | ||
| ``` | ||
|
|
||
| ## Adding a New Task | ||
|
|
||
| 1. Create a directory under `evals/tasks/<task-id>/` with the standard structure: | ||
|
|
||
| ``` | ||
| tasks/<task-id>/ | ||
| ├── task.toml # Config: graders, timeouts, resource limits | ||
| ├── instruction.md # Agent prompt | ||
| ├── environment/Dockerfile # Container setup | ||
| ├── tests/test.sh # Deterministic grader | ||
| ├── prompts/quality.md # LLM rubric grader | ||
| ├── solution/solve.sh # Reference solution | ||
| └── skills/ # Skill files under test | ||
| └── <skill-name>/SKILL.md | ||
| ``` | ||
|
|
||
| 2. Write a clear, unambiguous `instruction.md` that tells the agent exactly what | ||
| to build. | ||
|
|
||
| 3. Write `tests/test.sh` to check **outcomes** (files exist, project compiles, | ||
| correct selectors are present) rather than specific steps. | ||
|
|
||
| 4. Write `prompts/quality.md` with rubric dimensions that sum to 1.0. | ||
|
|
||
| 5. Write `solution/solve.sh` — a shell script that proves the task is solvable | ||
| and validates that the graders work correctly. | ||
|
|
||
| 6. Validate graders before submitting: | ||
|
|
||
| ```bash | ||
| npm run eval -- <task-id> --validate --provider=local | ||
| ``` | ||
|
|
||
| ## Pass / Fail Thresholds | ||
|
|
||
| Following [Anthropic's recommendations](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents): | ||
|
|
||
| | Metric | Threshold | Effect | | ||
| |---|---|---| | ||
| | `pass@5 ≥ 80%` | **Merge gate** | At least 1 success in 5 trials required | | ||
| | `pass^5 ≥ 60%` | **Tracked** | Flags flaky skills for investigation | | ||
| | `pass@5 < 60%` | **Blocks merge** | On PRs touching the relevant skill | | ||
|
|
||
| ## CI Integration | ||
|
|
||
| The GitHub Actions workflow at `.github/workflows/skill-eval.yml` runs | ||
| automatically on PRs that modify `skills/**` or `evals/**`. It: | ||
|
|
||
| 1. Checks out the repo | ||
| 2. Installs eval dependencies | ||
| 3. Runs all tasks with 5 trials | ||
| 4. Uploads results as an artifact | ||
| 5. Posts a summary comment on the PR | ||
|
|
||
| ## Grading Strategy | ||
|
|
||
| **Deterministic grader (60% weight)** — checks: | ||
| - Project builds without errors | ||
| - Correct Ignite UI selector is present in the generated template | ||
| - Required imports exist | ||
| - No use of forbidden alternatives | ||
|
Comment on lines
+207
to
+211
|
||
|
|
||
| **LLM rubric grader (40% weight)** — evaluates: | ||
| - Correct intent routing | ||
| - Idiomatic API usage | ||
| - Absence of hallucinated APIs | ||
| - Following the skill's guidance | ||
|
|
||
| ## Results | ||
|
|
||
| Baseline results are stored in `evals/results/baseline.json` and used for | ||
| regression comparison on PRs. The CI workflow uploads per-run results as | ||
| GitHub Actions artifacts. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| { | ||
| "name": "igniteui-angular-skill-evals", | ||
| "version": "1.0.0", | ||
| "description": "Evaluation suite for Ignite UI for Angular agent skills", | ||
| "private": true, | ||
| "scripts": { | ||
| "eval": "npx skill-eval", | ||
| "eval:grid": "npx skill-eval grid-basic-setup", | ||
| "eval:combo": "npx skill-eval component-combo-reactive-form", | ||
| "eval:theming": "npx skill-eval theming-palette-generation", | ||
| "eval:all": "npx skill-eval _ --suite=all", | ||
| "preview": "npx skill-eval preview", | ||
| "preview:browser": "npx skill-eval preview browser" | ||
| }, | ||
| "dependencies": { | ||
| "skill-eval": "^1.0.0" | ||
| }, | ||
| "engines": { | ||
| "node": ">=20.0.0" | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "generated_at": "2026-03-08T07:00:00.000Z", | ||
| "framework_version": "1.0.0", | ||
| "description": "Initial baseline results for skill evals. Actual scores will be populated after the first full eval run with an API key.", | ||
| "thresholds": { | ||
| "pass_at_5_merge_gate": 0.8, | ||
| "pass_at_5_block": 0.6, | ||
| "pass_pow_5_tracked": 0.6 | ||
| }, | ||
| "tasks": { | ||
| "grid-basic-setup": { | ||
| "skill": "igniteui-angular-grids", | ||
| "trials": 5, | ||
| "pass_rate": null, | ||
| "pass_at_5": null, | ||
| "pass_pow_5": null, | ||
| "status": "pending_first_run" | ||
| }, | ||
| "component-combo-reactive-form": { | ||
| "skill": "igniteui-angular-components", | ||
| "trials": 5, | ||
| "pass_rate": null, | ||
| "pass_at_5": null, | ||
| "pass_pow_5": null, | ||
| "status": "pending_first_run" | ||
| }, | ||
| "theming-palette-generation": { | ||
| "skill": "igniteui-angular-theming", | ||
| "trials": 5, | ||
| "pass_rate": null, | ||
| "pass_at_5": null, | ||
| "pass_pow_5": null, | ||
| "status": "pending_first_run" | ||
| } | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| FROM node:20-slim | ||
|
|
||
| WORKDIR /workspace | ||
|
|
||
| RUN npm install -g @angular/cli@latest | ||
|
|
||
| RUN ng new eval-app --skip-git --skip-install --style=scss --ssr=false && \ | ||
| cd eval-app && \ | ||
| npm install && \ | ||
| npm install igniteui-angular | ||
|
|
||
| WORKDIR /workspace/eval-app | ||
|
|
||
| COPY . . | ||
|
|
||
| RUN mkdir -p logs/verifier | ||
| CMD ["bash"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Task: Add a Multi-Select Combo in a Reactive Form | ||
|
|
||
| You are working in an Angular 20+ project that already has `igniteui-angular` installed and a theme applied. | ||
|
|
||
| ## Requirements | ||
|
|
||
| Create a `UserSettingsComponent` with a reactive form that includes a multi-select combo for choosing notification channels. | ||
|
|
||
| 1. **Component location**: `src/app/user-settings/user-settings.component.ts` (with its template) | ||
|
|
||
| 2. **Form structure**: Create a reactive form (`FormGroup`) with a `notificationChannels` control | ||
|
|
||
| 3. **Data source**: Use the following list of notification channels: | ||
|
|
||
| ```typescript | ||
| channels = [ | ||
| { id: 1, name: 'Email', icon: 'email' }, | ||
| { id: 2, name: 'SMS', icon: 'sms' }, | ||
| { id: 3, name: 'Push Notification', icon: 'notifications' }, | ||
| { id: 4, name: 'Slack', icon: 'chat' }, | ||
| { id: 5, name: 'Microsoft Teams', icon: 'groups' }, | ||
| ]; | ||
| ``` | ||
|
|
||
| 4. **Combo configuration**: | ||
| - Use the Ignite UI for Angular Combo component for multi-selection | ||
| - Bind it to the `notificationChannels` form control | ||
| - Display the `name` field in the dropdown | ||
| - Use the `id` field as the value key | ||
|
|
||
| 5. **Form validation**: The `notificationChannels` control must be required (at least one channel must be selected) | ||
|
|
||
| 6. **Submit button**: Add a submit button that is disabled when the form is invalid | ||
|
|
||
| ## Constraints | ||
|
|
||
| - Use the Ignite UI `igx-combo` component — do NOT use a native `<select multiple>`, `igx-select`, or Angular Material `mat-select`. | ||
| - Import from the correct `igniteui-angular` entry point. | ||
| - The component must be standalone and use `ChangeDetectionStrategy.OnPush`. | ||
| - Use reactive forms (`FormGroup` / `FormControl`), not template-driven forms. |
Uh oh!
There was an error while loading. Please reload this page.