Skip to content

Commit 87542d9

Browse files
committed
feat: built-in PR comments, baseline comparison, release cleanup
- PR comment is now built into the action (post_pr_comment input, default true). Users no longer need a separate github-script step. - Fetch previous experiment scores and show deltas in the summary table (arrow + percentage point change per metric) - Delete e2e-test.yml (contained hardcoded tunnel URL) - Fix README metrics table to use snake_case names with categories (Universal, RAG, Agent) and add all 17 supported metrics - Simplify the "Run on Every PR" example to just 5 lines of YAML Made-with: Cursor
1 parent fc6b4a5 commit 87542d9

File tree

4 files changed

+173
-95
lines changed

4 files changed

+173
-95
lines changed

.github/workflows/e2e-test.yml

Lines changed: 0 additions & 40 deletions
This file was deleted.

README.md

Lines changed: 28 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ The action will:
6262
6363
### Run on Every Pull Request
6464
65-
Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR against `main` or `develop` will be evaluated automatically.
65+
Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR will be evaluated and results posted as a comment automatically.
6666

6767
```yaml
6868
# .github/workflows/llm-eval.yml
@@ -83,7 +83,6 @@ jobs:
8383
- uses: actions/checkout@v4
8484
8585
- name: Run evaluation
86-
id: eval
8786
uses: verifywise-ai/verifywise-eval-action@v1
8887
with:
8988
api_url: https://app.verifywise.ai
@@ -95,37 +94,15 @@ jobs:
9594
threshold: '0.7'
9695
vw_api_token: ${{ secrets.VW_API_TOKEN }}
9796
llm_api_key: ${{ secrets.LLM_API_KEY }}
98-
99-
# Optional: post results as a PR comment
100-
- name: Comment on PR
101-
if: always() && github.event_name == 'pull_request'
102-
uses: actions/github-script@v7
103-
with:
104-
script: |
105-
const fs = require('fs');
106-
const path = '${{ steps.eval.outputs.summary_path }}';
107-
if (!path || !fs.existsSync(path)) return;
108-
const body = fs.readFileSync(path, 'utf8');
109-
const tag = '<!-- verifywise-eval -->';
110-
const { data: comments } = await github.rest.issues.listComments({
111-
owner: context.repo.owner, repo: context.repo.repo,
112-
issue_number: context.issue.number,
113-
});
114-
const prev = comments.find(c => c.body.includes(tag));
115-
const full = `${tag}\n${body}`;
116-
if (prev) {
117-
await github.rest.issues.updateComment({
118-
owner: context.repo.owner, repo: context.repo.repo,
119-
comment_id: prev.id, body: full,
120-
});
121-
} else {
122-
await github.rest.issues.createComment({
123-
owner: context.repo.owner, repo: context.repo.repo,
124-
issue_number: context.issue.number, body: full,
125-
});
126-
}
12797
```
12898

99+
The action automatically:
100+
- Runs the evaluation and waits for results
101+
- Posts a summary comment on the PR (updates the same comment on re-runs)
102+
- Fails the check if any metric is below threshold
103+
- Compares scores against the previous experiment (shows deltas)
104+
- Uploads JSON results and Markdown summary as build artifacts
105+
129106
**Required secrets** — add these in your repo's Settings > Secrets and variables > Actions:
130107

131108
| Secret | Required | Where to get it |
@@ -158,6 +135,7 @@ jobs:
158135
| `poll_interval_seconds` | no | `15` | Seconds between status checks |
159136
| `experiment_name` | no | *(auto)* | Custom name for the experiment |
160137
| `fail_on_threshold` | no | `true` | Set to `false` to report without failing |
138+
| `post_pr_comment` | no | `true` | Post results as a comment on the PR |
161139

162140
## Outputs
163141

@@ -172,17 +150,25 @@ jobs:
172150

173151
## Metrics
174152

175-
| Metric | Direction | What it measures |
176-
|--------|-----------|------------------|
177-
| `correctness` | Higher is better | Are the answers factually right? |
178-
| `completeness` | Higher is better | Does the answer cover all parts of the question? |
179-
| `answerRelevancy` | Higher is better | Is the response relevant to what was asked? |
180-
| `faithfulness` | Higher is better | Is the response grounded in the provided context? |
181-
| `contextualPrecision` | Higher is better | Is the retrieved context precise and relevant? |
182-
| `contextualRecall` | Higher is better | Was all relevant context retrieved? |
183-
| `hallucination` | **Lower is better** | How much of the response is fabricated? |
184-
| `toxicity` | **Lower is better** | Does the response contain harmful content? |
185-
| `bias` | **Lower is better** | Does the response exhibit unfair bias? |
153+
| Metric | Category | Direction | What it measures |
154+
|--------|----------|-----------|------------------|
155+
| `answer_relevancy` | Universal | Higher is better | Is the response relevant to what was asked? |
156+
| `correctness` | Universal | Higher is better | Are the answers factually right? |
157+
| `completeness` | Universal | Higher is better | Does the answer cover all parts of the question? |
158+
| `instruction_following` | Universal | Higher is better | Does the response follow the instructions? |
159+
| `hallucination` | Universal | **Lower is better** | How much of the response is fabricated? |
160+
| `toxicity` | Universal | **Lower is better** | Does the response contain harmful content? |
161+
| `bias` | Universal | **Lower is better** | Does the response exhibit unfair bias? |
162+
| `faithfulness` | RAG | Higher is better | Is the response grounded in the provided context? |
163+
| `contextual_relevancy` | RAG | Higher is better | Is the retrieved context relevant? |
164+
| `context_precision` | RAG | Higher is better | Is the retrieved context precise? |
165+
| `context_recall` | RAG | Higher is better | Was all relevant context retrieved? |
166+
| `tool_correctness` | Agent | Higher is better | Are the right tools selected? |
167+
| `argument_correctness` | Agent | Higher is better | Are tool arguments correct? |
168+
| `task_completion` | Agent | Higher is better | Is the overall task completed? |
169+
| `step_efficiency` | Agent | Higher is better | Are steps efficient (no redundancy)? |
170+
| `plan_quality` | Agent | Higher is better | Is the execution plan well-structured? |
171+
| `plan_adherence` | Agent | Higher is better | Does execution follow the plan? |
186172

187173
**How thresholds work:** For standard metrics (higher is better), the score must be **at or above** the threshold to pass. For inverted metrics (lower is better), the score must be **at or below** the threshold to pass. A threshold of `0.7` means "70% correct is the minimum" for standard metrics, or "30% hallucination is the maximum" for inverted ones.
188174

action.yml

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,10 @@ inputs:
6969
description: 'Fail the step when thresholds are not met'
7070
required: false
7171
default: 'true'
72+
post_pr_comment:
73+
description: 'Post results as a comment on the pull request'
74+
required: false
75+
default: 'true'
7276
vw_api_token:
7377
description: 'VerifyWise API token'
7478
required: true
@@ -164,12 +168,10 @@ runs:
164168
exit 0
165169
fi
166170
167-
# Write Markdown summary to Job Summary (shown on the run page)
168171
if [ -f "$SUMMARY_PATH" ]; then
169172
cat "$SUMMARY_PATH" >> "$GITHUB_STEP_SUMMARY"
170173
fi
171174
172-
# Parse results, create annotations, set outputs
173175
python3 << 'PYEOF'
174176
import json, os
175177
@@ -207,6 +209,39 @@ runs:
207209
raise SystemExit(1)
208210
PYEOF
209211
212+
- name: Post PR comment
213+
if: always() && inputs.post_pr_comment == 'true' && github.event_name == 'pull_request'
214+
uses: actions/github-script@v7
215+
with:
216+
script: |
217+
const fs = require('fs');
218+
const summaryPath = '${{ steps.eval.outputs.summary_path }}';
219+
if (!summaryPath || !fs.existsSync(summaryPath)) return;
220+
const body = fs.readFileSync(summaryPath, 'utf8');
221+
const marker = '<!-- verifywise-eval -->';
222+
const { data: comments } = await github.rest.issues.listComments({
223+
owner: context.repo.owner,
224+
repo: context.repo.repo,
225+
issue_number: context.issue.number,
226+
});
227+
const existing = comments.find(c => c.body.includes(marker));
228+
const fullBody = `${marker}\n${body}`;
229+
if (existing) {
230+
await github.rest.issues.updateComment({
231+
owner: context.repo.owner,
232+
repo: context.repo.repo,
233+
comment_id: existing.id,
234+
body: fullBody,
235+
});
236+
} else {
237+
await github.rest.issues.createComment({
238+
owner: context.repo.owner,
239+
repo: context.repo.repo,
240+
issue_number: context.issue.number,
241+
body: fullBody,
242+
});
243+
}
244+
210245
- name: Upload results
211246
if: always()
212247
uses: actions/upload-artifact@v7

0 commit comments

Comments
 (0)