verifywise-ai
diff --git a/‎.github/workflows/e2e-test.yml‎
Lines changed: 0 additions & 40 deletions b/‎.github/workflows/e2e-test.yml‎
Lines changed: 0 additions & 40 deletions
diff --git a/‎README.md‎
Lines changed: 28 additions & 42 deletions b/‎README.md‎
Lines changed: 28 additions & 42 deletions
diff --git a/‎action.yml‎
Lines changed: 37 additions & 2 deletions b/‎action.yml‎
Lines changed: 37 additions & 2 deletions
@@ -62,7 +62,7 @@ The action will:
 
 ### Run on Every Pull Request
 
-Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR against `main` or `develop` will be evaluated automatically.
+Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR will be evaluated and results posted as a comment automatically.
 
 ```yaml
 # .github/workflows/llm-eval.yml
@@ -83,7 +83,6 @@ jobs:
       - uses: actions/checkout@v4
 
       - name: Run evaluation
-        id: eval
         uses: verifywise-ai/verifywise-eval-action@v1
         with:
           api_url: https://app.verifywise.ai
@@ -95,37 +94,15 @@ jobs:
           threshold: '0.7'
           vw_api_token: ${{ secrets.VW_API_TOKEN }}
           llm_api_key: ${{ secrets.LLM_API_KEY }}
-
-      # Optional: post results as a PR comment
-      - name: Comment on PR
-        if: always() && github.event_name == 'pull_request'
-        uses: actions/github-script@v7
-        with:
-          script: |
-            const fs = require('fs');
-            const path = '${{ steps.eval.outputs.summary_path }}';
-            if (!path || !fs.existsSync(path)) return;
-            const body = fs.readFileSync(path, 'utf8');
-            const tag = '<!-- verifywise-eval -->';
-            const { data: comments } = await github.rest.issues.listComments({
-              owner: context.repo.owner, repo: context.repo.repo,
-              issue_number: context.issue.number,
-            });
-            const prev = comments.find(c => c.body.includes(tag));
-            const full = `${tag}\n${body}`;
-            if (prev) {
-              await github.rest.issues.updateComment({
-                owner: context.repo.owner, repo: context.repo.repo,
-                comment_id: prev.id, body: full,
-              });
-            } else {
-              await github.rest.issues.createComment({
-                owner: context.repo.owner, repo: context.repo.repo,
-                issue_number: context.issue.number, body: full,
-              });
-            }
 ```
 
+The action automatically:
+- Runs the evaluation and waits for results
+- Posts a summary comment on the PR (updates the same comment on re-runs)
+- Fails the check if any metric is below threshold
+- Compares scores against the previous experiment (shows deltas)
+- Uploads JSON results and Markdown summary as build artifacts
+
 **Required secrets** — add these in your repo's Settings > Secrets and variables > Actions:
 
 | Secret | Required | Where to get it |
@@ -158,6 +135,7 @@ jobs:
 | `poll_interval_seconds` | no | `15` | Seconds between status checks |
 | `experiment_name` | no | *(auto)* | Custom name for the experiment |
 | `fail_on_threshold` | no | `true` | Set to `false` to report without failing |
+| `post_pr_comment` | no | `true` | Post results as a comment on the PR |
 
 ## Outputs
 
@@ -172,17 +150,25 @@ jobs:
 
 ## Metrics
 
-| Metric | Direction | What it measures |
-|--------|-----------|------------------|
-| `correctness` | Higher is better | Are the answers factually right? |
-| `completeness` | Higher is better | Does the answer cover all parts of the question? |
-| `answerRelevancy` | Higher is better | Is the response relevant to what was asked? |
-| `faithfulness` | Higher is better | Is the response grounded in the provided context? |
-| `contextualPrecision` | Higher is better | Is the retrieved context precise and relevant? |
-| `contextualRecall` | Higher is better | Was all relevant context retrieved? |
-| `hallucination` | **Lower is better** | How much of the response is fabricated? |
-| `toxicity` | **Lower is better** | Does the response contain harmful content? |
-| `bias` | **Lower is better** | Does the response exhibit unfair bias? |
+| Metric | Category | Direction | What it measures |
+|--------|----------|-----------|------------------|
+| `answer_relevancy` | Universal | Higher is better | Is the response relevant to what was asked? |
+| `correctness` | Universal | Higher is better | Are the answers factually right? |
+| `completeness` | Universal | Higher is better | Does the answer cover all parts of the question? |
+| `instruction_following` | Universal | Higher is better | Does the response follow the instructions? |
+| `hallucination` | Universal | **Lower is better** | How much of the response is fabricated? |
+| `toxicity` | Universal | **Lower is better** | Does the response contain harmful content? |
+| `bias` | Universal | **Lower is better** | Does the response exhibit unfair bias? |
+| `faithfulness` | RAG | Higher is better | Is the response grounded in the provided context? |
+| `contextual_relevancy` | RAG | Higher is better | Is the retrieved context relevant? |
+| `context_precision` | RAG | Higher is better | Is the retrieved context precise? |
+| `context_recall` | RAG | Higher is better | Was all relevant context retrieved? |
+| `tool_correctness` | Agent | Higher is better | Are the right tools selected? |
+| `argument_correctness` | Agent | Higher is better | Are tool arguments correct? |
+| `task_completion` | Agent | Higher is better | Is the overall task completed? |
+| `step_efficiency` | Agent | Higher is better | Are steps efficient (no redundancy)? |
+| `plan_quality` | Agent | Higher is better | Is the execution plan well-structured? |
+| `plan_adherence` | Agent | Higher is better | Does execution follow the plan? |
 
 **How thresholds work:** For standard metrics (higher is better), the score must be **at or above** the threshold to pass. For inverted metrics (lower is better), the score must be **at or below** the threshold to pass. A threshold of `0.7` means "70% correct is the minimum" for standard metrics, or "30% hallucination is the maximum" for inverted ones.
 
 
@@ -69,6 +69,10 @@ inputs:
     description: 'Fail the step when thresholds are not met'
     required: false
     default: 'true'
+  post_pr_comment:
+    description: 'Post results as a comment on the pull request'
+    required: false
+    default: 'true'
   vw_api_token:
     description: 'VerifyWise API token'
     required: true
@@ -164,12 +168,10 @@ runs:
           exit 0
         fi
 
-        # Write Markdown summary to Job Summary (shown on the run page)
         if [ -f "$SUMMARY_PATH" ]; then
           cat "$SUMMARY_PATH" >> "$GITHUB_STEP_SUMMARY"
         fi
 
-        # Parse results, create annotations, set outputs
         python3 << 'PYEOF'
         import json, os
 
@@ -207,6 +209,39 @@ runs:
                 raise SystemExit(1)
         PYEOF
 
+    - name: Post PR comment
+      if: always() && inputs.post_pr_comment == 'true' && github.event_name == 'pull_request'
+      uses: actions/github-script@v7
+      with:
+        script: |
+          const fs = require('fs');
+          const summaryPath = '${{ steps.eval.outputs.summary_path }}';
+          if (!summaryPath || !fs.existsSync(summaryPath)) return;
+          const body = fs.readFileSync(summaryPath, 'utf8');
+          const marker = '<!-- verifywise-eval -->';
+          const { data: comments } = await github.rest.issues.listComments({
+            owner: context.repo.owner,
+            repo: context.repo.repo,
+            issue_number: context.issue.number,
+          });
+          const existing = comments.find(c => c.body.includes(marker));
+          const fullBody = `${marker}\n${body}`;
+          if (existing) {
+            await github.rest.issues.updateComment({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              comment_id: existing.id,
+              body: fullBody,
+            });
+          } else {
+            await github.rest.issues.createComment({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              issue_number: context.issue.number,
+              body: fullBody,
+            });
+          }
+
     - name: Upload results
       if: always()
       uses: actions/upload-artifact@v7