You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- PR comment is now built into the action (post_pr_comment input,
default true). Users no longer need a separate github-script step.
- Fetch previous experiment scores and show deltas in the summary
table (arrow + percentage point change per metric)
- Delete e2e-test.yml (contained hardcoded tunnel URL)
- Fix README metrics table to use snake_case names with categories
(Universal, RAG, Agent) and add all 17 supported metrics
- Simplify the "Run on Every PR" example to just 5 lines of YAML
Made-with: Cursor
Copy file name to clipboardExpand all lines: README.md
+28-42Lines changed: 28 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,7 +62,7 @@ The action will:
62
62
63
63
### Run on Every Pull Request
64
64
65
-
Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR against `main` or `develop` will be evaluated automatically.
65
+
Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR will be evaluated and results posted as a comment automatically.
| `answer_relevancy` | Universal | Higher is better | Is the response relevant to what was asked? |
156
+
| `correctness` | Universal | Higher is better | Are the answers factually right? |
157
+
| `completeness` | Universal | Higher is better | Does the answer cover all parts of the question? |
158
+
| `instruction_following` | Universal | Higher is better | Does the response follow the instructions? |
159
+
| `hallucination` | Universal | **Lower is better** | How much of the response is fabricated? |
160
+
| `toxicity` | Universal | **Lower is better** | Does the response contain harmful content? |
161
+
| `bias` | Universal | **Lower is better** | Does the response exhibit unfair bias? |
162
+
| `faithfulness` | RAG | Higher is better | Is the response grounded in the provided context? |
163
+
| `contextual_relevancy` | RAG | Higher is better | Is the retrieved context relevant? |
164
+
| `context_precision` | RAG | Higher is better | Is the retrieved context precise? |
165
+
| `context_recall` | RAG | Higher is better | Was all relevant context retrieved? |
166
+
| `tool_correctness` | Agent | Higher is better | Are the right tools selected? |
167
+
| `argument_correctness` | Agent | Higher is better | Are tool arguments correct? |
168
+
| `task_completion` | Agent | Higher is better | Is the overall task completed? |
169
+
| `step_efficiency` | Agent | Higher is better | Are steps efficient (no redundancy)? |
170
+
| `plan_quality` | Agent | Higher is better | Is the execution plan well-structured? |
171
+
| `plan_adherence` | Agent | Higher is better | Does execution follow the plan? |
186
172
187
173
**How thresholds work:** For standard metrics (higher is better), the score must be **at or above** the threshold to pass. For inverted metrics (lower is better), the score must be **at or below** the threshold to pass. A threshold of `0.7` means "70% correct is the minimum" for standard metrics, or "30% hallucination is the maximum" for inverted ones.
0 commit comments