feat: Test-based Evaluation with LLM-as-a-judge #225

keshavramji · 2025-11-06T20:36:35Z

Added TestBasedEval component for evaluation with generator and judge sessions.
Implements m eval run CLI command for configurable models and backends for the generator and juge.
Supports flexible test format with multiple examples per test, each containing input-target pairs evaluated against custom instructions.

mergify · 2025-11-06T20:37:10Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

avinash2692

@keshavramji : Thanks for your contribution!

This might be due to my ignorance, but I'm having some trouble understanding the overall goal of the PR and what a "Test based Evaluation" entails. It might be helpful if you add:

More details to your PR description that might later translate as an entry in the library docs.
Some tests (which could include sample .json files to show the user what you are looking for in terms of overall schema of input).

Also left some comments on the files here.

avinash2692 · 2025-11-07T20:16:10Z

.gitignore

+# KR files
+kr_results/
+kr_data/
+xet/
+job.sh
+hub/
+


Could this be removed? I think it's user specific

avinash2692 · 2025-11-07T20:39:12Z

mellea/__init__.py

+from mellea.stdlib.test_based_eval import TestBasedEval

-__all__ = ["MelleaSession", "generative", "model_ids", "start_session"]
+__all__ = ["MelleaSession", "TestBasedEval", "generative", "model_ids", "start_session"]


Is there a reason for this to be a package level import?

avinash2692 · 2025-11-07T21:06:07Z

cli/eval/commands.py

@@ -0,0 +1,46 @@
+import typer
+
+eval_app = typer.Typer(name="eval")


Could you add some documentation on what the CLI app is trying to achieve here? It would be good to have some clarity to end users on what the use of the command is.

avinash2692 · 2025-11-07T21:11:25Z

mellea/stdlib/test_based_eval.py

+        self.judge_prompt = """**Input to the model**
+
+            {input}
+
+            **Model output to be rated**
+
+            {prediction}
+
+            **Ground truth text**
+
+            {target}
+
+            **Rating Guidelines**
+            The model output should adhere to the following guidelines:
+             {guidelines}
+
+            **Scoring Criteria**
+             * Score 0: The model output violates any of the guidelines.
+             * Score 1: The model output is well aligned with the ground truth - if it exists, the input to the model, and adheres to all guidelines.
+
+            **Return Your Rating**
+               Return your rating in the following format:
+               {{\"score\": your_score, \"justification\": \"your_justification\"}}
+
+            Your rating:
+            """


This might be cause I don't have a proper understanding of what you're trying to achieve here, but could this be moved as a Jinja template under mellea/templates/prompts to generate a mellea.stdlib.base.TemplateRepresentation: ? You might also want to use the format_for_llm method in tandem with the TemplateRepresentation to create the Component

avinash2692 · 2025-11-07T21:16:57Z

cli/eval/runner.py

+    """Run all 'unit test' evaluations"""
+    all_test_evals: List[TestBasedEval] = []
+
+    for test_file in test_files:


it would be good to have an example of what you expect here in a test_file

avinash2692 · 2025-11-07T21:18:56Z

mellea/stdlib/test_based_eval.py

+
+        test_evals = []
+        for test_data in data:
+            examples = test_data.get("examples", [])


Could you create a schema here using either dataclasses or pydantic to fix what the input test_data would look like? This could lead to early failures when loading such files.

Keshav Ramji [email protected] added 3 commits November 6, 2025 16:35

v1 working

c46d41e

Update v1 data format and judge call

21f7960

Pre-commit fixes

7919b5b

avinash2692 self-assigned this Nov 7, 2025

avinash2692 self-requested a review November 7, 2025 20:05

avinash2692 removed their assignment Nov 7, 2025

avinash2692 requested changes Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Test-based Evaluation with LLM-as-a-judge #225

feat: Test-based Evaluation with LLM-as-a-judge #225

keshavramji commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

avinash2692 left a comment

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,46 @@
		import typer

		eval_app = typer.Typer(name="eval")

feat: Test-based Evaluation with LLM-as-a-judge #225

Are you sure you want to change the base?

feat: Test-based Evaluation with LLM-as-a-judge #225

Conversation

keshavramji commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 6, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

avinash2692 left a comment

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants