Skip to content

Conversation

@keshavramji
Copy link

  • Added TestBasedEval component for evaluation with generator and judge sessions.
  • Implements m eval run CLI command for configurable models and backends for the generator and juge.
  • Supports flexible test format with multiple examples per test, each containing input-target pairs evaluated against custom instructions.

@mergify
Copy link

mergify bot commented Nov 6, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@avinash2692 avinash2692 self-assigned this Nov 7, 2025
@avinash2692 avinash2692 self-requested a review November 7, 2025 20:05
@avinash2692 avinash2692 removed their assignment Nov 7, 2025
Copy link
Contributor

@avinash2692 avinash2692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keshavramji : Thanks for your contribution!

This might be due to my ignorance, but I'm having some trouble understanding the overall goal of the PR and what a "Test based Evaluation" entails. It might be helpful if you add:

  • More details to your PR description that might later translate as an entry in the library docs.
  • Some tests (which could include sample .json files to show the user what you are looking for in terms of overall schema of input).

Also left some comments on the files here.

Comment on lines +1 to +7
# KR files
kr_results/
kr_data/
xet/
job.sh
hub/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be removed? I think it's user specific

from mellea.stdlib.test_based_eval import TestBasedEval

__all__ = ["MelleaSession", "generative", "model_ids", "start_session"]
__all__ = ["MelleaSession", "TestBasedEval", "generative", "model_ids", "start_session"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this to be a package level import?

@@ -0,0 +1,46 @@
import typer

eval_app = typer.Typer(name="eval")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation on what the CLI app is trying to achieve here? It would be good to have some clarity to end users on what the use of the command is.

Comment on lines +32 to +57
self.judge_prompt = """**Input to the model**
{input}
**Model output to be rated**
{prediction}
**Ground truth text**
{target}
**Rating Guidelines**
The model output should adhere to the following guidelines:
{guidelines}
**Scoring Criteria**
* Score 0: The model output violates any of the guidelines.
* Score 1: The model output is well aligned with the ground truth - if it exists, the input to the model, and adheres to all guidelines.
**Return Your Rating**
Return your rating in the following format:
{{\"score\": your_score, \"justification\": \"your_justification\"}}
Your rating:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be cause I don't have a proper understanding of what you're trying to achieve here, but could this be moved as a Jinja template under mellea/templates/prompts to generate a mellea.stdlib.base.TemplateRepresentation: ? You might also want to use the format_for_llm method in tandem with the TemplateRepresentation to create the Component

"""Run all 'unit test' evaluations"""
all_test_evals: List[TestBasedEval] = []

for test_file in test_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have an example of what you expect here in a test_file


test_evals = []
for test_data in data:
examples = test_data.get("examples", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a schema here using either dataclasses or pydantic to fix what the input test_data would look like? This could lead to early failures when loading such files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants