-
Notifications
You must be signed in to change notification settings - Fork 49
feat: Test-based Evaluation with LLM-as-a-judge #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Test-based Evaluation with LLM-as-a-judge #225
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
avinash2692
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keshavramji : Thanks for your contribution!
This might be due to my ignorance, but I'm having some trouble understanding the overall goal of the PR and what a "Test based Evaluation" entails. It might be helpful if you add:
- More details to your PR description that might later translate as an entry in the library docs.
- Some tests (which could include sample
.jsonfiles to show the user what you are looking for in terms of overall schema of input).
Also left some comments on the files here.
| # KR files | ||
| kr_results/ | ||
| kr_data/ | ||
| xet/ | ||
| job.sh | ||
| hub/ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be removed? I think it's user specific
| from mellea.stdlib.test_based_eval import TestBasedEval | ||
|
|
||
| __all__ = ["MelleaSession", "generative", "model_ids", "start_session"] | ||
| __all__ = ["MelleaSession", "TestBasedEval", "generative", "model_ids", "start_session"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this to be a package level import?
| @@ -0,0 +1,46 @@ | |||
| import typer | |||
|
|
|||
| eval_app = typer.Typer(name="eval") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some documentation on what the CLI app is trying to achieve here? It would be good to have some clarity to end users on what the use of the command is.
| self.judge_prompt = """**Input to the model** | ||
| {input} | ||
| **Model output to be rated** | ||
| {prediction} | ||
| **Ground truth text** | ||
| {target} | ||
| **Rating Guidelines** | ||
| The model output should adhere to the following guidelines: | ||
| {guidelines} | ||
| **Scoring Criteria** | ||
| * Score 0: The model output violates any of the guidelines. | ||
| * Score 1: The model output is well aligned with the ground truth - if it exists, the input to the model, and adheres to all guidelines. | ||
| **Return Your Rating** | ||
| Return your rating in the following format: | ||
| {{\"score\": your_score, \"justification\": \"your_justification\"}} | ||
| Your rating: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be cause I don't have a proper understanding of what you're trying to achieve here, but could this be moved as a Jinja template under mellea/templates/prompts to generate a mellea.stdlib.base.TemplateRepresentation: ? You might also want to use the format_for_llm method in tandem with the TemplateRepresentation to create the Component
| """Run all 'unit test' evaluations""" | ||
| all_test_evals: List[TestBasedEval] = [] | ||
|
|
||
| for test_file in test_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to have an example of what you expect here in a test_file
|
|
||
| test_evals = [] | ||
| for test_data in data: | ||
| examples = test_data.get("examples", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you create a schema here using either dataclasses or pydantic to fix what the input test_data would look like? This could lead to early failures when loading such files.
m eval runCLI command for configurable models and backends for the generator and juge.