Correctness evaluator #969

habuma · 2024-06-28T02:58:19Z

This PR adds a new evaluator to judge correctness of a response. It is based loosely on LlamaIndex's correctness.py evaluator.

Note that I couldn't decide the best way to provide the reference answer. I saw three choices:

Add it directly to EvaluationRequest - This seemed like the worst choice, as it would be a property that is probably only used for this evaluator.
Override the evaluate() method to take an EvaluationRequest as well as the reference answer. I started this way, but it felt clunky.
Extend EvaluationRequest with CorrectnessEvaluationRequest and implement evaluate() to check for a CorrectnessEvaluationRequest and use its reference answer. This is the option I chose.

Also note that this change is built upon the change in #967, such that it takes a String for the response in EvaluationRequest.

ilopezluna

Some time ago, I worked on a project very similar to this one. I found it very useful to include the reason in the response. You can see an example here:
ValidatorAgent.java

I believe using a score to evaluate the answers might not be the best option, as the expected output for a test is typically a boolean (test passes or not). Using a score could make it harder to properly test an answer because the developer would need to decide on a threshold, which could be very arbitrary.

habuma · 2024-07-04T21:15:52Z

This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback property of the EvaluationResponse.

ilopezluna · 2024-07-05T09:21:57Z

This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback property of the EvaluationResponse.

I believe it makes more sense to rely on the judgment of the LLM to determine if the test passes or not. However, I've seen many examples where evaluations are based on scores, so I might be wrong.

If you always want the explanation included in the response, I suggest being more explicit in the prompt. Currently, you are asking for:

Output a single score that represents a holistic evaluation.

I believe this could lead the LLM to omit the explanation. When I was working on this, I realized that I had to be extremely explicit about what I wanted.

habuma added 3 commits June 27, 2024 17:20

Accept response as a String instead of ChatResponse in EvaluationRequest

b312dee

Add correctness evaluator

a114ad5

Remove unnecessary line-oriented formatting instructions

615390a

tzolov assigned markpollack Jun 28, 2024

tzolov added the Evaluation label Jun 28, 2024

habuma added 2 commits July 1, 2024 09:01

Fix since version in JavaDoc for correctness evaluator classes

8cb2fb4

Add documentation for CorrectnessEvaluator

9100134

ilopezluna reviewed Jul 4, 2024

View reviewed changes

habuma added 5 commits October 10, 2024 21:06

Add correctness evaluator

321fc52

Remove unnecessary line-oriented formatting instructions

43d58a8

Fix since version in JavaDoc for correctness evaluator classes

97be707

Add documentation for CorrectnessEvaluator

7d603fe

fix merge issue in testing documentation

85143f2

markpollack added this to the 1.0.x milestone May 16, 2025

markpollack removed this from the 1.1.0.M1 milestone Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctness evaluator #969

Correctness evaluator #969

Uh oh!

habuma commented Jun 28, 2024 •

edited

Loading

Uh oh!

ilopezluna left a comment

Uh oh!

habuma commented Jul 4, 2024

Uh oh!

ilopezluna commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Correctness evaluator #969

Are you sure you want to change the base?

Correctness evaluator #969

Uh oh!

Conversation

habuma commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilopezluna left a comment

Choose a reason for hiding this comment

Uh oh!

habuma commented Jul 4, 2024

Uh oh!

ilopezluna commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

habuma commented Jun 28, 2024 •

edited

Loading