Skip to content

Conversation

@habuma
Copy link
Member

@habuma habuma commented Jun 28, 2024

This PR adds a new evaluator to judge correctness of a response. It is based loosely on LlamaIndex's correctness.py evaluator.

Note that I couldn't decide the best way to provide the reference answer. I saw three choices:

  • Add it directly to EvaluationRequest - This seemed like the worst choice, as it would be a property that is probably only used for this evaluator.
  • Override the evaluate() method to take an EvaluationRequest as well as the reference answer. I started this way, but it felt clunky.
  • Extend EvaluationRequest with CorrectnessEvaluationRequest and implement evaluate() to check for a CorrectnessEvaluationRequest and use its reference answer. This is the option I chose.

Also note that this change is built upon the change in #967, such that it takes a String for the response in EvaluationRequest.

Copy link
Contributor

@ilopezluna ilopezluna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some time ago, I worked on a project very similar to this one. I found it very useful to include the reason in the response. You can see an example here:
ValidatorAgent.java

I believe using a score to evaluate the answers might not be the best option, as the expected output for a test is typically a boolean (test passes or not). Using a score could make it harder to properly test an answer because the developer would need to decide on a threshold, which could be very arbitrary.

@habuma
Copy link
Member Author

habuma commented Jul 4, 2024

This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback property of the EvaluationResponse.

@ilopezluna
Copy link
Contributor

This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback property of the EvaluationResponse.

I believe it makes more sense to rely on the judgment of the LLM to determine if the test passes or not. However, I've seen many examples where evaluations are based on scores, so I might be wrong.

If you always want the explanation included in the response, I suggest being more explicit in the prompt. Currently, you are asking for:

Output a single score that represents a holistic evaluation.

I believe this could lead the LLM to omit the explanation. When I was working on this, I realized that I had to be extremely explicit about what I wanted.

@markpollack markpollack added this to the 1.0.x milestone May 16, 2025
@markpollack markpollack removed this from the 1.1.0.M1 milestone Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants