Skip to content

Commit 7d603fe

Browse files
committed
Add documentation for CorrectnessEvaluator
1 parent 97be707 commit 7d603fe

File tree

1 file changed

+69
-1
lines changed
  • spring-ai-docs/src/main/antora/modules/ROOT/pages/api

1 file changed

+69
-1
lines changed

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/testing.adoc

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ The 'claim' and 'document' are presented to the AI model for evaluation. Smaller
103103

104104
=== Usage
105105
The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:
106+
106107
[source,java]
107108
----
108109
public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
@@ -147,4 +148,71 @@ void testFactChecking() {
147148
assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");
148149
149150
}
150-
----
151+
----
152+
153+
== CorrectnessEvaluator
154+
155+
Whereas `FactCheckingEvaluator` establishes if the generated content is factual given some context data, `CorrectnessEvaluator` determines if the generated content is correct, as compared with a reference answer that is correct. It also produces a score (with a range of 1 to 5) the gauge how correct the generated content is.
156+
157+
The `CorrectnessEvaluator` submits the following system prompt to the AI model as guidelines for determining correctness:
158+
159+
[source,text]
160+
----
161+
You are an expert evaluation system for a question answering chatbot.
162+
You are given the following information:
163+
- a user query, and
164+
- a generated answer
165+
You may also be given a reference answer to use for reference in your evaluation.
166+
Your job is to judge the relevance and correctness of the generated answer.
167+
Output a single score that represents a holistic evaluation.
168+
Follow these guidelines for scoring:
169+
- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.
170+
- If the generated answer is not relevant to the user query,
171+
you should give a score of 1.
172+
- If the generated answer is relevant but contains mistakes,
173+
you should give a score between 2 and 3.
174+
- If the generated answer is relevant and fully correct,
175+
you should give a score between 4 and 5.
176+
Example Response:
177+
4.0
178+
The generated answer has the exact same metrics as the reference answer,
179+
but it is not as concise.
180+
----
181+
182+
Along with the system prompt, the query input, generated answer, and the reference answer are provided in the user prompt:
183+
184+
[source,text]
185+
----
186+
{query}
187+
## Reference Answer
188+
{reference_answer}
189+
## Generated Answer
190+
{generated_answer}
191+
----
192+
193+
Here is an example of a JUnit test that performs a RAG query over a PDF document loaded into a Vector Store and then evaluates if the response is relevant to the user text.
194+
195+
[source,java]
196+
----
197+
void testEvaluation() {
198+
String userText = "Why is the sky blue?";
199+
200+
ChatResponse response = ChatClient.builder(chatModel)
201+
.build().prompt()
202+
.user(userText)
203+
.call()
204+
.chatResponse();
205+
206+
var correctnessEvaluator = new CorrectnessEvaluator(ChatClient.builder(chatModel), 3.5f);
207+
208+
EvaluationResponse evaluationResponse = correctnessEvaluator.evaluate(
209+
new EvaluationRequest(
210+
question,
211+
List.of(),
212+
"Light scattering makes the sky blue."));
213+
214+
assertTrue(evaluationResponse.isPass(), "Response is incorrect");
215+
}
216+
----
217+
218+
The `CorrectnessEvaluator` is created with a `ChatClient` as well as a threshold that the score must be greater than or equal to in order for the evaluation to be considered correct.

0 commit comments

Comments
 (0)