Skip to content

Commit 9100134

Browse files
committed
Add documentation for CorrectnessEvaluator
1 parent 8cb2fb4 commit 9100134

File tree

1 file changed

+69
-1
lines changed
  • spring-ai-docs/src/main/antora/modules/ROOT/pages/api

1 file changed

+69
-1
lines changed

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/testing.adoc

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,4 +92,72 @@ void testEvaluation() {
9292
}
9393
----
9494

95-
The code above is from the example application located https://github.com/rd-1-2022/ai-azure-rag.git[here].
95+
The code above is from the example application located https://github.com/rd-1-2022/ai-azure-rag.git[here].
96+
97+
== CorrectnessEvaluator
98+
99+
Whereas `RelevancyEvaluator` establishes if the generated content is relevant to the input, `CorrectnessEvaluator` determines if the generated content is correct, as compared with a reference answer that is correct. It also produces a score (with a range of 1 to 5) the gauge how correct the generated content is.
100+
101+
The `CorrectnessEvaluator` submits the following system prompt to the AI model as guidelines for determining correctness:
102+
103+
[source,text]
104+
----
105+
You are an expert evaluation system for a question answering chatbot.
106+
You are given the following information:
107+
- a user query, and
108+
- a generated answer
109+
You may also be given a reference answer to use for reference in your evaluation.
110+
Your job is to judge the relevance and correctness of the generated answer.
111+
Output a single score that represents a holistic evaluation.
112+
Follow these guidelines for scoring:
113+
- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.
114+
- If the generated answer is not relevant to the user query,
115+
you should give a score of 1.
116+
- If the generated answer is relevant but contains mistakes,
117+
you should give a score between 2 and 3.
118+
- If the generated answer is relevant and fully correct,
119+
you should give a score between 4 and 5.
120+
Example Response:
121+
4.0
122+
The generated answer has the exact same metrics as the reference answer,
123+
but it is not as concise.
124+
----
125+
126+
Along with the system prompt, the query input, generated answer, and the reference answer are provided in the user prompt:
127+
128+
[source,text]
129+
----
130+
{query}
131+
## Reference Answer
132+
{reference_answer}
133+
## Generated Answer
134+
{generated_answer}
135+
----
136+
137+
Here is an example of a JUnit test that performs a RAG query over a PDF document loaded into a Vector Store and then evaluates if the response is relevant to the user text.
138+
139+
[source,java]
140+
----
141+
@Test
142+
void testEvaluation() {
143+
String userText = "Why is the sky blue?";
144+
145+
ChatResponse response = ChatClient.builder(chatModel)
146+
.build().prompt()
147+
.user(userText)
148+
.call()
149+
.chatResponse();
150+
151+
var correctnessEvaluator = new CorrectnessEvaluator(ChatClient.builder(chatModel), 3.5f);
152+
153+
EvaluationResponse evaluationResponse = correctnessEvaluator.evaluate(
154+
new EvaluationRequest(
155+
question,
156+
List.of(),
157+
"Light scattering makes the sky blue."));
158+
159+
assertTrue(evaluationResponse.isPass(), "Response is incorrect");
160+
}
161+
----
162+
163+
The `CorrectnessEvaluator` is created with a `ChatClient` as well as a threshold that the score must be greater than or equal to in order for the evaluation to be considered correct.

0 commit comments

Comments
 (0)