You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: spring-ai-docs/src/main/antora/modules/ROOT/pages/api/testing.adoc
+69-1Lines changed: 69 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,6 +103,7 @@ The 'claim' and 'document' are presented to the AI model for evaluation. Smaller
103
103
104
104
=== Usage
105
105
The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:
106
+
106
107
[source,java]
107
108
----
108
109
public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
@@ -147,4 +148,71 @@ void testFactChecking() {
147
148
assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");
148
149
149
150
}
150
-
----
151
+
----
152
+
153
+
== CorrectnessEvaluator
154
+
155
+
Whereas `FactCheckingEvaluator` establishes if the generated content is factual given some context data, `CorrectnessEvaluator` determines if the generated content is correct, as compared with a reference answer that is correct. It also produces a score (with a range of 1 to 5) the gauge how correct the generated content is.
156
+
157
+
The `CorrectnessEvaluator` submits the following system prompt to the AI model as guidelines for determining correctness:
158
+
159
+
[source,text]
160
+
----
161
+
You are an expert evaluation system for a question answering chatbot.
162
+
You are given the following information:
163
+
- a user query, and
164
+
- a generated answer
165
+
You may also be given a reference answer to use for reference in your evaluation.
166
+
Your job is to judge the relevance and correctness of the generated answer.
167
+
Output a single score that represents a holistic evaluation.
168
+
Follow these guidelines for scoring:
169
+
- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.
170
+
- If the generated answer is not relevant to the user query,
171
+
you should give a score of 1.
172
+
- If the generated answer is relevant but contains mistakes,
173
+
you should give a score between 2 and 3.
174
+
- If the generated answer is relevant and fully correct,
175
+
you should give a score between 4 and 5.
176
+
Example Response:
177
+
4.0
178
+
The generated answer has the exact same metrics as the reference answer,
179
+
but it is not as concise.
180
+
----
181
+
182
+
Along with the system prompt, the query input, generated answer, and the reference answer are provided in the user prompt:
183
+
184
+
[source,text]
185
+
----
186
+
{query}
187
+
## Reference Answer
188
+
{reference_answer}
189
+
## Generated Answer
190
+
{generated_answer}
191
+
----
192
+
193
+
Here is an example of a JUnit test that performs a RAG query over a PDF document loaded into a Vector Store and then evaluates if the response is relevant to the user text.
assertTrue(evaluationResponse.isPass(), "Response is incorrect");
215
+
}
216
+
----
217
+
218
+
The `CorrectnessEvaluator` is created with a `ChatClient` as well as a threshold that the score must be greater than or equal to in order for the evaluation to be considered correct.
0 commit comments