|
2 | 2 |
|
3 | 3 | `FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter. |
4 | 4 |
|
| 5 | +### Example |
| 6 | + |
| 7 | +```python |
| 8 | +from openai import AsyncOpenAI |
| 9 | +from ragas.llms import llm_factory |
| 10 | +from ragas.metrics.collections import FactualCorrectness |
| 11 | + |
| 12 | +# Setup LLM |
| 13 | +client = AsyncOpenAI() |
| 14 | +llm = llm_factory("gpt-4o-mini", client=client) |
| 15 | + |
| 16 | +# Create metric |
| 17 | +scorer = FactualCorrectness(llm=llm) |
| 18 | + |
| 19 | +# Evaluate |
| 20 | +result = await scorer.ascore( |
| 21 | + response="The Eiffel Tower is located in Paris.", |
| 22 | + reference="The Eiffel Tower is located in Paris. It has a height of 1000ft." |
| 23 | +) |
| 24 | +print(f"Factual Correctness Score: {result.value}") |
| 25 | +``` |
| 26 | + |
| 27 | +Output: |
| 28 | + |
| 29 | +``` |
| 30 | +Factual Correctness Score: 0.67 |
| 31 | +``` |
| 32 | + |
| 33 | +By default, the mode is set to `f1`. You can change the mode to `precision` or `recall` by setting the `mode` parameter: |
| 34 | + |
| 35 | +```python |
| 36 | +# Precision mode - measures what fraction of response claims are supported by reference |
| 37 | +scorer = FactualCorrectness(llm=llm, mode="precision") |
| 38 | +result = await scorer.ascore( |
| 39 | + response="The Eiffel Tower is located in Paris.", |
| 40 | + reference="The Eiffel Tower is located in Paris. It has a height of 1000ft." |
| 41 | +) |
| 42 | +print(f"Precision Score: {result.value}") |
| 43 | +``` |
| 44 | + |
| 45 | +Output: |
| 46 | + |
| 47 | +``` |
| 48 | +Precision Score: 1.0 |
| 49 | +``` |
| 50 | + |
| 51 | +You can also configure the claim decomposition granularity using `atomicity` and `coverage` parameters: |
| 52 | + |
| 53 | +```python |
| 54 | +# High granularity - more detailed claim decomposition |
| 55 | +scorer = FactualCorrectness( |
| 56 | + llm=llm, |
| 57 | + mode="f1", |
| 58 | + atomicity="high", # More atomic claims |
| 59 | + coverage="high" # Comprehensive coverage |
| 60 | +) |
| 61 | +``` |
| 62 | + |
| 63 | +!!! note "Synchronous Usage" |
| 64 | + If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`: |
| 65 | + |
| 66 | + ```python |
| 67 | + result = scorer.score( |
| 68 | + response="The Eiffel Tower is located in Paris.", |
| 69 | + reference="The Eiffel Tower is located in Paris. It has a height of 1000ft." |
| 70 | + ) |
| 71 | + ``` |
| 72 | + |
| 73 | +### How It's Calculated |
| 74 | + |
5 | 75 | The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows: |
6 | 76 |
|
7 | 77 | $$ |
|
30 | 100 | \text{F1 Score} = {2 \times \text{Precision} \times \text{Recall} \over (\text{Precision} + \text{Recall})} |
31 | 101 | $$ |
32 | 102 |
|
33 | | -### Example |
34 | | - |
35 | | -```python |
36 | | -from ragas.dataset_schema import SingleTurnSample |
37 | | -from ragas.metrics._factual_correctness import FactualCorrectness |
38 | | - |
39 | | - |
40 | | -sample = SingleTurnSample( |
41 | | - response="The Eiffel Tower is located in Paris.", |
42 | | - reference="The Eiffel Tower is located in Paris. I has a height of 1000ft." |
43 | | -) |
44 | | - |
45 | | -scorer = FactualCorrectness(llm = evaluator_llm) |
46 | | -await scorer.single_turn_ascore(sample) |
47 | | -``` |
48 | | -Output |
49 | | -``` |
50 | | -0.67 |
51 | | -``` |
52 | | - |
53 | | -By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter. |
54 | | - |
55 | | -```python |
56 | | -scorer = FactualCorrectness(llm = evaluator_llm, mode="precision") |
57 | | -``` |
58 | | -Output |
59 | | -``` |
60 | | -1.0 |
61 | | -``` |
62 | | - |
63 | 103 | ### Controlling the Number of Claims |
64 | 104 |
|
65 | 105 | Each sentence in the response and reference can be broken down into one or more claims. The number of claims that are generated from a single sentence is determined by the level of `atomicity` and `coverage` required for your application. |
@@ -161,3 +201,58 @@ By adjusting both atomicity and coverage, you can customize the level of detail |
161 | 201 | - Use **Low Atomicity and Low Coverage** when only the key information is necessary, such as for summarization. |
162 | 202 |
|
163 | 203 | This flexibility in controlling the number of claims helps ensure that the information is presented at the right level of granularity for your application's requirements. |
| 204 | + |
| 205 | +## Legacy Metrics API |
| 206 | + |
| 207 | +The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. |
| 208 | + |
| 209 | +!!! warning "Deprecation Timeline" |
| 210 | + This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above. |
| 211 | + |
| 212 | +### Example with SingleTurnSample |
| 213 | + |
| 214 | +```python |
| 215 | +from ragas.dataset_schema import SingleTurnSample |
| 216 | +from ragas.metrics._factual_correctness import FactualCorrectness |
| 217 | + |
| 218 | + |
| 219 | +sample = SingleTurnSample( |
| 220 | + response="The Eiffel Tower is located in Paris.", |
| 221 | + reference="The Eiffel Tower is located in Paris. I has a height of 1000ft." |
| 222 | +) |
| 223 | + |
| 224 | +scorer = FactualCorrectness(llm = evaluator_llm) |
| 225 | +await scorer.single_turn_ascore(sample) |
| 226 | +``` |
| 227 | + |
| 228 | +Output: |
| 229 | + |
| 230 | +``` |
| 231 | +0.67 |
| 232 | +``` |
| 233 | + |
| 234 | +### Changing the Mode |
| 235 | + |
| 236 | +By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter. |
| 237 | + |
| 238 | +```python |
| 239 | +scorer = FactualCorrectness(llm = evaluator_llm, mode="precision") |
| 240 | +``` |
| 241 | + |
| 242 | +Output: |
| 243 | + |
| 244 | +``` |
| 245 | +1.0 |
| 246 | +``` |
| 247 | + |
| 248 | +### Controlling Atomicity |
| 249 | + |
| 250 | +```python |
| 251 | +scorer = FactualCorrectness(mode="precision", atomicity="low") |
| 252 | +``` |
| 253 | + |
| 254 | +Output: |
| 255 | + |
| 256 | +``` |
| 257 | +1.0 |
| 258 | +``` |
0 commit comments