You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/content/docs/metrics/probabilistic_metrics.md
+47-3Lines changed: 47 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,50 @@ sidebar:
6
6
7
7
Probabilistic LLM metrics are LLM-as-a-Judge metrics that provide score distributions with their associated confidence levels, enabling assessment of model certainty in its evaluations. These distributions are derived from the model's token-level log probabilities.
8
8
9
-
## Define a Probabilistic Metric
9
+
## Custom Probabilistic Metric
10
+
11
+
Similar to the custom LLM-as-a-Judge metric, you can define your own probabilistic metric by extending the `ProbabilisticCustomMetric` class.
12
+
13
+
```python
14
+
from continuous_eval.metrics.base.metric import Arg
15
+
from continuous_eval.metrics.base.response_type import Integer
16
+
from continuous_eval.metrics.custom import ProbabilisticCustomMetric
17
+
18
+
rubric ="""1: The joke is not funny or inappropriate.
19
+
2: The joke is somewhat funny and appropriate.
20
+
3: The joke is very funny and appropriate."""
21
+
22
+
metric = ProbabilisticCustomMetric(
23
+
name="FunnyJoke",
24
+
criteria="Joke is funny and appropriate",
25
+
rubric=rubric,
26
+
arguments={"joke": Arg(type=str, description="The joke to evaluate.")},
27
+
response_format=Integer(ge=1, le=3),
28
+
)
29
+
30
+
print(metric(
31
+
joke="""Scientists released a new way to measure AI performance.
32
+
It's so accurate, even the AI said, ‘Finally, someone understands me!’"""
33
+
))
34
+
```
35
+
36
+
Optionally, you can also add examples to the metric.
37
+
38
+
> Note: See the [limitations section](#current-limitations) for more information about the response format.
39
+
40
+
#### Example Output
41
+
42
+
```py
43
+
{
44
+
'FunnyJoke_score': 3,
45
+
'FunnyJoke_reasoning': 'The joke is clever as it plays on the idea of AI having feelings.',
Sometimes the criteria, rubric and examples are not enough to define the metric. In this case, you can define your own probabilistic metric by extending the `ProbabilisticMetric` class.
1. The `response_format` must be a _single token value_, we predefined a few: `GoodOrBad`, `YesOrNo`, `Boolean` and `Integer`, but it is possible to define your own. In case of integer scoring, negative values are not supported (they are two tokens) as well as values greater than 9.
123
-
2. Only OpenAI models are supported for probabilistic metrics.
165
+
1. The `response_format` must be a **single token value**, we predefined a few: `GoodOrBad`, `YesOrNo`, `Boolean` and `Integer`, but it is possible to define your own. In case of integer scoring, negative values are not supported (they are two tokens) as well as values greater than 9.
166
+
2. Arbitrary JSON format is not supported yet for probabilistic metrics.
167
+
3. At the moment, **only OpenAI models are supported for probabilistic metrics**.
0 commit comments