You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLM-as-a-Judge evaluators in Langfuse can now return boolean scores in addition to numeric and categorical ones. This makes it easier to model simple decisions directly as native `true` or `false` scores and analyze them across your existing score tooling.
16
+
17
+
This is especially useful when the right answer is a binary judgment:
18
+
19
+
- Detect `User Disagreement` as `true` or `false`
20
+
- Detect `Out-of-Scope Request` as `true` or `false`
21
+
- Detect `Insufficient Answer` as `true` or `false`
22
+
23
+
Numeric scores are still the right fit for continuous dimensions like helpfulness or faithfulness. Categorical scores remain best when you need more than two explicit labels. Boolean scores are the simplest option when the evaluator should return `true` or `false`. For concrete prompt examples, see [LLM-as-a-Judge for Production Monitoring](/blog/2026-04-01-llm-as-a-judge-production-monitoring).
24
+
25
+
## What's New
26
+
27
+
- Choose `Boolean` when creating a custom LLM-as-a-Judge evaluator
28
+
- Store `true` / `false` outcomes as native boolean scores
29
+
- Analyze boolean evaluator outputs in dashboards, filters, and score analytics alongside your existing scores
Copy file name to clipboardExpand all lines: content/docs/evaluation/evaluation-methods/llm-as-a-judge.mdx
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ A typical LLM-as-a-Judge prompt includes:
26
26
3.**Output to evaluate** — the application's response
27
27
4.**Optional reference** — ground truth or expected output for comparison
28
28
29
-
The judge model then returns a structured score and reasoning that can be tracked, aggregated, and analyzed over time. In Langfuse, that score can be numericor categorical. Use numeric scores for continuous judgments like helpfulness from `0` to `1`. Use categorical scores when you want explicit labels such as `correct`, `partially_correct`, or `incorrect`.
29
+
The judge model then returns a structured score and reasoning that can be tracked, aggregated, and analyzed over time. In Langfuse, that score can be numeric, categorical, or boolean. Use numeric scores for continuous judgments like helpfulness from `0` to `1`. Use categorical scores when you want explicit labels such as `correct`, `partially_correct`, or `incorrect`. Use boolean scores for binary decisions where the outcome is `true` or `false`, such as whether a user is disagreeing with the assistant, whether a request is out-of-scope, or whether an answer violates policy. For more production-monitoring examples, see [LLM-as-a-Judge for Production Monitoring](/blog/2026-04-01-llm-as-a-judge-production-monitoring).
Run evaluators on individual observations within your traces—such as LLM calls, retrieval operations, embedding generations, or tool calls.
@@ -232,9 +232,9 @@ Langfuse ships a growing catalog of evaluators built and maintained by us and pa
232
232
When the library doesn't fit your specific needs, add your own:
233
233
234
234
1. Draft an evaluation prompt with `{{variables}}` placeholders (`input`, `output`, `ground_truth` ...).
235
-
2. Choose a **score type**: use **Numeric** for gradients like helpfulness or faithfulness, or **Categorical** for discrete labels.
236
-
3. If you choose **Categorical**, define the allowed categories. Optionally enable **Allow multiple matches** if more than one label can apply. Langfuse will create one score per selected category.
237
-
4. Optional: Customize the **score reasoning** prompt and the numeric **score output** prompt or categorical **category selection** prompt.
235
+
2. Choose a **score type**: use **Numeric** for gradients like helpfulness or faithfulness, **Categorical** for discrete labels, or **Boolean** for `true` / `false` decisions such as `User Disagreement`, `Out-of-Scope Request`, or `Insufficient Answer`.
236
+
3. If you choose **Categorical**, define the allowed categories. Optionally enable **Allow multiple matches** if more than one label can apply. Langfuse will create one score per selected category.**Boolean** evaluators do not require a category list.
237
+
4. Optional: Customize the **score reasoning** prompt and the output prompt for your selected score type.
238
238
5. Optional: Pin a custom dedicated model for this evaluator. If no custom model is specified, it will use the default evaluation model (see Section 2).
239
239
6. Save → the evaluator can now be reused across your project.
240
240
@@ -243,12 +243,12 @@ When the library doesn't fit your specific needs, add your own:
243
243
244
244
### Choose which Data to Evaluate
245
245
246
-
With your evaluator and model selected, configure which data to run the evaluations on. See the [How it works](#how-it-works) section above to understand which option fits your use case.
246
+
With your evaluator and model selected, configure which data to run the evaluations on. See the [Understanding Each Evaluation Target](#understanding-each-evaluation-target) section above to understand which option fits your use case.
247
247
248
-
<Tabsitems={["Live Production Data", "Offline Experiment Data"]}storageKey="eval-data-type">
248
+
<Tabsitems={["Live Production Data", "Offline Experiment Data"]}>
@@ -318,7 +318,7 @@ We recommend migrating to [observation-level evaluators](/faq/all/llm-as-a-judge
318
318
319
319
You now need to teach Langfuse _which properties_ of your observation, trace, or experiment item represent the actual data to populate these variables for a sensible evaluation. For instance, you might map your system's logged observation input to the prompt's `{{input}}` variable, and the LLM response (observation output) to the prompt's `{{output}}` variable. This mapping is crucial for ensuring the evaluation is sensible and relevant.
320
320
321
-
<Tabsitems={["Live Production Data", "Offline Experiment Data"]}storageKey="eval-data-type">
321
+
<Tabsitems={["Live Production Data", "Offline Experiment Data"]}>
322
322
<Tab>
323
323
324
324
-**Prompt Preview**: As you configure the mapping, Langfuse shows a **live preview of the evaluation prompt populated with actual data**. This preview uses historical data from the last 24 hours that matched your filters. You can navigate through several examples to see how their respective data fills the prompt, helping you build confidence that the mapping is correct.
Copy file name to clipboardExpand all lines: content/faq/all/what-are-scores.mdx
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ tags: [evaluation]
8
8
9
9
Scores are Langfuse's universal data object for storing **evaluation results**. Any time you want to assign a quality judgment to an LLM output, whether by a human, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
10
10
11
-
Every score has a **name** (like `"correctness"` or `"helpfulness"`) and a **value**. The value can be one of three data types: numeric, categorical, or boolean.
11
+
Every score has a **name** (like `"correctness"` or `"helpfulness"`) and a **value**. The value can be one of three data types: numeric, categorical, or boolean (`true` or `false`).
12
12
13
13
Scores can be attached to [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations), [sessions](/docs/observability/data-model#sessions), or [dataset runs](/docs/evaluation/experiments/data-model). Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.
14
14
@@ -18,7 +18,7 @@ Scores become useful when you want to go beyond observing what your application
18
18
19
19
-**Collecting user feedback**: Capture thumbs up/down or star ratings from your users and attach them to traces. See the [user feedback guide](/docs/observability/features/user-feedback).
20
20
21
-
-**Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, tone, or discrete outcomes such as `correct`, `partially_correct`, and `incorrect`.
21
+
-**Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, tone, boolean checks that return `true` or `false`, or discrete outcomes such as `correct`, `partially_correct`, and `incorrect`.
22
22
23
23
-**Running guardrails**: Score whether outputs pass safety checks, like PII detection, format validation, or content policy compliance. These programmatic checks run in your application and write results back as scores.
24
24
@@ -30,7 +30,7 @@ Once you have scores, they show up in [score analytics](/docs/evaluation/evaluat
30
30
31
31
There are four ways to add scores:
32
32
33
-
-**LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numericor categorical scores plus reasoning, and can run on live production traces or on experiment results.
33
+
-**LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric, categorical, or boolean (`true` / `false`) scores plus reasoning, and can run on live production traces or on experiment results.
34
34
35
35
-**Annotation in the UI**: Team members [manually score](/docs/evaluation/evaluation-methods/annotation) traces, observations, or sessions directly in the Langfuse dashboard. Requires a [score config](/faq/all/manage-score-configs) to be set up first.
0 commit comments