Skip to content

Commit 59ca5f8

Browse files
annabellschacursoragentfelixkrrr
authored
proposal: changelog boolean (#2779)
Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: felixkrrr <felixkrrr@users.noreply.github.com> Co-authored-by: felixkrrr <57024447+felixkrrr@users.noreply.github.com>
1 parent 29a5bd1 commit 59ca5f8

File tree

4 files changed

+49
-13
lines changed

4 files changed

+49
-13
lines changed
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
date: 2026-04-08
3+
title: Boolean LLM-as-a-Judge Scores
4+
description: LLM-as-a-Judge evaluators can now return boolean scores for `true` / `false` decisions.
5+
ogImage: /images/changelog/2026-04-08-boolean-llm-as-a-judge-scores.jpg
6+
author: Hassieb
7+
canonical: /docs/evaluation/evaluation-methods/llm-as-a-judge
8+
---
9+
10+
import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
11+
import { Book } from "lucide-react";
12+
13+
<ChangelogHeader />
14+
15+
LLM-as-a-Judge evaluators in Langfuse can now return boolean scores in addition to numeric and categorical ones. This makes it easier to model simple decisions directly as native `true` or `false` scores and analyze them across your existing score tooling.
16+
17+
This is especially useful when the right answer is a binary judgment:
18+
19+
- Detect `User Disagreement` as `true` or `false`
20+
- Detect `Out-of-Scope Request` as `true` or `false`
21+
- Detect `Insufficient Answer` as `true` or `false`
22+
23+
Numeric scores are still the right fit for continuous dimensions like helpfulness or faithfulness. Categorical scores remain best when you need more than two explicit labels. Boolean scores are the simplest option when the evaluator should return `true` or `false`. For concrete prompt examples, see [LLM-as-a-Judge for Production Monitoring](/blog/2026-04-01-llm-as-a-judge-production-monitoring).
24+
25+
## What's New
26+
27+
- Choose `Boolean` when creating a custom LLM-as-a-Judge evaluator
28+
- Store `true` / `false` outcomes as native boolean scores
29+
- Analyze boolean evaluator outputs in dashboards, filters, and score analytics alongside your existing scores
30+
31+
## Get started
32+
33+
<Cards num={2}>
34+
<Card title="LLM-as-a-Judge Documentation" href="/docs/evaluation/evaluation-methods/llm-as-a-judge" icon={<Book />} />
35+
<Card title="What Are Scores?" href="/faq/all/what-are-scores" icon={<Book />} />
36+
</Cards>

content/docs/evaluation/evaluation-methods/llm-as-a-judge.mdx

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ A typical LLM-as-a-Judge prompt includes:
2626
3. **Output to evaluate** — the application's response
2727
4. **Optional reference** — ground truth or expected output for comparison
2828

29-
The judge model then returns a structured score and reasoning that can be tracked, aggregated, and analyzed over time. In Langfuse, that score can be numeric or categorical. Use numeric scores for continuous judgments like helpfulness from `0` to `1`. Use categorical scores when you want explicit labels such as `correct`, `partially_correct`, or `incorrect`.
29+
The judge model then returns a structured score and reasoning that can be tracked, aggregated, and analyzed over time. In Langfuse, that score can be numeric, categorical, or boolean. Use numeric scores for continuous judgments like helpfulness from `0` to `1`. Use categorical scores when you want explicit labels such as `correct`, `partially_correct`, or `incorrect`. Use boolean scores for binary decisions where the outcome is `true` or `false`, such as whether a user is disagreeing with the assistant, whether a request is out-of-scope, or whether an answer violates policy. For more production-monitoring examples, see [LLM-as-a-Judge for Production Monitoring](/blog/2026-04-01-llm-as-a-judge-production-monitoring).
3030

3131
## Why use LLM-as-a-Judge?
3232

@@ -111,12 +111,12 @@ import { Card, CardContent, CardDescription, CardHeader, CardTitle } from "@/com
111111

112112
### Understanding Each Evaluation Target
113113

114-
<Tabs items={["Live Production Data", "Offline Experiment Data"]} storageKey="eval-data-type">
114+
<Tabs items={["Live Production Data", "Offline Experiment Data"]}>
115115
<Tab>
116116

117117
Evaluate live production traffic to monitor your LLM application performance in real-time.
118118

119-
<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]} storageKey="eval-live-target">
119+
<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]}>
120120
<Tab>
121121

122122
Run evaluators on individual observations within your traces—such as LLM calls, retrieval operations, embedding generations, or tool calls.
@@ -232,9 +232,9 @@ Langfuse ships a growing catalog of evaluators built and maintained by us and pa
232232
When the library doesn't fit your specific needs, add your own:
233233

234234
1. Draft an evaluation prompt with `{{variables}}` placeholders (`input`, `output`, `ground_truth` ...).
235-
2. Choose a **score type**: use **Numeric** for gradients like helpfulness or faithfulness, or **Categorical** for discrete labels.
236-
3. If you choose **Categorical**, define the allowed categories. Optionally enable **Allow multiple matches** if more than one label can apply. Langfuse will create one score per selected category.
237-
4. Optional: Customize the **score reasoning** prompt and the numeric **score output** prompt or categorical **category selection** prompt.
235+
2. Choose a **score type**: use **Numeric** for gradients like helpfulness or faithfulness, **Categorical** for discrete labels, or **Boolean** for `true` / `false` decisions such as `User Disagreement`, `Out-of-Scope Request`, or `Insufficient Answer`.
236+
3. If you choose **Categorical**, define the allowed categories. Optionally enable **Allow multiple matches** if more than one label can apply. Langfuse will create one score per selected category. **Boolean** evaluators do not require a category list.
237+
4. Optional: Customize the **score reasoning** prompt and the output prompt for your selected score type.
238238
5. Optional: Pin a custom dedicated model for this evaluator. If no custom model is specified, it will use the default evaluation model (see Section 2).
239239
6. Save → the evaluator can now be reused across your project.
240240

@@ -243,12 +243,12 @@ When the library doesn't fit your specific needs, add your own:
243243

244244
### Choose which Data to Evaluate
245245

246-
With your evaluator and model selected, configure which data to run the evaluations on. See the [How it works](#how-it-works) section above to understand which option fits your use case.
246+
With your evaluator and model selected, configure which data to run the evaluations on. See the [Understanding Each Evaluation Target](#understanding-each-evaluation-target) section above to understand which option fits your use case.
247247

248-
<Tabs items={["Live Production Data", "Offline Experiment Data"]} storageKey="eval-data-type">
248+
<Tabs items={["Live Production Data", "Offline Experiment Data"]}>
249249
<Tab>
250250

251-
<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]} storageKey="eval-live-target">
251+
<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]}>
252252
<Tab>
253253

254254
**Configuration Steps**
@@ -318,7 +318,7 @@ We recommend migrating to [observation-level evaluators](/faq/all/llm-as-a-judge
318318

319319
You now need to teach Langfuse _which properties_ of your observation, trace, or experiment item represent the actual data to populate these variables for a sensible evaluation. For instance, you might map your system's logged observation input to the prompt's `{{input}}` variable, and the LLM response (observation output) to the prompt's `{{output}}` variable. This mapping is crucial for ensuring the evaluation is sensible and relevant.
320320

321-
<Tabs items={["Live Production Data", "Offline Experiment Data"]} storageKey="eval-data-type">
321+
<Tabs items={["Live Production Data", "Offline Experiment Data"]}>
322322
<Tab>
323323

324324
- **Prompt Preview**: As you configure the mapping, Langfuse shows a **live preview of the evaluation prompt populated with actual data**. This preview uses historical data from the last 24 hours that matched your filters. You can navigate through several examples to see how their respective data fills the prompt, helping you build confidence that the mapping is correct.

content/faq/all/what-are-scores.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ tags: [evaluation]
88

99
Scores are Langfuse's universal data object for storing **evaluation results**. Any time you want to assign a quality judgment to an LLM output, whether by a human, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
1010

11-
Every score has a **name** (like `"correctness"` or `"helpfulness"`) and a **value**. The value can be one of three data types: numeric, categorical, or boolean.
11+
Every score has a **name** (like `"correctness"` or `"helpfulness"`) and a **value**. The value can be one of three data types: numeric, categorical, or boolean (`true` or `false`).
1212

1313
Scores can be attached to [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations), [sessions](/docs/observability/data-model#sessions), or [dataset runs](/docs/evaluation/experiments/data-model). Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.
1414

@@ -18,7 +18,7 @@ Scores become useful when you want to go beyond observing what your application
1818

1919
- **Collecting user feedback**: Capture thumbs up/down or star ratings from your users and attach them to traces. See the [user feedback guide](/docs/observability/features/user-feedback).
2020

21-
- **Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, tone, or discrete outcomes such as `correct`, `partially_correct`, and `incorrect`.
21+
- **Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, tone, boolean checks that return `true` or `false`, or discrete outcomes such as `correct`, `partially_correct`, and `incorrect`.
2222

2323
- **Running guardrails**: Score whether outputs pass safety checks, like PII detection, format validation, or content policy compliance. These programmatic checks run in your application and write results back as scores.
2424

@@ -30,7 +30,7 @@ Once you have scores, they show up in [score analytics](/docs/evaluation/evaluat
3030

3131
There are four ways to add scores:
3232

33-
- **LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric or categorical scores plus reasoning, and can run on live production traces or on experiment results.
33+
- **LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric, categorical, or boolean (`true` / `false`) scores plus reasoning, and can run on live production traces or on experiment results.
3434

3535
- **Annotation in the UI**: Team members [manually score](/docs/evaluation/evaluation-methods/annotation) traces, observations, or sessions directly in the Langfuse dashboard. Requires a [score config](/faq/all/manage-score-configs) to be set up first.
3636

877 KB
Loading

0 commit comments

Comments
 (0)