proposal: changelog boolean (#2779)

annabellscha · cursoragent · felixkrrr · web-flow · commit 59ca5f894866 · 2026-04-08T16:58:18.000Z
Co-authored-by: Cursor Agent &lt;cursoragent@cursor.com&gt;
Co-authored-by: felixkrrr &lt;felixkrrr@users.noreply.github.com&gt;
Co-authored-by: felixkrrr &lt;57024447+felixkrrr@users.noreply.github.com&gt;
diff --git a/content/changelog/2026-04-08-boolean-llm-as-a-judge-scores.mdx b/content/changelog/2026-04-08-boolean-llm-as-a-judge-scores.mdx
@@ -0,0 +1,36 @@
+---
+date: 2026-04-08
+title: Boolean LLM-as-a-Judge Scores
+description: LLM-as-a-Judge evaluators can now return boolean scores for `true` / `false` decisions.
+ogImage: /images/changelog/2026-04-08-boolean-llm-as-a-judge-scores.jpg
+author: Hassieb
+canonical: /docs/evaluation/evaluation-methods/llm-as-a-judge
+---
+
+import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
+import { Book } from "lucide-react";
+
+<ChangelogHeader />
+
+LLM-as-a-Judge evaluators in Langfuse can now return boolean scores in addition to numeric and categorical ones. This makes it easier to model simple decisions directly as native `true` or `false` scores and analyze them across your existing score tooling.
+
+This is especially useful when the right answer is a binary judgment:
+
+- Detect `User Disagreement` as `true` or `false`
+- Detect `Out-of-Scope Request` as `true` or `false`
+- Detect `Insufficient Answer` as `true` or `false`
+
+Numeric scores are still the right fit for continuous dimensions like helpfulness or faithfulness. Categorical scores remain best when you need more than two explicit labels. Boolean scores are the simplest option when the evaluator should return `true` or `false`. For concrete prompt examples, see [LLM-as-a-Judge for Production Monitoring](/blog/2026-04-01-llm-as-a-judge-production-monitoring).
+
+## What's New
+
+- Choose `Boolean` when creating a custom LLM-as-a-Judge evaluator
+- Store `true` / `false` outcomes as native boolean scores
+- Analyze boolean evaluator outputs in dashboards, filters, and score analytics alongside your existing scores
+
+## Get started
+
+<Cards num={2}>
+  <Card title="LLM-as-a-Judge Documentation" href="/docs/evaluation/evaluation-methods/llm-as-a-judge" icon={<Book />} />
+  <Card title="What Are Scores?" href="/faq/all/what-are-scores" icon={<Book />} />
+</Cards>
diff --git a/content/docs/evaluation/evaluation-methods/llm-as-a-judge.mdx b/content/docs/evaluation/evaluation-methods/llm-as-a-judge.mdx
@@ -26,7 +26,7 @@ A typical LLM-as-a-Judge prompt includes:
 3. **Output to evaluate** — the application's response
 4. **Optional reference** — ground truth or expected output for comparison
 
-The judge model then returns a structured score and reasoning that can be tracked, aggregated, and analyzed over time. In Langfuse, that score can be numeric or categorical. Use numeric scores for continuous judgments like helpfulness from `0` to `1`. Use categorical scores when you want explicit labels such as `correct`, `partially_correct`, or `incorrect`.
+The judge model then returns a structured score and reasoning that can be tracked, aggregated, and analyzed over time. In Langfuse, that score can be numeric, categorical, or boolean. Use numeric scores for continuous judgments like helpfulness from `0` to `1`. Use categorical scores when you want explicit labels such as `correct`, `partially_correct`, or `incorrect`. Use boolean scores for binary decisions where the outcome is `true` or `false`, such as whether a user is disagreeing with the assistant, whether a request is out-of-scope, or whether an answer violates policy. For more production-monitoring examples, see [LLM-as-a-Judge for Production Monitoring](/blog/2026-04-01-llm-as-a-judge-production-monitoring).
 
 ## Why use LLM-as-a-Judge?
 
@@ -111,12 +111,12 @@ import { Card, CardContent, CardDescription, CardHeader, CardTitle } from "@/com
 
 ### Understanding Each Evaluation Target
 
-<Tabs items={["Live Production Data", "Offline Experiment Data"]} storageKey="eval-data-type">
+<Tabs items={["Live Production Data", "Offline Experiment Data"]}>
 <Tab>
 
 Evaluate live production traffic to monitor your LLM application performance in real-time.
 
-<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]} storageKey="eval-live-target">
+<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]}>
 <Tab>
 
 Run evaluators on individual observations within your traces—such as LLM calls, retrieval operations, embedding generations, or tool calls.
@@ -232,9 +232,9 @@ Langfuse ships a growing catalog of evaluators built and maintained by us and pa
 When the library doesn't fit your specific needs, add your own:
 
 1. Draft an evaluation prompt with `{{variables}}` placeholders (`input`, `output`, `ground_truth` ...).
-2. Choose a **score type**: use **Numeric** for gradients like helpfulness or faithfulness, or **Categorical** for discrete labels.
-3. If you choose **Categorical**, define the allowed categories. Optionally enable **Allow multiple matches** if more than one label can apply. Langfuse will create one score per selected category.
-4. Optional: Customize the **score reasoning** prompt and the numeric **score output** prompt or categorical **category selection** prompt.
+2. Choose a **score type**: use **Numeric** for gradients like helpfulness or faithfulness, **Categorical** for discrete labels, or **Boolean** for `true` / `false` decisions such as `User Disagreement`, `Out-of-Scope Request`, or `Insufficient Answer`.
+3. If you choose **Categorical**, define the allowed categories. Optionally enable **Allow multiple matches** if more than one label can apply. Langfuse will create one score per selected category. **Boolean** evaluators do not require a category list.
+4. Optional: Customize the **score reasoning** prompt and the output prompt for your selected score type.
 5. Optional: Pin a custom dedicated model for this evaluator. If no custom model is specified, it will use the default evaluation model (see Section 2).
 6. Save → the evaluator can now be reused across your project.
 
@@ -243,12 +243,12 @@ When the library doesn't fit your specific needs, add your own:
 
 ### Choose which Data to Evaluate
 
-With your evaluator and model selected, configure which data to run the evaluations on. See the [How it works](#how-it-works) section above to understand which option fits your use case.
+With your evaluator and model selected, configure which data to run the evaluations on. See the [Understanding Each Evaluation Target](#understanding-each-evaluation-target) section above to understand which option fits your use case.
 
-<Tabs items={["Live Production Data", "Offline Experiment Data"]} storageKey="eval-data-type">
+<Tabs items={["Live Production Data", "Offline Experiment Data"]}>
 <Tab>
 
-<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]} storageKey="eval-live-target">
+<Tabs items={["Observations (Recommended)", "Traces (Legacy)"]}>
 <Tab>
 
 **Configuration Steps**
@@ -318,7 +318,7 @@ We recommend migrating to [observation-level evaluators](/faq/all/llm-as-a-judge
 
 You now need to teach Langfuse _which properties_ of your observation, trace, or experiment item represent the actual data to populate these variables for a sensible evaluation. For instance, you might map your system's logged observation input to the prompt's `{{input}}` variable, and the LLM response (observation output) to the prompt's `{{output}}` variable. This mapping is crucial for ensuring the evaluation is sensible and relevant.
 
-<Tabs items={["Live Production Data", "Offline Experiment Data"]} storageKey="eval-data-type">
+<Tabs items={["Live Production Data", "Offline Experiment Data"]}>
 <Tab>
 
 - **Prompt Preview**: As you configure the mapping, Langfuse shows a **live preview of the evaluation prompt populated with actual data**. This preview uses historical data from the last 24 hours that matched your filters. You can navigate through several examples to see how their respective data fills the prompt, helping you build confidence that the mapping is correct.
diff --git a/content/faq/all/what-are-scores.mdx b/content/faq/all/what-are-scores.mdx
@@ -8,7 +8,7 @@ tags: [evaluation]
 
 Scores are Langfuse's universal data object for storing **evaluation results**. Any time you want to assign a quality judgment to an LLM output, whether by a human, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
 
-Every score has a **name** (like `"correctness"` or `"helpfulness"`) and a **value**. The value can be one of three data types: numeric, categorical, or boolean.
+Every score has a **name** (like `"correctness"` or `"helpfulness"`) and a **value**. The value can be one of three data types: numeric, categorical, or boolean (`true` or `false`).
 
 Scores can be attached to [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations), [sessions](/docs/observability/data-model#sessions), or [dataset runs](/docs/evaluation/experiments/data-model). Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.
 
@@ -18,7 +18,7 @@ Scores become useful when you want to go beyond observing what your application
 
 - **Collecting user feedback**: Capture thumbs up/down or star ratings from your users and attach them to traces. See the [user feedback guide](/docs/observability/features/user-feedback).
 
-- **Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, tone, or discrete outcomes such as `correct`, `partially_correct`, and `incorrect`.
+- **Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, tone, boolean checks that return `true` or `false`, or discrete outcomes such as `correct`, `partially_correct`, and `incorrect`.
 
 - **Running guardrails**: Score whether outputs pass safety checks, like PII detection, format validation, or content policy compliance. These programmatic checks run in your application and write results back as scores.
 
@@ -30,7 +30,7 @@ Once you have scores, they show up in [score analytics](/docs/evaluation/evaluat
 
 There are four ways to add scores:
 
-- **LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric or categorical scores plus reasoning, and can run on live production traces or on experiment results.
+- **LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric, categorical, or boolean (`true` / `false`) scores plus reasoning, and can run on live production traces or on experiment results.
 
 - **Annotation in the UI**: Team members [manually score](/docs/evaluation/evaluation-methods/annotation) traces, observations, or sessions directly in the Langfuse dashboard. Requires a [score config](/faq/all/manage-score-configs) to be set up first.
 
diff --git a/public/images/changelog/2026-04-08-boolean-llm-as-a-judge-scores.jpg b/public/images/changelog/2026-04-08-boolean-llm-as-a-judge-scores.jpg