Skip to content

Commit 024095c

Browse files
wochingeclaude
andauthored
docs(TEXT scores): add documentation for the new TEXT score type (#2769)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b2fd54c commit 024095c

29 files changed

+465
-225
lines changed

.agents/cursor/rules/available-internal-links.mdc

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,13 +74,15 @@ Pages: Self Hosting, Administration, Automated Access Provisioning, Headless Ini
7474
- [Core Concepts](/docs/evaluation/core-concepts.md)
7575
- [Annotation Queues](/docs/evaluation/evaluation-methods/annotation-queues.md)
7676
- [Llm As A Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge.md)
77-
- [Score Analytics](/docs/evaluation/evaluation-methods/score-analytics.md)
7877
- [Scores Via Sdk](/docs/evaluation/evaluation-methods/scores-via-sdk.md)
7978
- [Scores Via Ui](/docs/evaluation/evaluation-methods/scores-via-ui.md)
80-
- [Data Model](/docs/evaluation/experiments/data-model.md)
79+
- [Scores Overview](/docs/evaluation/scores/overview.md)
80+
- [Score Analytics](/docs/evaluation/scores/score-analytics.md)
81+
- [Scores Data Model](/docs/evaluation/scores/data-model.md)
8182
- [Datasets](/docs/evaluation/experiments/datasets.md)
8283
- [Experiments Via Sdk](/docs/evaluation/experiments/experiments-via-sdk.md)
8384
- [Experiments Via Ui](/docs/evaluation/experiments/experiments-via-ui.md)
85+
- [Experiments Data Model](/docs/evaluation/experiments/data-model.md)
8486
- [Overview](/docs/evaluation/overview.md)
8587
- [Troubleshooting And Faq](/docs/evaluation/troubleshooting-and-faq.md)
8688
- [Glossary](/docs/glossary.md)

components/Glossary.tsx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -384,15 +384,15 @@ const glossaryTerms: GlossaryTerm[] = [
384384
term: "Score",
385385
id: "score",
386386
definition: "The output of an evaluation. Scores can be numeric, categorical, or boolean and are assigned to traces, observations, sessions, or dataset runs.",
387-
link: "/docs/evaluation/experiments/data-model#scores",
387+
link: "/docs/evaluation/scores/data-model#scores",
388388
categories: ["EVALUATION"],
389389
relatedTerms: ["Score Config", "Evaluator", "LLM-as-a-Judge", "Annotation Queue"],
390390
},
391391
{
392392
term: "Score Config",
393393
id: "score-config",
394394
definition: "A configuration defining how a score is calculated and interpreted. Includes data type, value constraints, and categories for standardized scoring.",
395-
link: "/docs/evaluation/experiments/data-model#score-config",
395+
link: "/docs/evaluation/scores/data-model#score-config",
396396
categories: ["EVALUATION"],
397397
relatedTerms: ["Score", "LLM-as-a-Judge"],
398398
},

content/blog/2025-09-05-automated-evaluations.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ For this guide, we will set up an evaluator for the "Out of Scope" failure mode.
6262

6363
## How to Measure [#how-to-measure]
6464

65-
In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/experiments/data-model#scores). Evaluations in Langfuse can be set up in two main ways:
65+
In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/scores/overview). Evaluations in Langfuse can be set up in two main ways:
6666

6767
**In the Langfuse UI:** In Langfuse, you can set up **LLM-as-a-Judge Evaluators** that use another LLM to evaluate your application's output on subjective and nuanced criteria. These are easily configured directly in Langfuse. For a guide on setting them up in the UI, check the documentation on **[LLM-as-a-Judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge)**.
6868

content/changelog/2024-08-19-score-analytics.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: Advanced Score Analytics Charts
44
description: Explore customizable charts for aggregate and time series analytics, grouped by score type, source and name.
55
author: Marlies
66
ogVideo: https://static.langfuse.com/docs-videos/score_analytics.mp4
7-
canonical: /docs/evaluation/evaluation-methods/score-analytics
7+
canonical: /docs/evaluation/scores/score-analytics
88
---
99

1010
import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";

content/changelog/2025-04-28-session-level-scores.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Create and manage scores at the session level for more comprehensiv
44
date: 2025-04-28
55
author: Marlies
66
ogVideo: https://static.langfuse.com/docs-videos/session_scores.mp4
7-
canonical: /docs/evaluation/experiments/data-model#scores
7+
canonical: /docs/evaluation/scores/data-model#scores
88
---
99

1010
import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";

content/changelog/2025-05-07-run-level-scores.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Dataset Run Level Scores
33
description: Score dataset runs to assess the overall quality of each run
44
date: 2025-05-07
55
author: Marlies
6-
canonical: /docs/evaluation/experiments/data-model#scores
6+
canonical: /docs/evaluation/scores/data-model#scores
77
---
88

99
import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";

content/changelog/2025-11-07-score-analytics-multi-score-comparison.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ badge: Launch Week 4 🚀
55
description: Validate evaluation reliability and uncover insights with comprehensive score analysis. Compare different evaluation methods, track trends over time, and measure agreement between human annotators and LLM judges.
66
author: Michael
77
ogImage: /images/changelog/score-analytics-compare-numeric.png
8-
canonical: /docs/evaluation/evaluation-methods/score-analytics
8+
canonical: /docs/evaluation/scores/score-analytics
99
---
1010

1111
import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
@@ -82,7 +82,7 @@ Score Analytics provides a lightweight, zero-configuration way to analyze your s
8282
import { Book, Calendar } from "lucide-react";
8383

8484
<Cards num={1}>
85-
<Card title="Score Analytics Documentation" href="/docs/evaluation/evaluation-methods/score-analytics" icon={<Book />} />
85+
<Card title="Score Analytics Documentation" href="/docs/evaluation/scores/score-analytics" icon={<Book />} />
8686
<Card title="Score Configuration Management" href="/faq/all/manage-score-configs" icon={<Book />} />
8787
<Card title="See all Launch Week releases" href="/blog/2025-10-29-launch-week-4" icon={<Calendar />} />
8888
</Cards>

content/docs/administration/billable-units.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how billable units are calculated in Langfuse.
55

66
# Billable Units
77

8-
Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/experiments/data-model#scores).
8+
Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/scores/data-model#scores).
99

1010
`Units` = `Count of Traces` + `Count of Observations` + `Count of Scores`
1111

content/docs/evaluation/core-concepts.mdx

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Ready to start?
1717

1818
LLM applications often have a constant loop of testing and monitoring.
1919

20-
**Offline evaluation** lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the scores, iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running [Experiments](/docs/evaluation/core-concepts#experiments).
20+
**Offline evaluation** lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the [scores](#scores), iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running [Experiments](/docs/evaluation/core-concepts#experiments).
2121

2222
**Online evaluation** scores live traces to catch issues in real traffic. When you find edge cases your dataset didn't cover, you add them back to your dataset so future experiments will catch them.
2323

@@ -38,9 +38,15 @@ LLM applications often have a constant loop of testing and monitoring.
3838
>
3939
> Over time, your dataset grows from a couple of examples to a diverse, representative set of real-world test cases.
4040
41+
## Scores [#scores]
42+
43+
[Scores](/docs/evaluation/scores/overview) are Langfuse's universal data object for storing evaluation results. Any time you want to assign a quality judgment to an LLM output, whether by a human annotation, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
44+
45+
Scores can be attached to traces, observations, sessions, or dataset runs. Every score has a **name**, a **value**, and a **data type** (`NUMERIC`, `CATEGORICAL`, `BOOLEAN`, or `TEXT`). Learn more about [score types](/docs/evaluation/scores/overview#score-types), [how to create scores](/docs/evaluation/scores/overview#how-to-create-scores), and [score analytics](/docs/evaluation/scores/score-analytics) on the dedicated [Scores](/docs/evaluation/scores/overview) page.
46+
4147
## Evaluation Methods [#evaluation-methods]
4248

43-
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add [scores](/docs/evaluation/experiments/data-model#scores).
49+
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add [scores](#scores).
4450

4551

4652
| Method | What | Use when |
@@ -50,7 +56,7 @@ Evaluation methods are the functions that score traces, observations, sessions,
5056
| [Annotation Queues](/docs/evaluation/evaluation-methods/annotation-queues) | Structured human review workflows with customizable queues | Building ground truth, systematic labeling, team collaboration |
5157
| [Scores via API/SDK](/docs/evaluation/evaluation-methods/scores-via-sdk) | Programmatically add scores using the Langfuse API or SDK | Custom evaluation pipelines, deterministic checks, automated workflows |
5258

53-
When setting up new evaluation methods, you can use [Score Analytics](/docs/evaluation/evaluation-methods/score-analytics) to analyze or sense-check the scores you produce.
59+
When setting up new evaluation methods, you can use [Score Analytics](/docs/evaluation/scores/score-analytics) to analyze or sense-check the scores you produce.
5460
## Experiments [#experiments]
5561

5662
An experiment runs your application against a dataset and evaluates the outputs. This is how you test changes before deploying to production.
@@ -65,7 +71,7 @@ Before diving into experiments, it's helpful to understand the building blocks i
6571
| **Dataset item** | One item in a dataset. Each dataset item contains an input (the scenario to test) and optionally an expected output. |
6672
| **Task** | The application code that you want to test in an experiment. This will be performed on each dataset item, and you will score the output.
6773
| **Evaluation Method** | A function that scores experiment results. In the context of a Langfuse experiment, this can be a [deterministic check](/docs/evaluation/evaluation-methods/custom-scores), or [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge). |
68-
| **Score** | The output of an evaluation. This can be numeric, categorical, or boolean. See [Scores](/docs/evaluation/experiments/data-model#scores) for more details.|
74+
| **Score** | The output of an evaluation. See [Scores](#scores) for the available data types and details.|
6975
| **Experiment Run** | A single execution of your task against all items in a dataset, producing outputs (and scores). |
7076

7177
You can find the data model for these objects [here](/docs/evaluation/experiments/data-model).
@@ -85,7 +91,7 @@ Often, you want to score these experiment results. You can use various [evaluati
8591

8692
You can compare experiment runs to see if a new prompt version improves scores, or identify specific inputs where your application struggles. Based on these experiment results, you can decide whether the change is ready to be deployed to production.
8793

88-
You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/experiments/data-model).
94+
You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/experiments/data-model).
8995

9096

9197
### Two ways to run experiments

content/docs/evaluation/evaluation-methods/annotation-queues.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Manage your annotation tasks with ease using our new workflow tooli
55

66
# Annotation Queues [#annotation-queues]
77

8-
Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concepts#evaluation-methods) which is build for domain experts to add [scores](/docs/evaluation/evaluation-methods/data-model) and comments to traces, observations or sessions.
8+
Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concepts#evaluation-methods) which is build for domain experts to add [scores](/docs/evaluation/scores/overview) and comments to traces, observations or sessions.
99

1010
<Video
1111
src="https://static.langfuse.com/docs-videos/2025-12-19-annotation-queues.mp4"
@@ -27,7 +27,7 @@ Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concept
2727
### Create a new Annotation Queue
2828

2929
- Click on `New Queue` to create a new queue.
30-
- Select the [`Score Configs`](/docs/evaluation/experiments/data-model#score-config) you want to use for this queue.
30+
- Select the [`Score Configs`](/docs/evaluation/scores/data-model#score-config) you want to use for this queue.
3131
- Set the `Queue name` and `Description` (optional).
3232
- Assign users to the queue (optional).
3333

0 commit comments

Comments
 (0)