You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
definition: "The output of an evaluation. Scores can be numeric, categorical, or boolean and are assigned to traces, observations, sessions, or dataset runs.",
definition: "A configuration defining how a score is calculated and interpreted. Includes data type, value constraints, and categories for standardized scoring.",
Copy file name to clipboardExpand all lines: content/blog/2025-09-05-automated-evaluations.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,7 +62,7 @@ For this guide, we will set up an evaluator for the "Out of Scope" failure mode.
62
62
63
63
## How to Measure [#how-to-measure]
64
64
65
-
In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/experiments/data-model#scores). Evaluations in Langfuse can be set up in two main ways:
65
+
In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/scores/overview). Evaluations in Langfuse can be set up in two main ways:
66
66
67
67
**In the Langfuse UI:** In Langfuse, you can set up **LLM-as-a-Judge Evaluators** that use another LLM to evaluate your application's output on subjective and nuanced criteria. These are easily configured directly in Langfuse. For a guide on setting them up in the UI, check the documentation on **[LLM-as-a-Judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge)**.
Copy file name to clipboardExpand all lines: content/changelog/2025-11-07-score-analytics-multi-score-comparison.mdx
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ badge: Launch Week 4 🚀
5
5
description: Validate evaluation reliability and uncover insights with comprehensive score analysis. Compare different evaluation methods, track trends over time, and measure agreement between human annotators and LLM judges.
Copy file name to clipboardExpand all lines: content/docs/administration/billable-units.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ description: Learn how billable units are calculated in Langfuse.
5
5
6
6
# Billable Units
7
7
8
-
Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/experiments/data-model#scores).
8
+
Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/scores/data-model#scores).
9
9
10
10
`Units` = `Count of Traces` + `Count of Observations` + `Count of Scores`
Copy file name to clipboardExpand all lines: content/docs/evaluation/core-concepts.mdx
+11-5Lines changed: 11 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ Ready to start?
17
17
18
18
LLM applications often have a constant loop of testing and monitoring.
19
19
20
-
**Offline evaluation** lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the scores, iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running [Experiments](/docs/evaluation/core-concepts#experiments).
20
+
**Offline evaluation** lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the [scores](#scores), iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running [Experiments](/docs/evaluation/core-concepts#experiments).
21
21
22
22
**Online evaluation** scores live traces to catch issues in real traffic. When you find edge cases your dataset didn't cover, you add them back to your dataset so future experiments will catch them.
23
23
@@ -38,9 +38,15 @@ LLM applications often have a constant loop of testing and monitoring.
38
38
>
39
39
> Over time, your dataset grows from a couple of examples to a diverse, representative set of real-world test cases.
40
40
41
+
## Scores [#scores]
42
+
43
+
[Scores](/docs/evaluation/scores/overview) are Langfuse's universal data object for storing evaluation results. Any time you want to assign a quality judgment to an LLM output, whether by a human annotation, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
44
+
45
+
Scores can be attached to traces, observations, sessions, or dataset runs. Every score has a **name**, a **value**, and a **data type** (`NUMERIC`, `CATEGORICAL`, `BOOLEAN`, or `TEXT`). Learn more about [score types](/docs/evaluation/scores/overview#score-types), [how to create scores](/docs/evaluation/scores/overview#how-to-create-scores), and [score analytics](/docs/evaluation/scores/score-analytics) on the dedicated [Scores](/docs/evaluation/scores/overview) page.
46
+
41
47
## Evaluation Methods [#evaluation-methods]
42
48
43
-
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add [scores](/docs/evaluation/experiments/data-model#scores).
49
+
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add [scores](#scores).
44
50
45
51
46
52
| Method | What | Use when |
@@ -50,7 +56,7 @@ Evaluation methods are the functions that score traces, observations, sessions,
50
56
|[Annotation Queues](/docs/evaluation/evaluation-methods/annotation-queues)| Structured human review workflows with customizable queues | Building ground truth, systematic labeling, team collaboration |
51
57
|[Scores via API/SDK](/docs/evaluation/evaluation-methods/scores-via-sdk)| Programmatically add scores using the Langfuse API or SDK | Custom evaluation pipelines, deterministic checks, automated workflows |
52
58
53
-
When setting up new evaluation methods, you can use [Score Analytics](/docs/evaluation/evaluation-methods/score-analytics) to analyze or sense-check the scores you produce.
59
+
When setting up new evaluation methods, you can use [Score Analytics](/docs/evaluation/scores/score-analytics) to analyze or sense-check the scores you produce.
54
60
## Experiments [#experiments]
55
61
56
62
An experiment runs your application against a dataset and evaluates the outputs. This is how you test changes before deploying to production.
@@ -65,7 +71,7 @@ Before diving into experiments, it's helpful to understand the building blocks i
65
71
|**Dataset item**| One item in a dataset. Each dataset item contains an input (the scenario to test) and optionally an expected output. |
66
72
| **Task** | The application code that you want to test in an experiment. This will be performed on each dataset item, and you will score the output.
67
73
|**Evaluation Method**| A function that scores experiment results. In the context of a Langfuse experiment, this can be a [deterministic check](/docs/evaluation/evaluation-methods/custom-scores), or [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge). |
68
-
|**Score**| The output of an evaluation. This can be numeric, categorical, or boolean. See [Scores](/docs/evaluation/experiments/data-model#scores) for more details.|
74
+
|**Score**| The output of an evaluation. See [Scores](#scores) for the available data types and details.|
69
75
|**Experiment Run**| A single execution of your task against all items in a dataset, producing outputs (and scores). |
70
76
71
77
You can find the data model for these objects [here](/docs/evaluation/experiments/data-model).
@@ -85,7 +91,7 @@ Often, you want to score these experiment results. You can use various [evaluati
85
91
86
92
You can compare experiment runs to see if a new prompt version improves scores, or identify specific inputs where your application struggles. Based on these experiment results, you can decide whether the change is ready to be deployed to production.
87
93
88
-
You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/experiments/data-model).
94
+
You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/experiments/data-model).
Copy file name to clipboardExpand all lines: content/docs/evaluation/evaluation-methods/annotation-queues.mdx
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ description: Manage your annotation tasks with ease using our new workflow tooli
5
5
6
6
# Annotation Queues [#annotation-queues]
7
7
8
-
Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concepts#evaluation-methods) which is build for domain experts to add [scores](/docs/evaluation/evaluation-methods/data-model) and comments to traces, observations or sessions.
8
+
Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concepts#evaluation-methods) which is build for domain experts to add [scores](/docs/evaluation/scores/overview) and comments to traces, observations or sessions.
0 commit comments