Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/cursor/rules/available-internal-links.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Pages: Self Hosting, Administration, Automated Access Provisioning, Headless Ini
- [Score Analytics](/docs/evaluation/evaluation-methods/score-analytics.md)
- [Scores Via Sdk](/docs/evaluation/evaluation-methods/scores-via-sdk.md)
- [Scores Via Ui](/docs/evaluation/evaluation-methods/scores-via-ui.md)
- [Data Model](/docs/evaluation/experiments/data-model.md)
- [Data Model](/docs/evaluation/evaluation-methods/data-model.md)
- [Datasets](/docs/evaluation/experiments/datasets.md)
- [Experiments Via Sdk](/docs/evaluation/experiments/experiments-via-sdk.md)
- [Experiments Via Ui](/docs/evaluation/experiments/experiments-via-ui.md)
Expand Down
10 changes: 5 additions & 5 deletions components/Glossary.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -135,15 +135,15 @@ const glossaryTerms: GlossaryTerm[] = [
term: "Dataset Item",
id: "dataset-item",
definition: "An individual test case within a dataset. Each item contains an input (the scenario to test) and optionally an expected output.",
link: "/docs/evaluation/experiments/data-model#datasetitem-object",
link: "/docs/evaluation/evaluation-methods/data-model#datasetitem-object",
categories: ["EVALUATION"],
relatedTerms: ["Dataset", "Dataset Experiment", "Task"],
},
{
term: "Dataset Experiment",
id: "dataset-experiment",
definition: "Also known as a Dataset Run. The execution of a dataset through your LLM application, producing outputs that can be evaluated. Links dataset items to their corresponding traces.",
link: "/docs/evaluation/experiments/data-model#datasetrun-experiment-run",
link: "/docs/evaluation/evaluation-methods/data-model#datasetrun-experiment-run",
categories: ["EVALUATION"],
relatedTerms: ["Dataset", "Dataset Item", "Task", "Score"],
synonyms: ["Dataset Run", "Experiment Run"],
Expand Down Expand Up @@ -384,15 +384,15 @@ const glossaryTerms: GlossaryTerm[] = [
term: "Score",
id: "score",
definition: "The output of an evaluation. Scores can be numeric, categorical, or boolean and are assigned to traces, observations, sessions, or dataset runs.",
link: "/docs/evaluation/experiments/data-model#scores",
link: "/docs/evaluation/evaluation-methods/data-model#scores",
categories: ["EVALUATION"],
relatedTerms: ["Score Config", "Evaluator", "LLM-as-a-Judge", "Annotation Queue"],
},
{
term: "Score Config",
id: "score-config",
definition: "A configuration defining how a score is calculated and interpreted. Includes data type, value constraints, and categories for standardized scoring.",
link: "/docs/evaluation/experiments/data-model#score-config",
link: "/docs/evaluation/evaluation-methods/data-model#score-config",
categories: ["EVALUATION"],
relatedTerms: ["Score", "LLM-as-a-Judge"],
},
Expand Down Expand Up @@ -434,7 +434,7 @@ const glossaryTerms: GlossaryTerm[] = [
term: "Task",
id: "task",
definition: "A function definition that processes dataset items during an experiment. The task represents the application code you want to test.",
link: "/docs/evaluation/experiments/data-model#task",
link: "/docs/evaluation/evaluation-methods/data-model#task",
categories: ["EVALUATION"],
relatedTerms: ["Dataset", "Dataset Item", "Dataset Experiment"],
},
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2025-09-05-automated-evaluations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ For this guide, we will set up an evaluator for the "Out of Scope" failure mode.

## How to Measure [#how-to-measure]

In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/experiments/data-model#scores). Evaluations in Langfuse can be set up in two main ways:
In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/core-concepts#scores). Evaluations in Langfuse can be set up in two main ways:

**In the Langfuse UI:** In Langfuse, you can set up **LLM-as-a-Judge Evaluators** that use another LLM to evaluate your application's output on subjective and nuanced criteria. These are easily configured directly in Langfuse. For a guide on setting them up in the UI, check the documentation on **[LLM-as-a-Judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge)**.

Expand Down
2 changes: 1 addition & 1 deletion content/changelog/2025-04-28-session-level-scores.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Create and manage scores at the session level for more comprehensiv
date: 2025-04-28
author: Marlies
ogVideo: https://static.langfuse.com/docs-videos/session_scores.mp4
canonical: /docs/evaluation/experiments/data-model#scores
canonical: /docs/evaluation/evaluation-methods/data-model#scores
---

import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
Expand Down
2 changes: 1 addition & 1 deletion content/changelog/2025-05-07-run-level-scores.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: Dataset Run Level Scores
description: Score dataset runs to assess the overall quality of each run
date: 2025-05-07
author: Marlies
canonical: /docs/evaluation/experiments/data-model#scores
canonical: /docs/evaluation/evaluation-methods/data-model#scores
---

import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
Expand Down
2 changes: 1 addition & 1 deletion content/docs/administration/billable-units.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: Learn how billable units are calculated in Langfuse.

# Billable Units

Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/experiments/data-model#scores).
Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/evaluation-methods/data-model#scores).

`Units` = `Count of Traces` + `Count of Observations` + `Count of Scores`

Expand Down
70 changes: 65 additions & 5 deletions content/docs/evaluation/core-concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Ready to start?

LLM applications often have a constant loop of testing and monitoring.

**Offline evaluation** lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the scores, iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running [Experiments](/docs/evaluation/core-concepts#experiments).
**Offline evaluation** lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the [scores](#scores), iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running [Experiments](/docs/evaluation/core-concepts#experiments).

**Online evaluation** scores live traces to catch issues in real traffic. When you find edge cases your dataset didn't cover, you add them back to your dataset so future experiments will catch them.

Expand All @@ -38,9 +38,69 @@ LLM applications often have a constant loop of testing and monitoring.
>
> Over time, your dataset grows from a couple of examples to a diverse, representative set of real-world test cases.

## Scores [#scores]

Scores are Langfuse's universal data object for storing evaluation results. Any time you want to assign a quality judgment to an LLM output, whether by a [human annotation](/docs/evaluation/evaluation-methods/scores-via-ui), an [LLM judge](/docs/evaluation/evaluation-methods/llm-as-a-judge), a [programmatic check](/docs/evaluation/evaluation-methods/scores-via-sdk), or end-user feedback, the result is stored as a score.

Every score has a **name** (like `"correctness"` or `"helpfulness"`), a **value**, and a **[data type](#score-types)**. Scores also support an optional **[comment](#score-comments)** for additional context.

Scores can be attached to [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations), [sessions](/docs/observability/data-model#sessions), or [dataset runs](/docs/evaluation/evaluation-methods/data-model). Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.

Once you have scores, they show up in [score analytics](/docs/evaluation/evaluation-methods/score-analytics), can be visualized in [custom dashboards](/docs/metrics/features/custom-dashboards), and can be queried via the [API](/docs/api).

### When to Use Scores [#when-to-use-scores]

Scores become useful when you want to go beyond observing what your application does and start measuring how well it does it. Common use cases:

- **Collecting user feedback**: Capture thumbs up/down or star ratings from your users and attach them to traces. See the [user feedback guide](/docs/observability/features/user-feedback).
- **Monitoring production quality**: Set up automated evaluators (like [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)) to continuously score live traces for things like hallucination, relevance, or tone.
- **Running guardrails**: Score whether outputs pass safety checks like PII detection, format validation, or content policy compliance.
- **Comparing changes with experiments**: When you change a prompt, model, or pipeline, run an [experiment](/docs/evaluation/experiments) to score the new version against a dataset.

### Score Types [#score-types]

Langfuse supports four score data types:

| Type | Value | Use when |
| --- | --- | --- |
| `NUMERIC` | Float (e.g. `0.9`) | Continuous judgments like accuracy, relevance, or similarity scores |
| `CATEGORICAL` | String from predefined categories (e.g. `"correct"`, `"partially correct"`) | Discrete classifications where the set of possible values is known upfront |
| `BOOLEAN` | `0` or `1` | Pass/fail checks like hallucination detection or format validation |
| `TEXT` | Free-form string (1–500 characters) | Open-ended annotations like reviewer notes or qualitative feedback. Often used for [open coding](https://en.wikipedia.org/wiki/Open_coding) before formalizing into quantifiable scores via [axial coding](https://en.wikipedia.org/wiki/Axial_coding). |

<Callout type="info">
Text scores are designed for qualitative, open-ended scoring. Because free-form text cannot be meaningfully aggregated or compared, text scores are not supported in [experiments](#experiments), [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge), or [score analytics](/docs/evaluation/evaluation-methods/score-analytics).
</Callout>

### How to Create Scores [#how-to-create-scores]

There are four ways to add scores:

- **LLM-as-a-Judge**: Set up [automated evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric or categorical scores plus reasoning, and can run on live production traces or on experiment results.
- **Scores via UI**: Team members [manually score](/docs/evaluation/evaluation-methods/scores-via-ui) traces, observations, or sessions directly in the Langfuse UI. Requires a [score config](/faq/all/manage-score-configs) to be set up first.
- **Annotation Queues**: Set up [structured review workflows](/docs/evaluation/evaluation-methods/annotation-queues) where reviewers work through batches of traces.
- **Scores via API/SDK**: [Programmatically add scores](/docs/evaluation/evaluation-methods/scores-via-sdk) from your application code. This is the way to go for user feedback (thumbs up/down, star ratings), guardrail results, or custom evaluation pipelines.

### Should I Use Scores or Tags? [#scores-vs-tags]

| | Scores | Tags |
|---|---|---|
| **Purpose** | Measure _how good_ something is | Describe _what_ something is |
| **Data** | Numeric, categorical, boolean, or text value | Simple string label |
| **When added** | Can be added at any time, including long after the trace was created | Set during tracing and cannot be changed afterwards |
| **Used for** | Quality measurement, analytics, experiments | Filtering, segmentation, organizing |

As a rule of thumb: if you already know the category at tracing time (e.g. which feature or API endpoint triggered the trace), use a [tag](/docs/observability/features/tags). If you need to classify or evaluate traces later, use a score.

### Score Comments [#score-comments]

Every score supports an optional **comment** field. Use it to capture reasoning (e.g. why an LLM judge assigned a particular score), reviewer notes, or context that helps others understand the score value. Comments are shown alongside scores in the Langfuse UI.

Use a [`TEXT` score](#score-types) instead of comments to capture standalone qualitative feedback — comments are best for additional reasoning on an existing score.

## Evaluation Methods [#evaluation-methods]

Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add [scores](/docs/evaluation/experiments/data-model#scores).
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add [scores](#scores).


| Method | What | Use when |
Expand All @@ -65,10 +125,10 @@ Before diving into experiments, it's helpful to understand the building blocks i
| **Dataset item** | One item in a dataset. Each dataset item contains an input (the scenario to test) and optionally an expected output. |
| **Task** | The application code that you want to test in an experiment. This will be performed on each dataset item, and you will score the output.
| **Evaluation Method** | A function that scores experiment results. In the context of a Langfuse experiment, this can be a [deterministic check](/docs/evaluation/evaluation-methods/custom-scores), or [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge). |
| **Score** | The output of an evaluation. This can be numeric, categorical, or boolean. See [Scores](/docs/evaluation/experiments/data-model#scores) for more details.|
| **Score** | The output of an evaluation. See [Scores](#scores) for the available data types and details.|
| **Experiment Run** | A single execution of your task against all items in a dataset, producing outputs (and scores). |

You can find the data model for these objects [here](/docs/evaluation/experiments/data-model).
You can find the data model for these objects [here](/docs/evaluation/evaluation-methods/data-model).


### How these work together
Expand All @@ -85,7 +145,7 @@ Often, you want to score these experiment results. You can use various [evaluati

You can compare experiment runs to see if a new prompt version improves scores, or identify specific inputs where your application struggles. Based on these experiment results, you can decide whether the change is ready to be deployed to production.

You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/experiments/data-model).
You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/evaluation-methods/data-model).


### Two ways to run experiments
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: Manage your annotation tasks with ease using our new workflow tooli

# Annotation Queues [#annotation-queues]

Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concepts#evaluation-methods) which is build for domain experts to add [scores](/docs/evaluation/evaluation-methods/data-model) and comments to traces, observations or sessions.
Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concepts#evaluation-methods) which is build for domain experts to add [scores](/docs/evaluation/core-concepts#scores) and comments to traces, observations or sessions.

<Video
src="https://static.langfuse.com/docs-videos/2025-12-19-annotation-queues.mp4"
Expand All @@ -27,7 +27,7 @@ Annotation Queues are a manual [evaluation method](/docs/evaluation/core-concept
### Create a new Annotation Queue

- Click on `New Queue` to create a new queue.
- Select the [`Score Configs`](/docs/evaluation/experiments/data-model#score-config) you want to use for this queue.
- Select the [`Score Configs`](/docs/evaluation/evaluation-methods/data-model#score-config) you want to use for this queue.
- Set the `Queue name` and `Description` (optional).
- Assign users to the queue (optional).

Expand Down
Loading
Loading