LLM-as-a-judge on observations instead of traces #7754

thiagosalvatore · 2025-07-08T20:39:10Z

thiagosalvatore
Jul 8, 2025

I have some LLM-as-a-judge evaluators configured in my production environment.

I've configured a hallucination eval, and I want to be able to build a dashboard breaking it down by the model being used in that given trace.
However, because the model is only at the observation level, I cannot do that.

Is there a way to configure an eval using the UI to run on observations instead of at the trace level?

Answered by marcklingen

Jul 9, 2025

We currently do not support grouping by metadata. Sorry that the AI response misled you here! You could do this today via tags on traces as these are groupable in dashboards.

As Clemens mentioned, evals on observations is a planned change and will enable this workflow!

View full answer

thiagosalvatore · 2025-07-08T20:43:32Z

dosubot[bot]
bot Jul 8, 2025

Langfuse's UI does not currently support configuring LLM-as-a-judge evals directly at the observation level—evals can only be set up at the trace or dataset level in the UI, and there’s no option to target observations as standalone units for evaluation or dashboarding discussion docs.

If you want to break down eval results by model (when model info is only at the observation level), you’ll need to use the Langfuse SDK to fetch traces and their observations, then aggregate or analyze the results by model in your own code. For example, you can fetch traces, extract observation-level model info and eval results, and summarize them using Python/pandas SDK example.

If you want dashboards in the UI, a common workaround is to propagate model metadata from observations up to the trace level during ingestion (e.g., by setting a trace-level field based on the observation’s model), but this requires custom logic in your pipeline.

Let me know if you want a sample code snippet or more details on SDK usage!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

thiagosalvatore Jul 8, 2025
Author

How can I set this model on a trace level field and make it possible to group by it in the UI?

dosubot[bot] bot Jul 8, 2025

You can set the model name from an observation as a trace-level field by updating the trace's metadata using the Langfuse SDK. For example, in Python, after extracting the model name from your observation, you can do:

langfuse.update_current_trace(metadata={"model": model_name})

or, if you have a span or generation object:

span.update_trace(metadata={"model": model_name})

This will add or update the model field in the trace's metadata. Metadata updates are merged at the first level, so you can safely call this multiple times if needed docs.

Once the model is present in the trace metadata, you can group by it in the UI by configuring your dashboard widget: select traces as the data source, and add model (from metadata) as a dimension for grouping docs. The UI will automatically surface custom metadata fields for grouping if they exist at the trace level.

Best practice: use simple, top-level keys (like model) in your metadata for grouping. Deeply nested or non-string fields may not be supported for group-by in the UI.

Let me know if you want a more detailed code example or have questions about a specific SDK!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

clemra · 2025-07-09T01:20:33Z

clemra
Jul 9, 2025
Maintainer

This is in the works @thiagosalvatore -- we'll follow up here with more detail soon.

7 replies

thiagosalvatore Jul 9, 2025
Author

@clemra amazing, thank you!
I've done what dosu suggested and added the model name as a metadata in the trace, but it still doesn't allow me to add a breakdown by this field.

marcklingen Jul 9, 2025
Maintainer

We currently do not support grouping by metadata. Sorry that the AI response misled you here! You could do this today via tags on traces as these are groupable in dashboards.

As Clemens mentioned, evals on observations is a planned change and will enable this workflow!

Answer selected by marcklingen

thiagosalvatore Aug 29, 2025
Author

Hey @marcklingen . Is there an estimate when evals on observations will be available?

marcklingen Aug 29, 2025
Maintainer

not yet, we are looking into more scalable event sourcing that will power this and more extensive webhooks

artifact-wang Oct 24, 2025

Hi @marcklingen , I wonder if there are any updates from the evals on observations? And currently is it possible to do evals/scores individually on observations via Typescript SDK and somehow organize them in langfuse?

hassiebp Oct 28, 2025
Maintainer

@artifact-wang Do you mean Langfuse-hosted LLM-as-a-judge on observations? We are currently planning changes that will allow this in the near-term future - no ETA yet though

Sri-Vinay-Asama · 2025-09-23T20:06:56Z

Sri-Vinay-Asama
Sep 23, 2025

any updates regarding this issue?

1 reply

marcklingen Oct 16, 2025
Maintainer

This is a planned change, we are currently making some core infrastructure improvements in order to enable this at scale

wxxedu · 2025-10-16T01:26:42Z

wxxedu
Oct 16, 2025

Any updates?

0 replies

hnykda · 2025-11-14T10:43:49Z

hnykda
Nov 14, 2025

Yeah, also curious, our traces are having hundreds of spans, so evaluating a full trace is a no go without exploding the context...

1 reply

hnykda Nov 14, 2025

Oh, actually, scrap it, it's possible to do what I want! There is this object selector, sorry:

jonah-propcode · 2026-01-29T00:02:16Z

jonah-propcode
Jan 29, 2026

It appears that Langfuse simply chooses the "last" input or output when running on a trace, instead of looping through the observations/spans/whatever and running against all of them. For my use-case this makes the feature not useful

0 replies

marliessophie · 2026-02-04T10:25:32Z

marliessophie
Feb 4, 2026
Maintainer

We're finally launching support for LLM-as-a-judge to run on observations. We are looking for 3–5 teams to beta test the feature. If you are on Langfuse Cloud and want early access, reply to this comment with your Langfuse user email.

0 replies

LLM-as-a-judge on observations instead of traces #7754

Uh oh!

Replies: 7 comments · 11 replies

Uh oh!

dosubot[bot] bot Jul 8, 2025

Uh oh!

thiagosalvatore Jul 8, 2025 Author

Uh oh!

dosubot[bot] bot Jul 8, 2025

Uh oh!

clemra Jul 9, 2025 Maintainer

Uh oh!

thiagosalvatore Jul 9, 2025 Author

Uh oh!

marcklingen Jul 9, 2025 Maintainer

Uh oh!

thiagosalvatore Aug 29, 2025 Author

Uh oh!

marcklingen Aug 29, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

hassiebp Oct 28, 2025 Maintainer

Uh oh!

Uh oh!

marcklingen Oct 16, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marliessophie Feb 4, 2026 Maintainer

Replies: 7 comments 11 replies

dosubot[bot]
bot Jul 8, 2025

thiagosalvatore Jul 8, 2025
Author

clemra
Jul 9, 2025
Maintainer

thiagosalvatore Jul 9, 2025
Author

marcklingen Jul 9, 2025
Maintainer

thiagosalvatore Aug 29, 2025
Author

marcklingen Aug 29, 2025
Maintainer

hassiebp Oct 28, 2025
Maintainer

marcklingen Oct 16, 2025
Maintainer

marliessophie
Feb 4, 2026
Maintainer