LLM-as-a-judge on observations instead of traces #7754
-
|
I have some LLM-as-a-judge evaluators configured in my production environment. I've configured a hallucination eval, and I want to be able to build a dashboard breaking it down by the model being used in that given trace. Is there a way to configure an eval using the UI to run on observations instead of at the trace level? |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 11 replies
-
|
Langfuse's UI does not currently support configuring LLM-as-a-judge evals directly at the observation level—evals can only be set up at the trace or dataset level in the UI, and there’s no option to target observations as standalone units for evaluation or dashboarding discussion docs. If you want to break down eval results by model (when model info is only at the observation level), you’ll need to use the Langfuse SDK to fetch traces and their observations, then aggregate or analyze the results by model in your own code. For example, you can fetch traces, extract observation-level model info and eval results, and summarize them using Python/pandas SDK example. If you want dashboards in the UI, a common workaround is to propagate model metadata from observations up to the trace level during ingestion (e.g., by setting a trace-level field based on the observation’s model), but this requires custom logic in your pipeline. Let me know if you want a sample code snippet or more details on SDK usage! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
This is in the works @thiagosalvatore -- we'll follow up here with more detail soon. |
Beta Was this translation helpful? Give feedback.
-
|
any updates regarding this issue? |
Beta Was this translation helpful? Give feedback.
-
|
Any updates? |
Beta Was this translation helpful? Give feedback.
-
|
Yeah, also curious, our traces are having hundreds of spans, so evaluating a full trace is a no go without exploding the context... |
Beta Was this translation helpful? Give feedback.
-
|
It appears that Langfuse simply chooses the "last" input or output when running on a trace, instead of looping through the observations/spans/whatever and running against all of them. For my use-case this makes the feature not useful |
Beta Was this translation helpful? Give feedback.
-
|
We're finally launching support for LLM-as-a-judge to run on observations. We are looking for 3–5 teams to beta test the feature. If you are on Langfuse Cloud and want early access, reply to this comment with your Langfuse user email. |
Beta Was this translation helpful? Give feedback.

We currently do not support grouping by metadata. Sorry that the AI response misled you here! You could do this today via tags on traces as these are groupable in dashboards.
As Clemens mentioned, evals on observations is a planned change and will enable this workflow!