Using both SDK Experiments + Configured LLM-as-a-Judge Together #10133

jkealey · 2025-10-31T15:12:14Z

jkealey
Oct 31, 2025

I previously defined a few LLM-as-a-judge evaluators which run when my traces have certain tags.

The eval is my own custom prompt, built using your web UI
It scores them asynchronously.
The evals run on inner spans of a larger trace (20+ spans)
Works great

I'm now using SDK experiments.

My dataset is simple, just a set of seed ids
My task is basically 'run this complex service that generates a bunch of LLM calls' in the same trace
These get the LLM as a judge auto-fired
But my I don't have anything for run evals in the langfuse UI

My goal:

Stay in langfuse to visualize data set runs simply and maintain llm-as-a-judge evals
I'd want to be able to just have my run evaluator be an average of the auto-fired evals
But I can't use those scores in my run evaluation, since the current codebase doesn't give me access to them

After chatting with your AI bot, I understand that:

This is because the SDK is intended to be used in a synchronous way, whereas LLM-as-a-judge is asynchronous
You need to pass item evaluators directly when using the SDK

Options:
1 - I thought I could use fetch my custom evaluators on langfuse with the SDK and attach them on the experiment run

But seems like you can only attach generic AutoEvals to dataset runs, not my custom ones.
There's no API to fetch those evals (according to the bot)

2 - I thought I could just write a custom evaluator that would sleep a while and fetch scores on the trace

But the item eval doesn't have any trace_ids, just input/output/expected output
Plus get_current_trace_id() is not accurate at this stage, would probably need to find it from when I was generating the trace
Plus the item eval doesn't yet see the actual output, since the trace output is defined later by langfuse
(langfuse adds my top-level span as a child to some initially anonymous parent span and there's some asynchronous process running to populate it w/ stuff from my child spans)

3 - Therefore, sounds like I'd need to rewrite the llm-as-a-judge logic locally

Aka copy paste the prompt from the UI to my local python project and running those prompts manually, synchronously, to get my eval scores
Duplicated effort, etc.

Overall, seems like I'm trying to mix two different feature sets that weren't designed to go together.

Thought I'd share that I initially thought these things would be more seamlessly integrated.

I'd have liked the functionality to fetch scores generated by auto-fired evals on any trace created during the dataset item run to have been baked into the platform.

2025-10-31T15:13:55Z

inkeep-ai-assistant[bot]
bot Oct 31, 2025

You've identified a real architectural gap between SDK experiments and LLM-as-a-judge evaluators. You're correct that these features weren't designed to work together seamlessly in the way you're attempting.

The SDK experiment framework is synchronous by design - evaluators run inline during the experiment and return scores immediately(1). LLM-as-a-judge evaluators are asynchronous and trigger based on trace tags/filters after traces are ingested(2)(3).

Your analysis of the options is accurate:

Option 1 - There's no API to fetch custom evaluator configurations from Langfuse to reuse in SDK experiments. The createEvaluatorFromAutoevals function only works with the AutoEvals library(1).

Option 2 - Item evaluators receive input, output, expected_output, and metadata but not trace IDs(1). The trace hasn't been fully processed yet when the evaluator runs, so scores from async LLM-as-a-judge evaluators won't be available.

Option 3 - This is currently your most viable path - replicating the evaluation logic locally as synchronous evaluators.

Workaround suggestion: You could create a run-level evaluator that fetches scores after the experiment completes. While not ideal, you could:

Run the experiment with your task (which creates traces that trigger LLM-as-a-judge)
In a run evaluator, access item_results to get trace information
Poll the Langfuse API for scores on those traces after a delay
Return aggregate metrics

This is hacky and introduces timing dependencies, but it might bridge the gap until there's native support for combining these features.

Your feedback about wanting seamless integration between configured evaluators and SDK experiments is valuable - this does seem like a feature gap worth addressing.

📚 Sources:

Have another question? Just tag @inkeep.

0 replies

joegas · 2026-01-13T13:10:23Z

joegas
Jan 13, 2026

Facing the same problem here. Maybe we can get an answer from a human :)?

0 replies

jannikmaierhoefer · 2026-01-14T10:05:56Z

jannikmaierhoefer
Jan 14, 2026
Maintainer

Hi @jkealey @joegas, thanks! If I understand your message correctly, your LLM-as-a-Judge currently targets live traces.

However, you can also set up an LLM-as-a-Judge that specifically targets the dataset runs that you made via the SDK. You can see how in our LLM-as-a-Judge docs (Target data: Experiment Runs): https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge#choose-which-data-to-evaluate

Setting this up like that will give you aggregated score metrics in your Dataset Compare view.

Does this help?

We also have a guide on fetching the Dataset Experiment scores via SDK, in case you need this: https://langfuse.com/faq/all/retrieve-experiment-scores

0 replies

joegas · 2026-01-14T16:13:56Z

joegas
Jan 14, 2026

Hi @jannikmaierhoefer,

we are running our experiment via SDK as part of our CI/CD pipeline. We currently need to re-implement our evaluators in the pipeline code in order to evaluate the results from the experiment runs synchronously (as described in the SDK docs). We want the pipeline to fail based on score thresholds.

If i understand the docs you linked correctly, LLM-as-a-judge targeted at experiments acts in the background? Or is there a way to get the results synchronously?

1 reply

jannikmaierhoefer Jan 20, 2026
Maintainer

@joegas, yes, LLM-as-a-judge targeted at experiments acts in the background.

we are running our experiment via SDK as part of our CI/CD pipeline. We currently need to re-implement our evaluators in the pipeline code in order to evaluate the results from the experiment runs synchronously. We want the pipeline to fail based on score thresholds.

Understood. Yes, in this case, the evaluator should be part of the CI/CD pipeline. You can pass evaluation functions to the experiment runner SDK and then use the scores as fail/pass criterion: https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk#evaluators

We also have a guide on setting up CI/CD with the experiment runner SDK here: https://langfuse.com/blog/2025-10-21-testing-llm-applications

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langfuse

Using both SDK Experiments + Configured LLM-as-a-Judge Together #10133

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Langfuse

Using both SDK Experiments + Configured LLM-as-a-Judge Together #10133

Uh oh!

jkealey Oct 31, 2025

Replies: 4 comments · 1 reply

Uh oh!

inkeep-ai-assistant[bot] bot Oct 31, 2025

Uh oh!

joegas Jan 13, 2026

Uh oh!

jannikmaierhoefer Jan 14, 2026 Maintainer

Uh oh!

joegas Jan 14, 2026

Uh oh!

jannikmaierhoefer Jan 20, 2026 Maintainer

jkealey
Oct 31, 2025

Replies: 4 comments 1 reply

inkeep-ai-assistant[bot]
bot Oct 31, 2025

joegas
Jan 13, 2026

jannikmaierhoefer
Jan 14, 2026
Maintainer

joegas
Jan 14, 2026

jannikmaierhoefer Jan 20, 2026
Maintainer