Replies: 4 comments 1 reply
-
|
You've identified a real architectural gap between SDK experiments and LLM-as-a-judge evaluators. You're correct that these features weren't designed to work together seamlessly in the way you're attempting. The SDK experiment framework is synchronous by design - evaluators run inline during the experiment and return scores immediately(1). LLM-as-a-judge evaluators are asynchronous and trigger based on trace tags/filters after traces are ingested(2)(3). Your analysis of the options is accurate: Option 1 - There's no API to fetch custom evaluator configurations from Langfuse to reuse in SDK experiments. The Option 2 - Item evaluators receive Option 3 - This is currently your most viable path - replicating the evaluation logic locally as synchronous evaluators. Workaround suggestion: You could create a run-level evaluator that fetches scores after the experiment completes. While not ideal, you could:
This is hacky and introduces timing dependencies, but it might bridge the gap until there's native support for combining these features. Your feedback about wanting seamless integration between configured evaluators and SDK experiments is valuable - this does seem like a feature gap worth addressing. 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
Facing the same problem here. Maybe we can get an answer from a human :)? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @jkealey @joegas, thanks! If I understand your message correctly, your LLM-as-a-Judge currently targets live traces. However, you can also set up an LLM-as-a-Judge that specifically targets the dataset runs that you made via the SDK. You can see how in our LLM-as-a-Judge docs (Target data: Experiment Runs): https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge#choose-which-data-to-evaluate Setting this up like that will give you aggregated score metrics in your Dataset Compare view. Does this help? We also have a guide on fetching the Dataset Experiment scores via SDK, in case you need this: https://langfuse.com/faq/all/retrieve-experiment-scores |
Beta Was this translation helpful? Give feedback.
-
|
we are running our experiment via SDK as part of our CI/CD pipeline. We currently need to re-implement our evaluators in the pipeline code in order to evaluate the results from the experiment runs synchronously (as described in the SDK docs). We want the pipeline to fail based on score thresholds. If i understand the docs you linked correctly, LLM-as-a-judge targeted at experiments acts in the background? Or is there a way to get the results synchronously? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I previously defined a few LLM-as-a-judge evaluators which run when my traces have certain tags.
I'm now using SDK experiments.
My goal:
After chatting with your AI bot, I understand that:
Options:
1 - I thought I could use fetch my custom evaluators on langfuse with the SDK and attach them on the experiment run
2 - I thought I could just write a custom evaluator that would sleep a while and fetch scores on the trace
3 - Therefore, sounds like I'd need to rewrite the llm-as-a-judge logic locally
Overall, seems like I'm trying to mix two different feature sets that weren't designed to go together.
Thought I'd share that I initially thought these things would be more seamlessly integrated.
I'd have liked the functionality to fetch scores generated by auto-fired evals on any trace created during the dataset item run to have been baked into the platform.
Beta Was this translation helpful? Give feedback.
All reactions