Langfuse LLM-as-Judge: Can evaluator models use web search/tools during scoring? #11784
-
Describe your questionHi Langfuse team — when running LLM-as-a-judge evaluations, can the judge model actually browse/search the web (or call tools/functions) during scoring? We’re using LiteLLM with AWS Bedrock (Claude 4.5 Sonnet) and the judge appears to rely on stale priors instead of verifying facts, even when the prompt says “MUST use web search.” Is tool/browsing access supported for judges, and if so, how do we enable/configure it (or is passing retrieved context the recommended approach)? Langfuse Cloud or Self-Hosted?Langfuse Cloud If Self-HostedNo response If Langfuse Cloudhttps://aix-lens-stage.cnbc.com/project/cmkg8fepf003plj07g28lk8zq SDK and integration versionsNo response Pre-Submission Checklist
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Based on the available documentation, Langfuse's LLM-as-a-Judge evaluators do not currently support web search or tool calling during the evaluation process(1). The evaluator works by sending a prompt to the configured LLM model with variables populated from your trace data (input, output, ground_truth, etc.)(1). The model is expected to return a structured output containing a score and reasoning(1). Recommended approach: Instead of relying on the judge model to search the web, you should pass any necessary context or retrieved information as part of the evaluation variables(1). When setting up your evaluator, you can map additional data from your traces using JSONPath expressions to include relevant context that the judge needs for accurate scoring(1). For example, if you need the judge to verify facts, you would:
This ensures the judge has access to current, verified information without needing real-time web search capabilities. Note: While the documentation mentions that gpt-5 with web search is listed in GitHub Ideas discussions(2), this appears to be a feature request rather than a currently available capability. 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
Hi @aniket-versant
Langfuse's LLM-as-a-Judge evaluators currently do not support tool calling or web search during the evaluation process.
You can follow Inkeep bot's approach, or set up an external evaluation pipeline.
There is a related feature request for this, feel free to upvote it so that you're updated once we make this functionality available: https://github.com/orgs/langfuse/discussions/9346