Aggregated metrics for experiments #11404

shaharhyro · 2026-01-06T12:02:04Z

shaharhyro
Jan 6, 2026

Describe your question

We would like to use Langfuse experiments for multi label classifications tasks, where we usually use aggregated metrics like F2, recall & percision. Is it something that is currently supported? I know it was on the road map but not sure if it was done

Langfuse Cloud or Self-Hosted?

Langfuse Cloud

If Self-Hosted

No response

If Langfuse Cloud

No response

SDK and integration versions

No response

Pre-Submission Checklist

I have checked for existing issues/discussions and consulted Langfuse AI.

Answered by jannikmaierhoefer

Jan 7, 2026

Hi @shaharhyro, inkeep is right. You can use execute experiments via the SDK and attach custom aggregated scores to the results(s).
We are also planning to support this in the UI - tracking this here: https://github.com/orgs/langfuse/discussions/6087

View full answer

jannikmaierhoefer · 2026-01-06T12:02:31Z

inkeep-ai-assistant[bot]
bot Jan 6, 2026

Yes, aggregated metrics for experiments are now supported through run-level scores(1).

Langfuse supports two types of experiment scores(2):

Experiment-level scores: Overall metrics for the entire experiment run (e.g., precision, recall, F1-scores). These scores are immutable and represent aggregate performance(2).
Experiment-item-level scores: Scores for individual items within an experiment(2).

The experiment runner SDK supports both item-level and run-level evaluators(3). You can implement custom aggregated metrics like F2, recall, and precision using run evaluators(3).

Here's an example of how to calculate aggregated metrics:

def average_accuracy_evaluator(*, item_results, **kwargs):
    # Calculate average accuracy across all items
    accuracies = [
        evaluation["value"]
        for result in item_results
        for evaluation in result["evaluations"]
        if evaluation["name"] == "accuracy"
    ]
    
    if not accuracies:
        return {"name": "avg_accuracy", "value": None}
    
    avg = sum(accuracies) / len(accuracies)
    return {
        "name": "avg_accuracy",
        "value": avg,
        "comment": f"Average accuracy: {avg * 100:.1f}%",
    }

result = langfuse.run_experiment(
    name="My Experiment",
    data=test_data,
    task=my_task,
    evaluators=[item_level_evaluator],
    run_evaluators=[average_accuracy_evaluator],
)

(3)

You can access run-level evaluation results directly from the experiment result object(3). For your multi-label classification tasks, you would implement custom run evaluators that calculate F2, recall, and precision based on the item-level results.

**

📚 Sources:

Have another question? Just tag @inkeep.

1 reply

jannikmaierhoefer Jan 7, 2026
Maintainer

Hi @shaharhyro, inkeep is right. You can use execute experiments via the SDK and attach custom aggregated scores to the results(s).
We are also planning to support this in the UI - tracking this here: https://github.com/orgs/langfuse/discussions/6087

Answer selected by jannikmaierhoefer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langfuse

Aggregated metrics for experiments #11404

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Langfuse

Aggregated metrics for experiments #11404

Uh oh!

shaharhyro Jan 6, 2026

Describe your question

Langfuse Cloud or Self-Hosted?

If Self-Hosted

If Langfuse Cloud

SDK and integration versions

Pre-Submission Checklist

Replies: 1 comment · 1 reply

Uh oh!

inkeep-ai-assistant[bot] bot Jan 6, 2026

Uh oh!

jannikmaierhoefer Jan 7, 2026 Maintainer

shaharhyro
Jan 6, 2026

Replies: 1 comment 1 reply

inkeep-ai-assistant[bot]
bot Jan 6, 2026

jannikmaierhoefer Jan 7, 2026
Maintainer