Aggregated metrics for experiments #11404
-
Describe your questionWe would like to use Langfuse experiments for multi label classifications tasks, where we usually use aggregated metrics like F2, recall & percision. Is it something that is currently supported? I know it was on the road map but not sure if it was done Langfuse Cloud or Self-Hosted?Langfuse Cloud If Self-HostedNo response If Langfuse CloudNo response SDK and integration versionsNo response Pre-Submission Checklist
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Yes, aggregated metrics for experiments are now supported through run-level scores(1). Langfuse supports two types of experiment scores(2):
The experiment runner SDK supports both item-level and run-level evaluators(3). You can implement custom aggregated metrics like F2, recall, and precision using run evaluators(3). Here's an example of how to calculate aggregated metrics: def average_accuracy_evaluator(*, item_results, **kwargs):
# Calculate average accuracy across all items
accuracies = [
evaluation["value"]
for result in item_results
for evaluation in result["evaluations"]
if evaluation["name"] == "accuracy"
]
if not accuracies:
return {"name": "avg_accuracy", "value": None}
avg = sum(accuracies) / len(accuracies)
return {
"name": "avg_accuracy",
"value": avg,
"comment": f"Average accuracy: {avg * 100:.1f}%",
}
result = langfuse.run_experiment(
name="My Experiment",
data=test_data,
task=my_task,
evaluators=[item_level_evaluator],
run_evaluators=[average_accuracy_evaluator],
)You can access run-level evaluation results directly from the experiment result object(3). For your multi-label classification tasks, you would implement custom run evaluators that calculate F2, recall, and precision based on the item-level results. ** 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
Hi @shaharhyro, inkeep is right. You can use execute experiments via the SDK and attach custom aggregated scores to the results(s).
We are also planning to support this in the UI - tracking this here: https://github.com/orgs/langfuse/discussions/6087