-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
Hello,
Thank you for the insightful paper Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. I'm trying to reproduce the results in Figure 3, but the plot doesn’t allow extracting exact metric values.
Could you please clarify:
- For Figure 3, are the reported agent-level and step-level accuracies computed with ground-truth labels available to the judge, without ground-truth, or averaged across both settings?
- Could you share the exact numeric values underlying Figure 3 (per model and per method: Random / All-at-Once / Step-by-Step / Binary Search), ideally in a small table or CSV?
- If Figure 3 uses averaging across the “with GT” and “without GT” scenarios, could you also share the breakdown for each scenario separately?
Thank you very much in advance!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels