Fig. 3 exact values and GT setting (with/without/avg)

Hello,

Thank you for the insightful paper _Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems_. I'm trying to reproduce the results in Figure 3, but the plot doesn’t allow extracting exact metric values.

Could you please clarify:

1) For Figure 3, are the reported agent-level and step-level accuracies computed with ground-truth labels available to the judge, without ground-truth, or averaged across both settings?
2) Could you share the exact numeric values underlying Figure 3 (per model and per method: Random / All-at-Once / Step-by-Step / Binary Search), ideally in a small table or CSV?
3) If Figure 3 uses averaging across the “with GT” and “without GT” scenarios, could you also share the breakdown for each scenario separately?

Thank you very much in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fig. 3 exact values and GT setting (with/without/avg) #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fig. 3 exact values and GT setting (with/without/avg) #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions