Skip to content

Fig. 3 exact values and GT setting (with/without/avg) #11

@alinzh

Description

@alinzh

Hello,

Thank you for the insightful paper Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. I'm trying to reproduce the results in Figure 3, but the plot doesn’t allow extracting exact metric values.

Could you please clarify:

  1. For Figure 3, are the reported agent-level and step-level accuracies computed with ground-truth labels available to the judge, without ground-truth, or averaged across both settings?
  2. Could you share the exact numeric values underlying Figure 3 (per model and per method: Random / All-at-Once / Step-by-Step / Binary Search), ideally in a small table or CSV?
  3. If Figure 3 uses averaging across the “with GT” and “without GT” scenarios, could you also share the breakdown for each scenario separately?

Thank you very much in advance!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions