Datasets Task  Classification

Could you please clarify which database the accuracy of the Multi-hop TQA task in Table 2 is based on? I've noticed that both SQA and WTQ belong to this task category, but the accuracy and recall for WTQ fall significantly short of the values reported in the table.