Skip to content

Possible erroneous gold SQL in the dataset? #112

@yicun0720

Description

@yicun0720

I found a few subtle and imperceptible erroneous gold SQLs in the test dataset.
I suppose it may affect the evaluation result of model accuracy when using Test Suite Accuracy metric.

  1. JOIN to IN-subquery. (case 18 in test dataset)
    Natural language query: Show names of clubs that have players with more than 2 win counts.
    Original gold SQL:
    SELECT T1.Name FROM club AS T1 JOIN player AS T2 ON T1.Club_ID = T2.Club_ID WHERE T2.Wins_count > 2.
    It should be:
    SELECT name FROM club WHERE club_id IN (SELECT club_id FROM player WHERE wins_count > 2).
    The JOIN operator would introduce duplicated rows in the result set. Consider a database instance having one club of name 'club_1' with 10 players in this club, then the name 'club_1' would appear 10 times in the result set, which is not suitable.
    Adding a 'DISTINCT' may also drop duplicated names from DIFFERENT clubs.
    I propose the SQL with IN-subquery as the best-suited one to this natural language query.

  2. INNER JOIN -> LEFT JOIN. (case 129 in test dataset)
    Natural language query: Find the number of programs for each channel. Return the name of each channel as well.
    Original gold SQL:
    SELECT t1.name , count(*) FROM channel AS t1 JOIN program AS t2 ON t1.channel_id = t2.channel_id GROUP BY t1.channel_id.
    It should be:
    SELECT c.name, COUNT(p.program_id) FROM channel AS c LEFT JOIN program AS p ON c.channel_id = p.channel_id GROUP BY c.channel_id.
    The natural language query asks for the number of EACH channel. The INNER JOIN accidentally drops the channel records with NO programs while the LEFT JOIN preserves them. LEFT JOIN is better-suited to fit the natural language query semantics.

Note that similar errors also exist in the training set, fixing them could improve the quality of training data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions