|
| 1 | +# WikiSQL |
| 2 | + |
| 3 | +WikiSQL is a large crowd-sourced dataset for developing natural language interfaces for relational databases. |
| 4 | + |
| 5 | + |
| 6 | +## Citation |
| 7 | + |
| 8 | +If you use WikiSQL, please cite the following work: |
| 9 | + |
| 10 | +> Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. |
| 11 | +
|
| 12 | + |
| 13 | +## Installation |
| 14 | + |
| 15 | +Both the evaluation script as well as the dataset are stored within the repo. |
| 16 | +To track the data file, we use [Git LFS](https://git-lfs.github.com/). |
| 17 | +The installation steps are as follows: |
| 18 | + |
| 19 | +```bash |
| 20 | +git clone https://github.com/MetaMind/WikiSQL |
| 21 | +cd WikiSQL |
| 22 | +pip install -r requirements.txt |
| 23 | +tar xvjf data.tar.bz2 |
| 24 | +``` |
| 25 | + |
| 26 | +This will unpack the data files into a directory called `data`. |
| 27 | + |
| 28 | +## Content and format |
| 29 | + |
| 30 | +Inside the data folder you will find the files in `jsonl` and `db` format. |
| 31 | +The former can be read line by line, where each line is a serialized JSON object. |
| 32 | +The latter is a SQLite3 database. |
| 33 | + |
| 34 | +### Question, query and table ID |
| 35 | + |
| 36 | +These files are contained in the `*.jsonl` files. A line looks like the following: |
| 37 | + |
| 38 | +```json |
| 39 | +{ |
| 40 | + "phase":1, |
| 41 | + "question":"who is the manufacturer for the order year 1998?", |
| 42 | + "sql":{ |
| 43 | + "conds":[ |
| 44 | + [ |
| 45 | + 0, |
| 46 | + 0, |
| 47 | + "1998" |
| 48 | + ] |
| 49 | + ], |
| 50 | + "sel":1, |
| 51 | + "agg":0 |
| 52 | + }, |
| 53 | + "table_id":"1-10007452-3" |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +The fields represent the following: |
| 58 | + |
| 59 | +- `phase`: the phase in which the dataset was collection. We collected WikiSQL in two phases. |
| 60 | +- `question`: the natural language question written by the worker. |
| 61 | +- `table_id`: the ID of the table to which this question is addressed. |
| 62 | +- `sql`: the SQL query corresponding to the question. This has the following subfields: |
| 63 | + - `sel`: the numerical index of the column that is being selected. You can find the actual column from the table. |
| 64 | + - `agg`: the numerical index of the aggregation operator that is being used. You can find the actual operator from `Query.agg_ops` in `lib/query.py`. |
| 65 | + - `conds`: a list of triplets `(column_index, operator_index, condition)` where: |
| 66 | + - `column_index`: the numerical index of the condition column that is being used. You can find the actual column from the table. |
| 67 | + - `operator_index`: the numerical index of the condition operator that is being used. You can find the actual operator from `Query.cond_ops` in `lib/query.py`. |
| 68 | + - `condition`: the comparison value for the condition, in either `string` or `float` type. |
| 69 | + |
| 70 | +### Tables |
| 71 | + |
| 72 | +These files are contained in the `*.tables.jsonl` files. A line looks like the following: |
| 73 | + |
| 74 | +```json |
| 75 | +{ |
| 76 | + "id":"1-1000181-1", |
| 77 | + "header":[ |
| 78 | + "State/territory", |
| 79 | + "Text/background colour", |
| 80 | + "Format", |
| 81 | + "Current slogan", |
| 82 | + "Current series", |
| 83 | + "Notes" |
| 84 | + ], |
| 85 | + "types":[ |
| 86 | + "text", |
| 87 | + "text", |
| 88 | + "text", |
| 89 | + "text", |
| 90 | + "text", |
| 91 | + "text" |
| 92 | + ], |
| 93 | + "rows":[ |
| 94 | + [ |
| 95 | + "Australian Capital Territory", |
| 96 | + "blue/white", |
| 97 | + "Yaa\u00b7nna", |
| 98 | + "ACT \u00b7 CELEBRATION OF A CENTURY 2013", |
| 99 | + "YIL\u00b700A", |
| 100 | + "Slogan screenprinted on plate" |
| 101 | + ], |
| 102 | + [ |
| 103 | + "New South Wales", |
| 104 | + "black/yellow", |
| 105 | + "aa\u00b7nn\u00b7aa", |
| 106 | + "NEW SOUTH WALES", |
| 107 | + "BX\u00b799\u00b7HI", |
| 108 | + "No slogan on current series" |
| 109 | + ], |
| 110 | + [ |
| 111 | + "New South Wales", |
| 112 | + "black/white", |
| 113 | + "aaa\u00b7nna", |
| 114 | + "NSW", |
| 115 | + "CPX\u00b712A", |
| 116 | + "Optional white slimline series" |
| 117 | + ], |
| 118 | + [ |
| 119 | + "Northern Territory", |
| 120 | + "ochre/white", |
| 121 | + "Ca\u00b7nn\u00b7aa", |
| 122 | + "NT \u00b7 OUTBACK AUSTRALIA", |
| 123 | + "CB\u00b706\u00b7ZZ", |
| 124 | + "New series began in June 2011" |
| 125 | + ], |
| 126 | + [ |
| 127 | + "Queensland", |
| 128 | + "maroon/white", |
| 129 | + "nnn\u00b7aaa", |
| 130 | + "QUEENSLAND \u00b7 SUNSHINE STATE", |
| 131 | + "999\u00b7TLG", |
| 132 | + "Slogan embossed on plate" |
| 133 | + ], |
| 134 | + [ |
| 135 | + "South Australia", |
| 136 | + "black/white", |
| 137 | + "Snnn\u00b7aaa", |
| 138 | + "SOUTH AUSTRALIA", |
| 139 | + "S000\u00b7AZD", |
| 140 | + "No slogan on current series" |
| 141 | + ], |
| 142 | + [ |
| 143 | + "Victoria", |
| 144 | + "blue/white", |
| 145 | + "aaa\u00b7nnn", |
| 146 | + "VICTORIA - THE PLACE TO BE", |
| 147 | + "ZZZ\u00b7562", |
| 148 | + "Current series will be exhausted this year" |
| 149 | + ] |
| 150 | + ] |
| 151 | +} |
| 152 | +``` |
| 153 | + |
| 154 | +The fields represent the following: |
| 155 | +- `id`: the table ID. |
| 156 | +- `header`: a list of column names in the table. |
| 157 | +- `rows`: a list of rows. Each row is a list of row entries. |
| 158 | + |
| 159 | +Tables are also contained in a corresponding `*.db` file. |
| 160 | +This is a SQL database with the same information. |
| 161 | +Note that due to the flexible format of HTML tables, the column names of tables in the database has been symbolized. |
| 162 | +For example, for a table with the columns `['foo', 'bar']`, the columns in the database are actually `col0` and `col1`. |
| 163 | + |
| 164 | +## Scripts |
| 165 | + |
| 166 | +`evaluate.py` contains the evaluation script, whose options are: |
| 167 | + |
| 168 | +``` |
| 169 | +usage: evaluate.py [-h] source_file db_file pred_file |
| 170 | +
|
| 171 | +positional arguments: |
| 172 | + source_file source file for the prediction |
| 173 | + db_file source database for the prediction |
| 174 | + pred_file predictions by the model |
| 175 | +
|
| 176 | +optional arguments: |
| 177 | + -h, --help show this help message and exit |
| 178 | +``` |
| 179 | + |
| 180 | +The `pred_file`, which is supplied by the user, should contain lines of serialized JSON objects. |
| 181 | +Each JSON object should contain a `query` field which corresponds to the query predicted for a line in the input `*.jsonl` file and should be similar to the `sql` field of the input. |
| 182 | +In particular, it should contain: |
| 183 | + |
| 184 | +- `sel`: the numerical index of the column that is being selected. You can find the actual column from the table. |
| 185 | +- `agg`: the numerical index of the aggregation operator that is being used. You can find the actual operator from `Query.agg_ops` in `lib/query.py`. |
| 186 | +- `conds`: a list of triplets `(column_index, operator_index, condition)` where: |
| 187 | + - `column_index`: the numerical index of the condition column that is being used. You can find the actual column from the table. |
| 188 | + - `operator_index`: the numerical index of the condition operator that is being used. You can find the actual operator from `Query.cond_ops` in `lib/query.py`. |
| 189 | + - `condition`: the comparison value for the condition, in either `string` or `float` type. |
| 190 | + |
| 191 | +An example predictions file can be found in `test/example.pred.dev.jsonl`. |
| 192 | +The `lib` directory contains dependencies of `evaluate.py`. |
| 193 | + |
| 194 | + |
| 195 | +## Integration Test |
| 196 | + |
| 197 | +We supply a sample predctions file for the dev set in `test/example.pred.dev.jsonl.bz2`. |
| 198 | +You can unzip this file using `bunzip2 test/example.pred.dev.jsonl.bz2 -k` to look at what a real predictions file should look like. |
| 199 | +We distribute a docker file which installs the necessary dependencies of this library and runs the evaluation script on this file. |
| 200 | +The docker file also serves as an example of how to use the evaluation script. |
| 201 | + |
| 202 | +To run the test, first build the image from the root directory: |
| 203 | + |
| 204 | +```bash |
| 205 | +docker build -t wikisqltest -f test/Dockerfile . |
| 206 | +``` |
| 207 | + |
| 208 | +Next, run the image |
| 209 | +```bash |
| 210 | +docker run --rm --name wikisqltest wikisqltest |
| 211 | +``` |
| 212 | + |
| 213 | +If everything works correctly, the output should be: |
| 214 | + |
| 215 | +```json |
| 216 | +{ |
| 217 | + "ex_accuracy": 0.37036632039365774, |
| 218 | + "lf_accuracy": 0.2334609075997813 |
| 219 | +} |
| 220 | +``` |
0 commit comments