salesforce
diff --git a/‎.dockerignore
Lines changed: 1 addition & 0 deletions b/‎.dockerignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitattributes
Lines changed: 1 addition & 0 deletions b/‎.gitattributes
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitignore
Lines changed: 3 additions & 0 deletions b/‎.gitignore
Lines changed: 3 additions & 0 deletions
diff --git a/‎LICENSE
Lines changed: 29 additions & 0 deletions b/‎LICENSE
Lines changed: 29 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 220 additions & 0 deletions b/‎README.md
Lines changed: 220 additions & 0 deletions
diff --git a/‎data.tar.bz2
Lines changed: 3 additions & 0 deletions b/‎data.tar.bz2
Lines changed: 3 additions & 0 deletions
diff --git a/‎evaluate.py
Lines changed: 41 additions & 0 deletions b/‎evaluate.py
Lines changed: 41 additions & 0 deletions
diff --git a/‎lib/__init__.py b/‎lib/__init__.py
diff --git a/‎lib/common.py
Lines changed: 10 additions & 0 deletions b/‎lib/common.py
Lines changed: 10 additions & 0 deletions
diff --git a/‎lib/dbengine.py
Lines changed: 49 additions & 0 deletions b/‎lib/dbengine.py
Lines changed: 49 additions & 0 deletions
@@ -0,0 +1 @@
+data/
@@ -0,0 +1 @@
+*.bz2 filter=lfs diff=lfs merge=lfs -text
@@ -0,0 +1,3 @@
+.py[cod]
+__pycache__/
+data/
@@ -0,0 +1,29 @@
+BSD 3-Clause License
+
+Copyright (c) 2017, Salesforce Research
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,220 @@
+# WikiSQL
+
+WikiSQL is a large crowd-sourced dataset for developing natural language interfaces for relational databases.
+
+
+## Citation
+
+If you use WikiSQL, please cite the following work:
+
+> Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.
+
+
+## Installation
+
+Both the evaluation script as well as the dataset are stored within the repo.
+To track the data file, we use [Git LFS](https://git-lfs.github.com/).
+The installation steps are as follows:
+
+```bash
+git clone https://github.com/MetaMind/WikiSQL
+cd WikiSQL
+pip install -r requirements.txt
+tar xvjf data.tar.bz2
+``` 
+
+This will unpack the data files into a directory called `data`.
+
+## Content and format
+
+Inside the data folder you will find the files in `jsonl` and `db` format.
+The former can be read line by line, where each line is a serialized JSON object.
+The latter is a SQLite3 database.
+
+### Question, query and table ID
+
+These files are contained in the `*.jsonl` files. A line looks like the following:
+
+```json
+{
+   "phase":1,
+   "question":"who is the manufacturer for the order year 1998?",
+   "sql":{
+      "conds":[
+         [
+            0,
+            0,
+            "1998"
+         ]
+      ],
+      "sel":1,
+      "agg":0
+   },
+   "table_id":"1-10007452-3"
+}
+```
+
+The fields represent the following:
+
+- `phase`: the phase in which the dataset was collection. We collected WikiSQL in two phases.
+- `question`: the natural language question written by the worker.
+- `table_id`: the ID of the table to which this question is addressed.
+- `sql`: the SQL query corresponding to the question. This has the following subfields:
+  - `sel`: the numerical index of the column that is being selected. You can find the actual column from the table.
+  - `agg`: the numerical index of the aggregation operator that is being used. You can find the actual operator from `Query.agg_ops` in `lib/query.py`.
+  - `conds`: a list of triplets `(column_index, operator_index, condition)` where:
+    - `column_index`: the numerical index of the condition column that is being used. You can find the actual column from the table.
+    - `operator_index`: the numerical index of the condition operator that is being used. You can find the actual operator from `Query.cond_ops` in `lib/query.py`.
+    - `condition`: the comparison value for the condition, in either `string` or `float` type.
+
+### Tables
+
+These files are contained in the `*.tables.jsonl` files. A line looks like the following:
+
+```json
+{
+   "id":"1-1000181-1",
+   "header":[
+      "State/territory",
+      "Text/background colour",
+      "Format",
+      "Current slogan",
+      "Current series",
+      "Notes"
+   ],
+   "types":[
+      "text",
+      "text",
+      "text",
+      "text",
+      "text",
+      "text"
+   ],
+   "rows":[
+      [
+         "Australian Capital Territory",
+         "blue/white",
+         "Yaa\u00b7nna",
+         "ACT \u00b7 CELEBRATION OF A CENTURY 2013",
+         "YIL\u00b700A",
+         "Slogan screenprinted on plate"
+      ],
+      [
+         "New South Wales",
+         "black/yellow",
+         "aa\u00b7nn\u00b7aa",
+         "NEW SOUTH WALES",
+         "BX\u00b799\u00b7HI",
+         "No slogan on current series"
+      ],
+      [
+         "New South Wales",
+         "black/white",
+         "aaa\u00b7nna",
+         "NSW",
+         "CPX\u00b712A",
+         "Optional white slimline series"
+      ],
+      [
+         "Northern Territory",
+         "ochre/white",
+         "Ca\u00b7nn\u00b7aa",
+         "NT \u00b7 OUTBACK AUSTRALIA",
+         "CB\u00b706\u00b7ZZ",
+         "New series began in June 2011"
+      ],
+      [
+         "Queensland",
+         "maroon/white",
+         "nnn\u00b7aaa",
+         "QUEENSLAND \u00b7 SUNSHINE STATE",
+         "999\u00b7TLG",
+         "Slogan embossed on plate"
+      ],
+      [
+         "South Australia",
+         "black/white",
+         "Snnn\u00b7aaa",
+         "SOUTH AUSTRALIA",
+         "S000\u00b7AZD",
+         "No slogan on current series"
+      ],
+      [
+         "Victoria",
+         "blue/white",
+         "aaa\u00b7nnn",
+         "VICTORIA - THE PLACE TO BE",
+         "ZZZ\u00b7562",
+         "Current series will be exhausted this year"
+      ]
+   ]
+}
+```
+
+The fields represent the following:
+- `id`: the table ID.
+- `header`: a list of column names in the table.
+- `rows`: a list of rows. Each row is a list of row entries.
+
+Tables are also contained in a corresponding `*.db` file.
+This is a SQL database with the same information.
+Note that due to the flexible format of HTML tables, the column names of tables in the database has been symbolized.
+For example, for a table with the columns `['foo', 'bar']`, the columns in the database are actually `col0` and `col1`.
+
+## Scripts
+
+`evaluate.py` contains the evaluation script, whose options are:
+
+```
+usage: evaluate.py [-h] source_file db_file pred_file
+
+positional arguments:
+  source_file  source file for the prediction
+  db_file      source database for the prediction
+  pred_file    predictions by the model
+
+optional arguments:
+  -h, --help   show this help message and exit
+```
+
+The `pred_file`, which is supplied by the user, should contain lines of serialized JSON objects.
+Each JSON object should contain a `query` field which corresponds to the query predicted for a line in the input `*.jsonl` file and should be similar to the `sql` field of the input.
+In particular, it should contain:
+
+- `sel`: the numerical index of the column that is being selected. You can find the actual column from the table.
+- `agg`: the numerical index of the aggregation operator that is being used. You can find the actual operator from `Query.agg_ops` in `lib/query.py`.
+- `conds`: a list of triplets `(column_index, operator_index, condition)` where:
+  - `column_index`: the numerical index of the condition column that is being used. You can find the actual column from the table.
+  - `operator_index`: the numerical index of the condition operator that is being used. You can find the actual operator from `Query.cond_ops` in `lib/query.py`.
+  - `condition`: the comparison value for the condition, in either `string` or `float` type.
+
+An example predictions file can be found in `test/example.pred.dev.jsonl`.
+The `lib` directory contains dependencies of `evaluate.py`.
+
+
+## Integration Test
+
+We supply a sample predctions file for the dev set in `test/example.pred.dev.jsonl.bz2`.
+You can unzip this file using `bunzip2 test/example.pred.dev.jsonl.bz2 -k` to look at what a real predictions file should look like.
+We distribute a docker file which installs the necessary dependencies of this library and runs the evaluation script on this file.
+The docker file also serves as an example of how to use the evaluation script.
+
+To run the test, first build the image from the root directory:
+
+```bash
+docker build -t wikisqltest -f test/Dockerfile .
+```
+
+Next, run the image
+```bash
+docker run --rm --name wikisqltest wikisqltest
+```
+
+If everything works correctly, the output should be:
+
+```json
+{
+  "ex_accuracy": 0.37036632039365774,
+  "lf_accuracy": 0.2334609075997813
+}
+```
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c2edd896d4da457e1444db2ce7beec41b78ba9d7afd69ea57f7d408c704541d3
+size 26363890
@@ -0,0 +1,41 @@
+#!/usr/bin/env python
+import json
+from argparse import ArgumentParser
+from tqdm import tqdm
+from lib.dbengine import DBEngine
+from lib.query import Query
+from lib.common import count_lines
+
+
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    parser.add_argument('source_file', help='source file for the prediction')
+    parser.add_argument('db_file', help='source database for the prediction')
+    parser.add_argument('pred_file', help='predictions by the model')
+    args = parser.parse_args()
+
+    engine = DBEngine(args.db_file)
+    exact_match = []
+    with open(args.source_file) as fs, open(args.pred_file) as fp:
+        grades = []
+        for ls, lp in tqdm(zip(fs, fp), total=count_lines(args.source_file)):
+            eg = json.loads(ls)
+            ep = json.loads(lp)
+            qg = Query.from_dict(eg['sql'])
+            gold = engine.execute_query(eg['table_id'], qg, lower=True)
+            pred = ep['error']
+            qp = None
+            if not ep['error']:
+                try:
+                    qp = Query.from_dict(ep['query'])
+                    pred = engine.execute_query(eg['table_id'], qp, lower=True)
+                except Exception as e:
+                    pred = repr(e)
+            correct = pred == gold
+            match = qp == qg
+            grades.append(correct)
+            exact_match.append(match)
+        print(json.dumps({
+            'ex_accuracy': sum(grades) / len(grades),
+            'lf_accuracy': sum(exact_match) / len(exact_match),
+            }, indent=2))
@@ -0,0 +1,10 @@
+def count_lines(fname):
+    with open(fname) as f:
+        return sum(1 for line in f)
+
+
+def detokenize(tokens):
+    ret = ''
+    for g, a in zip(tokens['gloss'], tokens['after']):
+        ret += g + a
+    return ret.strip()
@@ -0,0 +1,49 @@
+import records
+import re
+from babel.numbers import parse_decimal, NumberFormatError
+from lib.query import Query
+
+
+schema_re = re.compile(r'\((.+)\)')
+num_re = re.compile(r'[-+]?\d*\.\d+|\d+')
+
+
+class DBEngine:
+
+    def __init__(self, fdb):
+        self.db = records.Database('sqlite:///{}'.format(fdb))
+
+    def execute_query(self, table_id, query, *args, **kwargs):
+        return self.execute(table_id, query.sel_index, query.agg_index, query.conditions, *args, **kwargs)
+
+    def execute(self, table_id, select_index, aggregation_index, conditions, lower=True):
+        if not table_id.startswith('table'):
+            table_id = 'table_{}'.format(table_id.replace('-', '_'))
+        table_info = self.db.query('SELECT sql from sqlite_master WHERE tbl_name = :name', name=table_id).all()[0].sql
+        schema_str = schema_re.findall(table_info)[0]
+        schema = {}
+        for tup in schema_str.split(', '):
+            c, t = tup.split()
+            schema[c] = t
+        select = 'col{}'.format(select_index)
+        agg = Query.agg_ops[aggregation_index]
+        if agg:
+            select = '{}({})'.format(agg, select)
+        where_clause = []
+        where_map = {}
+        for col_index, op, val in conditions:
+            if lower and isinstance(val, str):
+                val = val.lower()
+            if schema['col{}'.format(col_index)] == 'real' and not isinstance(val, (int, float)):
+                try:
+                    val = float(parse_decimal(val))
+                except NumberFormatError as e:
+                    val = float(num_re.findall(val)[0])
+            where_clause.append('col{} {} :col{}'.format(col_index, Query.cond_ops[op], col_index))
+            where_map['col{}'.format(col_index)] = val
+        where_str = ''
+        if where_clause:
+            where_str = 'WHERE ' + ' AND '.join(where_clause)
+        query = 'SELECT {} AS result FROM {} {}'.format(select, table_id, where_str)
+        out = self.db.query(query, **where_map)
+        return [o.result for o in out]
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+*.bz2 filter=lfs diff=lfs merge=lfs -text`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.py[cod]`
	`2`	`+__pycache__/`
	`3`	`+data/`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:c2edd896d4da457e1444db2ce7beec41b78ba9d7afd69ea57f7d408c704541d3`
	`3`	`+size 26363890`