Skip to content

Commit 1eb77cf

Browse files
committed
rebuild
1 parent 43e1bbb commit 1eb77cf

File tree

15 files changed

+831
-1104
lines changed

15 files changed

+831
-1104
lines changed

Examples/LogisticExample/ScoringExample.ipynb

Lines changed: 63 additions & 485 deletions
Large diffs are not rendered by default.

Examples/Simplification/query_simplification.ipynb

Lines changed: 34 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,18 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"metadata": {
6-
"collapsed": true,
7-
"pycharm": {
8-
"name": "#%% md\n"
9-
}
10-
},
115
"source": [
126
"[`data_algebra`](https://github.com/WinVector/data_algebra) version of this [`rquery` example](http://www.win-vector.com/blog/2019/12/what-is-new-for-rquery-december-2019/).\n",
137
"\n",
148
"First lets import our modules and set up our operator pipeline."
15-
]
9+
],
10+
"metadata": {
11+
"collapsed": false
12+
}
1613
},
1714
{
1815
"cell_type": "code",
19-
"execution_count": 15,
16+
"execution_count": 1,
2017
"outputs": [
2118
{
2219
"name": "stdout",
@@ -28,7 +25,7 @@
2825
" extend({\n",
2926
" 'sum23': 'col2 + col3',\n",
3027
" 'x': '5'}) .\\\n",
31-
" select_columns(['x', 'sum23'])\n"
28+
" select_columns(['x', 'sum23', 'col3'])\n"
3229
],
3330
"output_type": "stream"
3431
}
@@ -63,7 +60,7 @@
6360
" extend({\n",
6461
" 'x': 5\n",
6562
" }). \\\n",
66-
" select_columns(['x', 'sum23'])\n",
63+
" select_columns(['x', 'sum23', 'col3'])\n",
6764
"\n",
6865
"\n",
6966
"print(ops)\n"
@@ -85,24 +82,21 @@
8582
"These operations can be applied to data."
8683
],
8784
"metadata": {
88-
"collapsed": false,
89-
"pycharm": {
90-
"name": "#%% md\n"
91-
}
85+
"collapsed": false
9286
}
9387
},
9488
{
9589
"cell_type": "code",
96-
"execution_count": 16,
90+
"execution_count": 2,
9791
"outputs": [
9892
{
9993
"data": {
100-
"text/plain": " x sum23\n0 5 7\n1 5 9",
101-
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>x</th>\n <th>sum23</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>5</td>\n <td>7</td>\n </tr>\n <tr>\n <th>1</th>\n <td>5</td>\n <td>9</td>\n </tr>\n </tbody>\n</table>\n</div>"
94+
"text/plain": " x sum23 col3\n0 5 7 4\n1 5 9 5",
95+
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>x</th>\n <th>sum23</th>\n <th>col3</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>5</td>\n <td>7</td>\n <td>4</td>\n </tr>\n <tr>\n <th>1</th>\n <td>5</td>\n <td>9</td>\n <td>5</td>\n </tr>\n </tbody>\n</table>\n</div>"
10296
},
10397
"metadata": {},
10498
"output_type": "execute_result",
105-
"execution_count": 16
99+
"execution_count": 2
106100
}
107101
],
108102
"source": [
@@ -125,7 +119,7 @@
125119
{
126120
"cell_type": "markdown",
127121
"source": [
128-
"We are working on adaptors for near-`Pandas` systems such as `modin` and others.\n",
122+
"We are working on adapters for near-`Pandas` systems such as `modin` and others.\n",
129123
"\n",
130124
"We can also convert the query into `SQL` query."
131125
],
@@ -135,64 +129,15 @@
135129
},
136130
{
137131
"cell_type": "code",
138-
"execution_count": 17,
132+
"execution_count": 3,
139133
"outputs": [
140134
{
141135
"name": "stdout",
142136
"text": [
143-
"SELECT \"sum23\",\n",
144-
" \"x\"\n",
145-
"FROM\n",
146-
" (SELECT \"col2\" + \"col3\" AS \"sum23\",\n",
147-
" 5 AS \"x\"\n",
148-
" FROM\n",
149-
" (SELECT \"col2\",\n",
150-
" \"col3\"\n",
151-
" FROM \"d\") \"sq_0\") \"sq_1\"\n"
152-
],
153-
"output_type": "stream"
154-
}
155-
],
156-
"source": [
157-
"pg_model = data_algebra.PostgreSQL.PostgreSQLModel()\n",
158-
"\n",
159-
"print(ops.to_sql(db_model=pg_model, pretty=True))"
160-
],
161-
"metadata": {
162-
"collapsed": false,
163-
"pycharm": {
164-
"name": "#%%\n",
165-
"is_executing": false
166-
}
167-
}
168-
},
169-
{
170-
"cell_type": "markdown",
171-
"source": [
172-
"The excess inner query is working around the issue that the `PostgresSQL` `SQL` dialect does not accept table names in parenthesis in some situations.\n",
173-
"\n",
174-
"When we do not have such a constraint (such as with `SQLite`) we can generate a shorter query. \n"
175-
],
176-
"metadata": {
177-
"collapsed": false,
178-
"pycharm": {
179-
"name": "#%% md\n"
180-
}
181-
}
182-
},
183-
{
184-
"cell_type": "code",
185-
"execution_count": 18,
186-
"outputs": [
187-
{
188-
"name": "stdout",
189-
"text": [
190-
"SELECT \"sum23\",\n",
191-
" \"x\"\n",
192-
"FROM\n",
193-
" (SELECT \"col2\" + \"col3\" AS \"sum23\",\n",
194-
" 5 AS \"x\"\n",
195-
" FROM (\"d\") \"SQ_0\") \"SQ_1\"\n"
137+
"SELECT 5 AS \"x\",\n",
138+
" \"col2\" + \"col3\" AS \"sum23\",\n",
139+
" \"col3\"\n",
140+
"FROM \"d\"\n"
196141
],
197142
"output_type": "stream"
198143
}
@@ -213,33 +158,39 @@
213158
{
214159
"cell_type": "markdown",
215160
"source": [
216-
"One per-`SQL` dialect translations and affordances is one of the intents of the `data_algebra`.\n",
161+
"Notice this query is fairly compact. `data_algebra` optimizations do not combine steps with different concerns, but they do have some nice features:\n",
162+
"\n",
163+
" * Queries are shortened: some steps that are not used are not preserved.\n",
164+
" * Queries are narrowed: values not used in the result are not brought through intermediate queries.\n",
165+
" * Non-terminal row-orders are thrown away (as they are not semantic in many data-stores).\n",
166+
" * `select_column()` steps are implicit, change other steps but not translated as explicit queries.\n",
167+
" * Tables are used by name when deeper in queries.\n",
168+
" \n",
169+
"This make for tighter query generation than the current version of [`rquery`](https://github.com/WinVector/rquery/) (which [itself one of the best query generators in `R`](http://www.win-vector.com/blog/2019/12/what-is-new-for-rquery-december-2019/)).\n",
217170
"\n",
218171
"And we can easily demonstrate the query in action."
219172
],
220173
"metadata": {
221-
"collapsed": false,
222-
"pycharm": {
223-
"name": "#%% md\n"
224-
}
174+
"collapsed": false
225175
}
226176
},
227177
{
228178
"cell_type": "code",
229-
"execution_count": 19,
179+
"execution_count": 4,
230180
"outputs": [
231181
{
232182
"data": {
233-
"text/plain": " sum23 x\n0 7 5\n1 9 5",
234-
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>sum23</th>\n <th>x</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>7</td>\n <td>5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9</td>\n <td>5</td>\n </tr>\n </tbody>\n</table>\n</div>"
183+
"text/plain": " x sum23 col3\n0 5 7 4\n1 5 9 5",
184+
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>x</th>\n <th>sum23</th>\n <th>col3</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>5</td>\n <td>7</td>\n <td>4</td>\n </tr>\n <tr>\n <th>1</th>\n <td>5</td>\n <td>9</td>\n <td>5</td>\n </tr>\n </tbody>\n</table>\n</div>"
235185
},
236186
"metadata": {},
237187
"output_type": "execute_result",
238-
"execution_count": 19
188+
"execution_count": 4
239189
}
240190
],
241191
"source": [
242192
"conn = sqlite3.connect(':memory:')\n",
193+
"sql_model.prepare_connection(conn)\n",
243194
"sql_model.insert_table(conn, d, table_name='d')\n",
244195
"\n",
245196
"conn.execute('CREATE TABLE res AS ' + ops.to_sql(db_model=sql_model))\n",
@@ -255,7 +206,7 @@
255206
},
256207
{
257208
"cell_type": "code",
258-
"execution_count": 20,
209+
"execution_count": 5,
259210
"outputs": [],
260211
"source": [
261212
"conn.close()\n"

build/lib/data_algebra/PostgreSQL.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,3 @@ def build_qualified_table_name(self, table_name, *, qualifiers=None):
3131
if "schema" in qualifiers.keys():
3232
qt = self.quote_identifier(qualifiers["schema"]) + "." + qt
3333
return qt
34-
35-
def table_def_to_sql(self, table_def, *, using=None, force_sql=False):
36-
return super().table_def_to_sql(
37-
table_def=table_def, using=using, force_sql=True
38-
)

build/lib/data_algebra/SparkSQL.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,3 @@ def build_qualified_table_name(self, table_name, *, qualifiers=None):
3131
if "schema" in qualifiers.keys():
3232
qt = self.quote_identifier(qualifiers["schema"]) + "." + qt
3333
return qt
34-
35-
def table_def_to_sql(self, table_def, *, using=None, force_sql=False):
36-
return super().table_def_to_sql(
37-
table_def=table_def, using=using, force_sql=True
38-
)

build/lib/data_algebra/arrow.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,10 @@ def __init__(self, pipeline, *, free_table_key=None, strict=False):
6666
self.outgoing_columns = pipeline.column_names.copy()
6767
self.outgoing_columns.sort()
6868
self.outgoing_types = None
69-
if isinstance(pipeline, data_algebra.data_ops.TableDescription) and self.incoming_types is not None:
69+
if (
70+
isinstance(pipeline, data_algebra.data_ops.TableDescription)
71+
and self.incoming_types is not None
72+
):
7073
self.outgoing_types = self.incoming_types.copy()
7174
Arrow.__init__(self)
7275

@@ -104,7 +107,7 @@ def apply_to(self, b):
104107
new_pipeline = self.pipeline.apply_to(
105108
b.pipeline, target_table_key=self.free_table_key
106109
)
107-
new_pipeline.get_tables() # check tables are compatible
110+
new_pipeline.get_tables() # check tables are compatible
108111
res = DataOpArrow(pipeline=new_pipeline, free_table_key=b.free_table_key)
109112
res.incoming_types = b.incoming_types
110113
res.outgoing_types = self.outgoing_types

build/lib/data_algebra/data_ops.py

Lines changed: 29 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
import data_algebra.env
1414
from data_algebra.data_ops_types import *
1515
import data_algebra.data_ops_utils
16+
import data_algebra.near_sql
1617

1718
_have_black = False
1819
try:
@@ -234,6 +235,9 @@ def to_sql(self, db_model, *, pretty=False, encoding=None, sqlparse_options=None
234235
sql_str = self.to_sql_implementation(
235236
db_model=db_model, using=None, temp_id_source=temp_id_source
236237
)
238+
if isinstance(sql_str, str):
239+
print("break")
240+
sql_str = sql_str.to_sql(db_model=db_model, force_sql=True)
237241
if pretty and _have_sqlparse:
238242
try:
239243
sql_str = sqlparse.format(
@@ -664,22 +668,10 @@ def eval_implementation(self, *, data_map, eval_env, data_model):
664668
def columns_used_from_sources(self, using=None):
665669
return [] # no inputs to table description
666670

667-
def to_sql(self, db_model, *, pretty=False, encoding=None, sqlparse_options=None):
668-
if sqlparse_options is None:
669-
sqlparse_options = {"reindent": True, "keyword_case": "upper"}
670-
self.columns_used() # for table consistency check/raise
671-
temp_id_source = [0]
672-
sql_str = self.to_sql_implementation(
673-
db_model=db_model, using=None, temp_id_source=temp_id_source, force_sql=True
671+
def to_sql_implementation(self, db_model, *, using, temp_id_source):
672+
return db_model.table_def_to_sql(
673+
self, using=using, temp_id_source=temp_id_source
674674
)
675-
if pretty and _have_sqlparse:
676-
sql_str = sqlparse.format(sql_str, encoding=encoding, **sqlparse_options)
677-
return sql_str
678-
679-
def to_sql_implementation(
680-
self, db_model, *, using, temp_id_source, force_sql=False
681-
):
682-
return db_model.table_def_to_sql(self, using=using, force_sql=force_sql)
683675

684676
# comparable to other table descriptions
685677
def __lt__(self, other):
@@ -768,7 +760,7 @@ def apply_to(self, a, *, target_table_key=None):
768760
data_map, a = self._reach_in(a)
769761
return WrappedOperatorPlatform(
770762
underlying=self.underlying.apply_to(a, target_table_key=target_table_key),
771-
data_map=data_map
763+
data_map=data_map,
772764
)
773765

774766
# noinspection PyPep8Naming
@@ -1200,7 +1192,9 @@ def apply_to(self, a, *, target_table_key=None):
12001192
new_sources = [
12011193
s.apply_to(a, target_table_key=target_table_key) for s in self.sources
12021194
]
1203-
return new_sources[0].project_parsed(parsed_ops=self.ops, group_by=self.group_by)
1195+
return new_sources[0].project_parsed(
1196+
parsed_ops=self.ops, group_by=self.group_by
1197+
)
12041198

12051199
def _equiv_nodes(self, other):
12061200
if not self.group_by == other.group_by:
@@ -1517,9 +1511,7 @@ def apply_to(self, a, *, target_table_key=None):
15171511
s.apply_to(a, target_table_key=target_table_key) for s in self.sources
15181512
]
15191513
return new_sources[0].order_rows(
1520-
columns=self.order_columns,
1521-
reverse=self.reverse,
1522-
limit=self.limit,
1514+
columns=self.order_columns, reverse=self.reverse, limit=self.limit
15231515
)
15241516

15251517
def _equiv_nodes(self, other):
@@ -1948,8 +1940,7 @@ def apply_to(self, a, *, target_table_key=None):
19481940
s.apply_to(a, target_table_key=target_table_key) for s in self.sources
19491941
]
19501942
return new_sources[0].convert_records(
1951-
record_map=self.record_map,
1952-
blocks_out_table=self.blocks_out_table,
1943+
record_map=self.record_map, blocks_out_table=self.blocks_out_table
19531944
)
19541945

19551946
def _equiv_nodes(self, other):
@@ -2007,18 +1998,28 @@ def to_python_implementation(self, *, indent=0, strict=True, print_sources=True)
20071998
return s
20081999

20092000
def to_sql_implementation(self, db_model, *, using, temp_id_source):
2010-
res = self.sources[0].to_sql_implementation(
2001+
sub_query = self.sources[0].to_sql_implementation(
20112002
db_model=db_model, using=using, temp_id_source=temp_id_source
20122003
)
2004+
query = sub_query.to_sql(columns=using, db_model=db_model)
20132005
if self.record_map.blocks_in is not None:
2014-
res = db_model.blocks_to_row_recs_query(
2015-
res, record_spec=self.record_map.blocks_in
2006+
query = db_model.blocks_to_row_recs_query(
2007+
query, record_spec=self.record_map.blocks_in
20162008
)
20172009
if self.record_map.blocks_out is not None:
2018-
res = db_model.row_recs_to_blocks_query(
2019-
res, record_spec=self.record_map.blocks_out, record_view=self.sources[1]
2010+
query = db_model.row_recs_to_blocks_query(
2011+
query,
2012+
record_spec=self.record_map.blocks_out,
2013+
record_view=self.sources[1],
20202014
)
2021-
return res
2015+
if temp_id_source is None:
2016+
temp_id_source = [0]
2017+
view_name = "convert_records_" + str(temp_id_source[0])
2018+
temp_id_source[0] = temp_id_source[0] + 1
2019+
near_sql = data_algebra.near_sql.NearSQLq(
2020+
quoted_query_name=db_model.quote_identifier(view_name), query=query
2021+
)
2022+
return near_sql
20222023

20232024
def eval_implementation(self, *, data_map, eval_env, data_model):
20242025
if data_model is None:

0 commit comments

Comments
 (0)