Skip to content

Commit 9ea0ce1

Browse files
committed
edit
1 parent b1ddd9a commit 9ea0ce1

File tree

1 file changed

+80
-23
lines changed

1 file changed

+80
-23
lines changed

Examples/xicor/xicor.ipynb

Lines changed: 80 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,25 @@
99
}
1010
},
1111
"source": [
12-
"Professor Sourav Chatterjee's xicor coefficient of correlation (<a href=\"https://win-vector.com/2021/12/29/exploring-the-xi-correlation-coefficient/\">Nina Zumel's tutorial</a>, <a href=\"https://doi.org/10.1080/01621459.2020.1758115\">JASA</a>; original sources: <a href=\"https://CRAN.R-project.org/package=XICOR\">R package</a>, <a href=\"https://arxiv.org/abs/1909.10140\">Arxiv</a>, <a href=\"https://news.ycombinator.com/item?id=29687613\">Hacker News</a>, and <a href=\"https://github.com/czbiohub/xicor\">a Python package</a> (different author))."
12+
"For a lark, I decided to try and translate Professor Sourav Chatterjee's xicor coefficient of correlation into a [data algebra](https://github.com/WinVector/data_algebra) query that could be run in the database. (xicor refs: <a href=\"https://win-vector.com/2021/12/29/exploring-the-xi-correlation-coefficient/\">Nina Zumel's tutorial</a>, <a href=\"https://doi.org/10.1080/01621459.2020.1758115\">JASA</a>; original sources: <a href=\"https://CRAN.R-project.org/package=XICOR\">R package</a>, <a href=\"https://arxiv.org/abs/1909.10140\">Arxiv</a>, <a href=\"https://news.ycombinator.com/item?id=29687613\">Hacker News</a>, and <a href=\"https://github.com/czbiohub/xicor\">a Python package</a> (different author)). Actually the serious reason, is more complex tests help drive the development of the package.\n",
13+
"\n",
14+
"I found, I could translate the R reference implementation of xicor line by line into data algebra steps.\n",
15+
"\n",
16+
"I could re-run many examples in R and in the data algebra to confirm the implementation."
1317
]
1418
},
1519
{
1620
"cell_type": "code",
1721
"execution_count": 3,
1822
"outputs": [],
1923
"source": [
20-
"from typing import Iterable\n",
2124
"import numpy as np\n",
2225
"import pandas as pd\n",
23-
"from data_algebra.data_ops import descr, TableDescription, ViewRepresentation\n",
26+
"from data_algebra.data_ops import descr, TableDescription\n",
2427
"import data_algebra.BigQuery\n",
2528
"import data_algebra.solutions\n",
26-
"import yaml\n"
29+
"import yaml\n",
30+
"\n"
2731
],
2832
"metadata": {
2933
"collapsed": false,
@@ -32,18 +36,6 @@
3236
}
3337
}
3438
},
35-
{
36-
"cell_type": "code",
37-
"execution_count": 3,
38-
"outputs": [],
39-
"source": [],
40-
"metadata": {
41-
"collapsed": false,
42-
"pycharm": {
43-
"name": "#%%\n"
44-
}
45-
}
46-
},
4739
{
4840
"cell_type": "code",
4941
"execution_count": 4,
@@ -145,6 +137,22 @@
145137
}
146138
}
147139
},
140+
{
141+
"cell_type": "markdown",
142+
"source": [
143+
"The results matched expectation, but the code was very slow.\n",
144+
"\n",
145+
"Then I added \"group by\" clauses to the data algebra realization of the xicor calculation. This sped up the Pandas implementation immensely, as translation overhead was no amortized over a large efficient calculation.\n",
146+
"\n",
147+
"I could now confirm very many xicor calculations at once, by putting them all in a shared table (identifiable by row labels)."
148+
],
149+
"metadata": {
150+
"collapsed": false,
151+
"pycharm": {
152+
"name": "#%% md\n"
153+
}
154+
}
155+
},
148156
{
149157
"cell_type": "code",
150158
"execution_count": 8,
@@ -194,6 +202,18 @@
194202
}
195203
}
196204
},
205+
{
206+
"cell_type": "markdown",
207+
"source": [
208+
"I can even repeat the calculation and compute aggregates just by joining and projecting."
209+
],
210+
"metadata": {
211+
"collapsed": false,
212+
"pycharm": {
213+
"name": "#%% md\n"
214+
}
215+
}
216+
},
197217
{
198218
"cell_type": "code",
199219
"execution_count": 10,
@@ -243,7 +263,7 @@
243263
")\n",
244264
"xicor_results = grouped_calc.eval({'d': example_frames, 'rep_frame': rep_frame})\n",
245265
"\n",
246-
"xicor_results\n"
266+
"xicor_results"
247267
],
248268
"metadata": {
249269
"collapsed": false,
@@ -252,6 +272,18 @@
252272
}
253273
}
254274
},
275+
{
276+
"cell_type": "markdown",
277+
"source": [
278+
"And these accelerated grouped calculations still match the reference R implementation."
279+
],
280+
"metadata": {
281+
"collapsed": false,
282+
"pycharm": {
283+
"name": "#%% md\n"
284+
}
285+
}
286+
},
255287
{
256288
"cell_type": "code",
257289
"execution_count": 12,
@@ -337,6 +369,18 @@
337369
}
338370
}
339371
},
372+
{
373+
"cell_type": "markdown",
374+
"source": [
375+
"And, as always, the fact that this is a pure data algebra calculation means we can run it in a database (meaning we can apply it to big data)."
376+
],
377+
"metadata": {
378+
"collapsed": false,
379+
"pycharm": {
380+
"name": "#%% md\n"
381+
}
382+
}
383+
},
340384
{
341385
"cell_type": "code",
342386
"execution_count": 13,
@@ -353,8 +397,10 @@
353397
"source": [
354398
"# try it in database\n",
355399
"db_handle = data_algebra.BigQuery.example_handle()\n",
400+
"# place data in, in real applications data is already in database\n",
356401
"db_handle.insert_table(example_frames, table_name='d', allow_overwrite=True)\n",
357-
"db_handle.insert_table(rep_frame, table_name='rep_frame', allow_overwrite=True)"
402+
"db_handle.insert_table(rep_frame, table_name='rep_frame', allow_overwrite=True)\n",
403+
"db_handle.drop_table(\"xicor\")"
358404
],
359405
"metadata": {
360406
"collapsed": false,
@@ -365,10 +411,10 @@
365411
},
366412
{
367413
"cell_type": "code",
368-
"execution_count": 14,
414+
"execution_count": 15,
369415
"outputs": [],
370416
"source": [
371-
"db_handle.drop_table(\"xicor\")"
417+
"db_handle.execute(f\"CREATE TABLE {db_handle.db_model.table_prefix}.xicor AS {db_handle.to_sql(grouped_calc)}\")"
372418
],
373419
"metadata": {
374420
"collapsed": false,
@@ -379,10 +425,9 @@
379425
},
380426
{
381427
"cell_type": "code",
382-
"execution_count": 15,
428+
"execution_count": null,
383429
"outputs": [],
384430
"source": [
385-
"db_handle.execute(f\"CREATE TABLE {db_handle.db_model.table_prefix}.xicor AS {db_handle.to_sql(grouped_calc)}\")\n",
386431
"db_res = db_handle.read_query(f\"SELECT * FROM {db_handle.db_model.table_prefix}.xicor ORDER BY vname\")"
387432
],
388433
"metadata": {
@@ -481,7 +526,7 @@
481526
"db_handle.drop_table(\"rep_frame\")\n",
482527
"db_handle.drop_table(\"xicor\")\n",
483528
"db_handle.close()\n",
484-
"# show we made it to here, adn did not assert earlier\n",
529+
"# show we made it to here, and did not assert earlier\n",
485530
"print('done')"
486531
],
487532
"metadata": {
@@ -490,6 +535,18 @@
490535
"name": "#%%\n"
491536
}
492537
}
538+
},
539+
{
540+
"cell_type": "markdown",
541+
"source": [
542+
"And this is an example of a non-trivial statistical calculation being ported to the database."
543+
],
544+
"metadata": {
545+
"collapsed": false,
546+
"pycharm": {
547+
"name": "#%% md\n"
548+
}
549+
}
493550
}
494551
],
495552
"metadata": {

0 commit comments

Comments
 (0)