|
9 | 9 | } |
10 | 10 | }, |
11 | 11 | "source": [ |
12 | | - "Professor Sourav Chatterjee's xicor coefficient of correlation (<a href=\"https://win-vector.com/2021/12/29/exploring-the-xi-correlation-coefficient/\">Nina Zumel's tutorial</a>, <a href=\"https://doi.org/10.1080/01621459.2020.1758115\">JASA</a>; original sources: <a href=\"https://CRAN.R-project.org/package=XICOR\">R package</a>, <a href=\"https://arxiv.org/abs/1909.10140\">Arxiv</a>, <a href=\"https://news.ycombinator.com/item?id=29687613\">Hacker News</a>, and <a href=\"https://github.com/czbiohub/xicor\">a Python package</a> (different author))." |
| 12 | + "For a lark, I decided to try and translate Professor Sourav Chatterjee's xicor coefficient of correlation into a [data algebra](https://github.com/WinVector/data_algebra) query that could be run in the database. (xicor refs: <a href=\"https://win-vector.com/2021/12/29/exploring-the-xi-correlation-coefficient/\">Nina Zumel's tutorial</a>, <a href=\"https://doi.org/10.1080/01621459.2020.1758115\">JASA</a>; original sources: <a href=\"https://CRAN.R-project.org/package=XICOR\">R package</a>, <a href=\"https://arxiv.org/abs/1909.10140\">Arxiv</a>, <a href=\"https://news.ycombinator.com/item?id=29687613\">Hacker News</a>, and <a href=\"https://github.com/czbiohub/xicor\">a Python package</a> (different author)). Actually the serious reason, is more complex tests help drive the development of the package.\n", |
| 13 | + "\n", |
| 14 | + "I found, I could translate the R reference implementation of xicor line by line into data algebra steps.\n", |
| 15 | + "\n", |
| 16 | + "I could re-run many examples in R and in the data algebra to confirm the implementation." |
13 | 17 | ] |
14 | 18 | }, |
15 | 19 | { |
16 | 20 | "cell_type": "code", |
17 | 21 | "execution_count": 3, |
18 | 22 | "outputs": [], |
19 | 23 | "source": [ |
20 | | - "from typing import Iterable\n", |
21 | 24 | "import numpy as np\n", |
22 | 25 | "import pandas as pd\n", |
23 | | - "from data_algebra.data_ops import descr, TableDescription, ViewRepresentation\n", |
| 26 | + "from data_algebra.data_ops import descr, TableDescription\n", |
24 | 27 | "import data_algebra.BigQuery\n", |
25 | 28 | "import data_algebra.solutions\n", |
26 | | - "import yaml\n" |
| 29 | + "import yaml\n", |
| 30 | + "\n" |
27 | 31 | ], |
28 | 32 | "metadata": { |
29 | 33 | "collapsed": false, |
|
32 | 36 | } |
33 | 37 | } |
34 | 38 | }, |
35 | | - { |
36 | | - "cell_type": "code", |
37 | | - "execution_count": 3, |
38 | | - "outputs": [], |
39 | | - "source": [], |
40 | | - "metadata": { |
41 | | - "collapsed": false, |
42 | | - "pycharm": { |
43 | | - "name": "#%%\n" |
44 | | - } |
45 | | - } |
46 | | - }, |
47 | 39 | { |
48 | 40 | "cell_type": "code", |
49 | 41 | "execution_count": 4, |
|
145 | 137 | } |
146 | 138 | } |
147 | 139 | }, |
| 140 | + { |
| 141 | + "cell_type": "markdown", |
| 142 | + "source": [ |
| 143 | + "The results matched expectation, but the code was very slow.\n", |
| 144 | + "\n", |
| 145 | + "Then I added \"group by\" clauses to the data algebra realization of the xicor calculation. This sped up the Pandas implementation immensely, as translation overhead was no amortized over a large efficient calculation.\n", |
| 146 | + "\n", |
| 147 | + "I could now confirm very many xicor calculations at once, by putting them all in a shared table (identifiable by row labels)." |
| 148 | + ], |
| 149 | + "metadata": { |
| 150 | + "collapsed": false, |
| 151 | + "pycharm": { |
| 152 | + "name": "#%% md\n" |
| 153 | + } |
| 154 | + } |
| 155 | + }, |
148 | 156 | { |
149 | 157 | "cell_type": "code", |
150 | 158 | "execution_count": 8, |
|
194 | 202 | } |
195 | 203 | } |
196 | 204 | }, |
| 205 | + { |
| 206 | + "cell_type": "markdown", |
| 207 | + "source": [ |
| 208 | + "I can even repeat the calculation and compute aggregates just by joining and projecting." |
| 209 | + ], |
| 210 | + "metadata": { |
| 211 | + "collapsed": false, |
| 212 | + "pycharm": { |
| 213 | + "name": "#%% md\n" |
| 214 | + } |
| 215 | + } |
| 216 | + }, |
197 | 217 | { |
198 | 218 | "cell_type": "code", |
199 | 219 | "execution_count": 10, |
|
243 | 263 | ")\n", |
244 | 264 | "xicor_results = grouped_calc.eval({'d': example_frames, 'rep_frame': rep_frame})\n", |
245 | 265 | "\n", |
246 | | - "xicor_results\n" |
| 266 | + "xicor_results" |
247 | 267 | ], |
248 | 268 | "metadata": { |
249 | 269 | "collapsed": false, |
|
252 | 272 | } |
253 | 273 | } |
254 | 274 | }, |
| 275 | + { |
| 276 | + "cell_type": "markdown", |
| 277 | + "source": [ |
| 278 | + "And these accelerated grouped calculations still match the reference R implementation." |
| 279 | + ], |
| 280 | + "metadata": { |
| 281 | + "collapsed": false, |
| 282 | + "pycharm": { |
| 283 | + "name": "#%% md\n" |
| 284 | + } |
| 285 | + } |
| 286 | + }, |
255 | 287 | { |
256 | 288 | "cell_type": "code", |
257 | 289 | "execution_count": 12, |
|
337 | 369 | } |
338 | 370 | } |
339 | 371 | }, |
| 372 | + { |
| 373 | + "cell_type": "markdown", |
| 374 | + "source": [ |
| 375 | + "And, as always, the fact that this is a pure data algebra calculation means we can run it in a database (meaning we can apply it to big data)." |
| 376 | + ], |
| 377 | + "metadata": { |
| 378 | + "collapsed": false, |
| 379 | + "pycharm": { |
| 380 | + "name": "#%% md\n" |
| 381 | + } |
| 382 | + } |
| 383 | + }, |
340 | 384 | { |
341 | 385 | "cell_type": "code", |
342 | 386 | "execution_count": 13, |
|
353 | 397 | "source": [ |
354 | 398 | "# try it in database\n", |
355 | 399 | "db_handle = data_algebra.BigQuery.example_handle()\n", |
| 400 | + "# place data in, in real applications data is already in database\n", |
356 | 401 | "db_handle.insert_table(example_frames, table_name='d', allow_overwrite=True)\n", |
357 | | - "db_handle.insert_table(rep_frame, table_name='rep_frame', allow_overwrite=True)" |
| 402 | + "db_handle.insert_table(rep_frame, table_name='rep_frame', allow_overwrite=True)\n", |
| 403 | + "db_handle.drop_table(\"xicor\")" |
358 | 404 | ], |
359 | 405 | "metadata": { |
360 | 406 | "collapsed": false, |
|
365 | 411 | }, |
366 | 412 | { |
367 | 413 | "cell_type": "code", |
368 | | - "execution_count": 14, |
| 414 | + "execution_count": 15, |
369 | 415 | "outputs": [], |
370 | 416 | "source": [ |
371 | | - "db_handle.drop_table(\"xicor\")" |
| 417 | + "db_handle.execute(f\"CREATE TABLE {db_handle.db_model.table_prefix}.xicor AS {db_handle.to_sql(grouped_calc)}\")" |
372 | 418 | ], |
373 | 419 | "metadata": { |
374 | 420 | "collapsed": false, |
|
379 | 425 | }, |
380 | 426 | { |
381 | 427 | "cell_type": "code", |
382 | | - "execution_count": 15, |
| 428 | + "execution_count": null, |
383 | 429 | "outputs": [], |
384 | 430 | "source": [ |
385 | | - "db_handle.execute(f\"CREATE TABLE {db_handle.db_model.table_prefix}.xicor AS {db_handle.to_sql(grouped_calc)}\")\n", |
386 | 431 | "db_res = db_handle.read_query(f\"SELECT * FROM {db_handle.db_model.table_prefix}.xicor ORDER BY vname\")" |
387 | 432 | ], |
388 | 433 | "metadata": { |
|
481 | 526 | "db_handle.drop_table(\"rep_frame\")\n", |
482 | 527 | "db_handle.drop_table(\"xicor\")\n", |
483 | 528 | "db_handle.close()\n", |
484 | | - "# show we made it to here, adn did not assert earlier\n", |
| 529 | + "# show we made it to here, and did not assert earlier\n", |
485 | 530 | "print('done')" |
486 | 531 | ], |
487 | 532 | "metadata": { |
|
490 | 535 | "name": "#%%\n" |
491 | 536 | } |
492 | 537 | } |
| 538 | + }, |
| 539 | + { |
| 540 | + "cell_type": "markdown", |
| 541 | + "source": [ |
| 542 | + "And this is an example of a non-trivial statistical calculation being ported to the database." |
| 543 | + ], |
| 544 | + "metadata": { |
| 545 | + "collapsed": false, |
| 546 | + "pycharm": { |
| 547 | + "name": "#%% md\n" |
| 548 | + } |
| 549 | + } |
493 | 550 | } |
494 | 551 | ], |
495 | 552 | "metadata": { |
|
0 commit comments