edit

JohnMount · JohnMount · commit 9ea0ce1a5bab · 2022-01-09T13:33:56.000-08:00
diff --git a/Examples/xicor/xicor.ipynb b/Examples/xicor/xicor.ipynb
@@ -9,21 +9,25 @@
     }
    },
    "source": [
-    "Professor Sourav Chatterjee's xicor coefficient of correlation (<a href=\"https://win-vector.com/2021/12/29/exploring-the-xi-correlation-coefficient/\">Nina Zumel's tutorial</a>, <a href=\"https://doi.org/10.1080/01621459.2020.1758115\">JASA</a>; original sources: <a href=\"https://CRAN.R-project.org/package=XICOR\">R package</a>, <a href=\"https://arxiv.org/abs/1909.10140\">Arxiv</a>, <a href=\"https://news.ycombinator.com/item?id=29687613\">Hacker News</a>, and <a href=\"https://github.com/czbiohub/xicor\">a Python package</a> (different author))."
+    "For a lark, I decided to try and translate Professor Sourav Chatterjee's xicor coefficient of correlation into a [data algebra](https://github.com/WinVector/data_algebra) query that could be run in the database. (xicor refs: <a href=\"https://win-vector.com/2021/12/29/exploring-the-xi-correlation-coefficient/\">Nina Zumel's tutorial</a>, <a href=\"https://doi.org/10.1080/01621459.2020.1758115\">JASA</a>; original sources: <a href=\"https://CRAN.R-project.org/package=XICOR\">R package</a>, <a href=\"https://arxiv.org/abs/1909.10140\">Arxiv</a>, <a href=\"https://news.ycombinator.com/item?id=29687613\">Hacker News</a>, and <a href=\"https://github.com/czbiohub/xicor\">a Python package</a> (different author)). Actually the serious reason, is more complex tests help drive the development of the package.\n",
+    "\n",
+    "I found, I could translate the R reference implementation of xicor line by line into data algebra steps.\n",
+    "\n",
+    "I could re-run many examples in R and in the data algebra to confirm the implementation."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
    "outputs": [],
    "source": [
-    "from typing import Iterable\n",
     "import numpy as np\n",
     "import pandas as pd\n",
-    "from data_algebra.data_ops import descr, TableDescription, ViewRepresentation\n",
+    "from data_algebra.data_ops import descr, TableDescription\n",
     "import data_algebra.BigQuery\n",
     "import data_algebra.solutions\n",
-    "import yaml\n"
+    "import yaml\n",
+    "\n"
    ],
    "metadata": {
     "collapsed": false,
@@ -32,18 +36,6 @@
     }
    }
   },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "outputs": [],
-   "source": [],
-   "metadata": {
-    "collapsed": false,
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   }
-  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -145,6 +137,22 @@
     }
    }
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The results matched expectation, but the code was very slow.\n",
+    "\n",
+    "Then I added \"group by\" clauses to the data algebra realization of the xicor calculation.  This sped up the Pandas implementation immensely, as translation overhead was no amortized over a large efficient calculation.\n",
+    "\n",
+    "I could now confirm very many xicor calculations at once, by putting them all in a shared table (identifiable by row labels)."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -194,6 +202,18 @@
     }
    }
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "I can even repeat the calculation and compute aggregates just by joining and projecting."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
@@ -243,7 +263,7 @@
     ")\n",
     "xicor_results = grouped_calc.eval({'d': example_frames, 'rep_frame': rep_frame})\n",
     "\n",
-    "xicor_results\n"
+    "xicor_results"
    ],
    "metadata": {
     "collapsed": false,
@@ -252,6 +272,18 @@
     }
    }
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "And these accelerated grouped calculations still match the reference R implementation."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
   {
    "cell_type": "code",
    "execution_count": 12,
@@ -337,6 +369,18 @@
     }
    }
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "And, as always, the fact that this is a pure data algebra calculation means we can run it in a database (meaning we can apply it to big data)."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
@@ -353,8 +397,10 @@
    "source": [
     "# try it in database\n",
     "db_handle = data_algebra.BigQuery.example_handle()\n",
+    "# place data in, in real applications data is already in database\n",
     "db_handle.insert_table(example_frames, table_name='d', allow_overwrite=True)\n",
-    "db_handle.insert_table(rep_frame, table_name='rep_frame', allow_overwrite=True)"
+    "db_handle.insert_table(rep_frame, table_name='rep_frame', allow_overwrite=True)\n",
+    "db_handle.drop_table(\"xicor\")"
    ],
    "metadata": {
     "collapsed": false,
@@ -365,10 +411,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 15,
    "outputs": [],
    "source": [
-    "db_handle.drop_table(\"xicor\")"
+    "db_handle.execute(f\"CREATE TABLE {db_handle.db_model.table_prefix}.xicor AS {db_handle.to_sql(grouped_calc)}\")"
    ],
    "metadata": {
     "collapsed": false,
@@ -379,10 +425,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
    "outputs": [],
    "source": [
-    "db_handle.execute(f\"CREATE TABLE {db_handle.db_model.table_prefix}.xicor AS {db_handle.to_sql(grouped_calc)}\")\n",
     "db_res = db_handle.read_query(f\"SELECT * FROM {db_handle.db_model.table_prefix}.xicor ORDER BY vname\")"
    ],
    "metadata": {
@@ -481,7 +526,7 @@
     "db_handle.drop_table(\"rep_frame\")\n",
     "db_handle.drop_table(\"xicor\")\n",
     "db_handle.close()\n",
-    "# show we made it to here, adn did not assert earlier\n",
+    "# show we made it to here, and did not assert earlier\n",
     "print('done')"
    ],
    "metadata": {
@@ -490,6 +535,18 @@
      "name": "#%%\n"
     }
    }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "And this is an example of a non-trivial statistical calculation being ported to the database."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
   }
  ],
  "metadata": {