|
1 | 1 | { |
2 | 2 | "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "source": [ |
| 6 | + "# Values as Columns\n", |
| 7 | + "\n", |
| 8 | + "A [SQL](https://en.wikipedia.org/wiki/SQL) feature I realy like is the equivalence or interchangeability of values and columns. It is a small convenience, but a nice feature.\n", |
| 9 | + "\n", |
| 10 | + "Let's work an example to illustrate the point. Our task will be to count how many rows are in each group of a data frame.\n", |
| 11 | + "\n", |
| 12 | + "In the [data algebra](https://github.com/WinVector/data_algebra) over [Pandas](https://pandas.pydata.org) this looks like the following.\n", |
| 13 | + "\n", |
| 14 | + "First we import our packges and set up our example Pandas data frame." |
| 15 | + ], |
| 16 | + "metadata": { |
| 17 | + "collapsed": false, |
| 18 | + "pycharm": { |
| 19 | + "name": "#%% md\n" |
| 20 | + } |
| 21 | + } |
| 22 | + }, |
3 | 23 | { |
4 | 24 | "cell_type": "code", |
5 | | - "execution_count": null, |
| 25 | + "execution_count": 1, |
6 | 26 | "metadata": { |
7 | 27 | "collapsed": true |
8 | 28 | }, |
9 | | - "outputs": [], |
| 29 | + "outputs": [ |
| 30 | + { |
| 31 | + "data": { |
| 32 | + "text/plain": " group one\n0 a 1\n1 a 1\n2 b 1\n3 b 1\n4 b 1", |
| 33 | + "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>group</th>\n <th>one</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>a</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>a</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>b</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>b</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>b</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>" |
| 34 | + }, |
| 35 | + "execution_count": 1, |
| 36 | + "metadata": {}, |
| 37 | + "output_type": "execute_result" |
| 38 | + } |
| 39 | + ], |
10 | 40 | "source": [ |
11 | 41 | "import pandas as pd\n", |
12 | | - "from data_algebra.data_ops import data, descr, ex\n", |
13 | | - "from data_algebra.BigQuery import BigQueryModel, BigQuery_DBHandle\n", |
| 42 | + "from data_algebra.data_ops import descr\n", |
| 43 | + "from data_algebra.BigQuery import BigQueryModel\n", |
14 | 44 | "\n", |
15 | 45 | "\n", |
16 | 46 | "d = pd.DataFrame({\n", |
17 | 47 | " 'group': ['a', 'a', 'b', 'b', 'b'],\n", |
18 | 48 | " 'one': [1, 1, 1, 1, 1],\n", |
19 | 49 | "})\n", |
20 | 50 | "\n", |
| 51 | + "d\n" |
| 52 | + ] |
| 53 | + }, |
| 54 | + { |
| 55 | + "cell_type": "markdown", |
| 56 | + "source": [ |
| 57 | + "Now we specify our grouped counting operations, using a data algebra project step." |
| 58 | + ], |
| 59 | + "metadata": { |
| 60 | + "collapsed": false, |
| 61 | + "pycharm": { |
| 62 | + "name": "#%% md\n" |
| 63 | + } |
| 64 | + } |
| 65 | + }, |
| 66 | + { |
| 67 | + "cell_type": "code", |
| 68 | + "execution_count": 2, |
| 69 | + "outputs": [], |
| 70 | + "source": [ |
21 | 71 | "ops = (\n", |
22 | 72 | " descr(d=d)\n", |
23 | 73 | " .project(\n", |
|
27 | 77 | " },\n", |
28 | 78 | " group_by=['group']\n", |
29 | 79 | " )\n", |
30 | | - ")\n", |
| 80 | + ")" |
| 81 | + ], |
| 82 | + "metadata": { |
| 83 | + "collapsed": false, |
| 84 | + "pycharm": { |
| 85 | + "name": "#%%\n" |
| 86 | + } |
| 87 | + } |
| 88 | + }, |
| 89 | + { |
| 90 | + "cell_type": "markdown", |
| 91 | + "source": [ |
| 92 | + "The point is, we have the freedom to count using a value in a column (such as the column `one`) *or* just by summing a value directly (such as `1`, the parenthesis are so that the dot is interpreted as an attribute lookup, and not as a floating point marker).\n", |
31 | 93 | "\n", |
| 94 | + "As desired, both calculations return the same result." |
| 95 | + ], |
| 96 | + "metadata": { |
| 97 | + "collapsed": false, |
| 98 | + "pycharm": { |
| 99 | + "name": "#%% md\n" |
| 100 | + } |
| 101 | + } |
| 102 | + }, |
| 103 | + { |
| 104 | + "cell_type": "code", |
| 105 | + "execution_count": 3, |
| 106 | + "outputs": [ |
| 107 | + { |
| 108 | + "data": { |
| 109 | + "text/plain": " group sum_one sum_1\n0 a 2 2\n1 b 3 3", |
| 110 | + "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>group</th>\n <th>sum_one</th>\n <th>sum_1</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>a</td>\n <td>2</td>\n <td>2</td>\n </tr>\n <tr>\n <th>1</th>\n <td>b</td>\n <td>3</td>\n <td>3</td>\n </tr>\n </tbody>\n</table>\n</div>" |
| 111 | + }, |
| 112 | + "execution_count": 3, |
| 113 | + "metadata": {}, |
| 114 | + "output_type": "execute_result" |
| 115 | + } |
| 116 | + ], |
| 117 | + "source": [ |
32 | 118 | "ops.transform(d)" |
33 | | - ] |
| 119 | + ], |
| 120 | + "metadata": { |
| 121 | + "collapsed": false, |
| 122 | + "pycharm": { |
| 123 | + "name": "#%%\n" |
| 124 | + } |
| 125 | + } |
| 126 | + }, |
| 127 | + { |
| 128 | + "cell_type": "markdown", |
| 129 | + "source": [ |
| 130 | + "And the equivalent SQL is given as follows." |
| 131 | + ], |
| 132 | + "metadata": { |
| 133 | + "collapsed": false, |
| 134 | + "pycharm": { |
| 135 | + "name": "#%% md\n" |
| 136 | + } |
| 137 | + } |
34 | 138 | }, |
35 | 139 | { |
36 | 140 | "cell_type": "code", |
37 | | - "execution_count": null, |
38 | | - "outputs": [], |
| 141 | + "execution_count": 4, |
| 142 | + "outputs": [ |
| 143 | + { |
| 144 | + "name": "stdout", |
| 145 | + "output_type": "stream", |
| 146 | + "text": [ |
| 147 | + "-- data_algebra SQL https://github.com/WinVector/data_algebra\n", |
| 148 | + "-- dialect: BigQueryModel\n", |
| 149 | + "-- string quote: \"\n", |
| 150 | + "-- identifier quote: `\n", |
| 151 | + "SELECT -- .project({ 'sum_one': 'one.sum()', 'sum_1': '(1).sum()'}, group_by=['group'])\n", |
| 152 | + " SUM(`one`) AS `sum_one` ,\n", |
| 153 | + " SUM(1) AS `sum_1` ,\n", |
| 154 | + " `group`\n", |
| 155 | + "FROM\n", |
| 156 | + " `d`\n", |
| 157 | + "GROUP BY\n", |
| 158 | + " `group`\n", |
| 159 | + "\n" |
| 160 | + ] |
| 161 | + } |
| 162 | + ], |
39 | 163 | "source": [ |
40 | 164 | "db_model = BigQueryModel()\n", |
41 | 165 | "\n", |
42 | | - "print(db_model.to_sql(ops))\n" |
| 166 | + "sql_str = db_model.to_sql(ops)\n", |
| 167 | + "\n", |
| 168 | + "print(sql_str)" |
43 | 169 | ], |
44 | 170 | "metadata": { |
45 | 171 | "collapsed": false, |
46 | 172 | "pycharm": { |
47 | 173 | "name": "#%%\n" |
48 | 174 | } |
49 | 175 | } |
| 176 | + }, |
| 177 | + { |
| 178 | + "cell_type": "markdown", |
| 179 | + "source": [ |
| 180 | + "SQL being where the values and columns equivalence principle is borrowed from." |
| 181 | + ], |
| 182 | + "metadata": { |
| 183 | + "collapsed": false, |
| 184 | + "pycharm": { |
| 185 | + "name": "#%% md\n" |
| 186 | + } |
| 187 | + } |
50 | 188 | } |
51 | 189 | ], |
52 | 190 | "metadata": { |
|
0 commit comments