|
1 | 1 | { |
2 | 2 | "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "source": [ |
| 6 | + "The [`data_algebra`](https://github.com/WinVector/data_algebra/tree/master/data_algebra) locum stand-in gives us the ability to build up pipelines out of larger pieces.\n", |
| 7 | + "\n", |
| 8 | + "A traditiona all in one way of building up a pipeline looks like the following." |
| 9 | + ], |
| 10 | + "metadata": { |
| 11 | + "collapsed": false, |
| 12 | + "pycharm": { |
| 13 | + "name": "#%% md\n" |
| 14 | + } |
| 15 | + } |
| 16 | + }, |
3 | 17 | { |
4 | 18 | "cell_type": "code", |
5 | 19 | "execution_count": 1, |
|
51 | 65 | "print(ops)" |
52 | 66 | ] |
53 | 67 | }, |
| 68 | + { |
| 69 | + "cell_type": "markdown", |
| 70 | + "source": [ |
| 71 | + "Instead we can build up this calculation as three major steps: computing probability, rank based selection, and cleanup." |
| 72 | + ], |
| 73 | + "metadata": { |
| 74 | + "collapsed": false, |
| 75 | + "pycharm": { |
| 76 | + "name": "#%% md\n" |
| 77 | + } |
| 78 | + } |
| 79 | + }, |
54 | 80 | { |
55 | 81 | "cell_type": "code", |
56 | 82 | "execution_count": 2, |
|
138 | 164 | } |
139 | 165 | } |
140 | 166 | }, |
| 167 | + { |
| 168 | + "cell_type": "markdown", |
| 169 | + "source": [ |
| 170 | + "We can then combine this into a new pipeline as follows." |
| 171 | + ], |
| 172 | + "metadata": { |
| 173 | + "collapsed": false, |
| 174 | + "pycharm": { |
| 175 | + "name": "#%% md\n" |
| 176 | + } |
| 177 | + } |
| 178 | + }, |
141 | 179 | { |
142 | 180 | "cell_type": "code", |
143 | 181 | "execution_count": 5, |
|
150 | 188 | "output_type": "stream" |
151 | 189 | } |
152 | 190 | ], |
| 191 | + "source": [ |
| 192 | + "ops = Locum(). \\\n", |
| 193 | + " append(prob_calculation). \\\n", |
| 194 | + " append(top_rank). \\\n", |
| 195 | + " append(clean_up_columns). \\\n", |
| 196 | + " apply_to(data_algebra.data_ops.describe_table(d_local, 'd'))\n", |
| 197 | + "\n", |
| 198 | + "print(ops)" |
| 199 | + ], |
| 200 | + "metadata": { |
| 201 | + "collapsed": false, |
| 202 | + "pycharm": { |
| 203 | + "name": "#%%\n", |
| 204 | + "is_executing": false |
| 205 | + } |
| 206 | + } |
| 207 | + }, |
| 208 | + { |
| 209 | + "cell_type": "markdown", |
| 210 | + "source": [ |
| 211 | + "The pipeline is applied to data as follows." |
| 212 | + ], |
| 213 | + "metadata": { |
| 214 | + "collapsed": false, |
| 215 | + "pycharm": { |
| 216 | + "name": "#%% md\n" |
| 217 | + } |
| 218 | + } |
| 219 | + }, |
| 220 | + { |
| 221 | + "cell_type": "code", |
| 222 | + "execution_count": 6, |
| 223 | + "outputs": [ |
| 224 | + { |
| 225 | + "data": { |
| 226 | + "text/plain": " subjectID diagnosis probability\n0 1 withdrawal behavior 0.670622\n1 2 positive re-framing 0.558974", |
| 227 | + "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>subjectID</th>\n <th>diagnosis</th>\n <th>probability</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>withdrawal behavior</td>\n <td>0.670622</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>positive re-framing</td>\n <td>0.558974</td>\n </tr>\n </tbody>\n</table>\n</div>" |
| 228 | + }, |
| 229 | + "metadata": {}, |
| 230 | + "output_type": "execute_result", |
| 231 | + "execution_count": 6 |
| 232 | + } |
| 233 | + ], |
| 234 | + "source": [ |
| 235 | + "ops.transform(d_local)" |
| 236 | + ], |
| 237 | + "metadata": { |
| 238 | + "collapsed": false, |
| 239 | + "pycharm": { |
| 240 | + "name": "#%%\n", |
| 241 | + "is_executing": false |
| 242 | + } |
| 243 | + } |
| 244 | + }, |
| 245 | + { |
| 246 | + "cell_type": "markdown", |
| 247 | + "source": [ |
| 248 | + "Or we can use `+`/append notation to build up the pipeline." |
| 249 | + ], |
| 250 | + "metadata": { |
| 251 | + "collapsed": false, |
| 252 | + "pycharm": { |
| 253 | + "name": "#%% md\n" |
| 254 | + } |
| 255 | + } |
| 256 | + }, |
| 257 | + { |
| 258 | + "cell_type": "code", |
| 259 | + "execution_count": 7, |
| 260 | + "outputs": [ |
| 261 | + { |
| 262 | + "name": "stdout", |
| 263 | + "text": [ |
| 264 | + "TableDescription(table_name='d', column_names=['subjectID', 'surveyCategory', 'assessmentTotal', 'irrelevantCol1', 'irrelevantCol2']) .\\\n extend({'probability': '(assessmentTotal * 0.237).exp()'}) .\\\n extend({'total': 'probability.sum()'}, partition_by=['subjectID']) .\\\n extend({'probability': 'probability / total'}) .\\\n extend({'sort_key': '-probability'}) .\\\n extend({'row_number': '_row_number()'}, partition_by=['subjectID'], order_by=['sort_key']) .\\\n select_rows('row_number == 1') .\\\n select_columns(['subjectID', 'surveyCategory', 'probability']) .\\\n rename_columns({'diagnosis': 'surveyCategory'}) .\\\n order_rows(['subjectID'], reverse=['subjectID'])\n" |
| 265 | + ], |
| 266 | + "output_type": "stream" |
| 267 | + } |
| 268 | + ], |
153 | 269 | "source": [ |
154 | 270 | "ops = data_algebra.data_ops.describe_table(d_local, 'd') +\\\n", |
155 | 271 | " prob_calculation +\\\n", |
|
166 | 282 | } |
167 | 283 | } |
168 | 284 | }, |
| 285 | + { |
| 286 | + "cell_type": "markdown", |
| 287 | + "source": [ |
| 288 | + "And we could \"pipe\" the data into the operators, but that is less \"Pythonic\" (or idiomatic for Python)." |
| 289 | + ], |
| 290 | + "metadata": { |
| 291 | + "collapsed": false, |
| 292 | + "pycharm": { |
| 293 | + "name": "#%% md\n" |
| 294 | + } |
| 295 | + } |
| 296 | + }, |
169 | 297 | { |
170 | 298 | "cell_type": "code", |
171 | | - "execution_count": 6, |
| 299 | + "execution_count": 8, |
172 | 300 | "outputs": [ |
173 | 301 | { |
174 | 302 | "data": { |
|
177 | 305 | }, |
178 | 306 | "metadata": {}, |
179 | 307 | "output_type": "execute_result", |
180 | | - "execution_count": 6 |
| 308 | + "execution_count": 8 |
181 | 309 | } |
182 | 310 | ], |
183 | 311 | "source": [ |
|
0 commit comments