|
14 | 14 | "source": [ |
15 | 15 | "# Predict NYC Taxi Tips using Spark ML and Azure Open Datasets\n", |
16 | 16 | "\n", |
17 | | - "The notebook ingests, vizualises, prepares and then trains a model based on an Open Dataset that tracks NYC Yellow Taxi trips and various attributes around them.\n", |
| 17 | + "The notebook ingests, visualizes, prepares and then trains a model based on an Open Dataset that tracks NYC Yellow Taxi trips and various attributes around them.\n", |
18 | 18 | "The goal is to predict for a given trip whether there will be a trip or not.\n", |
19 | 19 | "" |
20 | 20 | ], |
|
91 | 91 | "source": [ |
92 | 92 | "## Exploratory Data Analysis\n", |
93 | 93 | "\n", |
94 | | - "Look at the data and evaluate its suitablility for use in a model, do this via some basic charts focussed on tip values and relatoinships." |
| 94 | + "Look at the data and evaluate its suitability for use in a model, do this via some basic charts focussed on tip values and relationships." |
95 | 95 | ], |
96 | 96 | "attachments": {} |
97 | 97 | }, |
|
178 | 178 | "source": [ |
179 | 179 | "## Data Prep and Featurization\n", |
180 | 180 | "\n", |
181 | | - "It's clear from the vizualisations above that there are a bunch of outliers in the data. These will need to be filtered out in addition there are extra variables that are not going to be useful in the model we build at the end.\n", |
| 181 | + "It's clear from the visualizations above that there are a bunch of outliers in the data. These will need to be filtered out in addition there are extra variables that are not going to be useful in the model we build at the end.\n", |
182 | 182 | "\n", |
183 | 183 | "Finally there is a need to create some new (derived) variables that will work better with the model.\n", |
184 | 184 | "" |
|
246 | 246 | "source": [ |
247 | 247 | "## Encoding\n", |
248 | 248 | "\n", |
249 | | - "Different ML alogirthms support different type sof input, for this example Logistic Regression is being used for Binry Classification. This means that any Categorical (string) variables must be converted to numbers.\n", |
| 249 | + "Different ML algorithms support different types of input, for this example Logistic Regression is being used for Binary Classification. This means that any Categorical (string) variables must be converted to numbers.\n", |
250 | 250 | "\n", |
251 | 251 | "The process is not as simple as a \"map\" style function as the relationship between the numbers can introduce a bias in the resulting model, the approach is to index the variable and then encode using a std approach called One Hot Encoding.\n", |
252 | 252 | "\n", |
|
350 | 350 | "cell_type": "markdown", |
351 | 351 | "metadata": {}, |
352 | 352 | "source": [ |
353 | | - "## Evaluate and Vizualise\n", |
| 353 | + "## Evaluate and Visualize\n", |
354 | 354 | "\n", |
355 | 355 | "Plot the actual curve to develop a better understanding of the model.\n", |
356 | 356 | "" |
|
0 commit comments