|
47 | 47 | "\n",
|
48 | 48 | "* What kind of machine learning problem/task is this?\n",
|
49 | 49 | "* What are our goals? What are we predicting?\n",
|
50 |
| - "* If this were a research task, does it make sense what were doing?\n", |
| 50 | + "* If this was a research task, does what were doing make sense?\n", |
51 | 51 | "* How would a human answer the question?\n",
|
52 | 52 | "* What performance would we be happy with?\n",
|
53 | 53 | "* Do we have any assumptions yet? Can we verify these?\n",
|
|
193 | 193 | "What does this tell us?\n",
|
194 | 194 | "\n",
|
195 | 195 | "1. Median income: This does not look like it is expressed in USD. The team that created this data let you know that the data has been scaled, and capped at ~15, and at ~0.5 for upper and lower median incomes. The numbers represent roughly k * 10,000 USD.\n",
|
196 |
| - "2. Median house value: Note the value counts of the 500k USD bin: this data has also been capped. This could lead to our model learning that house prices can never go about this value. If we want our model to perform well, we either need to collect accurate labels (house prices) for these districts, or remove these districts (from the train and test sets) altogether. This means that the model will not be evaluated poorly on these districts, but would not be able to be used to provide predictions. What are the business requirements here? One solution might be preferable.\n", |
| 196 | + "2. Median house value: Note the value counts of the 500k USD bin: this data has also been capped. This could lead to our model learning that house prices can never exceed this value. If we want our model to perform well, we either need to collect accurate labels (house prices) for these districts, or remove these districts (from the train and test sets) altogether. This means that the model will not be evaluated poorly on these districts, but would not be able to be used to provide predictions. What are the business requirements here? One solution might be preferable.\n", |
197 | 197 | "3. Note the differing scales of the features. We might need to re-scale these.\n",
|
198 | 198 | "4. Note that many distributions are quite tail-heavy. We might need to transform these later on."
|
199 | 199 | ]
|
|
363 | 363 | "cell_type": "markdown",
|
364 | 364 | "metadata": {},
|
365 | 365 | "source": [
|
366 |
| - "Final note: You need to be careful selecting strata, and ensuring that there are enough samples in each stratum." |
| 366 | + "Final note: You need to be careful selecting strata, and ensure that there are enough samples in each stratum." |
367 | 367 | ]
|
368 | 368 | },
|
369 | 369 | {
|
|
374 | 374 | "\n",
|
375 | 375 | "* Any data with latitude and longitude columns should be plotted immediately!\n",
|
376 | 376 | "* However, as we are just exploring at the moment, lets create a copy of our training set for safety.\n",
|
377 |
| - "* If our dataset was very large, we might want to randomly sample 10% (or another amount) from our training set to make it easier to work with in local memory." |
| 377 | + "* If our dataset was very large, we might want to randomly sample a subset (10%, for example) of our training set to make it easier to work with in local memory." |
378 | 378 | ]
|
379 | 379 | },
|
380 | 380 | {
|
|
596 | 596 | "* Depending on your model and type of task, there might be some preparation of data for your machine learning algorithm.\n",
|
597 | 597 | "* Remember that you will need to apply these transformations to the test set as well. Hence, these functions and processes should be as modular as possible.\n",
|
598 | 598 | "* There are standard transformations for each model and type of task. Some of these can be applied immediately (with know-how of course). For example, some models require things like one-hot encoding, where categoric columns are split into columns of binary representations.\n",
|
599 |
| - "* However, it is generally not a good idea to throw all of these processes at your data before starting. For example, lets say we want to replicate a research paper to check the performance of a model. If it has a complicated serious of transformations before training, it might be worth checking the performance of a simple linear model operating on minimally transformed data. More than a few research papers present highly complex models and pre-processing stages just to perform worse than linear regression!\n", |
| 599 | + "* However, it is generally not a good idea to throw all of these processes at your data before starting. For example, lets say we want to replicate a research paper to check the performance of a model. If it has a complicated series of transformations before training, it might be worth checking the performance of a simple linear model operating on minimally transformed data. More than a few research papers present highly complex models and pre-processing stages just to perform worse than linear regression!\n", |
600 | 600 | "\n",
|
601 | 601 | "Note: I am wrapping up data cleaning and feature engineering in the same stage here. The line between these is a bit blurry. I generally consider cleaning with feature engineering as you shouldnt make data cleaning decisions until you know what machine learning you are going to perform. What if you clean something useful?\n",
|
602 | 602 | "\n",
|
|
737 | 737 | "source": [
|
738 | 738 | "imputer = SimpleImputer(strategy=\"median\")\n",
|
739 | 739 | "\n",
|
740 |
| - "# Drop text attributes as we cant impute these\n", |
| 740 | + "# Drop text attributes as we can't impute these\n", |
741 | 741 | "housing_num = housing.select_dtypes(include=[np.number])"
|
742 | 742 | ]
|
743 | 743 | },
|
|
850 | 850 | "\n",
|
851 | 851 | "* Many machine learning models require processing of categorical/text/string attributes. Generally, machine learning algorithms prefer numbers!\n",
|
852 | 852 | "* One common processing step is called ordinal encoding, where we replace the categories with a numeric value.\n",
|
853 |
| - "* Another common processing step is called one-hot encoding, where a single categorical attribute of n categories is transformed into n (multiplied by rows) binary columns. \n", |
| 853 | + "* Another common processing step is *one-hot* encoding, where a categorical attribute with n categories is converted into n binary attributes—one for each category. Each attribute takes a value of 1 for its corresponding category and 0 for all others.\n", |
854 | 854 | "* Lets try ordinal encoding of the ocean proximity attribute."
|
855 | 855 | ]
|
856 | 856 | },
|
|
0 commit comments