You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/team-data-science-process/spark-advanced-data-exploration-modeling.md
+32-15Lines changed: 32 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,10 +36,12 @@ The models we use include logistic and linear regression, random forests, and gr
36
36
37
37
Modeling examples using CV and Hyperparameter sweep are shown for the binary classification problem. Simpler examples (without parameter sweeps) are presented in the main topic for regression tasks. But in the appendix, validation using elastic net for linear regression and CV with parameter sweep using for random forest regression are also presented. The **elastic net** is a regularized regression method for fitting linear regression models that linearly combines the L1 and L2 metrics as penalties of the [lasso](https://en.wikipedia.org/wiki/Lasso%20%28statistics%29) and [ridge](https://en.wikipedia.org/wiki/Tikhonov_regularization) methods.
38
38
39
+
<!---->
40
+
39
41
> [!NOTE]
40
42
> Although the Spark MLlib toolkit is designed to work on large datasets, a relatively small sample (~30 Mb using 170K rows, about 0.1% of the original NYC dataset) is used here for convenience. The exercise given here runs efficiently (in about 10 minutes) on an HDInsight cluster with 2 worker nodes. The same code, with minor modifications, can be used to process larger data-sets, with appropriate modifications for caching data in memory and changing the cluster size.
41
-
>
42
-
>
43
+
44
+
<!---->
43
45
44
46
## Setup: Spark clusters and notebooks
45
47
Setup steps and code are provided in this walkthrough for using an HDInsight Spark 1.6. But Jupyter notebooks are provided for both HDInsight Spark 1.6 and Spark 2.0 clusters. A description of the notebooks and links to them are provided in the [Readme.md](https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/Spark/pySpark/Readme.md) for the GitHub repository containing them. Moreover, the code here and in the linked notebooks is generic and should work on any Spark cluster. If you are not using HDInsight Spark, the cluster setup and management steps may be slightly different from what is shown here. For convenience, here are the links to the Jupyter notebooks for Spark 1.6 and 2.0 to be run in the pyspark kernel of the Jupyter Notebook server:
@@ -203,10 +205,12 @@ This query retrieves the trips by passenger count.
203
205
204
206
This code creates a local data-frame from the query output and plots the data. The `%%local` magic creates a local data-frame, `sqlResults`, which can be used for plotting with matplotlib.
205
207
208
+
<!---->
209
+
206
210
> [!NOTE]
207
211
> This PySpark magic is used multiple times in this walkthrough. If the amount of data is large, you should sample to create a data-frame that can fit in local memory.
208
-
>
209
-
>
212
+
213
+
<!---->
210
214
211
215
# RUN THE CODE LOCALLY ON THE JUPYTER SERVER
212
216
%%local
@@ -569,10 +573,12 @@ We show how to do cross-validation (CV) with parameter sweeping in two ways:
569
573
### Generic cross validation and hyperparameter sweeping used with the logistic regression algorithm for binary classification
570
574
The code in this section shows how to train, evaluate, and save a logistic regression model with [LBFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) that predicts whether or not a tip is paid for a trip in the NYC taxi trip and fare dataset. The model is trained using cross validation (CV) and hyperparameter sweeping implemented with custom code that can be applied to any of the learning algorithms in MLlib.
571
575
576
+
<!---->
577
+
572
578
> [!NOTE]
573
579
> The execution of this custom CV code can take several minutes.
574
-
>
575
-
>
580
+
581
+
<!---->
576
582
577
583
**Train the logistic regression model using CV and hyperparameter sweeping**
578
584
@@ -795,10 +801,12 @@ Time taken to execute above cell: 34.57 seconds
795
801
### Use MLlib's CrossValidator pipeline function with logistic regression (Elastic regression) model
796
802
The code in this section shows how to train, evaluate, and save a logistic regression model with [LBFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) that predicts whether or not a tip is paid for a trip in the NYC taxi trip and fare dataset. The model is trained using cross validation (CV) and hyperparameter sweeping implemented with the MLlib CrossValidator pipeline function for CV with parameter sweep.
797
803
804
+
<!---->
805
+
798
806
> [!NOTE]
799
807
> The execution of this MLlib CV code can take several minutes.
800
-
>
801
-
>
808
+
809
+
<!---->
802
810
803
811
# RECORD START TIME
804
812
timestart = datetime.datetime.now()
@@ -992,12 +1000,19 @@ These models were described in the introduction. Each model building code sectio
992
1000
2.**Model evaluation** on a test data set with metrics
993
1001
3.**Saving model** in blob for future consumption
994
1002
995
-
> [!NOTE] Cross-validation is not used with the three regression models in this section, since this was shown in detail for the logistic regression models. An example showing how to use CV with Elastic Net for linear regression is provided in the Appendix of this topic.
1003
+
<!---->
996
1004
997
-
998
-
> [!NOTE] In our experience, there can be issues with convergence of LinearRegressionWithSGD models, and parameters need to be changed/optimized carefully for obtaining a valid model. Scaling of variables significantly helps with convergence. Elastic net regression, shown in the Appendix to this topic, can also be used instead of LinearRegressionWithSGD.
999
-
>
1000
-
>
1005
+
> [!NOTE]
1006
+
> Cross-validation is not used with the three regression models in this section, since this was shown in detail for the logistic regression models. An example showing how to use CV with Elastic Net for linear regression is provided in the Appendix of this topic.
1007
+
1008
+
<!---->
1009
+
1010
+
<!---->
1011
+
1012
+
> [!NOTE]
1013
+
> In our experience, there can be issues with convergence of LinearRegressionWithSGD models, and parameters need to be changed/optimized carefully for obtaining a valid model. Scaling of variables significantly helps with convergence. Elastic net regression, shown in the Appendix to this topic, can also be used instead of LinearRegressionWithSGD.
1014
+
1015
+
<!---->
1001
1016
1002
1017
### Linear regression with SGD
1003
1018
The code in this section shows how to use scaled features to train a linear regression that uses stochastic gradient descent (SGD) for optimization, and how to score, evaluate, and save the model in Azure Blob Storage (WASB).
@@ -1060,10 +1075,12 @@ Time taken to execute above cell: 38.62 seconds
1060
1075
### Random Forest regression
1061
1076
The code in this section shows how to train, evaluate, and save a random forest model that predicts tip amount for the NYC taxi trip data.
1062
1077
1078
+
<!---->
1079
+
1063
1080
> [!NOTE]
1064
1081
> Cross-validation with parameter sweeping using custom code is provided in the appendix.
0 commit comments