Skip to content

Commit 03daf76

Browse files
committed
Fix NOTE rendering
1 parent 259e5ac commit 03daf76

File tree

1 file changed

+32
-15
lines changed

1 file changed

+32
-15
lines changed

articles/machine-learning/team-data-science-process/spark-advanced-data-exploration-modeling.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,12 @@ The models we use include logistic and linear regression, random forests, and gr
3636

3737
Modeling examples using CV and Hyperparameter sweep are shown for the binary classification problem. Simpler examples (without parameter sweeps) are presented in the main topic for regression tasks. But in the appendix, validation using elastic net for linear regression and CV with parameter sweep using for random forest regression are also presented. The **elastic net** is a regularized regression method for fitting linear regression models that linearly combines the L1 and L2 metrics as penalties of the [lasso](https://en.wikipedia.org/wiki/Lasso%20%28statistics%29) and [ridge](https://en.wikipedia.org/wiki/Tikhonov_regularization) methods.
3838

39+
<!-- -->
40+
3941
> [!NOTE]
4042
> Although the Spark MLlib toolkit is designed to work on large datasets, a relatively small sample (~30 Mb using 170K rows, about 0.1% of the original NYC dataset) is used here for convenience. The exercise given here runs efficiently (in about 10 minutes) on an HDInsight cluster with 2 worker nodes. The same code, with minor modifications, can be used to process larger data-sets, with appropriate modifications for caching data in memory and changing the cluster size.
41-
>
42-
>
43+
44+
<!-- -->
4345

4446
## Setup: Spark clusters and notebooks
4547
Setup steps and code are provided in this walkthrough for using an HDInsight Spark 1.6. But Jupyter notebooks are provided for both HDInsight Spark 1.6 and Spark 2.0 clusters. A description of the notebooks and links to them are provided in the [Readme.md](https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/Spark/pySpark/Readme.md) for the GitHub repository containing them. Moreover, the code here and in the linked notebooks is generic and should work on any Spark cluster. If you are not using HDInsight Spark, the cluster setup and management steps may be slightly different from what is shown here. For convenience, here are the links to the Jupyter notebooks for Spark 1.6 and 2.0 to be run in the pyspark kernel of the Jupyter Notebook server:
@@ -203,10 +205,12 @@ This query retrieves the trips by passenger count.
203205

204206
This code creates a local data-frame from the query output and plots the data. The `%%local` magic creates a local data-frame, `sqlResults`, which can be used for plotting with matplotlib.
205207

208+
<!-- -->
209+
206210
> [!NOTE]
207211
> This PySpark magic is used multiple times in this walkthrough. If the amount of data is large, you should sample to create a data-frame that can fit in local memory.
208-
>
209-
>
212+
213+
<!-- -->
210214

211215
# RUN THE CODE LOCALLY ON THE JUPYTER SERVER
212216
%%local
@@ -569,10 +573,12 @@ We show how to do cross-validation (CV) with parameter sweeping in two ways:
569573
### Generic cross validation and hyperparameter sweeping used with the logistic regression algorithm for binary classification
570574
The code in this section shows how to train, evaluate, and save a logistic regression model with [LBFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) that predicts whether or not a tip is paid for a trip in the NYC taxi trip and fare dataset. The model is trained using cross validation (CV) and hyperparameter sweeping implemented with custom code that can be applied to any of the learning algorithms in MLlib.
571575

576+
<!-- -->
577+
572578
> [!NOTE]
573579
> The execution of this custom CV code can take several minutes.
574-
>
575-
>
580+
581+
<!-- -->
576582

577583
**Train the logistic regression model using CV and hyperparameter sweeping**
578584

@@ -795,10 +801,12 @@ Time taken to execute above cell: 34.57 seconds
795801
### Use MLlib's CrossValidator pipeline function with logistic regression (Elastic regression) model
796802
The code in this section shows how to train, evaluate, and save a logistic regression model with [LBFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) that predicts whether or not a tip is paid for a trip in the NYC taxi trip and fare dataset. The model is trained using cross validation (CV) and hyperparameter sweeping implemented with the MLlib CrossValidator pipeline function for CV with parameter sweep.
797803

804+
<!-- -->
805+
798806
> [!NOTE]
799807
> The execution of this MLlib CV code can take several minutes.
800-
>
801-
>
808+
809+
<!-- -->
802810

803811
# RECORD START TIME
804812
timestart = datetime.datetime.now()
@@ -992,12 +1000,19 @@ These models were described in the introduction. Each model building code sectio
9921000
2. **Model evaluation** on a test data set with metrics
9931001
3. **Saving model** in blob for future consumption
9941002

995-
> [!NOTE] Cross-validation is not used with the three regression models in this section, since this was shown in detail for the logistic regression models. An example showing how to use CV with Elastic Net for linear regression is provided in the Appendix of this topic.
1003+
<!-- -->
9961004

997-
998-
> [!NOTE] In our experience, there can be issues with convergence of LinearRegressionWithSGD models, and parameters need to be changed/optimized carefully for obtaining a valid model. Scaling of variables significantly helps with convergence. Elastic net regression, shown in the Appendix to this topic, can also be used instead of LinearRegressionWithSGD.
999-
>
1000-
>
1005+
> [!NOTE]
1006+
> Cross-validation is not used with the three regression models in this section, since this was shown in detail for the logistic regression models. An example showing how to use CV with Elastic Net for linear regression is provided in the Appendix of this topic.
1007+
1008+
<!-- -->
1009+
1010+
<!-- -->
1011+
1012+
> [!NOTE]
1013+
> In our experience, there can be issues with convergence of LinearRegressionWithSGD models, and parameters need to be changed/optimized carefully for obtaining a valid model. Scaling of variables significantly helps with convergence. Elastic net regression, shown in the Appendix to this topic, can also be used instead of LinearRegressionWithSGD.
1014+
1015+
<!-- -->
10011016

10021017
### Linear regression with SGD
10031018
The code in this section shows how to use scaled features to train a linear regression that uses stochastic gradient descent (SGD) for optimization, and how to score, evaluate, and save the model in Azure Blob Storage (WASB).
@@ -1060,10 +1075,12 @@ Time taken to execute above cell: 38.62 seconds
10601075
### Random Forest regression
10611076
The code in this section shows how to train, evaluate, and save a random forest model that predicts tip amount for the NYC taxi trip data.
10621077

1078+
<!-- -->
1079+
10631080
> [!NOTE]
10641081
> Cross-validation with parameter sweeping using custom code is provided in the appendix.
1065-
>
1066-
>
1082+
1083+
<!-- -->
10671084

10681085
#PREDICT TIP AMOUNTS USING RANDOM FOREST
10691086

0 commit comments

Comments
 (0)