Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Commit 9f0e3be

Browse files
wangmiao1981mengxr
authored andcommitted
[SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes
## What changes were proposed in this pull request? spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work. ## How was this patch tested? Manual build html. Please see attached image for the result. ![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg) Author: [email protected] <[email protected]> Closes apache#16222 from wangmiao1981/veg. (cherry picked from commit 2aa16d0) Signed-off-by: Xiangrui Meng <[email protected]>
1 parent 9dc5fa5 commit 9f0e3be

File tree

1 file changed

+38
-7
lines changed

1 file changed

+38
-7
lines changed

R/pkg/vignettes/sparkr-vignettes.Rmd

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -565,7 +565,7 @@ head(aftPredictions)
565565

566566
#### Gaussian Mixture Model
567567

568-
(Coming in 2.1.0)
568+
(Added in 2.1.0)
569569

570570
`spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) (GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) is used to approximate the maximum likelihood estimator (MLE) of the model.
571571

@@ -584,7 +584,7 @@ head(select(gmmFitted, "V1", "V2", "prediction"))
584584

585585
#### Latent Dirichlet Allocation
586586

587-
(Coming in 2.1.0)
587+
(Added in 2.1.0)
588588

589589
`spark.lda` fits a [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) model on a `SparkDataFrame`. It is often used in topic modeling in which topics are inferred from a collection of text documents. LDA can be thought of as a clustering algorithm as follows:
590590

@@ -657,7 +657,7 @@ perplexity
657657

658658
#### Multilayer Perceptron
659659

660-
(Coming in 2.1.0)
660+
(Added in 2.1.0)
661661

662662
Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights $w$ and bias $b$ and applying an activation function. This can be written in matrix form for MLPC with $K+1$ layers as follows:
663663
$$
@@ -694,7 +694,7 @@ MLPC employs backpropagation for learning the model. We use the logistic loss fu
694694

695695
#### Collaborative Filtering
696696

697-
(Coming in 2.1.0)
697+
(Added in 2.1.0)
698698

699699
`spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614).
700700

@@ -725,7 +725,7 @@ head(predicted)
725725

726726
#### Isotonic Regression Model
727727

728-
(Coming in 2.1.0)
728+
(Added in 2.1.0)
729729

730730
`spark.isoreg` fits an [Isotonic Regression](https://en.wikipedia.org/wiki/Isotonic_regression) model against a `SparkDataFrame`. It solves a weighted univariate a regression problem under a complete order constraint. Specifically, given a set of real observed responses $y_1, \ldots, y_n$, corresponding real features $x_1, \ldots, x_n$, and optionally positive weights $w_1, \ldots, w_n$, we want to find a monotone (piecewise linear) function $f$ to minimize
731731
$$
@@ -768,8 +768,39 @@ newDF <- createDataFrame(data.frame(x = c(1.5, 3.2)))
768768
head(predict(isoregModel, newDF))
769769
```
770770

771-
#### What's More?
772-
We also expect Decision Tree, Random Forest, Kolmogorov-Smirnov Test coming in the next version 2.1.0.
771+
### Logistic Regression Model
772+
773+
(Added in 2.1.0)
774+
775+
[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Predictive Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
776+
We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters.
777+
It supports both binary and multiclass classification with elastic-net regularization and feature standardization, similar to `glmnet`.
778+
779+
We use a simple example to demonstrate `spark.logit` usage. In general, there are three steps of using `spark.logit`:
780+
1). Create a dataframe from a proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting;
781+
and 3). Obtain the coefficient matrix of the fitted model using `summary` and use the model for prediction with `predict`.
782+
783+
Binomial logistic regression
784+
```{r, warning=FALSE}
785+
df <- createDataFrame(iris)
786+
# Create a DataFrame containing two classes
787+
training <- df[df$Species %in% c("versicolor", "virginica"), ]
788+
model <- spark.logit(training, Species ~ ., regParam = 0.5)
789+
summary(model)
790+
```
791+
792+
Predict values on training data
793+
```{r}
794+
fitted <- predict(model, training)
795+
```
796+
797+
Multinomial logistic regression against three classes
798+
```{r, warning=FALSE}
799+
df <- createDataFrame(iris)
800+
# Note in this case, Spark infers it is multinomial logistic regression, so family = "multinomial" is optional.
801+
model <- spark.logit(df, Species ~ ., regParam = 0.5)
802+
summary(model)
803+
```
773804

774805
### Model Persistence
775806
The following example shows how to save/load an ML model by SparkR.

0 commit comments

Comments
 (0)