Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Commit 0d94201

Browse files
wangmiao1981Felix Cheung
authored andcommitted
[SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates
## What changes were proposed in this pull request? When do the QA work, I found that the following issues: 1). `spark.mlp` doesn't include an example; 2). `spark.mlp` and `spark.lda` have redundant parameter explanations; 3). `spark.lda` document misses default values for some parameters. I also changed the `spark.logit` regParam in the examples, as we discussed in apache#16222. ## How was this patch tested? Manual test Author: [email protected] <[email protected]> Closes apache#16284 from wangmiao1981/ks. (cherry picked from commit 3243885) Signed-off-by: Felix Cheung <[email protected]>
1 parent 280c35a commit 0d94201

File tree

1 file changed

+26
-30
lines changed

1 file changed

+26
-30
lines changed

R/pkg/vignettes/sparkr-vignettes.Rmd

Lines changed: 26 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -636,22 +636,6 @@ To use LDA, we need to specify a `features` column in `data` where each entry re
636636

637637
* libSVM: Each entry is a collection of words and will be processed directly.
638638

639-
There are several parameters LDA takes for fitting the model.
640-
641-
* `k`: number of topics (default 10).
642-
643-
* `maxIter`: maximum iterations (default 20).
644-
645-
* `optimizer`: optimizer to train an LDA model, "online" (default) uses [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf). "em" uses [expectation-maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).
646-
647-
* `subsamplingRate`: For `optimizer = "online"`. Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1] (default 0.05).
648-
649-
* `topicConcentration`: concentration parameter (commonly named beta or eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective topicConcentration. Only 1-size numeric is accepted.
650-
651-
* `docConcentration`: concentration parameter (commonly named alpha) for the prior placed on documents distributions over topics (theta), default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective docConcentration. Only 1-size or k-size numeric is accepted.
652-
653-
* `maxVocabSize`: maximum vocabulary size, default 1 << 18.
654-
655639
Two more functions are provided for the fitted model.
656640

657641
* `spark.posterior` returns a `SparkDataFrame` containing a column of posterior probabilities vectors named "topicDistribution".
@@ -690,7 +674,6 @@ perplexity <- spark.perplexity(model, corpusDF)
690674
perplexity
691675
```
692676

693-
694677
#### Multilayer Perceptron
695678

696679
(Added in 2.1.0)
@@ -714,19 +697,32 @@ The number of nodes $N$ in the output layer corresponds to the number of classes
714697

715698
MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.
716699

717-
`spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format. According to the description above, there are several additional parameters that can be set:
718-
719-
* `layers`: integer vector containing the number of nodes for each layer.
720-
721-
* `solver`: solver parameter, supported options: `"gd"` (minibatch gradient descent) or `"l-bfgs"`.
700+
`spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format.
722701

723-
* `maxIter`: maximum iteration number.
724-
725-
* `tol`: convergence tolerance of iterations.
726-
727-
* `stepSize`: step size for `"gd"`.
702+
We use iris data set to show how to use `spark.mlp` in classification.
703+
```{r, warning=FALSE}
704+
df <- createDataFrame(iris)
705+
# fit a Multilayer Perceptron Classification Model
706+
model <- spark.mlp(df, Species ~ ., blockSize = 128, layers = c(4, 3), solver = "l-bfgs", maxIter = 100, tol = 0.5, stepSize = 1, seed = 1, initialWeights = c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9))
707+
```
728708

729-
* `seed`: seed parameter for weights initialization.
709+
To avoid lengthy display, we only present partial results of the model summary. You can check the full result from your sparkR shell.
710+
```{r, include=FALSE}
711+
ops <- options()
712+
options(max.print=5)
713+
```
714+
```{r}
715+
# check the summary of the fitted model
716+
summary(model)
717+
```
718+
```{r, include=FALSE}
719+
options(ops)
720+
```
721+
```{r}
722+
# make predictions use the fitted model
723+
predictions <- predict(model, df)
724+
head(select(predictions, predictions$prediction))
725+
```
730726

731727
#### Collaborative Filtering
732728

@@ -821,7 +817,7 @@ Binomial logistic regression
821817
df <- createDataFrame(iris)
822818
# Create a DataFrame containing two classes
823819
training <- df[df$Species %in% c("versicolor", "virginica"), ]
824-
model <- spark.logit(training, Species ~ ., regParam = 0.5)
820+
model <- spark.logit(training, Species ~ ., regParam = 0.00042)
825821
summary(model)
826822
```
827823

@@ -834,7 +830,7 @@ Multinomial logistic regression against three classes
834830
```{r, warning=FALSE}
835831
df <- createDataFrame(iris)
836832
# Note in this case, Spark infers it is multinomial logistic regression, so family = "multinomial" is optional.
837-
model <- spark.logit(df, Species ~ ., regParam = 0.5)
833+
model <- spark.logit(df, Species ~ ., regParam = 0.056)
838834
summary(model)
839835
```
840836

0 commit comments

Comments
 (0)