You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
[SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates
## What changes were proposed in this pull request?
When do the QA work, I found that the following issues:
1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.
I also changed the `spark.logit` regParam in the examples, as we discussed in apache#16222.
## How was this patch tested?
Manual test
Author: [email protected] <[email protected]>
Closesapache#16284 from wangmiao1981/ks.
(cherry picked from commit 3243885)
Signed-off-by: Felix Cheung <[email protected]>
Copy file name to clipboardExpand all lines: R/pkg/vignettes/sparkr-vignettes.Rmd
+26-30Lines changed: 26 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -636,22 +636,6 @@ To use LDA, we need to specify a `features` column in `data` where each entry re
636
636
637
637
* libSVM: Each entry is a collection of words and will be processed directly.
638
638
639
-
There are several parameters LDA takes for fitting the model.
640
-
641
-
*`k`: number of topics (default 10).
642
-
643
-
*`maxIter`: maximum iterations (default 20).
644
-
645
-
*`optimizer`: optimizer to train an LDA model, "online" (default) uses [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf). "em" uses [expectation-maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).
646
-
647
-
*`subsamplingRate`: For `optimizer = "online"`. Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1] (default 0.05).
648
-
649
-
*`topicConcentration`: concentration parameter (commonly named beta or eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective topicConcentration. Only 1-size numeric is accepted.
650
-
651
-
*`docConcentration`: concentration parameter (commonly named alpha) for the prior placed on documents distributions over topics (theta), default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective docConcentration. Only 1-size or k-size numeric is accepted.
652
-
653
-
*`maxVocabSize`: maximum vocabulary size, default 1 << 18.
654
-
655
639
Two more functions are provided for the fitted model.
656
640
657
641
*`spark.posterior` returns a `SparkDataFrame` containing a column of posterior probabilities vectors named "topicDistribution".
@@ -714,19 +697,32 @@ The number of nodes $N$ in the output layer corresponds to the number of classes
714
697
715
698
MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.
716
699
717
-
`spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format. According to the description above, there are several additional parameters that can be set:
718
-
719
-
*`layers`: integer vector containing the number of nodes for each layer.
`spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format.
722
701
723
-
*`maxIter`: maximum iteration number.
724
-
725
-
*`tol`: convergence tolerance of iterations.
726
-
727
-
*`stepSize`: step size for `"gd"`.
702
+
We use iris data set to show how to use `spark.mlp` in classification.
703
+
```{r, warning=FALSE}
704
+
df <- createDataFrame(iris)
705
+
# fit a Multilayer Perceptron Classification Model
0 commit comments