Skip to content

Commit da79944

Browse files
j143Baunsgaard
authored andcommitted
[MINOR][DOCS] Algorithms Docs Review Edit
- tweak heading numbering and tex syntax - clustering header site name and table link - minor changes to regression - minor tweaks in matrix factorization - minor tweaks in survival analysis - correct the syntax - correct the syntax for descriptive statistics
1 parent 2cd9efd commit da79944

7 files changed

+205
-187
lines changed

docs/site/algorithms-classification.md

Lines changed: 44 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ Just as linear regression estimates the mean value $\mu_i$ of a
4646
numerical response variable, logistic regression does the same for
4747
category label probabilities. In linear regression, the mean of $y_i$ is
4848
estimated as a linear combination of the features:
49-
$$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$$.
49+
$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$.
5050
In logistic regression, the label probability has to lie between 0
5151
and 1, so a link function is applied to connect it to
5252
$\beta_0 + x_i\beta_{1:m}$. If there are just two possible category
@@ -59,10 +59,10 @@ Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\,
5959
\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}$$
6060

6161
Here category label 0
62-
serves as the *baseline*, and function $$\exp(\beta_0 + x_i\beta_{1:m})$$
62+
serves as the *baseline*, and function $\exp(\beta_0 + x_i\beta_{1:m})$
6363
shows how likely we expect to see "$y_i = 1$" in comparison to the
6464
baseline. Like in a loaded coin, the predicted odds of seeing 1 versus 0
65-
are $$\exp(\beta_0 + x_i\beta_{1:m})$$ to 1, with each feature $$x_{i,j}$$
65+
are $\exp(\beta_0 + x_i\beta_{1:m})$ to 1, with each feature $x_{i,j}$
6666
multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds. Given a
6767
large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic
6868
regression seeks to find the $\beta_j$’s that maximize the product of
@@ -76,11 +76,11 @@ $k \geq 3$ possible categories. Again we identify one category as the
7676
baseline, for example the $k$-th category. Instead of a coin, here we
7777
have a loaded multisided die, one side per category. Each non-baseline
7878
category $l = 1\ldots k\,{-}\,1$ has its own vector
79-
$$(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$$ of regression
79+
$(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$ of regression
8080
parameters with the intercept, making up a matrix $B$ of size
8181
$(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing
8282
non-baseline category $l$ versus the baseline $k$ are
83-
$$\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$$
83+
$\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$
8484
to 1, and the predicted probabilities are:
8585

8686
$$
@@ -99,7 +99,7 @@ $$
9999

100100
The goal of the regression
101101
is to estimate the parameter matrix $B$ from the provided dataset
102-
$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of $$Prob[y_i\mid x_i; B]$$ over the
102+
$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of $Prob[y_i\mid x_i; B]$ over the
103103
observed labels $y_i$. Taking its logarithm, negating, and adding a
104104
regularization term gives us a minimization objective:
105105

@@ -114,7 +114,7 @@ $$
114114

115115
The optional regularization term is added to
116116
mitigate overfitting and degeneracy in the data; to reduce bias, the
117-
intercepts $$\beta_{0,l}$$ are not regularized. Once the $\beta_{j,l}$’s
117+
intercepts $\beta_{0,l}$ are not regularized. Once the $\beta_{j,l}$’s
118118
are accurately estimated, we can make predictions about the category
119119
label $y$ for a new feature vector $x$ using
120120
Eqs. (1) and (2).
@@ -154,7 +154,7 @@ Newton method for logistic regression described in [[Lin2008]](algorithms-biblio
154154
For convenience, let us make some changes in notation:
155155

156156
- Convert the input vector of observed category labels into an indicator
157-
matrix $Y$ of size $n \times k$ such that $$Y_{i, l} = 1$$ if the $i$-th
157+
matrix $Y$ of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th
158158
category label is $l$ and $Y_{i, l} = 0$ otherwise.
159159
- Append an extra column of all ones, i.e. $(1, 1, \ldots, 1)^T$, as the
160160
$m\,{+}\,1$-st column to the feature matrix $X$ to represent the
@@ -203,30 +203,28 @@ $$\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \ma
203203

204204
This approximation is then
205205
minimized by trust-region conjugate gradient iterations (the *inner*
206-
iterations) subject to the constraint
207-
$\|S\|_2 \leq \delta$
208-
. The trust
209-
region size $\delta$ is initialized as
210-
$0.5\sqrt{m}\,/ \max_i \|x_i\|_2$
211-
and updated as described
212-
in [[Lin2008]](algorithms-bibliography.html).
206+
iterations) subject to the constraint
207+
208+
$$\|S\|_2 \leq \delta$$
209+
210+
The trust
211+
region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max_i \|x_i\|_2$
212+
and updated as described in [[Lin2008]](algorithms-bibliography.html).
213213
Users can specify the maximum number of the outer
214214
and the inner iterations with input parameters `moi` and
215215
`mii`, respectively. The iterative minimizer terminates
216-
successfully if
216+
successfully if
217+
217218
$$\|\nabla f\|_2 < \varepsilon \|\nabla f_{B=0} \|_2$$
218-
, where ${\varepsilon}> 0$ is a tolerance supplied by the user via input
219-
parameter `tol`.
219+
220+
, where ${\varepsilon}> 0$ is a tolerance supplied by the user via input parameter `tol`.
220221

221222
### Returns
222223

223-
The estimated regression parameters (the
224-
$$\hat{\beta}_{j, l}$$)
225-
are
224+
The estimated regression parameters (the $$.\hat{\beta}_{j, l}$$ ) are
226225
populated into a matrix and written to an HDFS file whose path/name was
227226
provided as the `B` input argument. Only the non-baseline
228-
categories ($1\leq l \leq k\,{-}\,1$) have their
229-
$$\hat{\beta}_{j, l}$$
227+
categories ($1\leq l \leq k\,{-}\,1$) have their $ \hat{\beta}_{j, l}$
230228
in the output; to add the baseline category, just append a column of zeros.
231229
If `icpt=0` in the input command line, no intercepts are used
232230
and `B` has size
@@ -290,7 +288,7 @@ specified.
290288

291289
* * *
292290

293-
### 2.2.2 Multi-Class Support Vector Machines
291+
### Multi-Class Support Vector Machines
294292

295293
#### Multi SVM Description
296294

@@ -429,19 +427,20 @@ determine which test to include, is to compare impurities of the tree
429427
nodes induced by the test. The *node impurity* measures the
430428
homogeneity of the labels at the node. This implementation supports two
431429
commonly used impurity measures (denoted by $\mathcal{I}$):
432-
*Entropy* $$\mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i$$, as
430+
*Entropy*
431+
$\mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i$, as
433432
well as *Gini impurity*
434-
$$\mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i)$$, where $C$ denotes the number of
433+
$\mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i)$, where $C$ denotes the number of
435434
unique labels and $f_i$ is the frequency of label $i$. Once the impurity
436435
at the tree nodes has been obtained, the *best split* is
437436
chosen from a set of possible splits that maximizes the
438437
*information gain* at the node, i.e.,
439-
$$\arg\max_{s}\mathcal{IG}(X,s)$$, where $\mathcal{IG}(X,s)$ denotes the
438+
$\arg\max_{s}\mathcal{IG}(X,s)$, where $\mathcal{IG}(X,s)$ denotes the
440439
information gain when the splitting test $s$ partitions the feature
441440
matrix $X$. Assuming that $s$ partitions $X$ that contains $N$ feature
442-
vectors into $$X_\text{left}$$ and $$X_\text{right}$$ each including
443-
$$N_\text{left}$$ and $$N_\text{right}$$ feature vectors, respectively,
444-
$$\mathcal{IG}(X,s)$$ is given by
441+
vectors into $X_\text{left}$ and $X_\text{right}$ each including
442+
$N_\text{left}$ and $N_\text{right}$ feature vectors, respectively,
443+
$\mathcal{IG}(X,s)$ is given by
445444

446445
$$\mathcal{IG}(X,s)=\mathcal{I}(X)-\frac{N_\text{left}}{N}\mathcal{I}(X_\text{left})-\frac{N_\text{right}}{N}\mathcal{I}(X_\text{right})$$
447446

@@ -504,10 +503,10 @@ which results in the maximum information gain is then selected.
504503
in a matrix $M$ that contains at least 6 rows. Each column in the matrix
505504
contains the parameters relevant to a single node in the tree. Note that
506505
for building the tree model, our implementation splits the feature
507-
matrix $X$ into $$X_\text{cont}$$ containing continuous-valued features
508-
and $$X_\text{cat}$$ containing categorical features. In the following,
506+
matrix $X$ into $X_\text{cont}$ containing continuous-valued features
507+
and $X_\text{cat}$ containing categorical features. In the following,
509508
the continuous-valued (resp. categorical) feature-ids correspond to the
510-
indices of the features in $$X_\text{cont}$$ (resp. $$X_\text{cat}$$).
509+
indices of the features in $X_\text{cont}$ (resp. $X_\text{cat}$).
511510
Moreover, we refer to an internal node as a continuous-valued
512511
(categorical) node if the feature that this nodes looks at is
513512
continuous-valued (categorical). Below is a description of what each row
@@ -518,8 +517,8 @@ in the matrix contains.
518517
- Row 2: for internal nodes stores the offsets (the number of columns)
519518
in $M$ to the left child, and otherwise `0`.
520519
- Row 3: stores the feature index of the feature (id of a
521-
continuous-valued feature in $$X_\text{cont}$$ if the feature is
522-
continuous-valued or id of a categorical feature in $$X_\text{cat}$$
520+
continuous-valued feature in $X_\text{cont}$ if the feature is
521+
continuous-valued or id of a categorical feature in $X_\text{cat}$
523522
if the feature is categorical) that this node looks at if the node
524523
is an internal node, otherwise `0`.
525524
- Row 4: store the type of the feature that this node looks at if the
@@ -547,7 +546,7 @@ its matrix representation.
547546

548547
#### Figure 2
549548

550-
**Figure 2**: (a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $$X_\text{cont}$$ (resp. $$X_\text{cat}$$). In this example all leaf nodes are pure and no training example is misclassified.
549+
**Figure 2**: (a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $X_\text{cont}$ (resp. $X_\text{cat}$). In this example all leaf nodes are pure and no training example is misclassified.
551550

552551
(a) ![Figure 2](../img/algorithms-reference/example-tree.png "Figure 2")
553552

@@ -570,10 +569,10 @@ its matrix representation.
570569
The matrix corresponding to the learned model as well as the training
571570
accuracy (if requested) is written to a file in the format specified.
572571
See details where the structure of the model matrix is described. Recall
573-
that in our implementation $X$ is split into $$X_\text{cont}$$ and
574-
$$X_\text{cat}$$. If requested, the mappings of the continuous-valued
575-
feature-ids in $$X_\text{cont}$$ (stored at `S_map`) and the
576-
categorical feature-ids in $$X_\text{cat}$$ (stored at
572+
that in our implementation $X$ is split into $X_\text{cont}$ and
573+
$X_\text{cat}$. If requested, the mappings of the continuous-valued
574+
feature-ids in $X_\text{cont}$ (stored at `S_map`) and the
575+
categorical feature-ids in $X_\text{cat}$ (stored at
577576
`C_map`) to the global feature-ids in $X$ will be provided.
578577
Depending on what arguments are provided during invocation, the
579578
`decision-tree-predict.dml` script may compute one or more of
@@ -641,15 +640,15 @@ with the difference that the tree-ids are stored in the second row and
641640
rows $2,3,\ldots$ from the decision tree model are shifted by one. See
642641
[Decision Trees](algorithms-classification.html#decision-trees) for a description of the model.
643642

644-
### Returns
643+
### Random Forests Returns
645644

646645
The matrix corresponding to the learned model is written to a file in
647646
the format specified. See [Decision Trees](algorithms-classification.html#decision-trees) where the
648647
details about the structure of the model matrix is described. Similar to
649-
`decision-tree.dml`, $X$ is split into $$X_\text{cont}$$ and
650-
$$X_\text{cat}$$. If requested, the mappings of the continuous feature-ids
651-
in $$X_\text{cont}$$ (stored at `S_map`) as well as the
652-
categorical feature-ids in $$X_\text{cat}$$ (stored at
648+
`decision-tree.dml`, $X$ is split into $X_\text{cont}$ and
649+
$X_\text{cat}$. If requested, the mappings of the continuous feature-ids
650+
in $X_\text{cont}$ (stored at `S_map`) as well as the
651+
categorical feature-ids in $X_\text{cat}$ (stored at
653652
`C_map`) to the global feature-ids in $X$ will be provided.
654653
The `random-forest-predict.dml` script may compute one or
655654
more of predictions, accuracy, confusion matrix, and `OOB` error estimate

docs/site/algorithms-clustering.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: site
3-
title: SystemDS Algorithms Reference Clustering
3+
title: Algorithms Reference Clustering
44
---
55
<!--
66
{% comment %}
@@ -348,4 +348,4 @@ best WCSS value, as well as some information about the performance of
348348
the other runs, is printed during the script execution. The scoring
349349
script `Kmeans-predict.dml` prints all its results in a
350350
self-explanatory manner, as defined in
351-
[**Table 6**](table-6).
351+
[**Table 6**](#table-6).

0 commit comments

Comments
 (0)