@@ -46,7 +46,7 @@ Just as linear regression estimates the mean value $\mu_i$ of a
4646numerical response variable, logistic regression does the same for
4747category label probabilities. In linear regression, the mean of $y_i$ is
4848estimated as a linear combination of the features:
49- $$ \mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m} $ $ .
49+ $\mu_i = \beta_0 + \beta_1 x_ {i,1} + \ldots + \beta_m x_ {i,m} = \beta_0 + x_i\beta_ {1: m }$.
5050In logistic regression, the label probability has to lie between 0
5151and 1, so a link function is applied to connect it to
5252$\beta_0 + x_i\beta_ {1: m }$. If there are just two possible category
@@ -59,10 +59,10 @@ Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\,
5959\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}} $$
6060
6161Here category label 0
62- serves as the * baseline* , and function $$ \exp(\beta_0 + x_i\beta_{1:m}) $ $
62+ serves as the * baseline* , and function $\exp(\beta_0 + x_i\beta_ {1: m })$
6363shows how likely we expect to see "$y_i = 1$" in comparison to the
6464baseline. Like in a loaded coin, the predicted odds of seeing 1 versus 0
65- are $$ \exp(\beta_0 + x_i\beta_{1:m}) $$ to 1, with each feature $$ x_{i,j} $ $
65+ are $\exp(\beta_0 + x_i\beta_ {1: m })$ to 1, with each feature $x_ {i,j}$
6666multiplying its own factor $\exp(\beta_j x_ {i,j})$ to the odds. Given a
6767large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic
6868regression seeks to find the $\beta_j$’s that maximize the product of
@@ -76,11 +76,11 @@ $k \geq 3$ possible categories. Again we identify one category as the
7676baseline, for example the $k$-th category. Instead of a coin, here we
7777have a loaded multisided die, one side per category. Each non-baseline
7878category $l = 1\ldots k\, {-}\, 1$ has its own vector
79- $$ (\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l}) $ $ of regression
79+ $(\beta_ {0,l}, \beta_ {1,l}, \ldots, \beta_ {m,l})$ of regression
8080parameters with the intercept, making up a matrix $B$ of size
8181$(m\, {+}\, 1)\times(k\, {-}\, 1)$. The predicted odds of seeing
8282non-baseline category $l$ versus the baseline $k$ are
83- $$ \exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big) $ $
83+ $\exp\big(\beta_ {0,l} + \sum\nolimits_ {j=1}^m x_ {i,j}\beta_ {j,l}\big)$
8484to 1, and the predicted probabilities are:
8585
8686$$
9999
100100The goal of the regression
101101is to estimate the parameter matrix $B$ from the provided dataset
102- $(X, Y) = (x_i, y_i)_ {i=1}^n$ by maximizing the product of $$ Prob[y_i\mid x_i; B] $ $ over the
102+ $(X, Y) = (x_i, y_i)_ {i=1}^n$ by maximizing the product of $Prob[ y_i\mid x_i; B] $ over the
103103observed labels $y_i$. Taking its logarithm, negating, and adding a
104104regularization term gives us a minimization objective:
105105
114114
115115The optional regularization term is added to
116116mitigate overfitting and degeneracy in the data; to reduce bias, the
117- intercepts $$ \beta_{0,l} $ $ are not regularized. Once the $\beta_ {j,l}$’s
117+ intercepts $\beta_ {0,l}$ are not regularized. Once the $\beta_ {j,l}$’s
118118are accurately estimated, we can make predictions about the category
119119label $y$ for a new feature vector $x$ using
120120Eqs. (1) and (2).
@@ -154,7 +154,7 @@ Newton method for logistic regression described in [[Lin2008]](algorithms-biblio
154154For convenience, let us make some changes in notation:
155155
156156- Convert the input vector of observed category labels into an indicator
157- matrix $Y$ of size $n \times k$ such that $$ Y_{i, l} = 1 $ $ if the $i$-th
157+ matrix $Y$ of size $n \times k$ such that $Y_ {i, l} = 1$ if the $i$-th
158158category label is $l$ and $Y_ {i, l} = 0$ otherwise.
159159- Append an extra column of all ones, i.e. $(1, 1, \ldots, 1)^T$, as the
160160$m\, {+}\, 1$-st column to the feature matrix $X$ to represent the
@@ -203,30 +203,28 @@ $$\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \ma
203203
204204This approximation is then
205205minimized by trust-region conjugate gradient iterations (the * inner*
206- iterations) subject to the constraint
207- $ \| S \| _ 2 \leq \delta$
208- . The trust
209- region size $\delta$ is initialized as
210- $0.5\sqrt{m} \, / \max_i \| x_i \| _ 2$
211- and updated as described
212- in [[ Lin2008]] ( algorithms-bibliography.html ) .
206+ iterations) subject to the constraint
207+
208+ $$ \|S\|_2 \leq \delta $$
209+
210+ The trust
211+ region size $\delta$ is initialized as $0.5\sqrt{m} \, / \max_i \| x_i \| _ 2$
212+ and updated as described in [[ Lin2008]] ( algorithms-bibliography.html ) .
213213Users can specify the maximum number of the outer
214214and the inner iterations with input parameters ` moi ` and
215215` mii ` , respectively. The iterative minimizer terminates
216- successfully if
216+ successfully if
217+
217218$$ \|\nabla f\|_2 < \varepsilon \|\nabla f_{B=0} \|_2 $$
218- , where ${\varepsilon}> 0$ is a tolerance supplied by the user via input
219- parameter ` tol ` .
219+
220+ , where ${\varepsilon}> 0$ is a tolerance supplied by the user via input parameter ` tol ` .
220221
221222### Returns
222223
223- The estimated regression parameters (the
224- $$ \hat{\beta}_{j, l} $$ )
225- are
224+ The estimated regression parameters (the $$ .\hat{\beta}_{j, l} $$ ) are
226225populated into a matrix and written to an HDFS file whose path/name was
227226provided as the ` B ` input argument. Only the non-baseline
228- categories ($1\leq l \leq k\, {-}\, 1$) have their
229- $$ \hat{\beta}_{j, l} $$
227+ categories ($1\leq l \leq k\, {-}\, 1$) have their $ \hat{\beta}_ {j, l}$
230228in the output; to add the baseline category, just append a column of zeros.
231229If ` icpt=0 ` in the input command line, no intercepts are used
232230and ` B ` has size
@@ -290,7 +288,7 @@ specified.
290288
291289* * *
292290
293- ### 2.2.2 Multi-Class Support Vector Machines
291+ ### Multi-Class Support Vector Machines
294292
295293#### Multi SVM Description
296294
@@ -429,19 +427,20 @@ determine which test to include, is to compare impurities of the tree
429427nodes induced by the test. The * node impurity* measures the
430428homogeneity of the labels at the node. This implementation supports two
431429commonly used impurity measures (denoted by $\mathcal{I}$):
432- * Entropy* $$ \mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i $$ , as
430+ * Entropy*
431+ $\mathcal{E}=\sum_ {i=1}^{C}-f_i \log f_i$, as
433432well as * Gini impurity*
434- $$ \mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i) $ $ , where $C$ denotes the number of
433+ $\mathcal{G}=\sum_ {i=1}^{C}f_i (1-f_i)$, where $C$ denotes the number of
435434unique labels and $f_i$ is the frequency of label $i$. Once the impurity
436435at the tree nodes has been obtained, the * best split* is
437436chosen from a set of possible splits that maximizes the
438437* information gain* at the node, i.e.,
439- $$ \arg\max_{s}\mathcal{IG}(X,s) $ $ , where $\mathcal{IG}(X,s)$ denotes the
438+ $\arg\max_ {s}\mathcal{IG}(X,s)$, where $\mathcal{IG}(X,s)$ denotes the
440439information gain when the splitting test $s$ partitions the feature
441440matrix $X$. Assuming that $s$ partitions $X$ that contains $N$ feature
442- vectors into $$ X_\text{left} $$ and $$ X_\text{right} $ $ each including
443- $$ N_\text{left} $$ and $$ N_\text{right} $ $ feature vectors, respectively,
444- $$ \mathcal{IG}(X,s) $ $ is given by
441+ vectors into $X_ \text{left}$ and $X_ \text{right}$ each including
442+ $N_ \text{left}$ and $N_ \text{right}$ feature vectors, respectively,
443+ $\mathcal{IG}(X,s)$ is given by
445444
446445$$ \mathcal{IG}(X,s)=\mathcal{I}(X)-\frac{N_\text{left}}{N}\mathcal{I}(X_\text{left})-\frac{N_\text{right}}{N}\mathcal{I}(X_\text{right}) $$
447446
@@ -504,10 +503,10 @@ which results in the maximum information gain is then selected.
504503in a matrix $M$ that contains at least 6 rows. Each column in the matrix
505504contains the parameters relevant to a single node in the tree. Note that
506505for building the tree model, our implementation splits the feature
507- matrix $X$ into $$ X_\text{cont} $ $ containing continuous-valued features
508- and $$ X_\text{cat} $ $ containing categorical features. In the following,
506+ matrix $X$ into $X_ \text{cont}$ containing continuous-valued features
507+ and $X_ \text{cat}$ containing categorical features. In the following,
509508the continuous-valued (resp. categorical) feature-ids correspond to the
510- indices of the features in $$ X_\text{cont} $$ (resp. $$ X_\text{cat} $ $ ).
509+ indices of the features in $X_ \text{cont}$ (resp. $X_ \text{cat}$).
511510Moreover, we refer to an internal node as a continuous-valued
512511(categorical) node if the feature that this nodes looks at is
513512continuous-valued (categorical). Below is a description of what each row
@@ -518,8 +517,8 @@ in the matrix contains.
518517- Row 2: for internal nodes stores the offsets (the number of columns)
519518 in $M$ to the left child, and otherwise ` 0 ` .
520519- Row 3: stores the feature index of the feature (id of a
521- continuous-valued feature in $$ X_\text{cont} $ $ if the feature is
522- continuous-valued or id of a categorical feature in $$ X_\text{cat} $ $
520+ continuous-valued feature in $X_ \text{cont}$ if the feature is
521+ continuous-valued or id of a categorical feature in $X_ \text{cat}$
523522 if the feature is categorical) that this node looks at if the node
524523 is an internal node, otherwise ` 0 ` .
525524- Row 4: store the type of the feature that this node looks at if the
@@ -547,7 +546,7 @@ its matrix representation.
547546
548547#### Figure 2
549548
550- ** Figure 2** : (a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $$ X_\text{cont} $$ (resp. $$ X_\text{cat} $ $ ). In this example all leaf nodes are pure and no training example is misclassified.
549+ ** Figure 2** : (a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $X_ \text{cont}$ (resp. $X_ \text{cat}$). In this example all leaf nodes are pure and no training example is misclassified.
551550
552551(a) ![ Figure 2] ( ../img/algorithms-reference/example-tree.png " Figure 2 ")
553552
@@ -570,10 +569,10 @@ its matrix representation.
570569The matrix corresponding to the learned model as well as the training
571570accuracy (if requested) is written to a file in the format specified.
572571See details where the structure of the model matrix is described. Recall
573- that in our implementation $X$ is split into $$ X_\text{cont} $ $ and
574- $$ X_\text{cat} $ $ . If requested, the mappings of the continuous-valued
575- feature-ids in $$ X_\text{cont} $ $ (stored at ` S_map ` ) and the
576- categorical feature-ids in $$ X_\text{cat} $ $ (stored at
572+ that in our implementation $X$ is split into $X_ \text{cont}$ and
573+ $X_ \text{cat}$. If requested, the mappings of the continuous-valued
574+ feature-ids in $X_ \text{cont}$ (stored at ` S_map ` ) and the
575+ categorical feature-ids in $X_ \text{cat}$ (stored at
577576` C_map ` ) to the global feature-ids in $X$ will be provided.
578577Depending on what arguments are provided during invocation, the
579578` decision-tree-predict.dml ` script may compute one or more of
@@ -641,15 +640,15 @@ with the difference that the tree-ids are stored in the second row and
641640rows $2,3,\ldots$ from the decision tree model are shifted by one. See
642641[ Decision Trees] ( algorithms-classification.html#decision-trees ) for a description of the model.
643642
644- ### Returns
643+ ### Random Forests Returns
645644
646645The matrix corresponding to the learned model is written to a file in
647646the format specified. See [ Decision Trees] ( algorithms-classification.html#decision-trees ) where the
648647details about the structure of the model matrix is described. Similar to
649- ` decision-tree.dml ` , $X$ is split into $$ X_\text{cont} $ $ and
650- $$ X_\text{cat} $ $ . If requested, the mappings of the continuous feature-ids
651- in $$ X_\text{cont} $ $ (stored at ` S_map ` ) as well as the
652- categorical feature-ids in $$ X_\text{cat} $ $ (stored at
648+ ` decision-tree.dml ` , $X$ is split into $X_ \text{cont}$ and
649+ $X_ \text{cat}$. If requested, the mappings of the continuous feature-ids
650+ in $X_ \text{cont}$ (stored at ` S_map ` ) as well as the
651+ categorical feature-ids in $X_ \text{cat}$ (stored at
653652` C_map ` ) to the global feature-ids in $X$ will be provided.
654653The ` random-forest-predict.dml ` script may compute one or
655654more of predictions, accuracy, confusion matrix, and ` OOB ` error estimate
0 commit comments