Skip to content

Commit 873dbf5

Browse files
authored
Merge pull request #287 from turkunov/logistic_reg
New Problem: Training Logistic Reg w/ Grad Descent
2 parents 320abc9 + 3859112 commit 873dbf5

File tree

2 files changed

+162
-0
lines changed

2 files changed

+162
-0
lines changed

Problems/106_train_logreg/learn.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
## Overview
2+
Logistic regression is a model used for a binary classification poblem.
3+
4+
## Prerequisites for a regular logistic regression
5+
Logistic regression is based on the concept of "logits of odds". **Odds** is measure of how frequent we encounter success. It also allows us to shift our probabilities domain of $[0, 1]$ to $[0,\infty]$ Consider a probability of scoring a goal $p=0.8$, then our $odds=\frac{0.8}{0.2}=4$. This means that every $4$ matches we could be expecting a goal followed by a miss. So the higher the odds, the more consistent is our streak of goals. **Logit** is an inverse of the standard logistic function, i.e. sigmoid: $logit(p)=\sigma^{-1}(p)=ln\frac{p}{1-p}$. In our case $p$ is a probability, therefore we call $\frac{p}{1-p}$ the "odds". The logit allows us to further expand our domain from $[0,\infty]$ to $[-\infty,\infty]$.
6+
7+
With this domain expansion we can treat our problem as a linear regression and try to approximate our logit function: $X\beta=logit(p)$. However what we really want for this approximation is to yield predictions for probabilities:
8+
$$
9+
X\beta=ln\frac{p}{1-p} \\
10+
e^{-X\beta}=\frac{1-p}{p} \\
11+
e^{-X\beta}+1 = \frac{1}{p} \\
12+
p = \frac{1}{e^{-X\beta}+1}
13+
$$
14+
15+
What we practically just did is taking an inverse of a logit function w.r.t. our approximation and go back to sigmoid. This is also the backbone of the regular logistic regression, which is commonly defined as:
16+
$$
17+
\pi=\frac{e^{\alpha+X\beta}}{1+e^{\alpha+X\beta}}=\frac{1}{1+e^{-(\alpha+X\beta)}}.
18+
$$
19+
20+
## Loss in logistic regression
21+
The loss function used for solving the logistic regression for $\beta$ is derived from MLE (Maximum Likelihood Estimation). This method allows us to search for $\beta$ that maximize our **likelihood function** $L(\beta)$. This function tells us how likely it is that $X$ has come from the distribution generated by $\beta$: $L(\beta)=L(\beta|X)=P(X|\beta)=\prod_{\{x\in X\}}f^{univar}_X(x;\beta)$, where $f$ is a PMF and $univar$ means univariate, i.e. applied to a single variable.
22+
23+
In the case of a regular logistic regression we expect our output to belong to a single Bernoulli-distributed random variable (hence the univariance), since our true label is either $y_i=0$ or $y_i=1$. The Bernoulli's PMF is defined as $P(Y=y)=p^y(1-p)^{(1-y)}$, where $y\in\{0, 1\}$. Also let's denote $\{x\in X\}$ simply as $X$ and refer to a single pair of vectors from the training set as $(x_i, y_i)$. Thus, our likelihood function would look like this:
24+
$$
25+
\prod_X p\left(x_i\right)^{y_i} \times\left[1-p\left(x_i\right)\right]^{1-y_i}
26+
$$
27+
28+
Then we convert our function from likelihood to log-likelihood by taking $ln$ (or $log$) of it:
29+
$$
30+
\sum_X y_i \log \left[p\left(x_i\right)\right]+\left(1-y_i\right) \log \left[1-p\left(x_i\right)\right]
31+
$$
32+
33+
And then we replace $p(x_i)$ with the sigmoid from previously defined equality to get a final version of our **loss function**:
34+
$$
35+
\sum_X y_i \log \left(\frac{1}{1+e^{-x_i\beta}}\right)+\left(1-y_i\right)\log \left(1-\frac{1}{1+e^{-x_i\beta}}\right)
36+
$$
37+
38+
## Optimization objective
39+
Recall that originally we wanted to search for $\beta$ that maximize the likelihood function. Since $log$ is a monotonic transformation, our maximization objective does not change and we can confindently say that now we can equally search for $\beta$ that maximize our log-likelihood. Hence we can finally write our actual objective as:
40+
41+
$$
42+
argmax_\beta [\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))] = \\
43+
44+
= argmin_\beta -[\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))]
45+
$$
46+
47+
where $\sigma$ is the sigmoid. This function we're trying to minimize is also called **Binary Cross Entropy** loss function (BCE). To find the minimum we would need to take the gradient of this LLF (Log-Likelihood Function), or find a vector of derivatives with respect to every individual $\beta_j$.
48+
49+
### Step 1
50+
To do that we're going to use a chain rule, that describes relational change in variables that our original function is made of. In our case the log-likeligood function depends on sigmoid $\sigma$, $\sigma$ depends on $X\beta$ and $X\beta$ finally depends on $\beta_j$, hence:
51+
52+
$$
53+
\frac{\partial LLF}{\partial\beta_j}=\frac{\partial LLF}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta]}\frac{\partial[X\beta]}{\beta_j}= \\
54+
55+
=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \frac{\partial\sigma}{\partial[x^{(i)}\beta]}
56+
$$
57+
58+
### Step 2
59+
Then we use a derivative of the sigmoid function, that is $\frac{\partial\sigma(x)}{\partial x}=\sigma(x)(1-\sigma(x))$:
60+
$$
61+
-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \sigma\left(x^{(i)}\beta\right)\left(1-\sigma\left(x^{(i)}\beta\right)\right)^{(*)} \frac{\partial[x^{(i)}\beta]}{\partial\beta_j} = \\
62+
63+
=-\sum_{i=1}^n\left(y^{(i)}\left(1-\sigma\left(x^{(i)}\beta\right)\right)-(1-y^{(i)} ) \sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
64+
65+
=-\sum_{i=1}^n\left(y^{(i)}-\sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
66+
67+
=\sum_{i=1}^n\left(\sigma\left(x^{(i)}\beta\right)-y^{(i)}\right) x_j^{(i)}.
68+
$$
69+
70+
The result sum can be then rewritten in a more convenient gradient matrix form as:
71+
$$
72+
X^T(\sigma(X\beta)-Y)
73+
$$
74+
75+
Then we can finally use gradient descent in order to iteratively update our parameters:
76+
$$
77+
\beta_{t+1}=\beta_t - \eta [X^T(\sigma(X\beta_t)-Y)]
78+
$$
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
import numpy as np
2+
3+
4+
def train_logreg(X: np.ndarray, y: np.ndarray,
5+
learning_rate: float, iterations: int) -> tuple[list[float], ...]:
6+
"""
7+
Gradient-descent training algorithm for logistic regression, that collects sum-reduced
8+
BCE losses, accuracies. Assigns label "0" if the P(x_i)<=0.5 and "1" otherwise.
9+
10+
Returns
11+
-------
12+
B : list[float]
13+
1xM updated parameter vector rounded to 4 floating points
14+
losses : list[float]
15+
collected values of a BCE loss function (LLF) rounded to 4 floating points
16+
"""
17+
18+
def sigmoid(x):
19+
return 1 / (1 + np.exp(-x))
20+
21+
def accuracy(y_pred, y_true):
22+
return (y_true == np.rint(y_pred)).sum() / len(y_true)
23+
24+
def bce_loss(y_pred, y_true):
25+
return -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
26+
27+
y = y.reshape(-1, 1)
28+
X = np.hstack((np.ones((X.shape[0], 1)), X))
29+
B = np.zeros((X.shape[1], 1))
30+
accuracies, losses = [], []
31+
32+
for epoch in range(iterations):
33+
y_pred = sigmoid(X @ B)
34+
B -= learning_rate * X.T @ (y_pred - y)
35+
losses.append(round(bce_loss(y_pred, y), 4))
36+
accuracies.append(round(accuracy(y_pred, y), 4))
37+
38+
return B.flatten().round(4).tolist(), losses
39+
40+
41+
def test_train_logreg():
42+
# Test 1
43+
X = np.array([[ 0.76743473, -0.23413696, -0.23415337, 1.57921282],
44+
[-1.4123037 , 0.31424733, -1.01283112, -0.90802408],
45+
[-0.46572975, 0.54256004, -0.46947439, -0.46341769],
46+
[-0.56228753, -1.91328024, 0.24196227, -1.72491783],
47+
[-1.42474819, -0.2257763 , 1.46564877, 0.0675282 ],
48+
[ 1.85227818, -0.29169375, -0.60063869, -0.60170661],
49+
[ 0.37569802, 0.11092259, -0.54438272, -1.15099358],
50+
[ 0.19686124, -1.95967012, 0.2088636 , -1.32818605],
51+
[ 1.52302986, -0.1382643 , 0.49671415, 0.64768854],
52+
[-1.22084365, -1.05771093, -0.01349722, 0.82254491]])
53+
y = np.array([1., 0., 0., 0., 1., 1., 0., 0., 1., 0.])
54+
learning_rate = 1e-3
55+
iterations = 10
56+
b, llf = train_logreg(X, y, learning_rate, iterations)
57+
assert b == [-0.0097, 0.0286, 0.015, 0.0135, 0.0316] and \
58+
llf == [6.9315, 6.9075, 6.8837, 6.8601, 6.8367, 6.8134, 6.7904, 6.7675, 6.7448, 6.7223], \
59+
'Test case 1 failed'
60+
61+
# Test 2
62+
X = np.array([[ 0.76743473, 1.57921282, -0.46947439],
63+
[-0.23415337, 1.52302986, -0.23413696],
64+
[ 0.11092259, -0.54438272, -1.15099358],
65+
[-0.60063869, 0.37569802, -0.29169375],
66+
[-1.91328024, 0.24196227, -1.72491783],
67+
[-1.01283112, -0.56228753, 0.31424733],
68+
[-0.1382643 , 0.49671415, 0.64768854],
69+
[-0.46341769, 0.54256004, -0.46572975],
70+
[-1.4123037 , -0.90802408, 1.46564877],
71+
[ 0.0675282 , -0.2257763 , -1.42474819]])
72+
y = np.array([1., 1., 0., 0., 0., 0., 1., 1., 0., 0.])
73+
learning_rate = 1e-1
74+
iterations = 10
75+
b, llf = train_logreg(X, y, learning_rate, iterations)
76+
assert b == [-0.2509, 0.9325, 1.6218, 0.6336] and \
77+
llf == [6.9315, 5.5073, 4.6382, 4.0609, 3.6503, 3.3432, 3.1045, 2.9134, 2.7567, 2.6258], \
78+
'Test case 2 failed'
79+
80+
print('All tests passed')
81+
82+
83+
if __name__ == '__main__':
84+
test_train_logreg()

0 commit comments

Comments
 (0)