Skip to content

Commit feb4879

Browse files
authored
Merge branch 'Open-Deep-ML:main' into main
2 parents bb2e5bd + 0c631f4 commit feb4879

File tree

300 files changed

+2365
-58
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

300 files changed

+2365
-58
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
## Binary Classification with Logistic Regression
2+
3+
Logistic Regression is a fundamental algorithm for binary classification. Given input features and learned model parameters (weights and bias), your task is to implement the prediction function that computes class probabilities.
4+
5+
### Mathematical Background
6+
7+
The logistic regression model makes predictions using the sigmoid function:
8+
9+
$\sigma(z) = \frac{1}{1 + e^{-z}}$
10+
11+
where z is the linear combination of features and weights plus bias:
12+
13+
$z = \mathbf{w}^T\mathbf{x} + b = \sum_{i=1}^{n} w_ix_i + b$
14+
15+
### Implementation Requirements
16+
17+
Your task is to implement a function that:
18+
19+
- Takes a batch of samples $\mathbf{X}$ (shape: N × D), weights $\mathbf{w}$ (shape: D), and bias b
20+
- Computes $z = \mathbf{X}\mathbf{w} + b$ for all samples
21+
- Applies the sigmoid function to get probabilities
22+
- Returns binary predictions i.e 0 or 1 using a threshold of 0.5
23+
24+
### Important Considerations
25+
26+
- Handle numerical stability in sigmoid computation
27+
- Ensure efficient vectorized operations using numpy
28+
- Return binary predictions i.e zeroes and ones
29+
30+
### Hint
31+
32+
To prevent overflow in the exponential calculation of sigmoid function, use np.clip to limit z values:
33+
34+
```python
35+
z = np.clip(z, -500, 500)
36+
```
37+
38+
This ensures numerical stability when dealing with large input values.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
import numpy as np
2+
3+
def predict_logistic(X: np.ndarray, weights: np.ndarray, bias: float) -> np.ndarray:
4+
5+
z = np.dot(X, weights) + bias
6+
z = np.clip(z, -500, 500) # Prevent overflow in exp
7+
probabilities = 1 / (1 + np.exp(-z))
8+
return (probabilities >= 0.5).astype(int)
9+
10+
def test_predict_logistic():
11+
# Test case 1: Simple linearly separable case
12+
X1 = np.array([[1, 1], [2, 2], [-1, -1], [-2, -2]])
13+
w1 = np.array([1, 1])
14+
b1 = 0
15+
expected1 = np.array([1, 1, 0, 0])
16+
assert np.array_equal(predict_logistic(X1, w1, b1), expected1), "Test case 1 failed"
17+
18+
# Test case 2: Decision boundary case
19+
X2 = np.array([[0, 0], [0.1, 0.1], [-0.1, -0.1]])
20+
w2 = np.array([1, 1])
21+
b2 = 0
22+
expected2 = np.array([1, 1, 0])
23+
assert np.array_equal(predict_logistic(X2, w2, b2), expected2), "Test case 2 failed"
24+
25+
# Test case 3: Higher dimensional input
26+
X3 = np.array([[1, 2, 3], [-1, -2, -3], [0.5, 1, 1.5]])
27+
w3 = np.array([0.1, 0.2, 0.3])
28+
b3 = -1
29+
expected3 = np.array([1, 0, 0])
30+
assert np.array_equal(predict_logistic(X3, w3, b3), expected3), "Test case 3 failed"
31+
32+
# # Test case 4: Single feature
33+
X4 = np.array([[1], [2], [-1], [-2]]).reshape(-1, 1)
34+
w4 = np.array([2])
35+
b4 = 0
36+
expected4 = np.array([1, 1, 0, 0])
37+
assert np.array_equal(predict_logistic(X4, w4, b4), expected4), "Test case 4 failed"
38+
39+
# # Test case 5: Numerical stability test with large values
40+
X6 = np.array([[1000, 2000], [-1000, -2000]])
41+
w6 = np.array([0.1, 0.1])
42+
b6 = 0
43+
result6 = predict_logistic(X6, w6, b6)
44+
assert result6[0] == 1 and result6[1] == 0, "Test case 5 failed"
45+
46+
if __name__ == "__main__":
47+
test_predict_logistic()
48+
print("All test cases passed!")
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
## Overview
2+
Softmax regression is a type of logistic regression that extends it to a multiclass problem by outputting a vector $P$ of probabilities for each distinct class and taking $argmax(P)$.
3+
4+
## Connection to a regular logistic regression
5+
Recall that a standard logistic regression is aimed at approximating
6+
$$
7+
p = \frac{1}{e^{-X\beta}+1} = \\
8+
= \frac{e^{X\beta}}{1+e^{X\beta}},
9+
$$
10+
11+
which actually alignes with the definition of the softmax function:
12+
$$
13+
softmax(z_i)=\sigma(z_i)=\frac{e^{z_i}}{\sum_j^Ce^{z_j}},
14+
$$
15+
16+
where $C$ is the number of classes and values of which sum up to $1$. Hence it simply extends the functionality of sigmoid to more than 2 classes and could be used for assigning probability values in a categorical distribution, i.e. softmax regression searches for the following vector-approximation:
17+
$$
18+
p^{(i)}=\frac{e^{x^{(i)}\beta}}{\sum_j^Ce^{x^{(i)}\beta_j}_j}
19+
$$
20+
21+
## Loss in softmax regression
22+
**tl;dr** key differences in the loss from logistic regression include replacing sigmoid with softmax and calculating several gradients for vectors $\beta_j$ corresponding to a particular class $j\in\{1,...,C\}$.
23+
24+
Recall that we use MLE in logistic regression. It is the same case with softmax regression, although instead of Bernoulli-distributed random variable we have categorical distribution, which is an extension of Bernoulli to more than 2 labels. Its PMF is defined as:
25+
$$
26+
f(y|p)=\prod_{i=1}^Kp_i^{[i=y]},
27+
$$
28+
29+
Hence, our log-likelihood looks like:
30+
$$
31+
\sum_X \sum_j^C [y_i=j] \log \left[p\left(x_i\right)\right]
32+
$$
33+
34+
Where we replace our probability function with softmax:
35+
$$
36+
\sum_X \sum_j^C [y_i=j] \log \frac{e^{x_i\beta_j}}{\sum_j^Ce^{x_i\beta_j}}
37+
$$
38+
39+
where $[i=y]$ is a function, that returns $0$, if $i\neq y$, and $1$ otherwise and $C$ - number of distinct classes (labels). You can see that since we are expecting a $1\times C$ output of $y$, just like in the neuron backprop problem, we will be having separate vector $\beta_j$ for every $j$ class out of $C$.
40+
41+
## Optimization objective
42+
The optimization objective is the same as with logistic regression. The function, which we are optimizing, is also commonly refered as **Cross Entropy** (CE):
43+
44+
$$
45+
argmin_\beta -[\sum_X \sum_j^C [y_i=j] \log \frac{e^{x_i\beta_j}}{\sum_j^Ce^{x_i\beta_j}}] \\
46+
$$
47+
48+
Then we are yet again using a chain rule for calculating partial derivative of $CE$ with respect to $\beta$:
49+
50+
51+
$$
52+
\frac{\partial CE}{\partial\beta^{(j)}_i}=\frac{\partial CE}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta^{(j)}]}\frac{\partial[X\beta^{(j)}]}{\beta^{(j)}_i}
53+
$$
54+
55+
Which is eventually reduced to a similar to logistic regression gradient matrix form:
56+
$$
57+
X^T(\sigma(X\beta^{(j)})-Y)
58+
$$
59+
60+
Then we can finally use gradient descent in order to iteratively update our parameters with respect to a particular class:
61+
$$
62+
\beta^{(j)}_{t+1}=\beta^{(j)}_t - \eta [X^T(\sigma(X\beta^{(j)}_t)-Y)]
63+
$$
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
import numpy as np
2+
3+
4+
def train_softmaxreg(X: np.ndarray, y: np.ndarray,
5+
learning_rate: float, iterations: int) -> tuple[list[float], ...]:
6+
"""
7+
Gradient-descent training algorithm for softmax regression, that collects mean-reduced
8+
CE losses, accuracies.
9+
Returns
10+
-------
11+
B : list[float]
12+
CxM updated parameter vector rounded to 4 floating points
13+
losses : list[float]
14+
collected values of a Cross Entropy rounded to 4 floating points
15+
"""
16+
17+
def softmax(z):
18+
return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)
19+
20+
def accuracy(y_pred, y_true):
21+
return (np.argmax(y_true, axis=1) == np.argmax(y_pred, axis=1)).sum() / len(y_true)
22+
23+
def ce_loss(y_pred, y_true):
24+
true_labels_idx = np.argmax(y_true, axis=1)
25+
return -np.sum(np.log(y_pred)[list(range(len(y_pred))),true_labels_idx])
26+
27+
y = y.astype(int)
28+
C = y.max()+1 # we assume that classes start from 0
29+
y = np.eye(C)[y]
30+
X = np.hstack((np.ones((X.shape[0], 1)), X))
31+
B = np.zeros((X.shape[1], C))
32+
accuracies, losses = [], []
33+
34+
for epoch in range(iterations):
35+
y_pred = softmax(X @ B)
36+
B -= learning_rate * X.T @ (y_pred - y)
37+
losses.append(round(ce_loss(y_pred, y), 4))
38+
accuracies.append(round(accuracy(y_pred, y), 4))
39+
40+
return B.T.round(4).tolist(), losses
41+
42+
43+
def test_train_softmaxreg():
44+
# Test 1
45+
X = np.array([[ 2.52569869, 2.33335813, 1.77303921, 0.41061103, -1.66484491],
46+
[ 1.51013861, 1.30237106, 1.31989315, 1.36087958, 0.46381252],
47+
[-2.09699866, -1.35960405, -1.04035503, -2.25481082, -0.32359947],
48+
[-0.96660088, -0.60680633, -0.72017167, -1.73257187, -1.12811486],
49+
[-0.38096611, -0.24852455, 0.18789426, 0.52359424, 1.30725962],
50+
[ 0.54828787, 0.33156614, 0.10676247, 0.30694669, -0.37555384],
51+
[-3.03393135, -2.01966141, -0.6546858 , -0.90330912, 2.89185791],
52+
[ 0.28602304, -0.1265 , -0.52209915, 0.28309144, -0.5865882 ],
53+
[-0.26268117, 0.76017979, 1.84095557, -0.23245038, 1.80716891],
54+
[ 0.30283562, -0.40231495, -1.29550644, -0.1422727 , -1.78121713]])
55+
y = np.array([2, 3, 0, 0, 1, 3, 0, 1, 2, 1])
56+
learning_rate = 3e-2
57+
iterations = 10
58+
expected_b = [[-0.0841, -0.5693, -0.3651, -0.2423, -0.5344, 0.0339],
59+
[0.2566, 0.0535, -0.2104, -0.4004, 0.2709, -0.1461],
60+
[-0.1318, 0.2109, 0.3998, 0.523, -0.1001, 0.0545],
61+
[-0.0407, 0.3049, 0.1757, 0.1197, 0.3637, 0.0576]]
62+
expected_losses = [13.8629, 10.7201, 9.3163, 8.4942, 7.9132,
63+
7.4598, 7.0854, 6.7653, 6.4851, 6.2358]
64+
b, ce = train_softmaxreg(X, y, learning_rate, iterations)
65+
assert b == expected_b and ce == expected_losses, 'Test case 1 failed'
66+
67+
# Test 2
68+
X = np.array([[-0.55605887, -0.74922526, -0.1913345 , 0.41584056],
69+
[-1.05481124, -1.13763371, -1.28685937, -1.0710115 ],
70+
[-1.17111877, -1.46866663, -0.75898143, 0.15915148],
71+
[-1.21725723, -1.55590285, -0.69318542, 0.3580615 ],
72+
[-1.90316075, -2.06075824, -2.2952422 , -1.87885386],
73+
[-0.79089629, -0.98662696, -0.52955027, 0.07329079],
74+
[ 1.97170638, 2.65609694, 0.6802377 , -1.47090364],
75+
[ 1.46907396, 1.61396429, 1.69602021, 1.29791351],
76+
[ 0.03095068, 0.15148081, -0.34698116, -0.74306029],
77+
[-1.40292946, -1.99308861, -0.1478281 , 1.72332995]])
78+
y = np.array([1., 0., 0., 1., 0., 1., 0., 1., 0., 1.])
79+
learning_rate = 1e-2
80+
iterations = 7
81+
expected_b = [[-0.0052, 0.0148, 0.0562, -0.113, -0.2488],
82+
[0.0052, -0.0148, -0.0562, 0.113, 0.2488]]
83+
expected_losses = [6.9315, 6.4544, 6.0487, 5.7025, 5.4055, 5.1493, 4.9269]
84+
b, ce = train_softmaxreg(X, y, learning_rate, iterations)
85+
assert b == expected_b and ce == expected_losses, 'Test case 2 failed'
86+
87+
print('All tests passed')
88+
89+
90+
if __name__ == '__main__':
91+
test_train_softmaxreg()

Problems/106_train_logreg/learn.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
## Overview
2+
Logistic regression is a model used for a binary classification poblem.
3+
4+
## Prerequisites for a regular logistic regression
5+
Logistic regression is based on the concept of "logits of odds". **Odds** is measure of how frequent we encounter success. It also allows us to shift our probabilities domain of $[0, 1]$ to $[0,\infty]$ Consider a probability of scoring a goal $p=0.8$, then our $odds=\frac{0.8}{0.2}=4$. This means that every $4$ matches we could be expecting a goal followed by a miss. So the higher the odds, the more consistent is our streak of goals. **Logit** is an inverse of the standard logistic function, i.e. sigmoid: $logit(p)=\sigma^{-1}(p)=ln\frac{p}{1-p}$. In our case $p$ is a probability, therefore we call $\frac{p}{1-p}$ the "odds". The logit allows us to further expand our domain from $[0,\infty]$ to $[-\infty,\infty]$.
6+
7+
With this domain expansion we can treat our problem as a linear regression and try to approximate our logit function: $X\beta=logit(p)$. However what we really want for this approximation is to yield predictions for probabilities:
8+
$$
9+
X\beta=ln\frac{p}{1-p} \\
10+
e^{-X\beta}=\frac{1-p}{p} \\
11+
e^{-X\beta}+1 = \frac{1}{p} \\
12+
p = \frac{1}{e^{-X\beta}+1}
13+
$$
14+
15+
What we practically just did is taking an inverse of a logit function w.r.t. our approximation and go back to sigmoid. This is also the backbone of the regular logistic regression, which is commonly defined as:
16+
$$
17+
\pi=\frac{e^{\alpha+X\beta}}{1+e^{\alpha+X\beta}}=\frac{1}{1+e^{-(\alpha+X\beta)}}.
18+
$$
19+
20+
## Loss in logistic regression
21+
The loss function used for solving the logistic regression for $\beta$ is derived from MLE (Maximum Likelihood Estimation). This method allows us to search for $\beta$ that maximize our **likelihood function** $L(\beta)$. This function tells us how likely it is that $X$ has come from the distribution generated by $\beta$: $L(\beta)=L(\beta|X)=P(X|\beta)=\prod_{\{x\in X\}}f^{univar}_X(x;\beta)$, where $f$ is a PMF and $univar$ means univariate, i.e. applied to a single variable.
22+
23+
In the case of a regular logistic regression we expect our output to belong to a single Bernoulli-distributed random variable (hence the univariance), since our true label is either $y_i=0$ or $y_i=1$. The Bernoulli's PMF is defined as $P(Y=y)=p^y(1-p)^{(1-y)}$, where $y\in\{0, 1\}$. Also let's denote $\{x\in X\}$ simply as $X$ and refer to a single pair of vectors from the training set as $(x_i, y_i)$. Thus, our likelihood function would look like this:
24+
$$
25+
\prod_X p\left(x_i\right)^{y_i} \times\left[1-p\left(x_i\right)\right]^{1-y_i}
26+
$$
27+
28+
Then we convert our function from likelihood to log-likelihood by taking $ln$ (or $log$) of it:
29+
$$
30+
\sum_X y_i \log \left[p\left(x_i\right)\right]+\left(1-y_i\right) \log \left[1-p\left(x_i\right)\right]
31+
$$
32+
33+
And then we replace $p(x_i)$ with the sigmoid from previously defined equality to get a final version of our **loss function**:
34+
$$
35+
\sum_X y_i \log \left(\frac{1}{1+e^{-x_i\beta}}\right)+\left(1-y_i\right)\log \left(1-\frac{1}{1+e^{-x_i\beta}}\right)
36+
$$
37+
38+
## Optimization objective
39+
Recall that originally we wanted to search for $\beta$ that maximize the likelihood function. Since $log$ is a monotonic transformation, our maximization objective does not change and we can confindently say that now we can equally search for $\beta$ that maximize our log-likelihood. Hence we can finally write our actual objective as:
40+
41+
$$
42+
argmax_\beta [\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))] = \\
43+
44+
= argmin_\beta -[\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))]
45+
$$
46+
47+
where $\sigma$ is the sigmoid. This function we're trying to minimize is also called **Binary Cross Entropy** loss function (BCE). To find the minimum we would need to take the gradient of this LLF (Log-Likelihood Function), or find a vector of derivatives with respect to every individual $\beta_j$.
48+
49+
### Step 1
50+
To do that we're going to use a chain rule, that describes relational change in variables that our original function is made of. In our case the log-likeligood function depends on sigmoid $\sigma$, $\sigma$ depends on $X\beta$ and $X\beta$ finally depends on $\beta_j$, hence:
51+
52+
$$
53+
\frac{\partial LLF}{\partial\beta_j}=\frac{\partial LLF}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta]}\frac{\partial[X\beta]}{\beta_j}= \\
54+
55+
=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \frac{\partial\sigma}{\partial[x^{(i)}\beta]}
56+
$$
57+
58+
### Step 2
59+
Then we use a derivative of the sigmoid function, that is $\frac{\partial\sigma(x)}{\partial x}=\sigma(x)(1-\sigma(x))$:
60+
$$
61+
-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \sigma\left(x^{(i)}\beta\right)\left(1-\sigma\left(x^{(i)}\beta\right)\right)^{(*)} \frac{\partial[x^{(i)}\beta]}{\partial\beta_j} = \\
62+
63+
=-\sum_{i=1}^n\left(y^{(i)}\left(1-\sigma\left(x^{(i)}\beta\right)\right)-(1-y^{(i)} ) \sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
64+
65+
=-\sum_{i=1}^n\left(y^{(i)}-\sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
66+
67+
=\sum_{i=1}^n\left(\sigma\left(x^{(i)}\beta\right)-y^{(i)}\right) x_j^{(i)}.
68+
$$
69+
70+
The result sum can be then rewritten in a more convenient gradient matrix form as:
71+
$$
72+
X^T(\sigma(X\beta)-Y)
73+
$$
74+
75+
Then we can finally use gradient descent in order to iteratively update our parameters:
76+
$$
77+
\beta_{t+1}=\beta_t - \eta [X^T(\sigma(X\beta_t)-Y)]
78+
$$

0 commit comments

Comments
 (0)