Open-Deep-ML
diff --git a/‎Problems/104_logistic_regression/learn.md‎
Lines changed: 38 additions & 0 deletions b/‎Problems/104_logistic_regression/learn.md‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎Problems/104_logistic_regression/solution.py‎
Lines changed: 48 additions & 0 deletions b/‎Problems/104_logistic_regression/solution.py‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎Problems/105_train_softmaxreg/learn.md‎
Lines changed: 63 additions & 0 deletions b/‎Problems/105_train_softmaxreg/learn.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎Problems/105_train_softmaxreg/solution.py‎
Lines changed: 91 additions & 0 deletions b/‎Problems/105_train_softmaxreg/solution.py‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎Problems/106_train_logreg/learn.md‎
Lines changed: 78 additions & 0 deletions b/‎Problems/106_train_logreg/learn.md‎
Lines changed: 78 additions & 0 deletions
@@ -0,0 +1,38 @@
+## Binary Classification with Logistic Regression
+
+Logistic Regression is a fundamental algorithm for binary classification. Given input features and learned model parameters (weights and bias), your task is to implement the prediction function that computes class probabilities.
+
+### Mathematical Background
+
+The logistic regression model makes predictions using the sigmoid function:
+
+$\sigma(z) = \frac{1}{1 + e^{-z}}$
+
+where z is the linear combination of features and weights plus bias:
+
+$z = \mathbf{w}^T\mathbf{x} + b = \sum_{i=1}^{n} w_ix_i + b$
+
+### Implementation Requirements
+
+Your task is to implement a function that:
+
+- Takes a batch of samples $\mathbf{X}$ (shape: N × D), weights $\mathbf{w}$ (shape: D), and bias b
+- Computes $z = \mathbf{X}\mathbf{w} + b$ for all samples
+- Applies the sigmoid function to get probabilities
+- Returns binary predictions i.e 0 or 1 using a threshold of 0.5
+
+### Important Considerations
+
+- Handle numerical stability in sigmoid computation
+- Ensure efficient vectorized operations using numpy
+- Return binary predictions i.e zeroes and ones
+
+### Hint
+
+To prevent overflow in the exponential calculation of sigmoid function, use np.clip to limit z values:
+
+```python
+z = np.clip(z, -500, 500)
+```
+
+This ensures numerical stability when dealing with large input values.
@@ -0,0 +1,48 @@
+import numpy as np
+
+def predict_logistic(X: np.ndarray, weights: np.ndarray, bias: float) -> np.ndarray:
+
+    z = np.dot(X, weights) + bias
+    z = np.clip(z, -500, 500)  # Prevent overflow in exp
+    probabilities = 1 / (1 + np.exp(-z))
+    return (probabilities >= 0.5).astype(int)
+
+def test_predict_logistic():
+    # Test case 1: Simple linearly separable case
+    X1 = np.array([[1, 1], [2, 2], [-1, -1], [-2, -2]])
+    w1 = np.array([1, 1])
+    b1 = 0
+    expected1 = np.array([1, 1, 0, 0])
+    assert np.array_equal(predict_logistic(X1, w1, b1), expected1), "Test case 1 failed"
+
+    # Test case 2: Decision boundary case
+    X2 = np.array([[0, 0], [0.1, 0.1], [-0.1, -0.1]])
+    w2 = np.array([1, 1])
+    b2 = 0
+    expected2 = np.array([1, 1, 0])
+    assert np.array_equal(predict_logistic(X2, w2, b2), expected2), "Test case 2 failed"
+
+    # Test case 3: Higher dimensional input
+    X3 = np.array([[1, 2, 3], [-1, -2, -3], [0.5, 1, 1.5]])
+    w3 = np.array([0.1, 0.2, 0.3])
+    b3 = -1
+    expected3 = np.array([1, 0, 0])
+    assert np.array_equal(predict_logistic(X3, w3, b3), expected3), "Test case 3 failed"
+
+#     # Test case 4: Single feature
+    X4 = np.array([[1], [2], [-1], [-2]]).reshape(-1, 1)
+    w4 = np.array([2])
+    b4 = 0
+    expected4 = np.array([1, 1, 0, 0])
+    assert np.array_equal(predict_logistic(X4, w4, b4), expected4), "Test case 4 failed"
+
+#     # Test case 5: Numerical stability test with large values
+    X6 = np.array([[1000, 2000], [-1000, -2000]])
+    w6 = np.array([0.1, 0.1])
+    b6 = 0
+    result6 = predict_logistic(X6, w6, b6)
+    assert result6[0] == 1 and result6[1] == 0, "Test case 5 failed"
+
+if __name__ == "__main__":
+    test_predict_logistic()
+    print("All test cases passed!")
@@ -0,0 +1,63 @@
+## Overview
+Softmax regression is a type of logistic regression that extends it to a multiclass problem by outputting a vector $P$ of probabilities for each distinct class and taking $argmax(P)$.
+
+## Connection to a regular logistic regression
+Recall that a standard logistic regression is aimed at approximating
+$$
+p = \frac{1}{e^{-X\beta}+1} = \\
+= \frac{e^{X\beta}}{1+e^{X\beta}},
+$$
+
+which actually alignes with the definition of the softmax function:
+$$
+softmax(z_i)=\sigma(z_i)=\frac{e^{z_i}}{\sum_j^Ce^{z_j}},
+$$
+
+where $C$ is the number of classes and values of which sum up to $1$. Hence it simply extends the functionality of sigmoid to more than 2 classes and could be used for assigning probability values in a categorical distribution, i.e. softmax regression searches for the following vector-approximation:
+$$
+p^{(i)}=\frac{e^{x^{(i)}\beta}}{\sum_j^Ce^{x^{(i)}\beta_j}_j}
+$$
+
+## Loss in softmax regression
+**tl;dr** key differences in the loss from logistic regression include replacing sigmoid with softmax and calculating several gradients for vectors $\beta_j$ corresponding to a particular class $j\in\{1,...,C\}$.
+
+Recall that we use MLE in logistic regression. It is the same case with softmax regression, although instead of Bernoulli-distributed random variable we have categorical distribution, which is an extension of Bernoulli to more than 2 labels. Its PMF is defined as:
+$$
+f(y|p)=\prod_{i=1}^Kp_i^{[i=y]},
+$$
+
+Hence, our log-likelihood looks like:
+$$
+\sum_X \sum_j^C [y_i=j] \log \left[p\left(x_i\right)\right]
+$$
+
+Where we replace our probability function with softmax:
+$$
+\sum_X \sum_j^C [y_i=j] \log \frac{e^{x_i\beta_j}}{\sum_j^Ce^{x_i\beta_j}}
+$$
+
+where $[i=y]$ is a function, that returns $0$, if $i\neq y$, and $1$ otherwise and $C$ - number of distinct classes (labels). You can see that since we are expecting a $1\times C$ output of $y$, just like in the neuron backprop problem, we will be having separate vector $\beta_j$ for every $j$ class out of $C$. 
+
+## Optimization objective
+The optimization objective is the same as with logistic regression. The function, which we are optimizing, is also commonly refered as **Cross Entropy** (CE):
+
+$$
+argmin_\beta -[\sum_X \sum_j^C [y_i=j] \log \frac{e^{x_i\beta_j}}{\sum_j^Ce^{x_i\beta_j}}] \\
+$$
+
+Then we are yet again using a chain rule for calculating partial derivative of $CE$ with respect to $\beta$:
+
+
+$$
+\frac{\partial CE}{\partial\beta^{(j)}_i}=\frac{\partial CE}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta^{(j)}]}\frac{\partial[X\beta^{(j)}]}{\beta^{(j)}_i}
+$$
+
+Which is eventually reduced to a similar to logistic regression gradient matrix form:
+$$
+X^T(\sigma(X\beta^{(j)})-Y)
+$$
+
+Then we can finally use gradient descent in order to iteratively update our parameters with respect to a particular class:
+$$
+\beta^{(j)}_{t+1}=\beta^{(j)}_t - \eta [X^T(\sigma(X\beta^{(j)}_t)-Y)]
+$$
@@ -0,0 +1,91 @@
+import numpy as np
+
+
+def train_softmaxreg(X: np.ndarray, y: np.ndarray, 
+                 learning_rate: float, iterations: int) -> tuple[list[float], ...]:
+    """        
+    Gradient-descent training algorithm for softmax regression, that collects mean-reduced
+    CE losses, accuracies.
+    Returns
+    -------
+    B : list[float]
+        CxM updated parameter vector rounded to 4 floating points
+    losses : list[float]
+        collected values of a Cross Entropy rounded to 4 floating points
+    """
+
+    def softmax(z):
+        return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)
+
+    def accuracy(y_pred, y_true):
+        return (np.argmax(y_true, axis=1) == np.argmax(y_pred, axis=1)).sum() / len(y_true)
+
+    def ce_loss(y_pred, y_true):
+        true_labels_idx = np.argmax(y_true, axis=1)
+        return -np.sum(np.log(y_pred)[list(range(len(y_pred))),true_labels_idx])
+ 
+    y = y.astype(int)
+    C = y.max()+1 # we assume that classes start from 0
+    y = np.eye(C)[y]
+    X = np.hstack((np.ones((X.shape[0], 1)), X))
+    B = np.zeros((X.shape[1], C))
+    accuracies, losses = [], []
+
+    for epoch in range(iterations):
+        y_pred = softmax(X @ B)
+        B -= learning_rate * X.T @ (y_pred - y)
+        losses.append(round(ce_loss(y_pred, y), 4))
+        accuracies.append(round(accuracy(y_pred, y), 4))
+
+    return B.T.round(4).tolist(), losses
+
+
+def test_train_softmaxreg():
+    # Test 1
+    X = np.array([[ 2.52569869,  2.33335813,  1.77303921,  0.41061103, -1.66484491],
+       [ 1.51013861,  1.30237106,  1.31989315,  1.36087958,  0.46381252],
+       [-2.09699866, -1.35960405, -1.04035503, -2.25481082, -0.32359947],
+       [-0.96660088, -0.60680633, -0.72017167, -1.73257187, -1.12811486],
+       [-0.38096611, -0.24852455,  0.18789426,  0.52359424,  1.30725962],
+       [ 0.54828787,  0.33156614,  0.10676247,  0.30694669, -0.37555384],
+       [-3.03393135, -2.01966141, -0.6546858 , -0.90330912,  2.89185791],
+       [ 0.28602304, -0.1265    , -0.52209915,  0.28309144, -0.5865882 ],
+       [-0.26268117,  0.76017979,  1.84095557, -0.23245038,  1.80716891],
+       [ 0.30283562, -0.40231495, -1.29550644, -0.1422727 , -1.78121713]])
+    y = np.array([2, 3, 0, 0, 1, 3, 0, 1, 2, 1])
+    learning_rate = 3e-2
+    iterations = 10
+    expected_b = [[-0.0841, -0.5693, -0.3651, -0.2423, -0.5344, 0.0339], 
+            [0.2566, 0.0535, -0.2104, -0.4004, 0.2709, -0.1461], 
+            [-0.1318, 0.2109, 0.3998, 0.523, -0.1001, 0.0545], 
+            [-0.0407, 0.3049, 0.1757, 0.1197, 0.3637, 0.0576]]
+    expected_losses = [13.8629, 10.7201, 9.3163, 8.4942, 7.9132, 
+            7.4598, 7.0854, 6.7653, 6.4851, 6.2358]
+    b, ce = train_softmaxreg(X, y, learning_rate, iterations)
+    assert b == expected_b and ce == expected_losses, 'Test case 1 failed'
+
+    # Test 2
+    X = np.array([[-0.55605887, -0.74922526, -0.1913345 ,  0.41584056],
+       [-1.05481124, -1.13763371, -1.28685937, -1.0710115 ],
+       [-1.17111877, -1.46866663, -0.75898143,  0.15915148],
+       [-1.21725723, -1.55590285, -0.69318542,  0.3580615 ],
+       [-1.90316075, -2.06075824, -2.2952422 , -1.87885386],
+       [-0.79089629, -0.98662696, -0.52955027,  0.07329079],
+       [ 1.97170638,  2.65609694,  0.6802377 , -1.47090364],
+       [ 1.46907396,  1.61396429,  1.69602021,  1.29791351],
+       [ 0.03095068,  0.15148081, -0.34698116, -0.74306029],
+       [-1.40292946, -1.99308861, -0.1478281 ,  1.72332995]])
+    y = np.array([1., 0., 0., 1., 0., 1., 0., 1., 0., 1.])
+    learning_rate = 1e-2
+    iterations = 7
+    expected_b = [[-0.0052, 0.0148, 0.0562, -0.113, -0.2488], 
+                [0.0052, -0.0148, -0.0562, 0.113, 0.2488]]
+    expected_losses = [6.9315, 6.4544, 6.0487, 5.7025, 5.4055, 5.1493, 4.9269]
+    b, ce = train_softmaxreg(X, y, learning_rate, iterations)
+    assert b == expected_b and ce == expected_losses, 'Test case 2 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_train_softmaxreg()
@@ -0,0 +1,78 @@
+## Overview
+Logistic regression is a model used for a binary classification poblem.
+
+## Prerequisites for a regular logistic regression
+Logistic regression is based on the concept of "logits of odds". **Odds** is measure of how frequent we encounter success. It also allows us to shift our probabilities domain of $[0, 1]$ to $[0,\infty]$ Consider a probability of scoring a goal $p=0.8$, then our $odds=\frac{0.8}{0.2}=4$. This means that every $4$ matches we could be expecting a goal followed by a miss. So the higher the odds, the more consistent is our streak of goals. **Logit** is an inverse of the standard logistic function, i.e. sigmoid: $logit(p)=\sigma^{-1}(p)=ln\frac{p}{1-p}$. In our case $p$ is a probability, therefore we call $\frac{p}{1-p}$ the "odds". The logit allows us to further expand our domain from $[0,\infty]$ to $[-\infty,\infty]$.
+
+With this domain expansion we can treat our problem as a linear regression and try to approximate our logit function: $X\beta=logit(p)$. However what we really want for this approximation is to yield predictions for probabilities:
+$$
+X\beta=ln\frac{p}{1-p} \\
+e^{-X\beta}=\frac{1-p}{p} \\ 
+e^{-X\beta}+1 = \frac{1}{p} \\
+p = \frac{1}{e^{-X\beta}+1}
+$$
+
+What we practically just did is taking an inverse of a logit function w.r.t. our approximation and go back to sigmoid. This is also the backbone of the regular logistic regression, which is commonly defined as:
+$$
+\pi=\frac{e^{\alpha+X\beta}}{1+e^{\alpha+X\beta}}=\frac{1}{1+e^{-(\alpha+X\beta)}}.
+$$
+
+## Loss in logistic regression
+The loss function used for solving the logistic regression for $\beta$ is derived from MLE (Maximum Likelihood Estimation). This method allows us to search for $\beta$ that maximize our **likelihood function** $L(\beta)$. This function tells us how likely it is that $X$ has come from the distribution generated by $\beta$: $L(\beta)=L(\beta|X)=P(X|\beta)=\prod_{\{x\in X\}}f^{univar}_X(x;\beta)$, where $f$ is a PMF and $univar$ means univariate, i.e. applied to a single variable.
+
+In the case of a regular logistic regression we expect our output to belong to a single Bernoulli-distributed random variable (hence the univariance), since our true label is either $y_i=0$ or $y_i=1$. The Bernoulli's PMF is defined as $P(Y=y)=p^y(1-p)^{(1-y)}$, where $y\in\{0, 1\}$. Also let's denote $\{x\in X\}$ simply as $X$ and refer to a single pair of vectors from the training set as $(x_i, y_i)$. Thus, our likelihood function would look like this:
+$$
+\prod_X p\left(x_i\right)^{y_i} \times\left[1-p\left(x_i\right)\right]^{1-y_i}
+$$
+
+Then we convert our function from likelihood to log-likelihood by taking $ln$ (or $log$) of it:
+$$
+\sum_X y_i \log \left[p\left(x_i\right)\right]+\left(1-y_i\right) \log \left[1-p\left(x_i\right)\right]
+$$
+
+And then we replace $p(x_i)$ with the sigmoid from previously defined equality to get a final version of our **loss function**:
+$$
+\sum_X y_i \log \left(\frac{1}{1+e^{-x_i\beta}}\right)+\left(1-y_i\right)\log \left(1-\frac{1}{1+e^{-x_i\beta}}\right)
+$$
+
+## Optimization objective
+Recall that originally we wanted to search for $\beta$ that maximize the likelihood function. Since $log$ is a monotonic transformation, our maximization objective does not change and we can confindently say that now we can equally search for $\beta$ that maximize our log-likelihood. Hence we can finally write our actual objective as:
+
+$$
+argmax_\beta [\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))] = \\
+
+= argmin_\beta -[\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))]
+$$
+
+where $\sigma$ is the sigmoid. This function we're trying to minimize is also called **Binary Cross Entropy** loss function (BCE). To find the minimum we would need to take the gradient of this LLF (Log-Likelihood Function), or find a vector of derivatives with respect to every individual $\beta_j$.
+
+### Step 1
+To do that we're going to use a chain rule, that describes relational change in variables that our original function is made of. In our case the log-likeligood function depends on sigmoid $\sigma$, $\sigma$ depends on $X\beta$ and $X\beta$ finally depends on $\beta_j$, hence:
+
+$$
+\frac{\partial LLF}{\partial\beta_j}=\frac{\partial LLF}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta]}\frac{\partial[X\beta]}{\beta_j}= \\
+
+=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \frac{\partial\sigma}{\partial[x^{(i)}\beta]}
+$$
+
+### Step 2
+Then we use a derivative of the sigmoid function, that is $\frac{\partial\sigma(x)}{\partial x}=\sigma(x)(1-\sigma(x))$: 
+$$
+-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \sigma\left(x^{(i)}\beta\right)\left(1-\sigma\left(x^{(i)}\beta\right)\right)^{(*)} \frac{\partial[x^{(i)}\beta]}{\partial\beta_j} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)}\left(1-\sigma\left(x^{(i)}\beta\right)\right)-(1-y^{(i)} ) \sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)}-\sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
+
+=\sum_{i=1}^n\left(\sigma\left(x^{(i)}\beta\right)-y^{(i)}\right) x_j^{(i)}.
+$$
+
+The result sum can be then rewritten in a more convenient gradient matrix form as:
+$$
+X^T(\sigma(X\beta)-Y)
+$$
+
+Then we can finally use gradient descent in order to iteratively update our parameters:
+$$
+\beta_{t+1}=\beta_t - \eta [X^T(\sigma(X\beta_t)-Y)]
+$$