Merge pull request #287 from turkunov/logistic_reg

moe18 · web-flow · commit 873dbf5b4f77 · 2025-02-22T20:12:38.000-05:00
New Problem: Training Logistic Reg w/ Grad Descent
diff --git a/Problems/106_train_logreg/learn.md b/Problems/106_train_logreg/learn.md
@@ -0,0 +1,78 @@
+## Overview
+Logistic regression is a model used for a binary classification poblem.
+
+## Prerequisites for a regular logistic regression
+Logistic regression is based on the concept of "logits of odds". **Odds** is measure of how frequent we encounter success. It also allows us to shift our probabilities domain of $[0, 1]$ to $[0,\infty]$ Consider a probability of scoring a goal $p=0.8$, then our $odds=\frac{0.8}{0.2}=4$. This means that every $4$ matches we could be expecting a goal followed by a miss. So the higher the odds, the more consistent is our streak of goals. **Logit** is an inverse of the standard logistic function, i.e. sigmoid: $logit(p)=\sigma^{-1}(p)=ln\frac{p}{1-p}$. In our case $p$ is a probability, therefore we call $\frac{p}{1-p}$ the "odds". The logit allows us to further expand our domain from $[0,\infty]$ to $[-\infty,\infty]$.
+
+With this domain expansion we can treat our problem as a linear regression and try to approximate our logit function: $X\beta=logit(p)$. However what we really want for this approximation is to yield predictions for probabilities:
+$$
+X\beta=ln\frac{p}{1-p} \\
+e^{-X\beta}=\frac{1-p}{p} \\ 
+e^{-X\beta}+1 = \frac{1}{p} \\
+p = \frac{1}{e^{-X\beta}+1}
+$$
+
+What we practically just did is taking an inverse of a logit function w.r.t. our approximation and go back to sigmoid. This is also the backbone of the regular logistic regression, which is commonly defined as:
+$$
+\pi=\frac{e^{\alpha+X\beta}}{1+e^{\alpha+X\beta}}=\frac{1}{1+e^{-(\alpha+X\beta)}}.
+$$
+
+## Loss in logistic regression
+The loss function used for solving the logistic regression for $\beta$ is derived from MLE (Maximum Likelihood Estimation). This method allows us to search for $\beta$ that maximize our **likelihood function** $L(\beta)$. This function tells us how likely it is that $X$ has come from the distribution generated by $\beta$: $L(\beta)=L(\beta|X)=P(X|\beta)=\prod_{\{x\in X\}}f^{univar}_X(x;\beta)$, where $f$ is a PMF and $univar$ means univariate, i.e. applied to a single variable.
+
+In the case of a regular logistic regression we expect our output to belong to a single Bernoulli-distributed random variable (hence the univariance), since our true label is either $y_i=0$ or $y_i=1$. The Bernoulli's PMF is defined as $P(Y=y)=p^y(1-p)^{(1-y)}$, where $y\in\{0, 1\}$. Also let's denote $\{x\in X\}$ simply as $X$ and refer to a single pair of vectors from the training set as $(x_i, y_i)$. Thus, our likelihood function would look like this:
+$$
+\prod_X p\left(x_i\right)^{y_i} \times\left[1-p\left(x_i\right)\right]^{1-y_i}
+$$
+
+Then we convert our function from likelihood to log-likelihood by taking $ln$ (or $log$) of it:
+$$
+\sum_X y_i \log \left[p\left(x_i\right)\right]+\left(1-y_i\right) \log \left[1-p\left(x_i\right)\right]
+$$
+
+And then we replace $p(x_i)$ with the sigmoid from previously defined equality to get a final version of our **loss function**:
+$$
+\sum_X y_i \log \left(\frac{1}{1+e^{-x_i\beta}}\right)+\left(1-y_i\right)\log \left(1-\frac{1}{1+e^{-x_i\beta}}\right)
+$$
+
+## Optimization objective
+Recall that originally we wanted to search for $\beta$ that maximize the likelihood function. Since $log$ is a monotonic transformation, our maximization objective does not change and we can confindently say that now we can equally search for $\beta$ that maximize our log-likelihood. Hence we can finally write our actual objective as:
+
+$$
+argmax_\beta [\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))] = \\
+
+= argmin_\beta -[\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))]
+$$
+
+where $\sigma$ is the sigmoid. This function we're trying to minimize is also called **Binary Cross Entropy** loss function (BCE). To find the minimum we would need to take the gradient of this LLF (Log-Likelihood Function), or find a vector of derivatives with respect to every individual $\beta_j$.
+
+### Step 1
+To do that we're going to use a chain rule, that describes relational change in variables that our original function is made of. In our case the log-likeligood function depends on sigmoid $\sigma$, $\sigma$ depends on $X\beta$ and $X\beta$ finally depends on $\beta_j$, hence:
+
+$$
+\frac{\partial LLF}{\partial\beta_j}=\frac{\partial LLF}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta]}\frac{\partial[X\beta]}{\beta_j}= \\
+
+=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \frac{\partial\sigma}{\partial[x^{(i)}\beta]}
+$$
+
+### Step 2
+Then we use a derivative of the sigmoid function, that is $\frac{\partial\sigma(x)}{\partial x}=\sigma(x)(1-\sigma(x))$: 
+$$
+-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \sigma\left(x^{(i)}\beta\right)\left(1-\sigma\left(x^{(i)}\beta\right)\right)^{(*)} \frac{\partial[x^{(i)}\beta]}{\partial\beta_j} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)}\left(1-\sigma\left(x^{(i)}\beta\right)\right)-(1-y^{(i)} ) \sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)}-\sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
+
+=\sum_{i=1}^n\left(\sigma\left(x^{(i)}\beta\right)-y^{(i)}\right) x_j^{(i)}.
+$$
+
+The result sum can be then rewritten in a more convenient gradient matrix form as:
+$$
+X^T(\sigma(X\beta)-Y)
+$$
+
+Then we can finally use gradient descent in order to iteratively update our parameters:
+$$
+\beta_{t+1}=\beta_t - \eta [X^T(\sigma(X\beta_t)-Y)]
+$$
diff --git a/Problems/106_train_logreg/solution.py b/Problems/106_train_logreg/solution.py
@@ -0,0 +1,84 @@
+import numpy as np
+
+
+def train_logreg(X: np.ndarray, y: np.ndarray, 
+                 learning_rate: float, iterations: int) -> tuple[list[float], ...]:
+    """        
+    Gradient-descent training algorithm for logistic regression, that collects sum-reduced
+    BCE losses, accuracies. Assigns label "0" if the P(x_i)<=0.5 and "1" otherwise.
+
+    Returns
+    -------
+    B : list[float]
+        1xM updated parameter vector rounded to 4 floating points
+    losses : list[float]
+        collected values of a BCE loss function (LLF) rounded to 4 floating points
+    """
+
+    def sigmoid(x):
+        return 1 / (1 + np.exp(-x))
+
+    def accuracy(y_pred, y_true):
+        return (y_true == np.rint(y_pred)).sum() / len(y_true)
+    
+    def bce_loss(y_pred, y_true):
+        return -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
+
+    y = y.reshape(-1, 1)
+    X = np.hstack((np.ones((X.shape[0], 1)), X))
+    B = np.zeros((X.shape[1], 1))
+    accuracies, losses = [], []
+
+    for epoch in range(iterations):
+        y_pred = sigmoid(X @ B)
+        B -= learning_rate * X.T @ (y_pred - y)
+        losses.append(round(bce_loss(y_pred, y), 4))
+        accuracies.append(round(accuracy(y_pred, y), 4))
+
+    return B.flatten().round(4).tolist(), losses
+
+
+def test_train_logreg():
+    # Test 1
+    X = np.array([[ 0.76743473, -0.23413696, -0.23415337,  1.57921282],
+       [-1.4123037 ,  0.31424733, -1.01283112, -0.90802408],
+       [-0.46572975,  0.54256004, -0.46947439, -0.46341769],
+       [-0.56228753, -1.91328024,  0.24196227, -1.72491783],
+       [-1.42474819, -0.2257763 ,  1.46564877,  0.0675282 ],
+       [ 1.85227818, -0.29169375, -0.60063869, -0.60170661],
+       [ 0.37569802,  0.11092259, -0.54438272, -1.15099358],
+       [ 0.19686124, -1.95967012,  0.2088636 , -1.32818605],
+       [ 1.52302986, -0.1382643 ,  0.49671415,  0.64768854],
+       [-1.22084365, -1.05771093, -0.01349722,  0.82254491]])
+    y = np.array([1., 0., 0., 0., 1., 1., 0., 0., 1., 0.])
+    learning_rate = 1e-3
+    iterations = 10
+    b, llf = train_logreg(X, y, learning_rate, iterations)
+    assert b == [-0.0097, 0.0286, 0.015, 0.0135, 0.0316] and \
+        llf == [6.9315, 6.9075, 6.8837, 6.8601, 6.8367, 6.8134, 6.7904, 6.7675, 6.7448, 6.7223], \
+            'Test case 1 failed'
+
+    # Test 2
+    X = np.array([[ 0.76743473,  1.57921282, -0.46947439],
+       [-0.23415337,  1.52302986, -0.23413696],
+       [ 0.11092259, -0.54438272, -1.15099358],
+       [-0.60063869,  0.37569802, -0.29169375],
+       [-1.91328024,  0.24196227, -1.72491783],
+       [-1.01283112, -0.56228753,  0.31424733],
+       [-0.1382643 ,  0.49671415,  0.64768854],
+       [-0.46341769,  0.54256004, -0.46572975],
+       [-1.4123037 , -0.90802408,  1.46564877],
+       [ 0.0675282 , -0.2257763 , -1.42474819]])
+    y = np.array([1., 1., 0., 0., 0., 0., 1., 1., 0., 0.])
+    learning_rate = 1e-1
+    iterations = 10
+    b, llf = train_logreg(X, y, learning_rate, iterations)
+    assert b == [-0.2509, 0.9325, 1.6218, 0.6336] and \
+        llf == [6.9315, 5.5073, 4.6382, 4.0609, 3.6503, 3.3432, 3.1045, 2.9134, 2.7567, 2.6258], \
+            'Test case 2 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_train_logreg()