|
| 1 | +\titledquestion{Kernels (applications)} |
| 2 | + |
| 3 | +Suppose that you have $N$ datapoints $x_i \in \mathcal{X}$ with labels $y_i \in \{-1,1\}$ that you want to classify, where $1\leqslant i \leqslant N$. To do so, you will try two different classical methods, enhanced with the kernel trick. For the whole exercise, we introduce the following kernel-related definitions: |
| 4 | +\begin{itemize} |
| 5 | + \item $\mathcal{H}$ is an RKHS, |
| 6 | + \item $\langle \cdot, \cdot \rangle_{\mathcal{H}} : \mathcal{H} \times \mathcal{H} \to \mathbb{R}$ is the inner product associated with $\mathcal{H}$, |
| 7 | + \item $\| \cdot \|_{\mathcal{H}} : \mathcal{H} \to \mathbb{R}^+$ is the induced norm in $\mathcal{H}$ such that $ \| f \|_{\mathcal{H}} = \sqrt{\langle f, f \rangle_{\mathcal{H}}}$ for any $f\in \mathcal{H}$, |
| 8 | + \item $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is the (symmetric positive-definite) kernel, |
| 9 | + \item $\phi: \mathcal{X} \to \mathcal{H}$ is a feature map such that $k(x,y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{H}}$ for any $x,y \in \mathcal{X}$, |
| 10 | + \item $K \in \mathbb{R}^{N \times N}$ is the kernel matrix such that $K_{ij} = k(x_i,x_j) = \langle \phi(x_i), \phi(x_j) \rangle_{\mathcal{H}}$. |
| 11 | +\end{itemize} |
| 12 | + |
| 13 | +\textbf{A) Method I: hard-margin kernel support-vector machine ($k$-SVM).} $k$-SVM tries to maximize the \textit{margin} of the hyperplane which would separate the data embedded in the feature space. The decision function for any new input $x \in \mathcal{X}$ thus reads $d_I(x) = \langle v^*, \phi(x) \rangle_{\mathcal{H}} + b^*$, where the direction vector of the hyperplane $v^* \in \mathcal{H}$ and the bias $b^*\in\mathbb{R}$ are found through the $k$-SVM optimization problem. Let $\hat{y}_{I}(x)$ be the class predicted by the model $x$. This decision rule satisfies $\hat{y}_{I}(x) = \sign(d_{I}(x))$, where the sign function is defined as $\sign(t) = -1$ if $t<0$, $\sign(0) = 0$ and $\sign(t) = 1$ if $t>0$. |
| 14 | +\begin{enumerate} |
| 15 | + \item A famous theorem allows to decompose $v^* = \sum_{i=1}^N \phi(x_i) \gamma_i^*$, with $\gamma^* \in \mathbb{R}^N$. What is the name of this theorem? |
| 16 | + \begin{solutionbox}{1cm} |
| 17 | + The \textbf{Representer Theorem}. |
| 18 | + \end{solutionbox} |
| 19 | + \item Write down the (primal) $k$-SVM optimization problem in terms of $\gamma \in \mathbb{R}^N$ and $b\in \mathbb{R}$, using only $K$, $y_i$ and $N$. |
| 20 | + \begin{solutionbox}{7.5cm} |
| 21 | + \begin{align*} |
| 22 | + \min_{\gamma \in \mathbb{R}^N, b \in \mathbb{R}} \quad & \frac{1}{2} \gamma^\top K \gamma \\ |
| 23 | + \text{s.t.} \quad & y_i \left( \sum_{j=1}^N K_{ij} \gamma_j + b \right) \geq 1, \quad \forall i \in \{1,\ldots,N\} |
| 24 | + \end{align*} |
| 25 | + or equivalently, $y_i((K\gamma)_i + b) \geq 1$ for all $i$. |
| 26 | + \end{solutionbox} |
| 27 | + \item Let $\gamma^* \in \mathbb{R}^N$ and $b^* \in \mathbb{R}$ be the solutions of the $k$-SVM problem (we assume that they exist and that they are bounded). Give the expression for $d_I(x)$ in terms of $x$, $\gamma^*$, $b^*$, $k$ and $x_i$. |
| 28 | + |
| 29 | + Moreover, give the expression for $\hat{y}_{I}(x)$. |
| 30 | + \begin{solutionbox}{9cm} |
| 31 | + \begin{align*} |
| 32 | + d_I(x) &= \sum_{i=1}^N \gamma_i^* k(x_i, x) + b^*, \\ |
| 33 | + \hat{y}_I(x) &= \sign(d_I(x)) = \sign\left(\sum_{i=1}^N \gamma_i^* k(x_i, x) + b^*\right). |
| 34 | + \end{align*} |
| 35 | + \end{solutionbox} |
| 36 | +\end{enumerate} |
| 37 | + |
| 38 | +\newpage |
| 39 | +\textbf{B) Method II: (squared) distance to the (lifted) mean.} |
| 40 | +We define the centers of each class as |
| 41 | +\[ |
| 42 | +\mu_+ = \frac{1}{n_+} \sum_{i = 1 \atop y_i = 1}^N \phi(x_i), |
| 43 | +\qquad |
| 44 | +\mu_- = \frac{1}{n_-} \sum_{i = 1 \atop y_i = -1}^N \phi(x_i), |
| 45 | +\] |
| 46 | +where $n_+$ (resp. $n_-$) is the number of points labeled $+1$ (resp. $-1$). We call $\hat{y}_{II}(x)$ the class predicted for a new point $x$ by the method of closest mean. The decision rule reads: |
| 47 | +\begin{align*} |
| 48 | + \hat{y}_{II}(x) &= \begin{cases} 1 & \textnormal{ if } \hspace{0.5cm} \|\phi(x) - \mu_+\|_{\mathcal{H}}^2 < \|\phi(x) - \mu_-\|_{\mathcal{H}}^2, \\ |
| 49 | + 0 & \textnormal{ if } \hspace{0.5cm} \|\phi(x) - \mu_+\|_{\mathcal{H}}^2 = \|\phi(x) - \mu_-\|_{\mathcal{H}}^2, \\ |
| 50 | + -1 & \textnormal{ if } \hspace{0.5cm} \|\phi(x) - \mu_+\|_{\mathcal{H}}^2 > \|\phi(x) - \mu_-\|_{\mathcal{H}}^2. \end{cases} |
| 51 | +\end{align*} |
| 52 | +\begin{enumerate} |
| 53 | + \item Propose and motivate an expression for the decision function $d_{II}(x)$ which would satisfy the equation $\hat{y}_{II}(x) = \sign(d_{II}(x))$, using only $\phi(x)$, $\mu_-$, $\mu_+$, $\langle \cdot , \cdot \rangle_{\mathcal{H}}$ and $\| \cdot \|_{\mathcal{H}}^2$. |
| 54 | + |
| 55 | + Simplify the expression as much as possible. |
| 56 | + \begin{solutionbox}{5.5cm} |
| 57 | + We need to satisfy the condition on $\sign(d_{II}(x))$. A solution is |
| 58 | + \begin{align*} |
| 59 | + d_{II}(x) &= \|\phi(x) - \mu_-\|_{\mathcal{H}}^2 - \|\phi(x) - \mu_+\|_{\mathcal{H}}^2 \\ |
| 60 | + &= \|\phi(x)\|_{\mathcal{H}}^2 - 2\langle \phi(x), \mu_- \rangle_{\mathcal{H}} + \|\mu_-\|_{\mathcal{H}}^2 - \|\phi(x)\|_{\mathcal{H}}^2 + 2\langle \phi(x), \mu_+ \rangle_{\mathcal{H}} - \|\mu_+\|_{\mathcal{H}}^2 \\ |
| 61 | + &= 2\langle \phi(x), \mu_+ - \mu_- \rangle_{\mathcal{H}} + \|\mu_-\|_{\mathcal{H}}^2 - \|\mu_+\|_{\mathcal{H}}^2. |
| 62 | + \end{align*} |
| 63 | + This choice of $d_{II}(x)$ is great because it simplifies to an affine function, because it is easy to differentiate (ex: for optimization purposes), because it is easy to compute, because it naturally comes from the conditions, because it is natural to look at squared norms (ex: least squares methods),... |
| 64 | + \end{solutionbox} |
| 65 | + |
| 66 | + \item Express the squared distances $\|\phi(x)-\mu_+\|_{\mathcal{H}}^2$ and $\|\phi(x)-\mu_-\|_{\mathcal{H}}^2$ using only $x$, $k$, $x_i$, $y_i$, $N$, $n_+$ and $n_-$. |
| 67 | + \begin{solutionbox}{5.5cm} |
| 68 | + \begin{align*} |
| 69 | + \|\phi(x)-\mu_+\|_{\mathcal{H}}^2 &= \|\phi(x)\|_{\mathcal{H}}^2 - 2\langle \phi(x), \mu_+ \rangle_{\mathcal{H}} + \|\mu_+\|_{\mathcal{H}}^2 \\ |
| 70 | + &= k(x,x) - \frac{2}{n_+} \sum_{i: y_i = 1} k(x, x_i) + \frac{1}{n_+^2} \sum_{i: y_i = 1} \sum_{j: y_j = 1} k(x_i, x_j), \\ |
| 71 | + \|\phi(x)-\mu_-\|_{\mathcal{H}}^2 &= k(x,x) - \frac{2}{n_-} \sum_{i: y_i = -1} k(x, x_i) + \frac{1}{n_-^2} \sum_{i: y_i = -1} \sum_{j: y_j = -1} k(x_i, x_j). |
| 72 | + \end{align*} |
| 73 | + \end{solutionbox} |
| 74 | + |
| 75 | + \item Just for this subquestion, assume that $n_+ = n_-$ (classes are balanced) and that $\|\mu_+\|_{\mathcal{H}} = \|\mu_-\|_{\mathcal{H}}$. Simplify the rule $\hat{y}_{II}(x)$ to express it only in terms of $x$, $N$, $k$, $x_i$ and $y_i$. |
| 76 | + |
| 77 | + \textit{Hint:} What is the relation between $N$, $n_+$ and $n_-$? |
| 78 | + \begin{solutionbox}{8cm} |
| 79 | + Since $n_+ = n_- = N/2$ and $\|\mu_+\|_{\mathcal{H}} = \|\mu_-\|_{\mathcal{H}}$, we have: |
| 80 | + \begin{align*} |
| 81 | + d_{II}(x) &= 2\langle \phi(x), \mu_+ - \mu_- \rangle_{\mathcal{H}} \\ |
| 82 | + &= \frac{2}{n_+} \sum_{i: y_i = 1} k(x, x_i) - \frac{2}{n_-} \sum_{i: y_i = -1} k(x, x_i) \\ |
| 83 | + &= \frac{4}{N} \sum_{i=1}^N y_i k(x, x_i). |
| 84 | + \end{align*} |
| 85 | + Therefore, $\hat{y}_{II}(x) = \sign\left(\frac{4}{N}\sum_{i=1}^N y_i k(x, x_i)\right)=\sign\left(\sum_{i=1}^N y_i k(x, x_i)\right)$. |
| 86 | + \end{solutionbox} |
| 87 | +\end{enumerate} |
| 88 | + |
| 89 | +\newpage |
| 90 | +\textbf{C) Methods comparison.} |
| 91 | +\begin{enumerate} |
| 92 | + \item Are the decision functions $d_{I}(x)$ and $d_{II}(x)$ affine functions of $\phi(x)$ in $\mathcal{H}$? Justify briefly. |
| 93 | + \begin{solutionbox}{8.5cm} |
| 94 | + Yes, both are affine functions of $\phi(x)$: |
| 95 | + \begin{itemize} |
| 96 | + \item $d_I(x) = \langle v^*, \phi(x) \rangle_{\mathcal{H}} + b^*$ is affine (linear plus constant). |
| 97 | + \item $d_{II}(x) = \langle 2( \mu_+ - \mu_- ),\phi(x)\rangle_{\mathcal{H}} + \|\mu_-\|_{\mathcal{H}}^2 - \|\mu_+\|_{\mathcal{H}}^2$ is also affine (linear plus constant). |
| 98 | + \end{itemize} |
| 99 | + Both have the form $\langle w, \phi(x) \rangle_{\mathcal{H}} + c$ for some $w \in \mathcal{H}$ and $c \in \mathbb{R}$. |
| 100 | + \end{solutionbox} |
| 101 | + \item If possible, give simple conditions under which both methods would be equivalent, meaning that for any new datapoint $x\in \mathcal{X}$, they would always predict the same class. |
| 102 | + \begin{solutionbox}{9cm} |
| 103 | + The methods would be equivalent iff $d_I(x)$ and $d_{II}(x)$ have the same sign for all $x$, i.e. $d_I(x) = c d_{II}(x)$ for $c>0$. A simple choice is $c=1$ so $d_{I}(x)=d_{II}(x)$. |
| 104 | + |
| 105 | + This gives the conditions $v^* = 2(\mu_+-\mu_-)$ and $b^* = \|\mu_-\|_{\mathcal{H}}^2 - \|\mu_+\|_{\mathcal{H}}^2$. |
| 106 | + |
| 107 | + \end{solutionbox} |
| 108 | + \item What method would you prefer to use for your classification task? Choose a method and give at least two arguments in favor of it. |
| 109 | + \begin{solutionbox}{10cm} |
| 110 | + I would prefer \textbf{Method I ($k$-SVM)} for the following reasons: |
| 111 | + \begin{enumerate} |
| 112 | + \item \textbf{Maximum margin:} Intelligent choice of separating hyperplane. |
| 113 | + \item \textbf{Sparse and scalable:} The solution typically uses only a subset of training points (support vectors), making predictions (inference) more efficient. This makes the method scalable to larger of more complex datasets at inference. |
| 114 | + \item \textbf{Theoretical guarantees:} SVMs have strong theoretical foundations with bounds on generalization error based on margin maximization. |
| 115 | + \item \textbf{Improvable:} Extension to soft-margin $k$-SVM for more robustness to outliers and better generalization. |
| 116 | + \end{enumerate} |
| 117 | + I would prefer \textbf{Method II (nearest mean)} for the following reasons: |
| 118 | + \begin{enumerate} |
| 119 | + \item \textbf{Simple and interpretable:} Simple and intuitive motivation, easy to understand and to interpret. |
| 120 | + \item \textbf{No pre-compute:} No need to solve an optimization problem to get the coefficients. |
| 121 | + \item \textbf{Non-separable data:} Works despite non-linearly separable data in $\mathcal{H}$ |
| 122 | + \item \textbf{Adaptable:} Easily adaptable to new datapoints. |
| 123 | + \end{enumerate} |
| 124 | + \end{solutionbox} |
| 125 | +\end{enumerate} |
| 126 | +\clearpage |
0 commit comments