|
6 | 6 | Theoretical Description |
7 | 7 | ======================= |
8 | 8 |
|
9 | | -TO BE CONTINUED |
10 | 9 |
|
11 | | -[2] Mauricio Sadinle, Jing Lei, & Larry Wasserman. |
| 10 | +Three methods for multi-class uncertainty-quantification have been implemented in MAPIE so far : |
| 11 | +LABEL [1], Adaptive Prediction Sets [2, 3] and Top-K [3]. |
| 12 | +The difference between these methods is the way the conformity scores are computed. |
| 13 | +The figure below illustrates the three methods implemented in MAPIE: |
| 14 | + |
| 15 | +.. image:: images/classification_methods.png |
| 16 | + :width: 600 |
| 17 | + :align: center |
| 18 | + |
| 19 | +For a classification problem in a standard independent and identically distributed (i.i.d) case, |
| 20 | +our training data :math:`(X, Y) = \{(x_1, y_1), \ldots, (x_n, y_n)\}`` has an unknown distribution :math:`P_{X, Y}`. |
| 21 | + |
| 22 | +For any risk level :math:`\alpha` between 0 and 1, the methods implemented in MAPIE allow the user construct a prediction |
| 23 | +set :math:`\hat{C}_{n, \alpha}(X_{n+1})` for a new observation :math:`\left( X_{n+1},Y_{n+1} \right)` with a guarantee |
| 24 | +on the marginal coverage such that : |
| 25 | + |
| 26 | +.. math:: |
| 27 | + P \{Y_{n+1} \in \hat{C}_{n, \alpha}(X_{n+1}) \} \geq 1 - \alpha |
| 28 | +
|
| 29 | +
|
| 30 | +In words, for a typical risk level $\alpha$ of $10 \%$, we want to construct prediction sets that contain the true observations |
| 31 | +for at least $90 \%$ of the new test data points. |
| 32 | +Note that the guarantee is possible only on the marginal coverage, and not on the conditional coverage |
| 33 | +:math:`P \{Y_{n+1} \in \hat{C}_{n, \alpha}(X_{n+1}) | X_{n+1} = x_{n+1} \}` which depends on the location of the new test point in the distribution. |
| 34 | + |
| 35 | +1. LABEL |
| 36 | +-------- |
| 37 | + |
| 38 | +In the LABEL method, the conformity score is defined as as one minus the score of the true label. For each point :math:`i` of the calibration set : |
| 39 | + |
| 40 | +.. math:: |
| 41 | + s_i(X_i, Y_i) = 1 - \hat{\mu}(X_i)_{Y_i} |
| 42 | +
|
| 43 | +Once the conformity scores :math:`{s_1, ..., s_n}` are estimated for all calibration points, we compute the :math:`(n+1)*(1-\alpha)/n` quantile |
| 44 | +:math:`\hat{q}` as follows : |
| 45 | + |
| 46 | +.. math:: |
| 47 | + \hat{q} = Quantile \left(s_1, ..., s_n ; \frac{\lceil(n+1)(1-\alpha)\rceil}{n}\right) \\ |
| 48 | +
|
| 49 | +
|
| 50 | +Finally, we construct a prediction set by including all labels with a score higher than the estimated quantile : |
| 51 | + |
| 52 | +.. math:: |
| 53 | + \hat{C}(X_{test}) = \{y : \hat{\mu}(X_{test})_y \geq 1 - \hat{q}\} |
| 54 | +
|
| 55 | +
|
| 56 | +This simple approach allows us to construct prediction sets which have a theoretical guarantee on the marginal coverage. |
| 57 | +However, although this method generally results in small prediction sets, it tends to produce empty ones when the model is uncertain, |
| 58 | +for example at the border between two classes. |
| 59 | + |
| 60 | + |
| 61 | +2. Adaptive Prediction Sets (APS) |
| 62 | +--------------------------------- |
| 63 | + |
| 64 | +The so-called Adaptive Prediction Set (APS) method overcomes the problem encountered by the LABEL method through the construction of |
| 65 | +prediction sets which are by definition non-empty. |
| 66 | +The conformity scores are computed by summing the ranked scores of each label, from the higher to the lower until reaching the true |
| 67 | +label of the observation : |
| 68 | + |
| 69 | +.. math:: |
| 70 | + s_i(X_i, Y_i) = \sum^k_{j=1} \hat{\mu}(X_i)_{\pi_j} \quad \text{where} \quad Y_i = \pi_j |
| 71 | +
|
| 72 | +
|
| 73 | +The quantile :math:`\hat{q}` is then computed the same way as the LABEL method. |
| 74 | +For the construction of the prediction sets for a new test point, the same procedure of ranked summing is applied until reaching the quantile, |
| 75 | +as described in the following equation : |
| 76 | + |
| 77 | + |
| 78 | +.. math:: |
| 79 | + \hat{C}(X_{test}) = \{\pi_1, ..., \pi_k\} \quad \text{where} \quad k = \text{inf}\{k : \sum^k_{j=1} \hat{\mu}(X_{test})_{\pi_j} \geq \hat{q}\} |
| 80 | +
|
| 81 | +
|
| 82 | +By default, the label whose cumulative score is above the quantile is included in the prediction set. |
| 83 | +However, its incorporation can also be chosen randomly based on the difference between its cumulative score and the quantile so the effective |
| 84 | +coverage remains close to the target (marginal) coverage. We refer the reader to [2, 3] for more details about this aspect. |
| 85 | + |
| 86 | + |
| 87 | +3. Top-K |
| 88 | +-------- |
| 89 | + |
| 90 | +Introduced by [3], the specificity of the Top-K method is that it will give the same prediction set size for all observations. |
| 91 | +The conformity score is the rank of the true label, with scores ranked from higher to lower. |
| 92 | +The prediction sets are build by taking the :math:`\hat{q}^{th}` higher scores. The procedure is described in the following equations : |
| 93 | + |
| 94 | +.. math:: |
| 95 | + s_i(X_i, Y_i) = j \quad \text{where} \quad Y_i = \pi_j \quad \text{and} \quad \hat{\mu}(X_i)_{\pi_1} > ... > \hat{\mu}(X_i)_{\pi_j} > ... > \hat{\mu}(X_i)_{\pi_n} |
| 96 | +
|
| 97 | +
|
| 98 | +.. math:: |
| 99 | + \hat{q} = \left \lceil Quantile \left(s_1, ..., s_n ; \frac{\lceil(n+1)(1-\alpha)\rceil}{n}\right) \right\rceil |
| 100 | +
|
| 101 | +
|
| 102 | +.. math:: |
| 103 | + \hat{C}(X_{test}) = \{\pi_1, ..., \pi_{\hat{q}}\} |
| 104 | +
|
| 105 | +As with other methods, this procedure allows the user to build prediction sets with guarantees on the marginal coverage. |
| 106 | + |
| 107 | + |
| 108 | +4. Split- and cross-conformal methods |
| 109 | +------------------------------------- |
| 110 | + |
| 111 | +It should be noted that MAPIE includes split- and cross-conformal strategies for the LABEL and APS methods, |
| 112 | +but only the split-conformal one for Top-K. |
| 113 | +The implementation of the cross-conformal method follows algorithm 2 of [2]. |
| 114 | +In short, conformity scores are calculated for all training instances in a cross-validation fashion from their corresponding out-of-fold models. |
| 115 | +By analogy with the CV+ method for regression, estimating the prediction sets is performed in four main steps: |
| 116 | + |
| 117 | +- We split the training set into *K* disjoint subsets :math:`S_1, S_2, ..., S_K` of equal size. |
| 118 | + |
| 119 | +- *K* regression functions :math:`\hat{\mu}_{-S_k}` are fitted on the training set with the |
| 120 | + corresponding :math:`k^{th}` fold removed. |
| 121 | + |
| 122 | +- The corresponding *out-of-fold* conformity score is computed for each :math:`i^{th}` point |
| 123 | + |
| 124 | +- Compare the conformity scores of training instances with the scores of each label for each new test point in order to |
| 125 | + decide whether or not the label should be included in the prediction set. |
| 126 | + For the APS method, the prediction set is constructed as follows (see equation 11 of [3]) : |
| 127 | + |
| 128 | +.. math:: |
| 129 | + C_{n, \alpha}(X_{n+1}) = |
| 130 | + \Big\{ y \in \mathcal{Y} : \sum_{i=1}^n {\rm 1} \Big[ E(X_i, Y_i, U_i; \hat{\pi}^{k(i)}) < E(X_{n+1}, y, U_{n+1}; \hat{\pi}^{k(i)}) \Big] < (1-\alpha)(n+1) \Big\} |
| 131 | +
|
| 132 | +where : |
| 133 | + |
| 134 | +- :math:`E(X_i, Y_i, U_i; \hat{\pi}^{k(i)})` is the conformity score of training instance :math:`i` |
| 135 | + |
| 136 | +- :math:`E(X_{n+1}, y, U_{n+1}; \hat{\pi}^{k(i)})` is the conformity score of label :math:`y` from a new test point. |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | + |
| 141 | +.. The :class:`mapie.regression.MapieClassifier` class implements several conformal methods |
| 142 | +.. for estimating predictions sets, i.e. a set of possibilities that include the true label |
| 143 | +.. with a given confidence level. |
| 144 | +.. The full-conformal methods being computationally intractable, we will focus on the split- |
| 145 | +.. and cross-conformal methods. |
| 146 | +
|
| 147 | +.. Before describing the methods, let's briefly present the mathematical setting. |
| 148 | +.. For a classification problem in a standard independent and identically distributed |
| 149 | +.. (i.i.d) case, our training data :math:`(X, Y) = \{(x_1, y_1), \ldots, (x_n, y_n)\}` |
| 150 | +.. has an unknown distribution :math:`P_{X, Y}`. |
| 151 | +
|
| 152 | +.. Given some target quantile :math:`\alpha` or associated target coverage level :math:`1-\alpha`, |
| 153 | +.. we aim at constructing a set of possible labels :math:`\hat{T}_{n, \alpha} \in {1, ..., K}` |
| 154 | +.. for a new feature vector :math:`X_{n+1}` such that |
| 155 | +
|
| 156 | +.. .. math:: |
| 157 | +.. P \{Y_{n+1} \in \hat{T}_{n, \alpha}(X_{n+1}) \} \geq 1 - \alpha |
| 158 | +
|
| 159 | +
|
| 160 | +.. 1. Split-conformal method |
| 161 | +.. ------------------------- |
| 162 | +
|
| 163 | +.. - In order to estimate prediction sets, one needs to "calibrate" so-called conformity scores |
| 164 | +.. on a given calibration set. The alpha-quantile of these conformity scores is then estimated |
| 165 | +.. and compared with the conformity scores of new test points output by the base model to assess |
| 166 | +.. whether a label must be included in the prediction set |
| 167 | +
|
| 168 | +.. - The split-conformal methodology can be summarized in the scheme below : |
| 169 | +.. - The training set is first split into a training set and a calibration set |
| 170 | +.. - The training set is used for training the model |
| 171 | +.. - The calibration set is only used for getting distribution of conformity scores output by |
| 172 | +.. the model trained only on the training set. |
| 173 | +
|
| 174 | +
|
| 175 | +.. 2. The "score" method |
| 176 | +.. --------------------- |
| 177 | +
|
| 178 | +.. 3. The "cumulated score" method |
| 179 | +.. ------------------------------- |
| 180 | +
|
| 181 | +.. 4. The cross-conformal method |
| 182 | +.. ----------------------------- |
| 183 | +
|
| 184 | +
|
| 185 | +
|
| 186 | +.. TO BE CONTINUED |
| 187 | +
|
| 188 | +5. References |
| 189 | +------------- |
| 190 | + |
| 191 | +[1] Mauricio Sadinle, Jing Lei, & Larry Wasserman. |
12 | 192 | "Least Ambiguous Set-Valued Classifiers With Bounded Error Levels." |
13 | 193 | Journal of the American Statistical Association, 114:525, 223-234, 2019. |
14 | 194 |
|
15 | | -[3] Yaniv Romano, Matteo Sesia and Emmanuel J. Candès. |
16 | | -"Classification with Valid and Adaptive Coverage." NeurIPS 202 (spotlight). |
| 195 | +[2] Yaniv Romano, Matteo Sesia and Emmanuel J. Candès. |
| 196 | +"Classification with Valid and Adaptive Coverage." |
| 197 | +NeurIPS 202 (spotlight), 2020. |
17 | 198 |
|
18 | | -[4] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan and Jitendra Malik. |
| 199 | +[3] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan and Jitendra Malik. |
19 | 200 | "Uncertainty Sets for Image Classifiers using Conformal Prediction." |
20 | 201 | International Conference on Learning Representations 2021. |
0 commit comments