@@ -4,8 +4,6 @@ title: Class-wise Shapley
44
55# Class-wise Shapley
66
7- ## AlgorIntroductionithm
8-
97Class-wise Shapley (CWS) [ @schoch_csshapley_2022] offers a Shapley framework
108tailored for classification problems. Let $D$ be a dataset, $D_ {y_i}$ be the
119subset of $D$ with labels $y_i$, and $D_ {-y_i}$ be the complement of $D_ {y_i}$
@@ -90,35 +88,45 @@ and $g$ for an exploration with different base scores.
9088 )
9189 ```
9290
93- The level curves for $f(x)=x$ and $g(x)=e^x$ are depicted below. The lines
94- illustrate the contour lines, annotated with their respective gradients.
95-
96- ![ Level curves of the class-wise utility] ( img/classwise-shapley-discounted-utility-function.svg ) { align=left width=33% class=invertible }
91+ ??? Surface of the discounted utility function
92+ The level curves for $f(x)=x$ and $g(x)=e^x$ are depicted below. The lines
93+ illustrate the contour lines, annotated with their respective gradients.
94+ ![ Level curves of the class-wise
95+ utility] ( img/classwise-shapley-discounted-utility-function.svg ) { align=left width=33% class=invertible }
9796
9897## Evaluation
9998
10099We evaluate the method on the nine datasets used in [ @schoch_csshapley_2022] ,
101- using the same pre-processing. For images, PCA is used to reduce down to 32 the
102- number of features found by a ` Resnet18 ` model. For more details on the
103- pre-processing steps, please refer to the paper.
104-
105- ??? info "Datasets used for evaluation"
106- | Dataset | Data Type | Classes | Input Dims | OpenML ID |
107- |----------------|-----------|---------|------------|-----------|
108- | Diabetes | Tabular | 2 | 8 | 37 |
109- | Click | Tabular | 2 | 11 | 1216 |
110- | CPU | Tabular | 2 | 21 | 197 |
111- | Covertype | Tabular | 7 | 54 | 1596 |
112- | Phoneme | Tabular | 2 | 5 | 1489 |
113- | FMNIST | Image | 2 | 32 | 40996 |
114- | CIFAR10 | Image | 2 | 32 | 40927 |
115- | MNIST (binary) | Image | 2 | 32 | 554 |
116- | MNIST (multi) | Image | 10 | 32 | 554 |
117-
118- ### Performance for (direct) point removal
119-
120- We compare the mean and the coefficient of variation (CV) of the weighted accuracy drop
121- (WAD) as proposed in [ @schoch_csshapley_2022] . The metric is defined by
100+ using the same pre-processing. For images, PCA is used to project the feature, found
101+ by a pre-trained ` Resnet18 ` model, to 32 principal components. A loc-scale normalization
102+ is performed for all models, except gradient boosting. The latter is not sensitive to
103+ the scale of the features. The following table shows the datasets used in the
104+
105+ | Dataset | Data Type | Classes | Input Dims | OpenML ID |
106+ | ----------------| -----------| ---------| ------------| -----------|
107+ | Diabetes | Tabular | 2 | 8 | 37 |
108+ | Click | Tabular | 2 | 11 | 1216 |
109+ | CPU | Tabular | 2 | 21 | 197 |
110+ | Covertype | Tabular | 7 | 54 | 1596 |
111+ | Phoneme | Tabular | 2 | 5 | 1489 |
112+ | FMNIST | Image | 2 | 32 | 40996 |
113+ | CIFAR10 | Image | 2 | 32 | 40927 |
114+ | MNIST (binary) | Image | 2 | 32 | 554 |
115+ | MNIST (multi) | Image | 10 | 32 | 554 |
116+
117+ experiments. In general there are three different experiments: point removal, noise
118+ removal and a distribution analysis. Metrics are evaluated as tables for mean and
119+ coefficient of variation (CV) $\frac{\sigma}{\mu}$ of an inner metric. The former
120+ displays the performance of the method, whereas the latter displays the repeatability of
121+ the method. We assume the mean has to be maximized and the CV has to be minimized.
122+ Furthermore, we remark that for all sampling-based valuation methods the same number of
123+ _ evaluations of the marginal utility_ was used. This is important, to make the
124+ algorithms comparable. In practice one should consider using a more sophisticated
125+ stopping criterion.
126+
127+ ### Dataset pruning for logistic regression
128+
129+ Weighted accuracy drop (WAD) [ @schoch_csshapley_2022] is defined as
122130
123131$$
124132\text{WAD} = \sum_{j=1}^{n} \left ( \frac{1}{j} \sum_{i=1}^{j}
@@ -133,15 +141,16 @@ standard deviation $\sigma_\text{WAD}$. The valuation of the training samples an
133141evaluation on the validation samples are both calculated based on a logistic regression
134142model. Let's have a look at the mean
135143
136- ![ Weighted accuracy drop (Mean)] ( img/classwise-shapley-metric-wad-mean.svg ) { align=left width=50% class=invertible }
144+ ![ Weighted accuracy drop
145+ (Mean)] ( img/classwise-shapley-metric-wad-mean.svg ) { align=left width=50% class=invertible }
137146
138147of the metric WAD. The table shows that CWS is competitive with all three other methods.
139148In all problems except ` MNIST (multi) ` it is better than TMCS, whereas in that
140- case TMCS has a slight advantage. Another important quantity is the CV
141- $\frac{\sigma_ \text{WAD}}{\mu_ \text{WAD}}$. It normalizes the standard
142- deviation by the mean. The results are shown below.
149+ case TMCS has a slight advantage. Another important quantity is the CV. The results are
150+ shown below.
143151
144- ![ Weighted accuracy drop (CV)] ( img/classwise-shapley-metric-wad-cv.svg ) { align=left width=50% class=invertible }
152+ ![ Weighted accuracy drop
153+ (CV)] ( img/classwise-shapley-metric-wad-cv.svg ) { align=left width=50% class=invertible }
145154
146155It is noteworthy that CWS is not the best method in terms of CV (Lower CV means better
147156performance). For ` CIFAR10 ` , ` Click ` , ` CPU ` and ` MNIST (binary) ` Beta Shapley has the
@@ -155,84 +164,85 @@ Each line represents five runs, whereas bootstrapping was used to estimate the 9
155164confidence intervals.
156165
157166
158- ![ Accuracy after sample removal using values from logistic regression] ( img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-logistic-regression.svg ) { class=invertible }
167+ ![ Accuracy after sample removal using values from logistic
168+ regression] ( img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-logistic-regression.svg ) { class=invertible }
159169
160170Samples are removed from high to low valuation order and hence we expect a steep
161171decrease in the curve. Overall we conclude that in terms of mean WAD CWS and TMCS are
162172the best methods. In terms of CV, CWS and Beta Shapley are the clear winners. Hence, CWS
163- is a competitive method for valuation of data sets with a low relative standard
164- deviation. We remark that for all valuation methods the same number of _ evaluations of
165- the marginal utility_ was used.
173+ is a competitive CV.
166174
167- ### Performance in value transfer for point removal
175+ ### Dataset pruning for neural network by value transfer
168176
169177Practically more relevant is the transfer of values from one model to another one. As
170178before the values are calculated using logistic regression. However, this time they are
171179used to prune the training set for a neural network. The following plot shows
172180valuation-set accuracy of the network on the y-axis, and the number of samples removed
173181on the x-axis.
174182
175- ![ Accuracy after sample removal using values transferred from logistic regression
176- to an MLP] ( img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-mlp.svg ) { class=invertible }
177-
178- Again samples are removed from high to low valuation order and hence we expect a steep
179- decrease in the curve. CWS is competitive with the compared methods. Especially
180- in very unbalanced datasets, like ` Click ` , the performance of CWS seems
181- superior. In other datasets, like ` Covertype ` and ` Diabetes ` and ` MNIST (multi) `
182- the performance is on par with TMC. For ` MNIST (binary) ` and ` Phoneme ` the
183- performance is competitive.
183+ ![ Accuracy after sample removal using values transferred from logistic regression to an
184+ MLP] ( img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-mlp.svg ) { class=invertible }
184185
185- ### Density of values
186-
187- This experiment compares the distribution of values for TMCS (green) and CWS
188- (red). Both methods are chosen due to their competivieness. The following plots show a
189- histogram as well as the density estimated by kernel density estimation (KDE).
186+ As in the previous experiment samples are removed from high to low valuation order and
187+ hence we expect a steep decrease in the curve. CWS is competitive with the compared
188+ methods. Especially in very unbalanced datasets, like ` Click ` , the performance of CWS
189+ seems superior. In other datasets, like ` Covertype ` and ` Diabetes ` and ` MNIST (multi) `
190+ the performance is on par with TMC. For ` MNIST (binary) ` and ` Phoneme ` the performance
191+ is competitive.
190192
193+ ### Detection of mis-labelled data points
191194
192- ![ Density of TMCS and CWS] ( img/classwise-shapley-density.svg ) { class=invertible }
193-
194- As apparent in the metric CV from the previous section, the variance of CWS is lower
195- than for TCMS. They seem to approximate the same form of distribution, although their
196- utility functions are different.
197-
198- For ` Click ` TMCS has a multi-modal distribution of values. This is inferior to CWS which
199- has only one-mode and is more stable on that dataset. ` Click ` is a very unbalanced
200- dataset, and we conclude that CWS seems to be more robust on unbalanced datasets.
201-
202- ### Noise removal for 20% of the flipped data
203-
204- Another type of experiment uses the algorithms to explore mis-labelled data points. The
205- indices are chosen randomly. Multi-class datasets are discarded, because they do not
195+ The next experiment uses the algorithms to detect mis-labelled data points. 20% of the
196+ indices is selected by choice. Multi-class datasets are discarded, because they do not
206197possess a unique flipping strategy. The following table shows the mean of the area under
207- the curve (AUC) of five runs.
198+ the curve (AUC) for five runs.
208199
209- ![ Area under the Curve (Mean)] ( img/classwise-shapley-metric-auc-mean.svg ) { align=left width=50% class=invertible }
200+ ![ Area under the Curve
201+ (Mean)] ( img/classwise-shapley-metric-auc-mean.svg ) { align=left width=50% class=invertible }
210202
211203In the majority of the cases TMCS has a slight advantage over CWS on average. For
212204` Click ` CWS has a slight edge, most probably due to the unbalanced nature of ` Click ` .
213205The following plot shows the CV for the AUC of the five runs.
214206
215- ![ Area under the Curve (CV)] ( img/classwise-shapley-metric-auc-cv.svg ) { align=left width=50% class=invertible }
207+ ![ Area under the Curve
208+ (CV)] ( img/classwise-shapley-metric-auc-cv.svg ) { align=left width=50% class=invertible }
216209
217- In terms of CV, CWS has a clear edge over TMCS and Beta Shapley. The following plot
218- shows the receiving operator characteristic (ROC) for the mean of five runs.
210+ In terms of CV, CWS has a clear edge over TMCS and Beta Shapley. The receiving operator
211+ characteristic (ROC) curve is a plot of the precision to the recall. The classifier
212+ uses the $n$-smallest values
213+ respect to the order of the valuation. The following plot shows thec (ROC) for the mean
214+ of five runs.
219215
220- ![ Receiver Operating Characteristic] ( img/classwise-shapley-roc-auc-logistic-regression.svg ) { align=left width=50% class=invertible }
216+ ![ Receiver Operating
217+ Characteristic] ( img/classwise-shapley-roc-auc-logistic-regression.svg ) { align=left width=50% class=invertible }
221218
222- The ROC curve is a plot of the true positive rate (TPR) against the false positive rate
223- (FPR). The TPR is the ratio of correctly classified positive samples to all positive
224- samples. The FPR is the ratio of incorrectly classified negative samples to all negative
225- samples. This tuple is calculated for all prefixes of the training set with respect to
226- the values. Although it seems that TMCS is the winner, considering sample efficiency,
219+ Although it seems that TMCS is the winner: If you consider sample efficiency,
227220CWS stays competitive. For a perfectly balanced dataset, CWS needs fewer samples than
228- TCMS on average. CWS is competitive and almost on par with TCMS, while requring less
229- samples on average.
221+ TCMS on average. Furthermore, CWS is almost on par with TCMS performance-wise.
222+
223+ ### Density of values
224+
225+ This experiment compares the distribution of values for TMCS (green) and CWS
226+ (red). Both methods are chosen due to their competitiveness. The plot shows a
227+ histogram as well as the density estimated by kernel density estimation (KDE) for each
228+ dataset.
229+
230+ ![ Density of TMCS and
231+ CWS] ( img/classwise-shapley-density.svg ) { class=invertible }
232+
233+ Similar to the behaviour of the CV from the previous section, the variance of CWS is
234+ lower than for TCMS. They seem to approximate the same of distribution, although their
235+ utility functions are quite different.
236+
237+ For ` Click ` TMCS has a multi-modal distribution of values. This is inferior to CWS which
238+ has only one-mode and is more stable on that dataset. ` Click ` is a very unbalanced
239+ dataset, and we conclude that CWS seems to be more robust on unbalanced datasets.
230240
231241## Conclusion
232242
233243CWS is a reasonable and effective way to handle classification problems. It reduces the
234244computing power and variance by splitting up the data set into classes. Given the
235245underlying similarities in the architecture of TMCS, Beta Shapley, and CWS, there's a
236- clear pathway for improving convergence rates, sample efficiency, and stabilize variance
237- for TMCS and Beta Shapley.
246+ clear pathway for improving convergence rates, sample efficiency, and stabilizing
247+ variance for TMCS and Beta Shapley.
238248
0 commit comments