Skip to content

Commit 3cee0f8

Browse files
author
Markus Semmler
committed
Refactor some text and add legend to density plot.
1 parent fdc4a06 commit 3cee0f8

File tree

2 files changed

+91
-81
lines changed

2 files changed

+91
-81
lines changed

docs/value/classwise-shapley.md

Lines changed: 90 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,6 @@ title: Class-wise Shapley
44

55
# Class-wise Shapley
66

7-
## AlgorIntroductionithm
8-
97
Class-wise Shapley (CWS) [@schoch_csshapley_2022] offers a Shapley framework
108
tailored for classification problems. Let $D$ be a dataset, $D_{y_i}$ be the
119
subset of $D$ with labels $y_i$, and $D_{-y_i}$ be the complement of $D_{y_i}$
@@ -90,35 +88,45 @@ and $g$ for an exploration with different base scores.
9088
)
9189
```
9290

93-
The level curves for $f(x)=x$ and $g(x)=e^x$ are depicted below. The lines
94-
illustrate the contour lines, annotated with their respective gradients.
95-
96-
![Level curves of the class-wise utility](img/classwise-shapley-discounted-utility-function.svg){ align=left width=33% class=invertible }
91+
??? Surface of the discounted utility function
92+
The level curves for $f(x)=x$ and $g(x)=e^x$ are depicted below. The lines
93+
illustrate the contour lines, annotated with their respective gradients.
94+
![Level curves of the class-wise
95+
utility](img/classwise-shapley-discounted-utility-function.svg){ align=left width=33% class=invertible }
9796

9897
## Evaluation
9998

10099
We evaluate the method on the nine datasets used in [@schoch_csshapley_2022],
101-
using the same pre-processing. For images, PCA is used to reduce down to 32 the
102-
number of features found by a `Resnet18` model. For more details on the
103-
pre-processing steps, please refer to the paper.
104-
105-
??? info "Datasets used for evaluation"
106-
| Dataset | Data Type | Classes | Input Dims | OpenML ID |
107-
|----------------|-----------|---------|------------|-----------|
108-
| Diabetes | Tabular | 2 | 8 | 37 |
109-
| Click | Tabular | 2 | 11 | 1216 |
110-
| CPU | Tabular | 2 | 21 | 197 |
111-
| Covertype | Tabular | 7 | 54 | 1596 |
112-
| Phoneme | Tabular | 2 | 5 | 1489 |
113-
| FMNIST | Image | 2 | 32 | 40996 |
114-
| CIFAR10 | Image | 2 | 32 | 40927 |
115-
| MNIST (binary) | Image | 2 | 32 | 554 |
116-
| MNIST (multi) | Image | 10 | 32 | 554 |
117-
118-
### Performance for (direct) point removal
119-
120-
We compare the mean and the coefficient of variation (CV) of the weighted accuracy drop
121-
(WAD) as proposed in [@schoch_csshapley_2022]. The metric is defined by
100+
using the same pre-processing. For images, PCA is used to project the feature, found
101+
by a pre-trained `Resnet18` model, to 32 principal components. A loc-scale normalization
102+
is performed for all models, except gradient boosting. The latter is not sensitive to
103+
the scale of the features. The following table shows the datasets used in the
104+
105+
| Dataset | Data Type | Classes | Input Dims | OpenML ID |
106+
|----------------|-----------|---------|------------|-----------|
107+
| Diabetes | Tabular | 2 | 8 | 37 |
108+
| Click | Tabular | 2 | 11 | 1216 |
109+
| CPU | Tabular | 2 | 21 | 197 |
110+
| Covertype | Tabular | 7 | 54 | 1596 |
111+
| Phoneme | Tabular | 2 | 5 | 1489 |
112+
| FMNIST | Image | 2 | 32 | 40996 |
113+
| CIFAR10 | Image | 2 | 32 | 40927 |
114+
| MNIST (binary) | Image | 2 | 32 | 554 |
115+
| MNIST (multi) | Image | 10 | 32 | 554 |
116+
117+
experiments. In general there are three different experiments: point removal, noise
118+
removal and a distribution analysis. Metrics are evaluated as tables for mean and
119+
coefficient of variation (CV) $\frac{\sigma}{\mu}$ of an inner metric. The former
120+
displays the performance of the method, whereas the latter displays the repeatability of
121+
the method. We assume the mean has to be maximized and the CV has to be minimized.
122+
Furthermore, we remark that for all sampling-based valuation methods the same number of
123+
_evaluations of the marginal utility_ was used. This is important, to make the
124+
algorithms comparable. In practice one should consider using a more sophisticated
125+
stopping criterion.
126+
127+
### Dataset pruning for logistic regression
128+
129+
Weighted accuracy drop (WAD) [@schoch_csshapley_2022] is defined as
122130

123131
$$
124132
\text{WAD} = \sum_{j=1}^{n} \left ( \frac{1}{j} \sum_{i=1}^{j}
@@ -133,15 +141,16 @@ standard deviation $\sigma_\text{WAD}$. The valuation of the training samples an
133141
evaluation on the validation samples are both calculated based on a logistic regression
134142
model. Let's have a look at the mean
135143

136-
![Weighted accuracy drop (Mean)](img/classwise-shapley-metric-wad-mean.svg){ align=left width=50% class=invertible }
144+
![Weighted accuracy drop
145+
(Mean)](img/classwise-shapley-metric-wad-mean.svg){ align=left width=50% class=invertible }
137146

138147
of the metric WAD. The table shows that CWS is competitive with all three other methods.
139148
In all problems except `MNIST (multi)` it is better than TMCS, whereas in that
140-
case TMCS has a slight advantage. Another important quantity is the CV
141-
$\frac{\sigma_\text{WAD}}{\mu_\text{WAD}}$. It normalizes the standard
142-
deviation by the mean. The results are shown below.
149+
case TMCS has a slight advantage. Another important quantity is the CV. The results are
150+
shown below.
143151

144-
![Weighted accuracy drop (CV)](img/classwise-shapley-metric-wad-cv.svg){ align=left width=50% class=invertible }
152+
![Weighted accuracy drop
153+
(CV)](img/classwise-shapley-metric-wad-cv.svg){ align=left width=50% class=invertible }
145154

146155
It is noteworthy that CWS is not the best method in terms of CV (Lower CV means better
147156
performance). For `CIFAR10`, `Click`, `CPU` and `MNIST (binary)` Beta Shapley has the
@@ -155,84 +164,85 @@ Each line represents five runs, whereas bootstrapping was used to estimate the 9
155164
confidence intervals.
156165

157166

158-
![Accuracy after sample removal using values from logistic regression](img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-logistic-regression.svg){ class=invertible }
167+
![Accuracy after sample removal using values from logistic
168+
regression](img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-logistic-regression.svg){ class=invertible }
159169

160170
Samples are removed from high to low valuation order and hence we expect a steep
161171
decrease in the curve. Overall we conclude that in terms of mean WAD CWS and TMCS are
162172
the best methods. In terms of CV, CWS and Beta Shapley are the clear winners. Hence, CWS
163-
is a competitive method for valuation of data sets with a low relative standard
164-
deviation. We remark that for all valuation methods the same number of _evaluations of
165-
the marginal utility_ was used.
173+
is a competitive CV.
166174

167-
### Performance in value transfer for point removal
175+
### Dataset pruning for neural network by value transfer
168176

169177
Practically more relevant is the transfer of values from one model to another one. As
170178
before the values are calculated using logistic regression. However, this time they are
171179
used to prune the training set for a neural network. The following plot shows
172180
valuation-set accuracy of the network on the y-axis, and the number of samples removed
173181
on the x-axis.
174182

175-
![Accuracy after sample removal using values transferred from logistic regression
176-
to an MLP](img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-mlp.svg){ class=invertible }
177-
178-
Again samples are removed from high to low valuation order and hence we expect a steep
179-
decrease in the curve. CWS is competitive with the compared methods. Especially
180-
in very unbalanced datasets, like `Click`, the performance of CWS seems
181-
superior. In other datasets, like `Covertype` and `Diabetes` and `MNIST (multi)`
182-
the performance is on par with TMC. For `MNIST (binary)` and `Phoneme` the
183-
performance is competitive.
183+
![Accuracy after sample removal using values transferred from logistic regression to an
184+
MLP](img/classwise-shapley-weighted-accuracy-drop-logistic-regression-to-mlp.svg){ class=invertible }
184185

185-
### Density of values
186-
187-
This experiment compares the distribution of values for TMCS (green) and CWS
188-
(red). Both methods are chosen due to their competivieness. The following plots show a
189-
histogram as well as the density estimated by kernel density estimation (KDE).
186+
As in the previous experiment samples are removed from high to low valuation order and
187+
hence we expect a steep decrease in the curve. CWS is competitive with the compared
188+
methods. Especially in very unbalanced datasets, like `Click`, the performance of CWS
189+
seems superior. In other datasets, like `Covertype` and `Diabetes` and `MNIST (multi)`
190+
the performance is on par with TMC. For `MNIST (binary)` and `Phoneme` the performance
191+
is competitive.
190192

193+
### Detection of mis-labelled data points
191194

192-
![Density of TMCS and CWS](img/classwise-shapley-density.svg){ class=invertible }
193-
194-
As apparent in the metric CV from the previous section, the variance of CWS is lower
195-
than for TCMS. They seem to approximate the same form of distribution, although their
196-
utility functions are different.
197-
198-
For `Click` TMCS has a multi-modal distribution of values. This is inferior to CWS which
199-
has only one-mode and is more stable on that dataset. `Click` is a very unbalanced
200-
dataset, and we conclude that CWS seems to be more robust on unbalanced datasets.
201-
202-
### Noise removal for 20% of the flipped data
203-
204-
Another type of experiment uses the algorithms to explore mis-labelled data points. The
205-
indices are chosen randomly. Multi-class datasets are discarded, because they do not
195+
The next experiment uses the algorithms to detect mis-labelled data points. 20% of the
196+
indices is selected by choice. Multi-class datasets are discarded, because they do not
206197
possess a unique flipping strategy. The following table shows the mean of the area under
207-
the curve (AUC) of five runs.
198+
the curve (AUC) for five runs.
208199

209-
![Area under the Curve (Mean)](img/classwise-shapley-metric-auc-mean.svg){ align=left width=50% class=invertible }
200+
![Area under the Curve
201+
(Mean)](img/classwise-shapley-metric-auc-mean.svg){ align=left width=50% class=invertible }
210202

211203
In the majority of the cases TMCS has a slight advantage over CWS on average. For
212204
`Click` CWS has a slight edge, most probably due to the unbalanced nature of `Click`.
213205
The following plot shows the CV for the AUC of the five runs.
214206

215-
![Area under the Curve (CV)](img/classwise-shapley-metric-auc-cv.svg){ align=left width=50% class=invertible }
207+
![Area under the Curve
208+
(CV)](img/classwise-shapley-metric-auc-cv.svg){ align=left width=50% class=invertible }
216209

217-
In terms of CV, CWS has a clear edge over TMCS and Beta Shapley. The following plot
218-
shows the receiving operator characteristic (ROC) for the mean of five runs.
210+
In terms of CV, CWS has a clear edge over TMCS and Beta Shapley. The receiving operator
211+
characteristic (ROC) curve is a plot of the precision to the recall. The classifier
212+
uses the $n$-smallest values
213+
respect to the order of the valuation. The following plot shows thec (ROC) for the mean
214+
of five runs.
219215

220-
![Receiver Operating Characteristic](img/classwise-shapley-roc-auc-logistic-regression.svg){ align=left width=50% class=invertible }
216+
![Receiver Operating
217+
Characteristic](img/classwise-shapley-roc-auc-logistic-regression.svg){ align=left width=50% class=invertible }
221218

222-
The ROC curve is a plot of the true positive rate (TPR) against the false positive rate
223-
(FPR). The TPR is the ratio of correctly classified positive samples to all positive
224-
samples. The FPR is the ratio of incorrectly classified negative samples to all negative
225-
samples. This tuple is calculated for all prefixes of the training set with respect to
226-
the values. Although it seems that TMCS is the winner, considering sample efficiency,
219+
Although it seems that TMCS is the winner: If you consider sample efficiency,
227220
CWS stays competitive. For a perfectly balanced dataset, CWS needs fewer samples than
228-
TCMS on average. CWS is competitive and almost on par with TCMS, while requring less
229-
samples on average.
221+
TCMS on average. Furthermore, CWS is almost on par with TCMS performance-wise.
222+
223+
### Density of values
224+
225+
This experiment compares the distribution of values for TMCS (green) and CWS
226+
(red). Both methods are chosen due to their competitiveness. The plot shows a
227+
histogram as well as the density estimated by kernel density estimation (KDE) for each
228+
dataset.
229+
230+
![Density of TMCS and
231+
CWS](img/classwise-shapley-density.svg){ class=invertible }
232+
233+
Similar to the behaviour of the CV from the previous section, the variance of CWS is
234+
lower than for TCMS. They seem to approximate the same of distribution, although their
235+
utility functions are quite different.
236+
237+
For `Click` TMCS has a multi-modal distribution of values. This is inferior to CWS which
238+
has only one-mode and is more stable on that dataset. `Click` is a very unbalanced
239+
dataset, and we conclude that CWS seems to be more robust on unbalanced datasets.
230240

231241
## Conclusion
232242

233243
CWS is a reasonable and effective way to handle classification problems. It reduces the
234244
computing power and variance by splitting up the data set into classes. Given the
235245
underlying similarities in the architecture of TMCS, Beta Shapley, and CWS, there's a
236-
clear pathway for improving convergence rates, sample efficiency, and stabilize variance
237-
for TMCS and Beta Shapley.
246+
clear pathway for improving convergence rates, sample efficiency, and stabilizing
247+
variance for TMCS and Beta Shapley.
238248

docs/value/img/classwise-shapley-density.svg

Lines changed: 1 addition & 1 deletion
Loading

0 commit comments

Comments
 (0)