Skip to content

Commit 67a5b06

Browse files
authored
Merge pull request #520 from aai-institute/feature/msr-banzhaf
MSR method for Banzhaf
2 parents be14b2b + 426b867 commit 67a5b06

File tree

22 files changed

+2189
-73
lines changed

22 files changed

+2189
-73
lines changed

.notebook_test_durations

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
{
2-
"notebooks/data_oob.ipynb::": 14.608769827000287,
3-
"notebooks/influence_imagenet.ipynb::": 13.570316236000508,
4-
"notebooks/influence_sentiment_analysis.ipynb::": 20.546479973001624,
5-
"notebooks/influence_synthetic.ipynb::": 5.9324631089984905,
6-
"notebooks/influence_wine.ipynb::": 16.114133220999065,
7-
"notebooks/least_core_basic.ipynb::": 14.312467472000208,
8-
"notebooks/shapley_basic_spotify.ipynb::": 15.608795123000164,
9-
"notebooks/shapley_knn_flowers.ipynb::": 3.9430189769991557,
10-
"notebooks/shapley_utility_learning.ipynb::": 26.96671833400069
2+
"notebooks/data_oob.ipynb::": 14.514983271001256,
3+
"notebooks/influence_imagenet.ipynb::": 15.937124550999215,
4+
"notebooks/influence_sentiment_analysis.ipynb::": 26.479645616000198,
5+
"notebooks/influence_synthetic.ipynb::": 6.61773010700017,
6+
"notebooks/influence_wine.ipynb::": 16.312171267998565,
7+
"notebooks/least_core_basic.ipynb::": 14.375480750999486,
8+
"notebooks/msr_banzhaf_digits.ipynb::": 106.6507187110019,
9+
"notebooks/shapley_basic_spotify.ipynb::": 15.657225806997303,
10+
"notebooks/shapley_knn_flowers.ipynb::": 3.9943819290019746,
11+
"notebooks/shapley_utility_learning.ipynb::": 25.939783253001224
1112
}

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66

77
- New method: `NystroemSketchInfluence`
88
[PR #504](https://github.com/aai-institute/pyDVL/pull/504)
9+
- New method `MSR Banzhaf` with accompanying notebook, and new stopping
10+
criterion `RankCorrelation`
11+
[PR #520](https://github.com/aai-institute/pyDVL/pull/520)
912
- New preconditioned block variant of conjugate gradient
1013
[PR #507](https://github.com/aai-institute/pyDVL/pull/507)
1114
- Improvements to documentation: fixes, links, text, example gallery, LFS and

docs/assets/pydvl.bib

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -450,7 +450,7 @@ @book{trefethen_numerical_1997
450450
langid = {english}
451451
}
452452

453-
@inproceedings{wang_data_2022,
453+
@inproceedings{wang_data_2023,
454454
title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
455455
shorttitle = {Data {{Banzhaf}}},
456456
booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
Lines changed: 3 additions & 0 deletions
Loading

docs/examples/index.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,15 @@ alias:
5353

5454
[![](img/data_oob.png)](data_oob/)
5555

56+
- [__Faster Banzhaf values__](msr_banzhaf_digits/)
57+
58+
---
59+
60+
Using Banzhaf values to estimate the value of data points in MNIST, and
61+
evaluating convergence speed of MSR.
62+
63+
[![](img/msr_banzhaf_digits.png)](msr_banzhaf_digits/)
64+
5665
</div>
5766

5867

docs/getting-started/glossary.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,15 @@ performance when that point is removed from the training set.
147147
* [Implementation][pydvl.value.loo.loo.compute_loo]
148148
* [Documentation][leave-one-out-values]
149149

150+
### Maximum Sample Reuse
151+
152+
MSR is a sampling method for data valuation that updates the value of every
153+
data point in one sample. This method can achieve much faster convergence.
154+
Introduced by [@wang_data_2023]
155+
156+
* [Implementation][pydvl.value.sampler.MSRSampler]
157+
158+
150159
### Monte Carlo Least Core
151160

152161
MCLC is a variation of the Least Core that uses a reduced amount of

docs/value/semi-values.md

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ the set $D_{-i}^{(k)}$ contains all subsets of $D$ of size $k$ that do not
2121
include sample $x_i$, $S_{+i}$ is the set $S$ with $x_i$ added, and $u$ is the
2222
utility function.
2323

24-
Two instances of this are **Banzhaf indices** [@wang_data_2022],
24+
Two instances of this are **Banzhaf indices** [@wang_data_2023],
2525
and **Beta Shapley** [@kwon_beta_2022], with better numerical and
2626
rank stability in certain situations.
2727

@@ -84,7 +84,7 @@ any choice of weight function $w$, one can always construct a utility with
8484
higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
8585
one can do is to pick a constant weight.
8686

87-
The authors of [@wang_data_2022] show that Banzhaf indices are more robust to
87+
The authors of [@wang_data_2023] show that Banzhaf indices are more robust to
8888
variance in the utility function than Shapley and Beta Shapley values. They are
8989
available in pyDVL through
9090
[compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues]:
@@ -98,6 +98,37 @@ values = compute_banzhaf_semivalues(
9898
)
9999
```
100100

101+
### Banzhaf semi-values with MSR sampling
102+
Wang et. al. propose a more sample-efficient method for computing Banzhaf
103+
semivalues in their paper *Data Banzhaf: A Robust Data Valuation Framework
104+
for Machine Learning* [@wang_data_2023]. This method updates all semivalues
105+
per evaluation of the utility (i.e. per model trained) based on whether a
106+
specific data point was included in the data subset or not. The expression
107+
for computing the semivalues is
108+
109+
$$\hat{\phi}_{MSR}(i) = \frac{1}{|\mathbf{S}_{\ni i}|} \sum_{S \in
110+
\mathbf{S}_{\ni i}} U(S) - \frac{1}{|\mathbf{S}_{\not{\ni} i}|}
111+
\sum_{S \in \mathbf{S}_{\not{\ni} i}} U(S)$$
112+
113+
where $\mathbf{S}_{\ni i}$ are the subsets that contain the index $i$ and
114+
$\mathbf{S}_{\not{\ni} i}$ are the subsets not containing the index $i$.
115+
116+
The function implementing this method is
117+
[compute_msr_banzhaf_semivalues][pydvl.value.semivalues.compute_msr_banzhaf_semivalues].
118+
119+
```python
120+
from pydvl.value import compute_msr_banzhaf_semivalues, RankCorrelation, Utility
121+
122+
utility = Utility(model, data)
123+
values = compute_msr_banzhaf_semivalues(
124+
u=utility, done=RankCorrelation(rtol=0.001),
125+
)
126+
```
127+
For further details on how to use this method and a comparison of the sample
128+
efficiency, we suggest to take a look at the example notebook
129+
[msr_banzhaf_spotify](../../examples/msr_banzhaf_spotify).
130+
131+
101132
## General semi-values
102133

103134
As explained above, both Beta Shapley and Banzhaf indices are special cases of
@@ -130,7 +161,7 @@ values = compute_generic_semivalues(
130161
Allowing any coefficient can help when experimenting with models which are more
131162
sensitive to changes in training set size. However, Data Banzhaf indices are
132163
proven to be the most robust to variance in the utility function, in the sense
133-
of rank stability, across a range of models and datasets [@wang_data_2022].
164+
of rank stability, across a range of models and datasets [@wang_data_2023].
134165

135166
!!! warning "Careful with permutation sampling"
136167
This generic implementation of semi-values allowing for any combination of

docs_includes/abbreviations.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
*[MLRC]: Machine Learning Reproducibility Challenge
2020
*[ML]: Machine Learning
2121
*[MSE]: Mean Squared Error
22+
*[MSR]: Maximum Sample Reuse
2223
*[NLRA]: Nyström Low-Rank Approximation
2324
*[OOB]: Out-of-Bag
2425
*[PCA]: Principal Component Analysis

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ nav:
3434
- Data utility learning: examples/shapley_utility_learning.ipynb
3535
- Least Core: examples/least_core_basic.ipynb
3636
- Data OOB: examples/data_oob.ipynb
37+
- Banzhaf Semivalues: examples/msr_banzhaf_digits.ipynb
3738
- Influence Function:
3839
- For CNNs: examples/influence_imagenet.ipynb
3940
- For mislabeled data: examples/influence_synthetic.ipynb

notebooks/least_core_basic.ipynb

Lines changed: 9 additions & 9 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)