aai-institute
diff --git a/‎.notebook_test_durations‎
Lines changed: 10 additions & 9 deletions b/‎.notebook_test_durations‎
Lines changed: 10 additions & 9 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/assets/pydvl.bib‎
Lines changed: 1 addition & 1 deletion b/‎docs/assets/pydvl.bib‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/examples/img/msr_banzhaf_digits.png‎
Lines changed: 3 additions & 0 deletions b/‎docs/examples/img/msr_banzhaf_digits.png‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/examples/index.md‎
Lines changed: 9 additions & 0 deletions b/‎docs/examples/index.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/getting-started/glossary.md‎
Lines changed: 9 additions & 0 deletions b/‎docs/getting-started/glossary.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/value/semi-values.md‎
Lines changed: 34 additions & 3 deletions b/‎docs/value/semi-values.md‎
Lines changed: 34 additions & 3 deletions
diff --git a/‎docs_includes/abbreviations.md‎
Lines changed: 1 addition & 0 deletions b/‎docs_includes/abbreviations.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions b/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎notebooks/least_core_basic.ipynb‎
Lines changed: 9 additions & 9 deletions b/‎notebooks/least_core_basic.ipynb‎
Lines changed: 9 additions & 9 deletions
@@ -1,11 +1,12 @@
 {
-    "notebooks/data_oob.ipynb::": 14.608769827000287,
-    "notebooks/influence_imagenet.ipynb::": 13.570316236000508,
-    "notebooks/influence_sentiment_analysis.ipynb::": 20.546479973001624,
-    "notebooks/influence_synthetic.ipynb::": 5.9324631089984905,
-    "notebooks/influence_wine.ipynb::": 16.114133220999065,
-    "notebooks/least_core_basic.ipynb::": 14.312467472000208,
-    "notebooks/shapley_basic_spotify.ipynb::": 15.608795123000164,
-    "notebooks/shapley_knn_flowers.ipynb::": 3.9430189769991557,
-    "notebooks/shapley_utility_learning.ipynb::": 26.96671833400069
+    "notebooks/data_oob.ipynb::": 14.514983271001256,
+    "notebooks/influence_imagenet.ipynb::": 15.937124550999215,
+    "notebooks/influence_sentiment_analysis.ipynb::": 26.479645616000198,
+    "notebooks/influence_synthetic.ipynb::": 6.61773010700017,
+    "notebooks/influence_wine.ipynb::": 16.312171267998565,
+    "notebooks/least_core_basic.ipynb::": 14.375480750999486,
+    "notebooks/msr_banzhaf_digits.ipynb::": 106.6507187110019,
+    "notebooks/shapley_basic_spotify.ipynb::": 15.657225806997303,
+    "notebooks/shapley_knn_flowers.ipynb::": 3.9943819290019746,
+    "notebooks/shapley_utility_learning.ipynb::": 25.939783253001224
 }
@@ -6,6 +6,9 @@
 
 - New method: `NystroemSketchInfluence`
   [PR #504](https://github.com/aai-institute/pyDVL/pull/504)
+- New method `MSR Banzhaf` with accompanying notebook, and new stopping
+  criterion `RankCorrelation`
+  [PR #520](https://github.com/aai-institute/pyDVL/pull/520)
 - New preconditioned block variant of conjugate gradient
   [PR #507](https://github.com/aai-institute/pyDVL/pull/507)
 - Improvements to documentation: fixes, links, text, example gallery, LFS and
 
@@ -450,7 +450,7 @@ @book{trefethen_numerical_1997
   langid = {english}
 }
 
-@inproceedings{wang_data_2022,
+@inproceedings{wang_data_2023,
   title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
   shorttitle = {Data {{Banzhaf}}},
   booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
 
@@ -53,6 +53,15 @@ alias:
 
     [![](img/data_oob.png)](data_oob/)
 
+-  [__Faster Banzhaf values__](msr_banzhaf_digits/)
+
+    ---
+
+    Using Banzhaf values to estimate the value of data points in MNIST, and
+    evaluating convergence speed of MSR. 
+
+    [![](img/msr_banzhaf_digits.png)](msr_banzhaf_digits/)
+
 </div>
 
 
 
@@ -147,6 +147,15 @@ performance when that point is removed from the training set.
  * [Implementation][pydvl.value.loo.loo.compute_loo]
  * [Documentation][leave-one-out-values]
 
+### Maximum Sample Reuse
+
+MSR is a sampling method for data valuation that updates the value of every
+data point in one sample. This method can achieve much faster convergence.
+Introduced by [@wang_data_2023]
+
+* [Implementation][pydvl.value.sampler.MSRSampler]
+
+
 ### Monte Carlo Least Core
 
 MCLC is a variation of the Least Core that uses a reduced amount of
 
@@ -21,7 +21,7 @@ the set $D_{-i}^{(k)}$ contains all subsets of $D$ of size $k$ that do not
 include sample $x_i$, $S_{+i}$ is the set $S$ with $x_i$ added, and $u$ is the
 utility function.
 
-Two instances of this are **Banzhaf indices** [@wang_data_2022],
+Two instances of this are **Banzhaf indices** [@wang_data_2023],
 and **Beta Shapley** [@kwon_beta_2022], with better numerical and
 rank stability in certain situations.
 
@@ -84,7 +84,7 @@ any choice of weight function $w$, one can always construct a utility with
 higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
 one can do is to pick a constant weight.
 
-The authors of [@wang_data_2022] show that Banzhaf indices are more robust to
+The authors of [@wang_data_2023] show that Banzhaf indices are more robust to
 variance in the utility function than Shapley and Beta Shapley values. They are
 available in pyDVL through
 [compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues]:
@@ -98,6 +98,37 @@ values = compute_banzhaf_semivalues(
 )
 ```
 
+### Banzhaf semi-values with MSR sampling
+Wang et. al. propose a more sample-efficient method for computing Banzhaf 
+semivalues in their paper *Data Banzhaf: A Robust Data Valuation Framework 
+for Machine Learning* [@wang_data_2023]. This method updates all semivalues
+per evaluation of the utility (i.e. per model trained) based on whether a 
+specific data point was included in the data subset or not. The expression 
+for computing the semivalues is
+
+$$\hat{\phi}_{MSR}(i) = \frac{1}{|\mathbf{S}_{\ni i}|} \sum_{S \in 
+\mathbf{S}_{\ni i}} U(S) - \frac{1}{|\mathbf{S}_{\not{\ni} i}|} 
+\sum_{S \in \mathbf{S}_{\not{\ni} i}} U(S)$$
+
+where $\mathbf{S}_{\ni i}$ are the subsets that contain the index $i$ and 
+$\mathbf{S}_{\not{\ni} i}$ are the subsets not containing the index $i$.
+
+The function implementing this method is
+[compute_msr_banzhaf_semivalues][pydvl.value.semivalues.compute_msr_banzhaf_semivalues].
+
+```python
+from pydvl.value import compute_msr_banzhaf_semivalues, RankCorrelation, Utility
+
+utility = Utility(model, data)
+values = compute_msr_banzhaf_semivalues(
+  u=utility, done=RankCorrelation(rtol=0.001),
+  )
+```
+For further details on how to use this method and a comparison of the sample 
+efficiency, we suggest to take a look at the example notebook 
+[msr_banzhaf_spotify](../../examples/msr_banzhaf_spotify).
+
+
 ## General semi-values
 
 As explained above, both Beta Shapley and Banzhaf indices are special cases of
@@ -130,7 +161,7 @@ values = compute_generic_semivalues(
 Allowing any coefficient can help when experimenting with models which are more
 sensitive to changes in training set size. However, Data Banzhaf indices are
 proven to be the most robust to variance in the utility function, in the sense
-of rank stability, across a range of models and datasets [@wang_data_2022]. 
+of rank stability, across a range of models and datasets [@wang_data_2023]. 
 
 !!! warning "Careful with permutation sampling"
     This generic implementation of semi-values allowing for any combination of
 
@@ -19,6 +19,7 @@
 *[MLRC]: Machine Learning Reproducibility Challenge
 *[ML]: Machine Learning
 *[MSE]: Mean Squared Error
+*[MSR]: Maximum Sample Reuse
 *[NLRA]: Nyström Low-Rank Approximation
 *[OOB]: Out-of-Bag
 *[PCA]: Principal Component Analysis
 
@@ -34,6 +34,7 @@ nav:
         - Data utility learning: examples/shapley_utility_learning.ipynb
         - Least Core: examples/least_core_basic.ipynb
         - Data OOB: examples/data_oob.ipynb
+        - Banzhaf Semivalues: examples/msr_banzhaf_digits.ipynb
     - Influence Function:
       - For CNNs: examples/influence_imagenet.ipynb
       - For mislabeled data: examples/influence_synthetic.ipynb
Original file line number	Diff line number	Diff line change
`@@ -450,7 +450,7 @@ @book{trefethen_numerical_1997`
`450`	`450`	`langid = {english}`
`451`	`451`	`}`
`452`	`452`
`453`		`-@inproceedings{wang_data_2022,`
	`453`	`+@inproceedings{wang_data_2023,`
`454`	`454`	`title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},`
`455`	`455`	`shorttitle = {Data {{Banzhaf}}},`
`456`	`456`	`booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},`