@@ -21,7 +21,7 @@ the set $D_{-i}^{(k)}$ contains all subsets of $D$ of size $k$ that do not
2121include sample $x_i$, $S_ {+i}$ is the set $S$ with $x_i$ added, and $u$ is the
2222utility function.
2323
24- Two instances of this are ** Banzhaf indices** [ @wang_data_2022 ] ,
24+ Two instances of this are ** Banzhaf indices** [ @wang_data_2023 ] ,
2525and ** Beta Shapley** [ @kwon_beta_2022] , with better numerical and
2626rank stability in certain situations.
2727
@@ -84,7 +84,7 @@ any choice of weight function $w$, one can always construct a utility with
8484higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
8585one can do is to pick a constant weight.
8686
87- The authors of [ @wang_data_2022 ] show that Banzhaf indices are more robust to
87+ The authors of [ @wang_data_2023 ] show that Banzhaf indices are more robust to
8888variance in the utility function than Shapley and Beta Shapley values. They are
8989available in pyDVL through
9090[ compute_banzhaf_semivalues] [ pydvl.value.semivalues.compute_banzhaf_semivalues ] :
@@ -98,6 +98,37 @@ values = compute_banzhaf_semivalues(
9898)
9999```
100100
101+ ### Banzhaf semi-values with MSR sampling
102+ Wang et. al. propose a more sample-efficient method for computing Banzhaf
103+ semivalues in their paper * Data Banzhaf: A Robust Data Valuation Framework
104+ for Machine Learning* [ @wang_data_2023] . This method updates all semivalues
105+ per evaluation of the utility (i.e. per model trained) based on whether a
106+ specific data point was included in the data subset or not. The expression
107+ for computing the semivalues is
108+
109+ $$ \hat{\phi}_{MSR}(i) = \frac{1}{|\mathbf{S}_{\ni i}|} \sum_{S \in
110+ \mathbf{S}_{\ni i}} U(S) - \frac{1}{|\mathbf{S}_{\not{\ni} i}|}
111+ \sum_{S \in \mathbf{S}_{\not{\ni} i}} U(S) $$
112+
113+ where $\mathbf{S}_ {\ni i}$ are the subsets that contain the index $i$ and
114+ $\mathbf{S}_ {\not{\ni} i}$ are the subsets not containing the index $i$.
115+
116+ The function implementing this method is
117+ [ compute_msr_banzhaf_semivalues] [ pydvl.value.semivalues.compute_msr_banzhaf_semivalues ] .
118+
119+ ``` python
120+ from pydvl.value import compute_msr_banzhaf_semivalues, RankCorrelation, Utility
121+
122+ utility = Utility(model, data)
123+ values = compute_msr_banzhaf_semivalues(
124+ u = utility, done = RankCorrelation(rtol = 0.001 ),
125+ )
126+ ```
127+ For further details on how to use this method and a comparison of the sample
128+ efficiency, we suggest to take a look at the example notebook
129+ [ msr_banzhaf_spotify] ( ../../examples/msr_banzhaf_spotify ) .
130+
131+
101132## General semi-values
102133
103134As explained above, both Beta Shapley and Banzhaf indices are special cases of
@@ -130,7 +161,7 @@ values = compute_generic_semivalues(
130161Allowing any coefficient can help when experimenting with models which are more
131162sensitive to changes in training set size. However, Data Banzhaf indices are
132163proven to be the most robust to variance in the utility function, in the sense
133- of rank stability, across a range of models and datasets [ @wang_data_2022 ] .
164+ of rank stability, across a range of models and datasets [ @wang_data_2023 ] .
134165
135166!!! warning "Careful with permutation sampling"
136167 This generic implementation of semi-values allowing for any combination of
0 commit comments