@@ -241,6 +241,7 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
241241.. code-block :: python
242242
243243 from pydvl.value import compute_shapley_value
244+
244245 utility = Utility(... )
245246 values = compute_shapley_values(utility, mode = " combinatorial_exact" )
246247 df = values.to_dataframe(column = ' value' )
@@ -264,7 +265,8 @@ same pattern:
264265.. code-block :: python
265266
266267 from pydvl.utils import Dataset, Utility
267- from pydvl.value.shapley import compute_shapley_values
268+ from pydvl.value import compute_shapley_values
269+
268270 model = ...
269271 data = Dataset(... )
270272 utility = Utility(model, data)
@@ -303,7 +305,8 @@ values in pyDVL. First construct the dataset and utility, then call
303305.. code-block :: python
304306
305307 from pydvl.utils import Dataset, Utility
306- from pydvl.value.shapley import compute_shapley_values
308+ from pydvl.value import compute_shapley_values
309+
307310 model = ...
308311 dataset = Dataset(... )
309312 utility = Utility(data, model)
@@ -329,11 +332,11 @@ It uses permutations over indices instead of subsets:
329332
330333$$
331334v_u(x_i) = \f rac{1}{n!} \s um_{\s igma \i n \P i(n)}
332- [u(\s igma_{i-1 } \c up {i }) − u(\s igma_{i})]
335+ [u(\s igma_{:i } \c up \{ i \ } ) − u(\s igma_{: i})]
333336,$$
334337
335- where $\s igma_i $ denotes the set of indices in permutation sigma up until the
336- position of index $i$. To approximate this sum (with $\m athcal{O}(n!)$ terms!)
338+ where $\s igma_{:i} $ denotes the set of indices in permutation sigma before the
339+ position where $i$ appears . To approximate this sum (with $\m athcal{O}(n!)$ terms!)
337340one uses Monte Carlo sampling of permutations, something which has surprisingly
338341low sample complexity. By adding early stopping, the result is the so-called
339342**Truncated Monte Carlo Shapley ** (:footcite:t: `ghorbani_data_2019 `), which is
@@ -342,7 +345,7 @@ efficient enough to be useful in some applications.
342345.. code-block :: python
343346
344347 from pydvl.utils import Dataset, Utility
345- from pydvl.value.shapley import compute_shapley_values
348+ from pydvl.value import compute_shapley_values
346349
347350 model = ...
348351 data = Dataset(... )
@@ -364,7 +367,7 @@ and can be used in pyDVL with:
364367.. code-block :: python
365368
366369 from pydvl.utils import Dataset, Utility
367- from pydvl.value.shapley import compute_shapley_values
370+ from pydvl.value import compute_shapley_values
368371 from sklearn.neighbors import KNeighborsClassifier
369372
370373 model = KNeighborsClassifier(n_neighbors = 5 )
@@ -410,7 +413,7 @@ its variance.
410413.. code-block :: python
411414
412415 from pydvl.utils import Dataset, Utility
413- from pydvl.value.shapley import compute_shapley_values
416+ from pydvl.value import compute_shapley_values
414417
415418 model = ...
416419 data = Dataset(... )
@@ -449,7 +452,7 @@ It satisfies the following 2 properties:
449452 The sum of payoffs to the agents in any coalition S is at
450453 least as large as the amount that these agents could earn by
451454 forming a coalition on their own.
452- $$\s um_{x_i\i n S} v_u(x_i) \g eq u(S), \f orall S \s ubseteq D\, $$
455+ $$\s um_{x_i\i n S} v_u(x_i) \g eq u(S), \f orall S \s ubset D\, $$
453456
454457The second property states that the sum of payoffs to the agents
455458in any subcoalition $S$ is at least as large as the amount that
@@ -463,7 +466,7 @@ By relaxing the coalitional rationality property by a subsidy $e \gt 0$,
463466we are then able to find approximate payoffs:
464467
465468$$
466- \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S), \f orall S \s ubseteq D \
469+ \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S), \f orall S \s ubset D, S \n eq \e mptyset \
467470,$$
468471
469472The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
473476\b egin{array}{lll}
474477\t ext{minimize} & e & \\
475478\t ext{subject to} & \s um_{x_i\i n D} v_u(x_i) = u(D) & \\
476- & \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S) &, \f orall S \s ubseteq D \\
479+ & \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S) &, \f orall S \s ubset D, S \n eq \e mptyset \\
477480\e nd{array}
478481$$
479482
@@ -487,11 +490,12 @@ As such it returns as exact a value as the utility function allows
487490.. code-block :: python
488491
489492 from pydvl.utils import Dataset, Utility
490- from pydvl.value.least_core import exact_least_core
493+ from pydvl.value import compute_least_core_values
494+
491495 model = ...
492496 dataset = Dataset(... )
493497 utility = Utility(data, model)
494- values = exact_least_core (utility)
498+ values = compute_least_core_values (utility, mode = " exact " )
495499
496500 Monte Carlo Least Core
497501----------------------
@@ -515,16 +519,20 @@ where $e^{*}$ is the optimal least core subsidy.
515519.. code-block :: python
516520
517521 from pydvl.utils import Dataset, Utility
518- from pydvl.value.least_core import montecarlo_least_core
522+ from pydvl.value import compute_least_core_values
523+
519524 model = ...
520525 dataset = Dataset(... )
521526 n_iterations = ...
522527 utility = Utility(data, model)
523- values = montecarlo_least_core(utility, n_iterations = n_iterations)
528+ values = compute_least_core_values(
529+ utility, mode = " montecarlo" , n_iterations = n_iterations
530+ )
524531
525532 .. note ::
526533
527- ``n_iterations `` needs to be at least equal to the number of data points.
534+ Although any number is supported, it is best to choose ``n_iterations `` to be
535+ at least equal to the number of data points.
528536
529537Because computing the Least Core values requires the solution of a linear and a
530538quadratic problem *after * computing all the utility values, we offer the
@@ -538,6 +546,7 @@ list of problems to solve, then solve them in parallel with
538546
539547 from pydvl.utils import Dataset, Utility
540548 from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
549+
541550 model = ...
542551 dataset = Dataset(... )
543552 n_iterations = ...
@@ -548,15 +557,102 @@ list of problems to solve, then solve them in parallel with
548557 values = lc_solve_problems(problems)
549558
550559
551- Other methods
552- =============
560+ Semi-values
561+ ===========
562+
563+ Shapley values are a particular case of a more general concept called semi-value,
564+ which is a generalization to different weighting schemes. A **semi-value ** is
565+ any valuation function with the form:
566+
567+ $$
568+ v\_\t ext{semi}(i) = \s um_{i=1}^n w(k)
569+ \s um_{S \s ubset D\_ {-i}^{(k)}} [U(S\_ {+i})-U(S)],
570+ $$
571+
572+ where the coefficients $w(k)$ satisfy the property:
573+
574+ $$\s um_{k=1}^n w(k) = 1.$$
575+
576+ Two instances of this are **Banzhaf indices ** (:footcite:t: `wang_data_2022 `),
577+ and **Beta Shapley ** (:footcite:t: `kwon_beta_2022 `), with better numerical and
578+ rank stability in certain situations.
579+
580+ .. note ::
581+
582+ Shapley values are a particular case of semi-values and can therefore also be
583+ computed with the methods described here. However, as of version 0.6.0, we
584+ recommend using :func: `~pydvl.value.shapley.compute_shapley_values ` instead,
585+ in particular because it implements truncated Monte Carlo sampling for faster
586+ computation.
587+
588+
589+ Beta Shapley
590+ ^^^^^^^^^^^^
591+
592+ For some machine learning applications, where the utility is typically the
593+ performance when trained on a set $S \s ubset D$, diminishing returns are often
594+ observed when computing the marginal utility of adding a new data point.
595+
596+ Beta Shapley is a weighting scheme that uses the Beta function to place more
597+ weight on subsets deemed to be more informative. The weights are defined as:
598+
599+ $$
600+ w(k) := \f rac{B(k+\b eta, n-k+1+\a lpha)}{B(\a lpha, \b eta)},
601+ $$
602+
603+ where $B$ is the `Beta function <https://en.wikipedia.org/wiki/Beta_function >`_,
604+ and $\a lpha$ and $\b eta$ are parameters that control the weighting of the
605+ subsets. Setting both to 1 recovers Shapley values, and setting $\a lpha = 1$, and
606+ $\b eta = 16$ is reported in :footcite:t: `kwon_beta_2022 ` to be a good choice for
607+ some applications. See however :ref: `banzhaf indices ` for an alternative choice
608+ of weights which is reported to work better.
609+
610+ .. code-block :: python
611+
612+ from pydvl.utils import Dataset, Utility
613+ from pydvl.value import compute_semivalues
614+
615+ model = ...
616+ data = Dataset(... )
617+ utility = Utility(model, data)
618+ values = compute_semivalues(
619+ u = utility, mode = " beta_shapley" , done = MaxUpdates(500 ), alpha = 1 , beta = 16
620+ )
621+
622+ .. _banzhaf indices :
553623
554- There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
555- of semivalue, which is a generalization to different weighting schemes:
556- in particular **Banzhaf indices ** and **Beta Shapley **, with better numerical
557- and rank stability in certain situations.
624+ Banzhaf indices
625+ ^^^^^^^^^^^^^^^
558626
559- Contributions are welcome!
627+ As noted below in :ref: `problems of data values `, the Shapley value can be very
628+ sensitive to variance in the utility function. For machine learning applications,
629+ where the utility is typically the performance when trained on a set $S \s ubset
630+ D$, this variance is often largest for smaller subsets $S$. It is therefore
631+ reasonable to try reducing the relative contribution of these subsets with
632+ adequate weights.
633+
634+ One such choice of weights is the Banzhaf index, which is defined as the
635+ constant:
636+
637+ $$w(k) := 2^{n-1},$$
638+
639+ for all set sizes $k$. The intuition for picking a constant weight is that for
640+ any choice of weight function $w$, one can always construct a utility with
641+ higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
642+ one can do is to pick a constant weight.
643+
644+ The authors of :footcite:t: `wang_data_2022 ` show that Banzhaf indices are more
645+ robust to variance in the utility function than Shapley and Beta Shapley values.
646+
647+ .. code-block :: python
648+
649+ from pydvl.utils import Dataset, Utility
650+ from pydvl.value import compute_semivalues
651+
652+ model = ...
653+ data = Dataset(... )
654+ utility = Utility(model, data)
655+ values = compute_semivalues( u = utility, mode = " banzhaf" , done = MaxUpdates(500 ))
560656
561657
562658 .. _problems of data values :
0 commit comments