Skip to content

Commit aa8aa46

Browse files
authored
Merge pull request #319 from appliedAI-Initiative/feature/sampler
Semi-values and samplers
2 parents 996e014 + f7ceff0 commit aa8aa46

File tree

15 files changed

+984
-57
lines changed

15 files changed

+984
-57
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
## Unreleased
44

5+
- **New method**: Implements generalised semi-values for data valuation,
6+
including Data Banzhaf and Beta Shapley, with configurable sampling strategies
7+
[PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
58
- Adds kwargs parameter to `from_array` and `from_sklearn`
69
Dataset and GroupedDataset class methods
710
[PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,13 @@ methods from the following papers:
5454
[Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
5555
In 22nd International Conference on Artificial Intelligence and Statistics,
5656
1167–76. PMLR, 2019.
57+
- Wang, Jiachen T., and Ruoxi Jia.
58+
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
59+
arXiv, October 22, 2022.
60+
- Kwon, Yongchan, and James Zou.
61+
[Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
62+
In Proceedings of the 25th International Conference on Artificial Intelligence
63+
and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
5764

5865
Influence Functions compute the effect that single points have on an estimator /
5966
model. We implement methods from the following papers:

docs/30-data-valuation.rst

Lines changed: 116 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,7 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
241241
.. code-block:: python
242242
243243
from pydvl.value import compute_shapley_value
244+
244245
utility = Utility(...)
245246
values = compute_shapley_values(utility, mode="combinatorial_exact")
246247
df = values.to_dataframe(column='value')
@@ -264,7 +265,8 @@ same pattern:
264265
.. code-block:: python
265266
266267
from pydvl.utils import Dataset, Utility
267-
from pydvl.value.shapley import compute_shapley_values
268+
from pydvl.value import compute_shapley_values
269+
268270
model = ...
269271
data = Dataset(...)
270272
utility = Utility(model, data)
@@ -303,7 +305,8 @@ values in pyDVL. First construct the dataset and utility, then call
303305
.. code-block:: python
304306
305307
from pydvl.utils import Dataset, Utility
306-
from pydvl.value.shapley import compute_shapley_values
308+
from pydvl.value import compute_shapley_values
309+
307310
model = ...
308311
dataset = Dataset(...)
309312
utility = Utility(data, model)
@@ -329,11 +332,11 @@ It uses permutations over indices instead of subsets:
329332

330333
$$
331334
v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
332-
[u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})]
335+
[u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
333336
,$$
334337

335-
where $\sigma_i$ denotes the set of indices in permutation sigma up until the
336-
position of index $i$. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
338+
where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
339+
position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
337340
one uses Monte Carlo sampling of permutations, something which has surprisingly
338341
low sample complexity. By adding early stopping, the result is the so-called
339342
**Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
@@ -342,7 +345,7 @@ efficient enough to be useful in some applications.
342345
.. code-block:: python
343346
344347
from pydvl.utils import Dataset, Utility
345-
from pydvl.value.shapley import compute_shapley_values
348+
from pydvl.value import compute_shapley_values
346349
347350
model = ...
348351
data = Dataset(...)
@@ -364,7 +367,7 @@ and can be used in pyDVL with:
364367
.. code-block:: python
365368
366369
from pydvl.utils import Dataset, Utility
367-
from pydvl.value.shapley import compute_shapley_values
370+
from pydvl.value import compute_shapley_values
368371
from sklearn.neighbors import KNeighborsClassifier
369372
370373
model = KNeighborsClassifier(n_neighbors=5)
@@ -410,7 +413,7 @@ its variance.
410413
.. code-block:: python
411414
412415
from pydvl.utils import Dataset, Utility
413-
from pydvl.value.shapley import compute_shapley_values
416+
from pydvl.value import compute_shapley_values
414417
415418
model = ...
416419
data = Dataset(...)
@@ -487,11 +490,12 @@ As such it returns as exact a value as the utility function allows
487490
.. code-block:: python
488491
489492
from pydvl.utils import Dataset, Utility
490-
from pydvl.value.least_core import exact_least_core
493+
from pydvl.value import compute_least_core_values
494+
491495
model = ...
492496
dataset = Dataset(...)
493497
utility = Utility(data, model)
494-
values = exact_least_core(utility)
498+
values = compute_least_core_values(utility, mode="exact")
495499
496500
Monte Carlo Least Core
497501
----------------------
@@ -515,16 +519,20 @@ where $e^{*}$ is the optimal least core subsidy.
515519
.. code-block:: python
516520
517521
from pydvl.utils import Dataset, Utility
518-
from pydvl.value.least_core import montecarlo_least_core
522+
from pydvl.value import compute_least_core_values
523+
519524
model = ...
520525
dataset = Dataset(...)
521526
n_iterations = ...
522527
utility = Utility(data, model)
523-
values = montecarlo_least_core(utility, n_iterations=n_iterations)
528+
values = compute_least_core_values(
529+
utility, mode="montecarlo", n_iterations=n_iterations
530+
)
524531
525532
.. note::
526533

527-
``n_iterations`` needs to be at least equal to the number of data points.
534+
Although any number is supported, it is best to choose ``n_iterations`` to be
535+
at least equal to the number of data points.
528536

529537
Because computing the Least Core values requires the solution of a linear and a
530538
quadratic problem *after* computing all the utility values, we offer the
@@ -538,6 +546,7 @@ list of problems to solve, then solve them in parallel with
538546
539547
from pydvl.utils import Dataset, Utility
540548
from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
549+
541550
model = ...
542551
dataset = Dataset(...)
543552
n_iterations = ...
@@ -548,15 +557,102 @@ list of problems to solve, then solve them in parallel with
548557
values = lc_solve_problems(problems)
549558
550559
551-
Other methods
552-
=============
560+
Semi-values
561+
===========
562+
563+
Shapley values are a particular case of a more general concept called semi-value,
564+
which is a generalization to different weighting schemes. A **semi-value** is
565+
any valuation function with the form:
566+
567+
$$
568+
v\_\text{semi}(i) = \sum_{i=1}^n w(k)
569+
\sum_{S \subset D\_{-i}^{(k)}} [U(S\_{+i})-U(S)],
570+
$$
571+
572+
where the coefficients $w(k)$ satisfy the property:
573+
574+
$$\sum_{k=1}^n w(k) = 1.$$
575+
576+
Two instances of this are **Banzhaf indices** (:footcite:t:`wang_data_2022`),
577+
and **Beta Shapley** (:footcite:t:`kwon_beta_2022`), with better numerical and
578+
rank stability in certain situations.
579+
580+
.. note::
581+
582+
Shapley values are a particular case of semi-values and can therefore also be
583+
computed with the methods described here. However, as of version 0.5.1, we
584+
recommend using :func:`~pydvl.value.shapley.compute_shapley_values` instead,
585+
in particular because it implements truncated Monte Carlo sampling for faster
586+
computation.
587+
588+
589+
Beta Shapley
590+
^^^^^^^^^^^^
591+
592+
For some machine learning applications, where the utility is typically the
593+
performance when trained on a set $S \subset D$, diminishing returns are often
594+
observed when computing the marginal utility of adding a new data point.
595+
596+
Beta Shapley is a weighting scheme that uses the Beta function to place more
597+
weight on subsets deemed to be more informative. The weights are defined as:
598+
599+
$$
600+
w(k) := \frac{B(k+\beta, n-k+1+\alpha)}{B(\alpha, \beta)},
601+
$$
602+
603+
where $B$ is the `Beta function <https://en.wikipedia.org/wiki/Beta_function>`_,
604+
and $\alpha$ and $\beta$ are parameters that control the weighting of the
605+
subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$, and
606+
$\beta = 16$ is reported in :footcite:t:`kwon_beta_2022` to be a good choice for
607+
some applications. See however :ref:`banzhaf indices` for an alternative choice
608+
of weights which is reported to work better.
609+
610+
.. code-block:: python
611+
612+
from pydvl.utils import Dataset, Utility
613+
from pydvl.value import compute_semivalues
614+
615+
model = ...
616+
data = Dataset(...)
617+
utility = Utility(model, data)
618+
values = compute_semivalues(
619+
u=utility, mode="beta_shapley", done=MaxUpdates(500), alpha=1, beta=16
620+
)
621+
622+
.. _banzhaf indices:
553623

554-
There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
555-
of semivalue, which is a generalization to different weighting schemes:
556-
in particular **Banzhaf indices** and **Beta Shapley**, with better numerical
557-
and rank stability in certain situations.
624+
Banzhaf indices
625+
^^^^^^^^^^^^^^^
558626

559-
Contributions are welcome!
627+
As noted below in :ref:`problems of data values`, the Shapley value can be very
628+
sensitive to variance in the utility function. For machine learning applications,
629+
where the utility is typically the performance when trained on a set $S \subset
630+
D$, this variance is often largest for smaller subsets $S$. It is therefore
631+
reasonable to try reducing the relative contribution of these subsets with
632+
adequate weights.
633+
634+
One such choice of weights is the Banzhaf index, which is defined as the
635+
constant:
636+
637+
$$w(k) := 2^{n-1},$$
638+
639+
for all set sizes $k$. The intuition for picking a constant weight is that for
640+
any choice of weight function $w$, one can always construct a utility with
641+
higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
642+
one can do is to pick a constant weight.
643+
644+
The authors of :footcite:t:`wang_data_2022` show that Banzhaf indices are more
645+
robust to variance in the utility function than Shapley and Beta Shapley values.
646+
647+
.. code-block:: python
648+
649+
from pydvl.utils import Dataset, Utility
650+
from pydvl.value import compute_semivalues
651+
652+
model = ...
653+
data = Dataset(...)
654+
utility = Utility(model, data)
655+
values = compute_semivalues( u=utility, mode="banzhaf", done=MaxUpdates(500))
560656
561657
562658
.. _problems of data values:

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,7 @@ def lineno_from_object_name(source_file, object_name):
334334

335335
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
336336
html_show_copyright = True
337-
copyright = "2022 AppliedAI Institute gGmbH"
337+
copyright = "AppliedAI Institute gGmbH"
338338

339339
# If true, an OpenSearch description file will be output, and all pages will
340340
# contain a <link> tag referring to it. The value of this option must be the

src/pydvl/influence/conjugate_gradient.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,6 @@ def batched_preconditioned_conjugate_gradient(
123123
atol = np.linalg.norm(b, axis=1) * rtol
124124

125125
while iteration < max_iterations:
126-
127126
# remaining fields
128127
iteration += 1
129128
not_yet_converged_indices = np.argwhere(np.logical_not(converged))[:, 0]

src/pydvl/utils/numeric.py

Lines changed: 43 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,14 @@
22
This module contains routines for numerical computations used across the
33
library.
44
"""
5+
from __future__ import annotations
56

67
from itertools import chain, combinations
7-
from typing import Collection, Generator, Iterator, Optional, Tuple, TypeVar
8+
from typing import Collection, Generator, Iterator, Optional, Tuple, TypeVar, overload
89

910
import numpy as np
1011
from numpy.typing import NDArray
1112

12-
FloatOrArray = TypeVar("FloatOrArray", float, NDArray[np.float_])
13-
IntOrArray = TypeVar("IntOrArray", int, NDArray[np.int_])
14-
1513
__all__ = [
1614
"running_moments",
1715
"linear_regression_analytical_derivative_d2_theta",
@@ -20,6 +18,7 @@
2018
"num_samples_permutation_hoeffding",
2119
"powerset",
2220
"random_matrix_with_condition_number",
21+
"random_subset",
2322
"random_powerset",
2423
"random_subset_of_size",
2524
"top_k_value_accuracy",
@@ -63,6 +62,19 @@ def num_samples_permutation_hoeffding(eps: float, delta: float, u_range: float)
6362
return int(np.ceil(np.log(2 / delta) * 2 * u_range**2 / eps**2))
6463

6564

65+
def random_subset(s: NDArray[T], q: float = 0.5) -> NDArray[T]:
66+
"""Returns one subset at random from ``s``.
67+
68+
:param s: set to sample from
69+
:param q: Sampling probability for elements. The default 0.5 yields a
70+
uniform distribution over the power set of s.
71+
:return: the subset
72+
"""
73+
rng = np.random.default_rng()
74+
selection = rng.uniform(size=len(s)) > q
75+
return s[selection]
76+
77+
6678
def random_powerset(
6779
s: NDArray[T], n_samples: Optional[int] = None, q: float = 0.5
6880
) -> Generator[NDArray[T], None, None]:
@@ -72,9 +84,8 @@ def random_powerset(
7284
See `powerset()` if you wish to deterministically generate all subsets.
7385
7486
To generate subsets, `len(s)` Bernoulli draws with probability `q` are
75-
drawn.
76-
The default value of `q = 0.5` provides a uniform distribution over the
77-
power set of `s`. Other choices can be used e.g. to implement
87+
drawn. The default value of `q = 0.5` provides a uniform distribution over
88+
the power set of `s`. Other choices can be used e.g. to implement
7889
:func:`Owen sampling
7990
<pydvl.value.shapley.montecarlo.owen_sampling_shapley>`.
8091
@@ -94,19 +105,17 @@ def random_powerset(
94105
if q < 0 or q > 1:
95106
raise ValueError("Element sampling probability must be in [0,1]")
96107

97-
rng = np.random.default_rng()
98108
total = 1
99109
if n_samples is None:
100110
n_samples = np.iinfo(np.int32).max
101111
while total <= n_samples:
102-
selection = rng.uniform(size=len(s)) > q
103-
subset = s[selection]
104-
yield subset
112+
yield random_subset(s, q)
105113
total += 1
106114

107115

108116
def random_subset_of_size(s: NDArray[T], size: int) -> NDArray[T]:
109-
"""Samples a random subset of given size.
117+
"""Samples a random subset of given size uniformly from the powerset
118+
of ``s``.
110119
111120
:param s: Set to sample from
112121
:param size: Size of the subset to generate
@@ -221,13 +230,29 @@ def linear_regression_analytical_derivative_d_x_d_theta(
221230
return full_derivative / N # type: ignore
222231

223232

224-
# FIXME: FloatOrArray doesn't really work
233+
@overload
234+
def running_moments(
235+
previous_avg: float, previous_variance: float, count: int, new_value: float
236+
) -> Tuple[float, float]:
237+
...
238+
239+
240+
@overload
241+
def running_moments(
242+
previous_avg: NDArray[np.float_],
243+
previous_variance: NDArray[np.float_],
244+
count: int,
245+
new_value: NDArray[np.float_],
246+
) -> Tuple[NDArray[np.float_], NDArray[np.float_]]:
247+
...
248+
249+
225250
def running_moments(
226-
previous_avg: FloatOrArray,
227-
previous_variance: FloatOrArray,
228-
count: IntOrArray,
229-
new_value: FloatOrArray,
230-
) -> Tuple: # [FloatOrArray, FloatOrArray]:
251+
previous_avg: float | NDArray[np.float_],
252+
previous_variance: float | NDArray[np.float_],
253+
count: int,
254+
new_value: float | NDArray[np.float_],
255+
) -> Tuple[float | NDArray[np.float_], float | NDArray[np.float_]]:
231256
"""Uses Welford's algorithm to calculate the running average and variance of
232257
a set of numbers.
233258

src/pydvl/value/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,7 @@
99
from ..utils import Dataset, Scorer, Utility
1010
from .least_core import *
1111
from .loo import *
12+
from .sampler import *
13+
from .semivalues import *
1214
from .shapley import *
1315
from .stopping import *

0 commit comments

Comments
 (0)