Skip to content

Commit 63af7cc

Browse files
Merge pull request #237 from appliedAI-Initiative/157-implement-least-core-v2
Implement Least Core
2 parents c94aafe + 66e0467 commit 63af7cc

File tree

23 files changed

+1980
-210
lines changed

23 files changed

+1980
-210
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@
2222
fixes a bug in `MapReduceJob`'s chunkification when `n_runs` >= `n_jobs`,
2323
and defines a sequential parallel backend to run all jobs in the current thread
2424
[PR #232](https://github.com/appliedAI-Initiative/pyDVL/pull/232)
25+
- **New method**: Implements exact and monte carlo Least Core for data valuation,
26+
adds `from_arrays()` class method to the `Dataset` and `GroupedDataset` classes,
27+
adds `extra_values` argument to `ValuationResult`,
28+
adds `compute_removal_score()` and `compute_random_removal_score()` helper functions
29+
[PR #237](https://github.com/appliedAI-Initiative/pyDVL/pull/237)
2530

2631
## 0.3.0 - 💥 Breaking changes
2732

README.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ methods from the following papers:
4646
[A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
4747
In 2020 25th International Conference on Pattern Recognition (ICPR), 7992–99.
4848
IEEE, 2021.
49+
- Yan, T., & Procaccia, A. D.
50+
[If You Like Shapley Then You’ll Love the Core]().
51+
Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
4952

5053
Influence Functions compute the effect that single points have on an estimator /
5154
model. We implement methods from the following papers:
@@ -76,18 +79,7 @@ documentation.
7679

7780
# Usage
7881

79-
pyDVL uses [Memcached](https://memcached.org/) to cache certain results and
80-
speed up computation. You can run it either locally or, using
81-
[Docker](https://www.docker.com/):
82-
83-
```shell
84-
docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
85-
```
86-
87-
You can read more in the [caching module's
88-
documentation](https://appliedAI-Initiative.github.io/pyDVL/pydvl/utils/caching.html).
89-
90-
Once that's done, the steps required to compute values for your samples are
82+
The steps required to compute values for your samples are:
9183

9284
1. Create a `Dataset` object with your train and test splits.
9385
2. Create an instance of a `SupervisedModel` (basically any sklearn compatible
@@ -108,14 +100,14 @@ from sklearn.model_selection import train_test_split
108100

109101
X, y = np.arange(100).reshape((50, 2)), np.arange(50)
110102
X_train, X_test, y_train, y_test = train_test_split(
111-
X, y, test_size=0.5, random_state=16
112-
)
103+
X, y, test_size=0.5, random_state=16
104+
)
113105
dataset = Dataset(X_train, y_train, X_test, y_test)
114106
model = LinearRegression()
115107
utility = Utility(model, dataset)
116108
values = compute_shapley_values(
117-
u=utility, max_iterations=100, mode="truncated_montecarlo"
118-
)
109+
u=utility, max_iterations=100, mode="truncated_montecarlo"
110+
)
119111
```
120112

121113
For more instructions and information refer to [Getting
@@ -124,6 +116,21 @@ the documentation. We provide several
124116
[examples](https://appliedAI-Initiative.github.io/pyDVL/examples/index.html)
125117
with details on the algorithms and their applications.
126118

119+
## Caching
120+
121+
pyDVL offers the possibility to cache certain results and
122+
speed up computation. It uses [Memcached](https://memcached.org/) For that.
123+
124+
You can run it either locally or, using
125+
[Docker](https://www.docker.com/):
126+
127+
```shell
128+
docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
129+
```
130+
131+
You can read more in the [caching module's
132+
documentation](https://appliedAI-Initiative.github.io/pyDVL/pydvl/utils/caching.html).
133+
127134
# Contributing
128135

129136
Please open new issues for bugs, feature requests and extensions. You can read

docs/30-data-valuation.rst

Lines changed: 133 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,10 @@ The value $v$ of the $i$-th sample in dataset $D$ wrt. utility $u$ is computed
205205
as a weighted sum of its marginal utility wrt. every possible coalition of
206206
training samples within the training set:
207207

208-
$$v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}} \binom{n-1}{ | S | }^{-1} [u(S \cup \{x_i\}) − u(S)] ,$$
208+
$$
209+
v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
210+
\binom{n-1}{ | S | }^{-1} [u(S \cup \{x_i\}) − u(S)]
211+
,$$
209212

210213
.. code-block:: python
211214
@@ -253,7 +256,10 @@ of the utility from $\{0,1\}^n$, where a 1 in position $i$ means that sample
253256
$x_i$ is used to train the model, to $[0,1]^n$. The ensuing expression for
254257
Shapley value uses integration instead of discrete weights:
255258

256-
$$v_u(i) = \int_0^1 \mathbb{E}_{S \sim P_q(D_{\backslash \{ i \}})} [u(S \cup {i}) - u(S)].$$
259+
$$
260+
v_u(i) = \int_0^1 \mathbb{E}_{S \sim P_q(D_{\backslash \{ i \}})}
261+
[u(S \cup {i}) - u(S)]
262+
.$$
257263

258264
Using Owen sampling follows the same pattern as every other method for Shapley
259265
values in pyDVL. First construct the dataset and utility, then call
@@ -282,7 +288,10 @@ Permutation Shapley
282288
An equivalent way of computing Shapley values appears often in the literature.
283289
It uses permutations over indices instead of subsets:
284290

285-
$$v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})],$$
291+
$$
292+
v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
293+
[u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})]
294+
,$$
286295

287296
where $\sigma_i$ denotes the set of indices in permutation sigma up until the
288297
position of index $i$. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
@@ -324,13 +333,131 @@ and can be used in pyDVL with:
324333
utility = Utility(model, data)
325334
values = compute_shapley_values(u=utility, mode="knn")
326335
336+
.. _Least Core:
337+
338+
Core values
339+
===========
340+
341+
The Shapley values define a fair way to distribute payoffs amongst all
342+
participants when they form a grand coalition. But they do not consider
343+
the question of stability: under which conditions do all participants
344+
form the grand coalition? Would the participants be willing to form
345+
the grand coalition given how the payoffs are assigned,
346+
or would some of them prefer to form smaller coalitions?
347+
348+
The Core is another approach to computing data values originating
349+
in cooperative game theory that attempts to ensure this stability.
350+
It is the set of feasible payoffs that cannot be improved upon
351+
by a coalition of the participants.
352+
353+
It satisfies the following 2 properties:
354+
355+
- **Efficiency**:
356+
The payoffs are distributed such that it is not possible
357+
to make any participant better off
358+
without making another one worse off.
359+
$$\sum_{x_i\in D} v_u(x_i) = u(D)\,$$
360+
361+
- **Coalitional rationality**:
362+
The sum of payoffs to the agents in any coalition S is at
363+
least as large as the amount that these agents could earn by
364+
forming a coalition on their own.
365+
$$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subseteq D\,$$
366+
367+
The second property states that the sum of payoffs to the agents
368+
in any subcoalition $S$ is at least as large as the amount that
369+
these agents could earn by forming a coalition on their own.
370+
371+
Least Core values
372+
^^^^^^^^^^^^^^^^^
373+
374+
Unfortunately, for many cooperative games the Core may be empty.
375+
By relaxing the coalitional rationality property by $e \gt 0$,
376+
we are then able to find approximate payoffs:
377+
378+
$$
379+
\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subseteq D\
380+
,$$
381+
382+
The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
383+
utility $u$ is computed by solving the following Linear Program:
384+
385+
$$
386+
\begin{array}{lll}
387+
\text{minimize} & e & \\
388+
\text{subject to} & \sum_{x_i\in D} v_u(x_i) = u(D) & \\
389+
& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subseteq D \\
390+
\end{array}
391+
$$
392+
393+
Exact Least Core
394+
----------------
395+
396+
This first algorithm is just a verbatim implementation of the definition.
397+
As such it returns as exact a value as the utility function allows
398+
(see what this means in :ref:`problems of data values`).
399+
400+
.. code-block:: python
401+
402+
from pydvl.utils import Dataset, Utility
403+
from pydvl.value.least_core import exact_least_core
404+
model = ...
405+
dataset = Dataset(...)
406+
utility = Utility(data, model)
407+
values = exact_least_core(utility)
408+
409+
Monte Carlo Least Core
410+
----------------------
411+
412+
Because the number of subsets $S \subseteq D \setminus \{x_i\}$ is
413+
$2^{ | D | - 1 }$, one typically must resort to approximations.
414+
415+
The simplest approximation consists of two relaxations of the Least Core
416+
(:footcite:t:`yan_procaccia_2021`):
417+
418+
- Further relaxing the coalitional rationality property by
419+
a constant value $\epsilon > 0$:
420+
421+
$$
422+
\sum_{x_i\in S} v_u(x_i) + e + \epsilon \geq u(S)
423+
$$
424+
425+
- Using a fraction of all subsets instead of all possible subsets.
426+
427+
Combined, this gives us the following property:
428+
429+
$$
430+
P_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} + \epsilon \geq u(S)\right]
431+
\geq 1 - \delta
432+
$$
433+
434+
Where $e^{*}$ is the optimal least core value.
435+
436+
With these relaxations, we obtain a polynomial running time.
437+
438+
.. code-block:: python
439+
440+
from pydvl.utils import Dataset, Utility
441+
from pydvl.value.least_core import montecarlo_least_core
442+
model = ...
443+
dataset = Dataset(...)
444+
max_iterations = ...
445+
utility = Utility(data, model)
446+
values = montecarlo_least_core(utility, max_iterations=max_iterations)
447+
448+
.. note::
449+
450+
``max_iterations`` needs to be at least equal to the number of data points.
327451

328452
Other methods
329453
=============
330454

331-
Other game-theoretic concepts in pyDVL's roadmap are the **Least Core** (in
332-
progress), and **Banzhaf indices** (the latter is just a different weighting
333-
scheme with better numerical stability properties). Contributions are welcome!
455+
There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
456+
of semivalue, which is a generalization to different weighting schemes:
457+
in particular **Banzhaf indices** and **Beta Shapley**, with better numerical
458+
and rank stability in certain situations.
459+
460+
Contributions are welcome!
334461

335462

336463
.. _problems of data values:

docs/examples/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ The following examples illustrate the usage of pyDVL features.
1111
:glob:
1212

1313
shapley*
14+
least_core*
1415

1516
.. toctree::
1617
:caption: Influence function

docs/pydvl.bib

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,17 @@ @inproceedings{wang_improving_2022
106106
archiveprefix = {arXiv},
107107
langid = {english}
108108
}
109+
110+
@article{yan_procaccia_2021,
111+
title = {If You Like Shapley Then You’ll Love the Core},
112+
volume = {35},
113+
url = {https://ojs.aaai.org/index.php/AAAI/article/view/16721},
114+
doi = {10.1609/aaai.v35i6.16721},
115+
abstract = {The prevalent approach to problems of credit assignment in machine learning -- such as feature and data valuation -- is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community’s current consensus around the Shapley value, and make a case for the core as a viable alternative. To that end, we prove that arbitrarily good approximations to the least core -- a core relaxation that is always feasible -- can be computed efficiently (but prove an impossibility for a more refined solution concept, the nucleolus). We also perform experiments that corroborate these theoretical results and shed light on settings where the least core may be preferable to the Shapley value.},
116+
number = {6},
117+
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
118+
author = {Yan, Tom and Procaccia, Ariel D.},
119+
year = {2021},
120+
month = {May},
121+
pages = {5751-5759}
122+
}

0 commit comments

Comments
 (0)