Skip to content

Commit bfa3c71

Browse files
authored
Merge branch 'develop' into fix/cleanup
2 parents f530e20 + 633621e commit bfa3c71

25 files changed

+655
-177
lines changed

CHANGELOG.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Changelog
22

3-
## Unreleased - 💥 More breaking changes!
3+
## Unreleased - 💥 More breaking changes and new algorithms 🏭
44

55
- GH action to mark issues as stale
66
[PR #201](https://github.com/appliedAI-Initiative/pyDVL/pull/201)
@@ -15,18 +15,20 @@
1515
new example notebook
1616
[PR #195](https://github.com/appliedAI-Initiative/pyDVL/pull/195)
1717
- **Breaking change**: Passes the input to `MapReduceJob` at initialization,
18-
removes `chunkify_inputs` argument from `MapReduceJob`,
19-
removes `n_runs` argument from `MapReduceJob`,
20-
calls the parallel backend's `put()` method for each generated chunk in `_chunkify()`,
21-
renames ParallelConfig's `num_workers` attribute to `n_local_workers`,
22-
fixes a bug in `MapReduceJob`'s chunkification when `n_runs` >= `n_jobs`,
23-
and defines a sequential parallel backend to run all jobs in the current thread
18+
removes `chunkify_inputs` argument from `MapReduceJob`, removes `n_runs`
19+
argument from `MapReduceJob`, calls the parallel backend's `put()` method for
20+
each generated chunk in `_chunkify()`, renames ParallelConfig's `num_workers`
21+
attribute to `n_local_workers`, fixes a bug in `MapReduceJob`'s chunkification
22+
when `n_runs` >= `n_jobs`, and defines a sequential parallel backend to run
23+
all jobs in the current thread
2424
[PR #232](https://github.com/appliedAI-Initiative/pyDVL/pull/232)
2525
- **New method**: Implements exact and monte carlo Least Core for data valuation,
26-
adds `from_arrays()` class method to the `Dataset` and `GroupedDataset` classes,
27-
adds `extra_values` argument to `ValuationResult`,
28-
adds `compute_removal_score()` and `compute_random_removal_score()` helper functions
26+
adds `from_arrays()` class method to the `Dataset` and `GroupedDataset`
27+
classes, adds `extra_values` argument to `ValuationResult`, adds
28+
`compute_removal_score()` and `compute_random_removal_score()` helper functions
2929
[PR #237](https://github.com/appliedAI-Initiative/pyDVL/pull/237)
30+
- **New method**: Group Testing Shapley for valuation, from _Jia et al. 2019_
31+
[PR #240](https://github.com/appliedAI-Initiative/pyDVL/pull/240)
3032
- Fixes bug in ray initialization in `RayParallelBackend` class
3133
[PR #239](https://github.com/appliedAI-Initiative/pyDVL/pull/239)
3234

README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,16 @@ methods from the following papers:
4444
Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
4545
- Okhrati, Ramin, and Aldo Lipani.
4646
[A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
47-
In 2020 25th International Conference on Pattern Recognition (ICPR), 7992–99.
47+
In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99.
4848
IEEE, 2021.
49-
- Yan, T., & Procaccia, A. D.
49+
- Yan, T., & Procaccia, A. D.
5050
[If You Like Shapley Then You’ll Love the Core]().
5151
Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
52+
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
53+
Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos.
54+
[Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
55+
In 22nd International Conference on Artificial Intelligence and Statistics,
56+
1167–76. PMLR, 2019.
5257

5358
Influence Functions compute the effect that single points have on an estimator /
5459
model. We implement methods from the following papers:
@@ -106,7 +111,7 @@ dataset = Dataset(X_train, y_train, X_test, y_test)
106111
model = LinearRegression()
107112
utility = Utility(model, dataset)
108113
values = compute_shapley_values(
109-
u=utility, max_iterations=100, mode="truncated_montecarlo"
114+
u=utility, n_iterations=100, mode="truncated_montecarlo"
110115
)
111116
```
112117

docs/30-data-valuation.rst

Lines changed: 52 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,7 @@ values in pyDVL. First construct the dataset and utility, then call
273273
dataset = Dataset(...)
274274
utility = Utility(data, model)
275275
values = compute_shapley_values(
276-
u=utility, mode="owen", max_iterations=4, max_q=200
276+
u=utility, mode="owen", n_iterations=4, max_q=200
277277
)
278278
279279
There are more details on Owen
@@ -309,7 +309,7 @@ efficient enough to be useful in some applications.
309309
data = Dataset(...)
310310
utility = Utility(model, data)
311311
values = compute_shapley_values(
312-
u=utility, mode="truncated_montecarlo", max_iterations=100
312+
u=utility, mode="truncated_montecarlo", n_iterations=100
313313
)
314314
315315
@@ -333,6 +333,53 @@ and can be used in pyDVL with:
333333
utility = Utility(model, data)
334334
values = compute_shapley_values(u=utility, mode="knn")
335335
336+
337+
Group testing
338+
^^^^^^^^^^^^^
339+
340+
An alternative approach introduced in :footcite:t:`jia_efficient_2019a`
341+
first approximates the differences of values with a Monte Carlo sum. With
342+
343+
$$\hat{\Delta}_{i j} \approx v_i - v_j,$$
344+
345+
one then solves the following linear constraint satisfaction problem (CSP) to
346+
infer the final values:
347+
348+
$$
349+
\begin{array}{lll}
350+
\sum_{i = 1}^N v_i & = & U (D)\\
351+
| v_i - v_j - \hat{\Delta}_{i j} | & \leqslant &
352+
\frac{\varepsilon}{2 \sqrt{N}}
353+
\end{array}
354+
$$
355+
356+
.. warning::
357+
We have reproduced this method in pyDVL for completeness and benchmarking,
358+
but we don't advocate its use because of the speed and memory cost. Despite
359+
our best efforts, the number of samples required in practice for convergence
360+
can be several orders of magnitude worse than with e.g. Truncated Monte Carlo.
361+
362+
Usage follows the same pattern as every other Shapley method, but with the
363+
addition of an ``eps`` parameter required for the solution of the CSP. It should
364+
be the same value used to compute the minimum number of samples required. This
365+
can be done with :func:`~pydvl.value.shapley.gt.num_samples_eps_delta`, but note
366+
that the number returned will be huge! In practice, fewer samples can be enough,
367+
but the actual number will strongly depend on the utility, in particular its
368+
variance.
369+
370+
.. code-block:: python
371+
372+
from pydvl.utils import Dataset, Utility
373+
from pydvl.value.shapley import compute_shapley_values
374+
375+
model = ...
376+
data = Dataset(...)
377+
utility = Utility(model, data, score_range=(_min, _max))
378+
min_iterations = num_samples_eps_delta(epsilon, delta, n, utility.score_range)
379+
values = compute_shapley_values(
380+
u=utility, mode="group_testing", n_iterations=min_iterations, eps=eps
381+
)
382+
336383
.. _Least Core:
337384

338385
Core values
@@ -441,13 +488,13 @@ With these relaxations, we obtain a polynomial running time.
441488
from pydvl.value.least_core import montecarlo_least_core
442489
model = ...
443490
dataset = Dataset(...)
444-
max_iterations = ...
491+
n_iterations = ...
445492
utility = Utility(data, model)
446-
values = montecarlo_least_core(utility, max_iterations=max_iterations)
493+
values = montecarlo_least_core(utility, n_iterations=n_iterations)
447494
448495
.. note::
449496

450-
``max_iterations`` needs to be at least equal to the number of data points.
497+
``n_iterations`` needs to be at least equal to the number of data points.
451498

452499
Other methods
453500
=============

docs/pydvl.bib

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,21 @@ @inproceedings{ghorbani_data_2019
1717
langid = {english}
1818
}
1919

20+
@inproceedings{jia_efficient_2019,
21+
title = {Towards {{Efficient Data Valuation Based}} on the {{Shapley Value}}},
22+
booktitle = {Proceedings of the 22nd {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
23+
author = {Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Hynes, Nick and G{\"u}rel, Nezihe Merve and Li, Bo and Zhang, Ce and Song, Dawn and Spanos, Costas J.},
24+
year = {2019},
25+
month = apr,
26+
pages = {1167--1176},
27+
publisher = {{PMLR}},
28+
issn = {2640-3498},
29+
url = {http://proceedings.mlr.press/v89/jia19a.html},
30+
urldate = {2021-02-12},
31+
abstract = {``How much is my data worth?'' is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits...},
32+
langid = {english}
33+
}
34+
2035
@article{jia_efficient_2019a,
2136
title = {Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms},
2237
shorttitle = {{{VLDB}} 2019},

notebooks/influence_synthetic.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -670,7 +670,7 @@
670670
" y_test=test_data[1].astype(float),\n",
671671
" influence_type=\"up\",\n",
672672
" inversion_method=\"cg\",\n",
673-
" inversion_method_kwargs={\"max_iterations\": 10, \"max_step_size\": 1},\n",
673+
" inversion_method_kwargs={\"n_iterations\": 10, \"max_step_size\": 1},\n",
674674
")\n",
675675
"mean_train_influences = np.mean(influence_values, axis=0)\n",
676676
"\n",

notebooks/least_core_basic.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -442,7 +442,7 @@
442442
" column_name = f\"estimated_value_{budget}\"\n",
443443
" values = montecarlo_least_core(\n",
444444
" u=utility,\n",
445-
" max_iterations=budget,\n",
445+
" n_iterations=budget,\n",
446446
" n_jobs=n_jobs,\n",
447447
" progress=False,\n",
448448
" )\n",
@@ -649,7 +649,7 @@
649649
" else:\n",
650650
" values = montecarlo_least_core(\n",
651651
" u=utility,\n",
652-
" max_iterations=budget,\n",
652+
" n_iterations=budget,\n",
653653
" n_jobs=n_jobs,\n",
654654
" progress=False,\n",
655655
" )\n",
@@ -775,7 +775,7 @@
775775
" else:\n",
776776
" values = montecarlo_least_core(\n",
777777
" u=utility,\n",
778-
" max_iterations=budget,\n",
778+
" n_iterations=budget,\n",
779779
" n_jobs=n_jobs,\n",
780780
" progress=False,\n",
781781
" )\n",

notebooks/shapley_basic_spotify.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -415,7 +415,7 @@
415415
" enable_cache=enable_cache,\n",
416416
")\n",
417417
"dvl_df = compute_shapley_values(\n",
418-
" utility, max_iterations=200, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
418+
" utility, n_iterations=200, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
419419
").to_dataframe(column=\"data_value\")"
420420
]
421421
},
@@ -703,7 +703,7 @@
703703
" enable_cache=True,\n",
704704
")\n",
705705
"dvl_df = compute_shapley_values(\n",
706-
" utility, max_iterations=100, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
706+
" utility, n_iterations=100, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
707707
").to_dataframe(column=\"data_value\")"
708708
]
709709
},

notebooks/shapley_utility_learning.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@
295295
" mode=\"truncated_montecarlo\",\n",
296296
" n_jobs=min(len(dataset), available_cpus()),\n",
297297
" progress=False,\n",
298-
" max_iterations=2 ** len(dataset), # DUL will kick in after training_budget\n",
298+
" n_iterations=2 ** len(dataset), # DUL will kick in after training_budget\n",
299299
" )\n",
300300
" .to_dataframe(column=f\"{budget}_{idx}\")\n",
301301
" .drop(columns=[f\"{budget}_{idx}_stderr\"])\n",
@@ -567,7 +567,7 @@
567567
" u=dul_utility,\n",
568568
" mode=\"truncated_montecarlo\",\n",
569569
" n_jobs=min(len(corrupted_dataset), available_cpus()),\n",
570-
" max_iterations=2 ** len(corrupted_dataset),\n",
570+
" n_iterations=2 ** len(corrupted_dataset),\n",
571571
" progress=False,\n",
572572
").to_dataframe(column=\"estimated\")\n",
573573
"df_corrupted = pd.concat([df_corrupted, dul_df], axis=1)"

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@ joblib
66
pymemcache
77
cloudpickle
88
tqdm
9-
matplotlib
9+
matplotlib
10+
scipy>=1.7.0

0 commit comments

Comments
 (0)