Skip to content

Commit d2d95fd

Browse files
authored
Merge pull request #194 from appliedAI-Initiative/feature/semivalues
Owen sampling, documentation, cleanup and some refactoring
2 parents cc65698 + e725601 commit d2d95fd

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1712
-926
lines changed

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Changelog
22

3-
## Unreleased
3+
## 0.3.0 - 💥 Breaking changes
44

55
- Simplified and fixed powerset sampling and testing
66
[PR #181](https://github.com/appliedAI-Initiative/pyDVL/pull/181)
@@ -12,6 +12,12 @@
1212
[PR #185](https://github.com/appliedAI-Initiative/pyDVL/pull/185)
1313
- Modified Pull Request template to automatically link PR to issue
1414
[PR ##186](https://github.com/appliedAI-Initiative/pyDVL/pull/186)
15+
- First implementation of Owen Sampling, squashed scores, better testing
16+
[PR #194](https://github.com/appliedAI-Initiative/pyDVL/pull/194)
17+
- Improved documentation on caching, Shapley, caveats of values, bibtex
18+
[PR #194](https://github.com/appliedAI-Initiative/pyDVL/pull/194)
19+
- **Breaking change:** Rearranging of modules to accommodate for new methods
20+
[PR #194](https://github.com/appliedAI-Initiative/pyDVL/pull/194)
1521

1622

1723
## 0.2.0 - 📚 Better docs

README.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -32,24 +32,28 @@ Data Valuation is the task of estimating the intrinsic value of a data point
3232
wrt. the training set, the model and a scoring function. We currently implement
3333
methods from the following papers:
3434

35-
- Ghorbani, Amirata, and James Zou. ‘Data Shapley: Equitable Valuation of Data for
36-
Machine Learning’. In International Conference on Machine Learning, 2242–51.
37-
PMLR, 2019. http://proceedings.mlr.press/v97/ghorbani19c.html.
38-
- Wang, Tianhao, Yu Yang, and Ruoxi Jia. ‘Improving Cooperative Game Theory-Based
39-
Data Valuation via Data Utility Learning’. arXiv, 2022.
40-
https://doi.org/10.48550/arXiv.2107.06336.
35+
- Ghorbani, Amirata, and James Zou.
36+
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
37+
In International Conference on Machine Learning, 2242–51. PMLR, 2019.
38+
- Wang, Tianhao, Yu Yang, and Ruoxi Jia.
39+
[Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning](https://doi.org/10.48550/arXiv.2107.06336).
40+
arXiv, 2022.
4141
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li,
42-
Ce Zhang, Costas Spanos, and Dawn Song. ‘Efficient Task-Specific Data Valuation
43-
for Nearest Neighbor Algorithms’. Proceedings of the VLDB Endowment 12, no. 11 (1
44-
July 2019): 1610–23. https://doi.org/10.14778/3342263.3342637.
42+
Ce Zhang, Costas Spanos, and Dawn Song.
43+
[Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
44+
Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
45+
- Okhrati, Ramin, and Aldo Lipani.
46+
[A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
47+
In 2020 25th International Conference on Pattern Recognition (ICPR), 7992–99.
48+
IEEE, 2021.
4549

4650
Influence Functions compute the effect that single points have on an estimator /
4751
model. We implement methods from the following papers:
4852

49-
- Koh, Pang Wei, and Percy Liang. ‘Understanding Black-Box Predictions via
50-
Influence Functions’. In Proceedings of the 34th International Conference on
51-
Machine Learning, 70:1885–94. Sydney, Australia: PMLR, 2017.
52-
http://proceedings.mlr.press/v70/koh17a.html.
53+
- Koh, Pang Wei, and Percy Liang.
54+
[Understanding Black-Box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a.html).
55+
In Proceedings of the 34th International Conference on Machine Learning,
56+
70:1885–94. Sydney, Australia: PMLR, 2017.
5357

5458
# Installation
5559

@@ -98,18 +102,20 @@ Data Shapley values:
98102
```python
99103
import numpy as np
100104
from pydvl.utils import Dataset, Utility
101-
from pydvl.shapley import compute_shapley_values
105+
from pydvl.value.shapley import compute_shapley_values
102106
from sklearn.linear_model import LinearRegression
103107
from sklearn.model_selection import train_test_split
104108

105109
X, y = np.arange(100).reshape((50, 2)), np.arange(50)
106110
X_train, X_test, y_train, y_test = train_test_split(
107-
X, y, test_size=0.5, random_state=16
108-
)
111+
X, y, test_size=0.5, random_state=16
112+
)
109113
dataset = Dataset(X_train, y_train, X_test, y_test)
110114
model = LinearRegression()
111115
utility = Utility(model, dataset)
112-
values, errors = compute_shapley_values(u=utility, max_iterations=100)
116+
values = compute_shapley_values(
117+
u=utility, max_iterations=100, mode="truncated_montecarlo"
118+
)
113119
```
114120

115121
For more instructions and information refer to [Getting

docs/10-getting-started.rst

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,22 @@
44
Getting started
55
===============
66

7-
Make sure you have :ref:`installed pyDVL <pyDVL Installation>` before proceeding
8-
further.
7+
.. warning::
8+
Make sure you have read :ref:`the installation instructions
9+
<pyDVL Installation>` before using the library. In particular read about how
10+
caching and parallelization work, since they require additional setup.
911

10-
.. note::
11-
We provide minimal overviews of key concepts in :ref:`data valuation` and
12-
:ref:`influence`. For an in-depth survey of the field, we refer to the review on
13-
the topic at the :tfl:`TransferLab website <>`.
12+
pyDVL aims to be a repository of production-ready, reference implementations of
13+
algorithms for data valuation and influence functions. You can read:
14+
15+
* :ref:`data valuation` for key objects and usage patterns for Shapley value
16+
computation and related methods.
17+
* :ref:`influence` for instruction on how to compute influence functions (still
18+
in a pre-alpha state)
19+
20+
We only briefly introduce key concepts in the documentation. For a thorough
21+
introduction and survey of the field, we refer to **the upcoming review** at the
22+
:tfl:`TransferLab website <>`.
1423

1524
Running the examples
1625
====================
@@ -24,12 +33,3 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
2433
- Locally, by starting a jupyter server at the root of the project. You will
2534
have to install jupyter first manually since it's not a dependency of the
2635
library.
27-
28-
Methods covered
29-
===============
30-
31-
pyDVL offers algorithms for data valuation and computation of influence
32-
functions. You can read more about each family of methods here:
33-
34-
- :ref:`data valuation`.
35-
- :ref:`influence`.

docs/20-install.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ the instructions in their documentation for installation.
4545
.. _caching setup:
4646

4747
Setting up the cache
48-
--------------------
48+
====================
4949

5050
memcached is an in-memory key-value store accessible over the network. pyDVL
5151
uses it to cache certain results and speed-up the computations. You can either

0 commit comments

Comments
 (0)