Skip to content

Commit 6681e61

Browse files
authored
Merge branch 'develop' into dependabot/pip/pillow-10.0.1
2 parents 865f7c6 + 17c97c1 commit 6681e61

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+6619
-3924
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.7.2.dev0
2+
current_version = 0.8.1.dev0
33
commit = False
44
tag = False
55
allow_dirty = False

CHANGELOG.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,46 @@
11
# Changelog
22

3-
43
## Unreleased
54

5+
### Fixed
6+
7+
- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
8+
for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
9+
10+
## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁
11+
12+
### Added
13+
14+
- New cache backends: InMemoryCacheBackend and DiskCacheBackend
15+
[PR #458](https://github.com/aai-institute/pyDVL/pull/458)
16+
- New influence function interface `InfluenceFunctionModel`
17+
- Data parallel computation with `DaskInfluenceCalculator`
18+
[PR #26](https://github.com/aai-institute/pyDVL/issues/26)
19+
- Sequential batch-wise computation and write to disk with `SequentialInfluenceCalculator`
20+
[PR #377](https://github.com/aai-institute/pyDVL/issues/377)
21+
- Adapt notebooks to new influence abstractions
22+
[PR #430](https://github.com/aai-institute/pyDVL/issues/430)
23+
624
### Changed
725

26+
- Refactor and simplify caching implementation
27+
[PR #458](https://github.com/aai-institute/pyDVL/pull/458)
28+
- Simplify display of computation progress
29+
[PR #466](https://github.com/aai-institute/pyDVL/pull/466)
830
- Improve readme and explain better the examples
931
[PR #465](https://github.com/aai-institute/pyDVL/pull/465)
1032
- Simplify and improve tests, add CodeCov code coverage
1133
[PR #429](https://github.com/aai-institute/pyDVL/pull/429)
34+
- **Breaking Changes**
35+
- Removed `compute_influences` and all related code.
36+
Replaced by new `InfluenceFunctionModel` interface. Removed modules:
37+
- influence.general
38+
- influence.inversion
39+
- influence.twice_differentiable
40+
- influence.torch.torch_differentiable
41+
42+
### Fixed
43+
- Import bug in README [PR #457](https://github.com/aai-institute/pyDVL/issues/457)
1244

1345
## 0.7.1 - 🆕 New methods, bug fixes and improvements for local tests 🐞🧪
1446

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ There are a few important arguments:
106106
To start memcached locally in the background with Docker use:
107107

108108
```shell
109-
run --name pydvl-memcache -p 11211:11211 -d memcached
109+
docker run --name pydvl-memcache -p 11211:11211 -d memcached
110110
```
111111

112112
- `-n` sets the number of parallel workers for

README.md

Lines changed: 42 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,13 @@
77
</p>
88

99
<p align="center" style="text-align:center;">
10-
<a href="https://pypi.org/project/pydvl/">
11-
<img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI">
12-
</a>
13-
<a href="https://pypi.org/project/pydvl/">
14-
<img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version">
15-
</a>
16-
<a href="https://pydvl.org">
17-
<img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation">
18-
</a>
19-
<a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE">
20-
<img alt="License" src="https://img.shields.io/pypi/l/pydvl">
21-
</a>
22-
<a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml">
23-
<img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" >
24-
</a>
25-
<a href="https://codecov.io/gh/aai-institute/pyDVL">
26-
<img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/>
27-
</a>
28-
<a href="https://zenodo.org/badge/latestdoi/354117916">
29-
<img src="https://zenodo.org/badge/354117916.svg" alt="DOI">
30-
</a>
10+
<a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI"></a>
11+
<a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version"></a>
12+
<a href="https://pydvl.org"><img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation"></a>
13+
<a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE"><img alt="License" src="https://img.shields.io/pypi/l/pydvl"></a>
14+
<a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml"><img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" ></a>
15+
<a href="https://codecov.io/gh/aai-institute/pyDVL"><img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/></a>
16+
<a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
3117
</p>
3218

3319
**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
@@ -116,37 +102,34 @@ For influence computation, follow these steps:
116102
import torch
117103
from torch import nn
118104
from torch.utils.data import DataLoader, TensorDataset
119-
from pydvl.influence import compute_influences, InversionMethod
120-
from pydvl.influence.torch import TorchTwiceDifferentiable
105+
106+
from pydvl.influence.torch import DirectInfluence
107+
from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
108+
from pydvl.influence import SequentialInfluenceCalculator
121109
```
122110

123111
2. Create PyTorch data loaders for your train and test splits.
124112

125113
```python
126-
torch.manual_seed(16)
127-
128114
input_dim = (5, 5, 5)
129115
output_dim = 3
116+
train_x = torch.rand((10, *input_dim))
117+
train_y = torch.rand((10, output_dim))
118+
test_x = torch.rand((5, *input_dim))
119+
test_y = torch.rand((5, output_dim))
130120

131-
train_data_loader = DataLoader(
132-
TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
133-
batch_size=2,
134-
)
135-
test_data_loader = DataLoader(
136-
TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
137-
batch_size=1,
138-
)
121+
train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
122+
test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
139123
```
140124

141125
3. Instantiate your neural network model.
142126

143127
```python
144128
nn_architecture = nn.Sequential(
145-
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
146-
nn.Flatten(),
147-
nn.Linear(27, 3),
129+
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
130+
nn.Flatten(),
131+
nn.Linear(27, 3),
148132
)
149-
nn_architecture.eval()
150133
```
151134

152135
4. Define your loss:
@@ -155,30 +138,38 @@ For influence computation, follow these steps:
155138
loss = nn.MSELoss()
156139
```
157140

158-
5. Wrap your model and loss in a `TorchTwiceDifferentiable` object.
141+
5. Instantiate an `InfluenceFunctionModel` and fit it to the training data
159142

160143
```python
161-
model = TorchTwiceDifferentiable(nn_architecture, loss)
144+
infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
145+
infl_model = infl_model.fit(train_data_loader)
162146
```
163147

164-
6. Compute influence factors by providing training data and inversion method.
165-
Using the conjugate gradient algorithm, this would look like:
148+
6. For small input data call influence method on the fitted instance.
166149

167150
```python
168-
influences = compute_influences(
169-
model,
170-
training_data=train_data_loader,
171-
test_data=test_data_loader,
172-
inversion_method=InversionMethod.Cg,
173-
hessian_regularization=1e-1,
174-
maxiter=200,
175-
progress=True,
176-
)
151+
influences = infl_model.influences(test_x, test_y, train_x, train_y)
177152
```
178153
The result is a tensor of shape `(training samples x test samples)`
179154
that contains at index `(i, j`) the influence of training sample `i` on
180155
test sample `j`.
181156

157+
7. For larger data, wrap the model into a
158+
calculator and call methods on the calculator.
159+
```python
160+
infl_calc = SequentialInfluenceCalculator(infl_model)
161+
162+
# Lazy object providing arrays batch-wise in a sequential manner
163+
lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
164+
165+
# Trigger computation and pull results to memory
166+
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
167+
168+
# Trigger computation and write results batch-wise to disk
169+
lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
170+
```
171+
172+
182173
The higher the absolute value of the influence of a training sample
183174
on a test sample, the more influential it is for the chosen test sample, model
184175
and data loaders. The sign of the influence determines whether it is
@@ -328,6 +319,7 @@ We currently implement the following papers:
328319
[Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052).
329320
In Proceedings of the AAAI-22. arXiv, 2021.
330321

322+
331323
# License
332324

333325
pyDVL is distributed under

docs/getting-started/first-steps.md

Lines changed: 75 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@ alias:
99

1010
!!! Warning
1111
Make sure you have read [[installation]] before using the library.
12-
In particular read about how caching and parallelization work,
13-
since they might require additional setup.
12+
In particular read about which extra dependencies you may need.
1413

1514
## Main concepts
1615

@@ -23,7 +22,6 @@ should be enough to get you started.
2322
computation and related methods.
2423
* [[influence-values]] for instructions on how to compute influence functions.
2524

26-
2725
## Running the examples
2826

2927
If you are somewhat familiar with the concepts of data valuation, you can start
@@ -36,23 +34,22 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
3634
have to install jupyter first manually since it's not a dependency of the
3735
library.
3836

39-
# Advanced usage
37+
## Advanced usage
4038

41-
Besides the do's and don'ts of data valuation itself, which are the subject of
39+
Besides the dos and don'ts of data valuation itself, which are the subject of
4240
the examples and the documentation of each method, there are two main things to
4341
keep in mind when using pyDVL.
4442

45-
## Caching
46-
47-
pyDVL uses [memcached](https://memcached.org/) to cache the computation of the
48-
utility function and speed up some computations (see the [installation
49-
guide](installation.md/#setting-up-the-cache)).
43+
### Caching
5044

51-
Caching of the utility function is disabled by default. When it is enabled it
52-
takes into account the data indices passed as argument and the utility function
53-
wrapped into the [Utility][pydvl.utils.utility.Utility] object. This means that
45+
PyDVL can cache (memoize) the computation of the utility function
46+
and speed up some computations for data valuation.
47+
It is however disabled by default.
48+
When it is enabled it takes into account the data indices passed as argument
49+
and the utility function wrapped into the
50+
[Utility][pydvl.utils.utility.Utility] object. This means that
5451
care must be taken when reusing the same utility function with different data,
55-
see the documentation for the [caching module][pydvl.utils.caching] for more
52+
see the documentation for the [caching package][pydvl.utils.caching] for more
5653
information.
5754

5855
In general, caching won't play a major role in the computation of Shapley values
@@ -61,24 +58,82 @@ the same utility function computation, is very low. However, it can be very
6158
useful when comparing methods that use the same utility function, or when
6259
running multiple experiments with the same data.
6360

61+
pyDVL supports 3 different caching backends:
62+
63+
- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
64+
an in-memory cache backend that uses a dictionary to store and retrieve
65+
cached values. This is used to share cached values between threads
66+
in a single process.
67+
- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
68+
a disk-based cache backend that uses pickled values written to and read from disk.
69+
This is used to share cached values between processes in a single machine.
70+
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
71+
a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
72+
and read from a Memcached server. This is used to share cached values
73+
between processes across multiple machines.
74+
75+
**Note** This specific backend requires optional dependencies.
76+
See [[installation#extras]] for more information)
77+
6478
!!! tip "When is the cache really necessary?"
6579
Crucially, semi-value computations with the
6680
[PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
6781
to be enabled, or they will take twice as long as the direct implementation
6882
in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
6983

70-
## Parallelization
84+
!!! tip "Using the cache"
85+
Continue reading about the cache in the documentation
86+
for the [caching package][pydvl.utils.caching].
87+
88+
#### Setting up the Memcached cache
89+
90+
[Memcached](https://memcached.org/) is an in-memory key-value store accessible
91+
over the network. pyDVL can use it to cache the computation of the utility function
92+
and speed up some computations (in particular, semi-value computations with the
93+
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
94+
may benefit as well).
95+
96+
You can either install it as a package or run it inside a docker container (the
97+
simplest). For installation instructions, refer to the [Getting
98+
started](https://github.com/memcached/memcached/wiki#getting-started) section in
99+
memcached's wiki. Then you can run it with:
71100

72-
pyDVL supports [joblib](https://joblib.readthedocs.io/en/latest/) for local
73-
parallelization (within one machine) and [ray](https://ray.io) for distributed
74-
parallelization (across multiple machines).
101+
```shell
102+
memcached -u user
103+
```
75104

76-
The former works out of the box but for the latter you will need to provide a
77-
running cluster (or run ray in local mode).
105+
To run memcached inside a container in daemon mode instead, use:
106+
107+
```shell
108+
docker container run -d --rm -p 11211:11211 memcached:latest
109+
```
110+
111+
### Parallelization
112+
113+
pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
114+
parallelization (within one machine) and supports using
115+
[Ray](https://ray.io) for distributed parallelization (across multiple machines).
116+
117+
The former works out of the box but for the latter you will need to install
118+
additional dependencies (see [[installation#extras]] )
119+
and to provide a running cluster (or run ray in local mode).
78120

79121
As of v0.7.0 pyDVL does not allow requesting resources per task sent to the
80122
cluster, so you will need to make sure that each worker has enough resources to
81123
handle the tasks it receives. A data valuation task using game-theoretic methods
82124
will typically make a copy of the whole model and dataset to each worker, even
83125
if the re-training only happens on a subset of the data. This means that you
84126
should make sure that each worker has enough memory to handle the whole dataset.
127+
128+
#### Ray
129+
130+
Please follow the instructions in Ray's documentation to set up a cluster.
131+
Once you have a running cluster, you can use it by passing the address
132+
of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].
133+
134+
For a local ray cluster you would use:
135+
136+
```python
137+
from pydvl.parallel.config import ParallelConfig
138+
config = ParallelConfig(backend="ray")
139+
```

0 commit comments

Comments
 (0)