Skip to content

Commit 452790c

Browse files
authored
Merge pull request #673 from aai-institute/feature/tensor-support
Feature/tensor support
2 parents 71a60e3 + 330b2d0 commit 452790c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+5062
-1258
lines changed

.gitignore

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ celerybeat.pid
110110
.venv
111111
env/
112112
venv/
113-
venv38/
113+
venv39/
114114
ENV/
115115
env.bak/
116116
venv.bak/
@@ -148,3 +148,11 @@ docs_build
148148

149149
# pytest-profiling
150150
prof/
151+
152+
# JS tooling
153+
node_modules/
154+
package.json
155+
package-lock.json
156+
157+
#
158+
.serena

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,26 @@
11
# Changelog
22

3+
## Unreleased
4+
5+
### Added
6+
7+
- Support for `torch.Tensor` as underlying data type in `Dataset` and
8+
`GroupedDataset`
9+
[PR #673](https://github.com/aai-institute/pyDVL/pull/673)
10+
- Support for pytorch models in most valuation methods when wrapped in
11+
classes implementing the protocol `TorchSupervisedModel`, e.g. by using
12+
[skorch.NeuralNetClassifier](https://skorch.readthedocs.io/en/stable/classifier.html)
13+
models
14+
[PR #673](https://github.com/aai-institute/pyDVL/pull/673)
15+
16+
### Fixed
17+
18+
- Issues with `Dataset` indexing
19+
[PR #673](https://github.com/aai-institute/pyDVL/pull/673)
20+
21+
### Changed
22+
23+
324
## v0.10.0 - 💥📚🐞🆕 New valuation interface, improved docs, new methods, breaking changes and tons of improvements
425

526

CLAUDE.local.md

Whitespace-only changes.

docs/getting-started/advanced-usage.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,26 @@ anything up.
7272
to each worker, but in general you should make sure that each worker has
7373
enough memory to handle the whole dataset.
7474

75+
### Working with large datasets { #large-datasets-parallelization }
76+
77+
When running in parallel, the utility is copied to each worker. This implies
78+
copying the dataset as well, which can obviously be very expensive. In order to
79+
alleviate the problem, one can memmap the data from disk by setting `mmap=True`
80+
when creating the [Dataset][pydvl.valuation.dataset.Dataset] objects. In case
81+
you create the `Dataset` with previously memory-mapped arrays, you must ensure
82+
that the shapes conform to the requirements, since internal checks are disabled
83+
to avoid additional copying. This amounts to calling
84+
[check_X_y()][pydvl.utils.array.check_X_y] on the arrays beforehand.
85+
86+
If you are working with torch tensors as underlying raw data, you can try
87+
activating shared memory for them using `tensor.share_memory_()`, but whether
88+
this yields a benefit or not will depend on the precise situation.
89+
90+
If you are working on a cluster, the data will be copied to each worker node. In
91+
this case, subclassing of `Dataset` to leverage your particular distributed
92+
storage solution will be necessary. Feel free to open an issue if you need help
93+
with this.
94+
7595

7696
### Influence functions { #influence-parallelization }
7797

docs/getting-started/first-steps.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,8 @@ alias:
1414
## Main concepts
1515

1616
pyDVL aims to be a repository of production-ready, reference implementations of
17-
algorithms for data valuation and influence functions. Even though we only
18-
briefly introduce key concepts in the documentation, the following sections
19-
should be enough to get you started.
17+
algorithms for data valuation and influence functions. Read the following
18+
sections to get started:
2019

2120
<div class="grid cards" markdown>
2221

@@ -36,6 +35,23 @@ should be enough to get you started.
3635

3736
</div>
3837

38+
## Supported frameworks
39+
40+
* The module for influence functions is built around PyTorch. Because of our use
41+
of the `torch.func` stateless api, we do not support jitted modules yet (see
42+
[#640](https://github.com/aai-institute/pyDVL/issues/640)).
43+
44+
* Up until v0.10.0, pyDVL only supported NumPy arrays for data valuation. From
45+
version 0.10.1 onwards, the library also supports PyTorch tensors for most
46+
valuation methods. The implementation attempts to preserve the input data type
47+
for the [Dataset][pydvl.valuation.dataset.Dataset] throughout computations where
48+
possible.
49+
50+
Note that some features have specific requirements or limitations when using
51+
tensors. For details on tensor support and caveats, see the [[tensor-support]]
52+
section.
53+
54+
3955
## Running the examples
4056

4157
If you are somewhat familiar with the concepts of data valuation, you can start

docs/value/data-oob.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,24 @@ makes the list of bootstrapped samples available in some way. This includes
5959
`BaggingRegressor`, `BaggingClassifier`, `ExtraTreesClassifier`,
6060
`ExtraTreesRegressor` and `IsolationForest`.
6161

62+
!!! info "PyTorch support"
63+
With the introduction of version 0.10.1, Data-OOB supports PyTorch tensor
64+
inputs with certain limitations. Standard scikit-learn bagging models (like
65+
[BaggingClassifier][] or [RandomForest][]) require NumPy inputs for training,
66+
even though the dataset used for valuation can contain tensors. For full
67+
tensor support throughout the pipeline, you must implement a custom bagging
68+
model class that implements the [BaggingModel][pydvl.valuation.types.BaggingModel]
69+
interface with support for tensor operations. This custom model must provide
70+
the following attributes:
71+
72+
- `estimators_`: list of fitted base estimators
73+
- `estimators_samples_`: list of sample indices used to train each estimator
74+
(as NumPy arrays)
75+
76+
(There is a mock in `tests.valuation.methods.conftest.TorchBaggingClassifier`).
77+
See [Tensor Support][tensor-support] for more general information about tensor
78+
support in pyDVL.
79+
6280
## Bagging arbitrary models
6381

6482
Through `BaggingClassifier` and `BaggingRegressor`, one can compute values

docs/value/index.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,60 @@ necessary:
119119
computation e.g. when the change in estimates is low, or the number of
120120
iterations or time elapsed exceed some threshold.
121121

122+
### Tensor Support { #tensor-support }
123+
124+
Starting from version 0.10.1, pyDVL supports both NumPy arrays and PyTorch
125+
tensors for data valuation. The implementation follows these key principles:
126+
127+
1. **Type Preservation**: The valuation methods maintain the input data type
128+
throughout computations, whether you provide NumPy arrays or PyTorch tensors
129+
when constructing the [Dataset][pydvl.valuation.dataset.Dataset].
130+
2. **Transparent Usage**: The API remains the same regardless of the input type,
131+
simply provide your data as tensors. The main difference is that the torch
132+
model must be wrapped in a class compatible with the protocol
133+
[TorchSupervisedModel][pydvl.valuation.types.TorchSupervisedModel].
134+
!!! tip "Wrapping torch models"
135+
There is an example implementation of
136+
[TorchSupervisedModel][pydvl.valuation.types.TorchSupervisedModel]
137+
in `notebooks/support/banzhaf.py`. But you should consider using
138+
[skorch](https://github.com/skorch-dev/skorch) models instead, which
139+
are entirely compatible with pyDVL.
140+
3. **Consistent Indexing**: Internally, indices are always managed as NumPy
141+
arrays for consistency and compatibility, but the actual data operations
142+
preserve tensor types when provided. In particular, samplers always return
143+
NumPy arrays, and the [Dataset][pydvl.valuation.dataset.Dataset] class
144+
uses NumPy arrays for indexing.
145+
4. [ValuationResult][pydvl.valuation.result.ValuationResult] objects always
146+
contain NumPy arrays.
147+
148+
??? example "Creating and using a tensor dataset"
149+
```python
150+
import torch
151+
from pydvl.valuation.dataset import Dataset
152+
from sklearn.datasets import make_classification
153+
from skorch import NeuralNetClassifier
154+
155+
X, y = make_classification(n_samples=100, n_features=20, n_classes=3)
156+
X_tensor = torch.tensor(X, dtype=torch.float32)
157+
y_tensor = torch.tensor(y, dtype=torch.long)
158+
159+
train, test = Dataset.from_arrays(X_tensor, y_tensor, stratify_by_target=True)
160+
model = NeuralNetClassifier(SomeNNModule(),
161+
max_epochs=10,
162+
criterion=torch.nn.CrossEntropyLoss,
163+
optimizer=torch.optim.Adam)
164+
scorer = SupervisedScorer(model, test, default=0.0, range=(0, 1))
165+
utility = ModelUtility(model, scorer)
166+
valuation = TMCShapleyValuation(utility)
167+
```
168+
169+
!!! warning "Library-specific requirements"
170+
Some methods that rely on specific libraries may have type requirements:
171+
172+
- Methods that use scikit-learn models directly will convert tensors to
173+
NumPy arrays internally.
174+
- The [KNNShapleyValuation][pydvl.valuation.methods.knn_shapley.KNNShapleyValuation]
175+
method requires NumPy arrays.
122176

123177
### Creating a Dataset
124178

@@ -217,6 +271,8 @@ constructor accepts the same types of arguments as those of
217271
[None][] for the default.
218272

219273
```python
274+
import numpy as np
275+
from pydvl.valuation.scorers import SupervisedScorer
220276
scorer = SupervisedScorer("explained_variance", default=0.0, range=(-np.inf, 1))
221277
```
222278

docs/value/shapley.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ Shapley values. Empirically, one of the most useful methods is the so-called
1717
[Truncated Monte Carlo Shapley][tmcs-intro] [@ghorbani_data_2019], but several
1818
approximations exist with different convergence rates and computational costs.
1919

20+
??? info "Support for torch models"
21+
Starting from version 0.10.1, all Shapley value methods support both NumPy
22+
arrays and PyTorch tensors as input data types. The implementation preserves
23+
the input type throughout the computation, allowing integration with PyTorch
24+
models. See [Tensor Support][tensor-support] for more details.
25+
2026

2127
## Combinatorial Shapley { #combinatorial-shapley-intro }
2228

docs/value/the-core.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,19 @@ obtain a single valuation to use, one breaks ties by solving a quadratic program
6262
to select the $v$ in the LC with the smallest $\ell_2$ norm. This is called the
6363
_egalitarian least core_.
6464

65+
!!! info "Pytorch support"
66+
As of version 0.10.1, both
67+
[ExactLeastCoreValuation][pydvl.valuation.methods.least_core.ExactLeastCoreValuation]
68+
and
69+
[MonteCarloLeastCoreValuation][pydvl.valuation.methods.least_core.MonteCarloLeastCoreValuation]
70+
support PyTorch tensor inputs. Tensor data is used throughout the coalition
71+
evaluation process to compute utility values. These utility values are then
72+
assembled into numpy arrays for the constraint matrices used by the linear
73+
programming solver (CVXPY), which operates on CPU. See [Tensor
74+
Support][tensor-support] for more general information about tensor support
75+
in pyDVL.
76+
77+
6578
## Exact Least Core
6679

6780
This first algorithm is just a verbatim implementation of the definition above.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,7 @@ plugins:
163163
- https://pandas.pydata.org/docs/objects.inv
164164
- https://scikit-learn.org/stable/objects.inv
165165
- https://pytorch.org/docs/stable/objects.inv
166+
- https://skorch.readthedocs.io/en/latest/objects.inv
166167
- https://pymemcache.readthedocs.io/en/latest/objects.inv
167168
- https://joblib.readthedocs.io/en/stable/objects.inv
168169
- https://loky.readthedocs.io/en/stable/objects.inv

0 commit comments

Comments
 (0)