aai-institute
diff --git a/‎.gitignore‎
Lines changed: 9 additions & 1 deletion b/‎.gitignore‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 21 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎CLAUDE.local.md‎ b/‎CLAUDE.local.md‎
diff --git a/‎docs/getting-started/advanced-usage.md‎
Lines changed: 20 additions & 0 deletions b/‎docs/getting-started/advanced-usage.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/getting-started/first-steps.md‎
Lines changed: 19 additions & 3 deletions b/‎docs/getting-started/first-steps.md‎
Lines changed: 19 additions & 3 deletions
diff --git a/‎docs/value/data-oob.md‎
Lines changed: 18 additions & 0 deletions b/‎docs/value/data-oob.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎docs/value/index.md‎
Lines changed: 56 additions & 0 deletions b/‎docs/value/index.md‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎docs/value/shapley.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/value/shapley.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/value/the-core.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/value/the-core.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions b/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions
@@ -110,7 +110,7 @@ celerybeat.pid
 .venv
 env/
 venv/
-venv38/
+venv39/
 ENV/
 env.bak/
 venv.bak/
@@ -148,3 +148,11 @@ docs_build
 
 # pytest-profiling
 prof/
+
+# JS tooling
+node_modules/
+package.json
+package-lock.json
+
+#
+.serena
@@ -1,5 +1,26 @@
 # Changelog
 
+## Unreleased
+
+### Added
+
+- Support for `torch.Tensor` as underlying data type in `Dataset` and
+  `GroupedDataset`
+  [PR #673](https://github.com/aai-institute/pyDVL/pull/673)
+- Support for pytorch models in most valuation methods when wrapped in
+  classes implementing the protocol `TorchSupervisedModel`, e.g. by using
+  [skorch.NeuralNetClassifier](https://skorch.readthedocs.io/en/stable/classifier.html)
+  models
+  [PR #673](https://github.com/aai-institute/pyDVL/pull/673)
+
+### Fixed
+
+- Issues with `Dataset` indexing
+  [PR #673](https://github.com/aai-institute/pyDVL/pull/673)
+
+### Changed
+
+
 ## v0.10.0 - 💥📚🐞🆕 New valuation interface, improved docs, new methods, breaking changes and tons of improvements
 
 
 
@@ -72,6 +72,26 @@ anything up.
     to each worker, but in general you should make sure that each worker has
     enough memory to handle the whole dataset.
 
+### Working with large datasets { #large-datasets-parallelization }
+
+When running in parallel, the utility is copied to each worker. This implies
+copying the dataset as well, which can obviously be very expensive. In order to
+alleviate the problem, one can memmap the data from disk by setting `mmap=True`
+when creating the [Dataset][pydvl.valuation.dataset.Dataset] objects. In case
+you create the `Dataset` with previously memory-mapped arrays, you must ensure
+that the shapes conform to the requirements, since internal checks are disabled
+to avoid additional copying. This amounts to calling
+[check_X_y()][pydvl.utils.array.check_X_y] on the arrays beforehand.
+
+If you are working with torch tensors as underlying raw data, you can try
+activating shared memory for them using `tensor.share_memory_()`, but whether
+this yields a benefit or not will depend on the precise situation.
+
+If you are working on a cluster, the data will be copied to each worker node. In
+this case, subclassing of `Dataset` to leverage your particular distributed
+storage solution will be necessary. Feel free to open an issue if you need help
+with this.
+
 
 ### Influence functions { #influence-parallelization }
 
 
@@ -14,9 +14,8 @@ alias:
 ## Main concepts
 
 pyDVL aims to be a repository of production-ready, reference implementations of
-algorithms for data valuation and influence functions. Even though we only
-briefly introduce key concepts in the documentation, the following sections 
-should be enough to get you started.
+algorithms for data valuation and influence functions. Read the following
+sections to get started:
 
 <div class="grid cards" markdown>
 
@@ -36,6 +35,23 @@ should be enough to get you started.
 
 </div>
 
+## Supported frameworks
+
+* The module for influence functions is built around PyTorch. Because of our use
+  of the `torch.func` stateless api, we do not support jitted modules yet (see
+  [#640](https://github.com/aai-institute/pyDVL/issues/640)).
+
+* Up until v0.10.0, pyDVL only supported NumPy arrays for data valuation. From
+  version 0.10.1 onwards, the library also supports PyTorch tensors for most
+  valuation methods. The implementation attempts to preserve the input data type
+  for the [Dataset][pydvl.valuation.dataset.Dataset] throughout computations where
+  possible.
+
+  Note that some features have specific requirements or limitations when using
+  tensors. For details on tensor support and caveats, see the [[tensor-support]]
+  section.
+
+
 ## Running the examples
 
 If you are somewhat familiar with the concepts of data valuation, you can start
 
@@ -59,6 +59,24 @@ makes the list of bootstrapped samples available in some way. This includes
 `BaggingRegressor`, `BaggingClassifier`, `ExtraTreesClassifier`,
 `ExtraTreesRegressor` and `IsolationForest`.
 
+!!! info "PyTorch support"
+    With the introduction of version 0.10.1, Data-OOB supports PyTorch tensor
+    inputs with certain limitations. Standard scikit-learn bagging models (like
+    [BaggingClassifier][] or [RandomForest][]) require NumPy inputs for training,
+    even though the dataset used for valuation can contain tensors. For full
+    tensor support throughout the pipeline, you must implement a custom bagging
+    model class that implements the [BaggingModel][pydvl.valuation.types.BaggingModel]
+    interface with support for tensor operations. This custom model must provide
+    the following attributes:
+    
+      - `estimators_`: list of fitted base estimators
+      - `estimators_samples_`: list of sample indices used to train each estimator
+        (as NumPy arrays)
+    
+    (There is a mock in `tests.valuation.methods.conftest.TorchBaggingClassifier`).
+    See [Tensor Support][tensor-support] for more general information about tensor
+    support in pyDVL.
+
 ## Bagging arbitrary models
 
 Through `BaggingClassifier` and `BaggingRegressor`, one can compute values
 
@@ -119,6 +119,60 @@ necessary:
    computation e.g. when the change in estimates is low, or the number of
    iterations or time elapsed exceed some threshold.
 
+### Tensor Support { #tensor-support }
+
+Starting from version 0.10.1, pyDVL supports both NumPy arrays and PyTorch
+tensors for data valuation. The implementation follows these key principles:
+
+1. **Type Preservation**: The valuation methods maintain the input data type
+   throughout computations, whether you provide NumPy arrays or PyTorch tensors
+   when constructing the [Dataset][pydvl.valuation.dataset.Dataset].
+2. **Transparent Usage**: The API remains the same regardless of the input type,
+   simply provide your data as tensors. The main difference is that the torch
+   model must be wrapped in a class compatible with the protocol
+   [TorchSupervisedModel][pydvl.valuation.types.TorchSupervisedModel].
+     !!! tip "Wrapping torch models"
+         There is an example implementation of
+         [TorchSupervisedModel][pydvl.valuation.types.TorchSupervisedModel]
+         in `notebooks/support/banzhaf.py`. But you should consider using
+         [skorch](https://github.com/skorch-dev/skorch) models instead, which
+         are entirely compatible with pyDVL.
+3. **Consistent Indexing**: Internally, indices are always managed as NumPy
+   arrays for consistency and compatibility, but the actual data operations
+   preserve tensor types when provided. In particular, samplers always return
+    NumPy arrays, and the [Dataset][pydvl.valuation.dataset.Dataset] class
+   uses NumPy arrays for indexing.
+4. [ValuationResult][pydvl.valuation.result.ValuationResult] objects always
+   contain NumPy arrays.
+
+??? example "Creating and using a tensor dataset"
+    ```python
+    import torch
+    from pydvl.valuation.dataset import Dataset
+    from sklearn.datasets import make_classification
+    from skorch import NeuralNetClassifier
+
+    X, y = make_classification(n_samples=100, n_features=20, n_classes=3)
+    X_tensor = torch.tensor(X, dtype=torch.float32)
+    y_tensor = torch.tensor(y, dtype=torch.long)
+    
+    train, test = Dataset.from_arrays(X_tensor, y_tensor, stratify_by_target=True)
+    model = NeuralNetClassifier(SomeNNModule(), 
+                                 max_epochs=10,
+                                 criterion=torch.nn.CrossEntropyLoss,
+                                 optimizer=torch.optim.Adam)
+    scorer = SupervisedScorer(model, test, default=0.0, range=(0, 1))
+    utility = ModelUtility(model, scorer)
+    valuation = TMCShapleyValuation(utility)
+    ```
+
+!!! warning "Library-specific requirements"
+    Some methods that rely on specific libraries may have type requirements:
+
+      - Methods that use scikit-learn models directly will convert tensors to
+        NumPy arrays internally.
+      - The [KNNShapleyValuation][pydvl.valuation.methods.knn_shapley.KNNShapleyValuation]
+        method requires NumPy arrays.
 
 ### Creating a Dataset
 
@@ -217,6 +271,8 @@ constructor accepts the same types of arguments as those of
 [None][] for the default.
 
 ```python
+import numpy as np
+from pydvl.valuation.scorers import SupervisedScorer
 scorer = SupervisedScorer("explained_variance", default=0.0, range=(-np.inf, 1))
 ```
 
 
@@ -17,6 +17,12 @@ Shapley values. Empirically, one of the most useful methods is the so-called
 [Truncated Monte Carlo Shapley][tmcs-intro] [@ghorbani_data_2019], but several
 approximations exist with different convergence rates and computational costs.
 
+??? info "Support for torch models"
+    Starting from version 0.10.1, all Shapley value methods support both NumPy
+    arrays and PyTorch tensors as input data types. The implementation preserves
+    the input type throughout the computation, allowing integration with PyTorch
+    models. See [Tensor Support][tensor-support] for more details.
+
 
 ## Combinatorial Shapley  { #combinatorial-shapley-intro }
 
 
@@ -62,6 +62,19 @@ obtain a single valuation to use, one breaks ties by solving a quadratic program
 to select the $v$ in the LC with the smallest $\ell_2$ norm. This is called the
 _egalitarian least core_.
 
+!!! info "Pytorch support"
+    As of version 0.10.1, both
+    [ExactLeastCoreValuation][pydvl.valuation.methods.least_core.ExactLeastCoreValuation]
+    and
+    [MonteCarloLeastCoreValuation][pydvl.valuation.methods.least_core.MonteCarloLeastCoreValuation]
+    support PyTorch tensor inputs. Tensor data is used throughout the coalition
+    evaluation process to compute utility values. These utility values are then
+    assembled into numpy arrays for the constraint matrices used by the linear
+    programming solver (CVXPY), which operates on CPU. See [Tensor
+    Support][tensor-support] for more general information about tensor support
+    in pyDVL.
+
+
 ## Exact Least Core
 
 This first algorithm is just a verbatim implementation of the definition above.
 
@@ -163,6 +163,7 @@ plugins:
             - https://pandas.pydata.org/docs/objects.inv
             - https://scikit-learn.org/stable/objects.inv
             - https://pytorch.org/docs/stable/objects.inv
+            - https://skorch.readthedocs.io/en/latest/objects.inv
             - https://pymemcache.readthedocs.io/en/latest/objects.inv
             - https://joblib.readthedocs.io/en/stable/objects.inv
             - https://loky.readthedocs.io/en/stable/objects.inv