aai-institute
diff --git a/‎.bumpversion.cfg‎
Lines changed: 1 addition & 1 deletion b/‎.bumpversion.cfg‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 33 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 33 additions & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 42 additions & 50 deletions b/‎README.md‎
Lines changed: 42 additions & 50 deletions
diff --git a/‎docs/getting-started/first-steps.md‎
Lines changed: 75 additions & 20 deletions b/‎docs/getting-started/first-steps.md‎
Lines changed: 75 additions & 20 deletions
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.7.2.dev0
+current_version = 0.8.1.dev0
 commit = False
 tag = False
 allow_dirty = False
 
@@ -1,14 +1,46 @@
 # Changelog
 
-
 ## Unreleased
 
+### Fixed
+
+- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
+  for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
+
+## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁
+
+### Added
+
+- New cache backends: InMemoryCacheBackend and DiskCacheBackend
+  [PR #458](https://github.com/aai-institute/pyDVL/pull/458)
+- New influence function interface `InfluenceFunctionModel`
+- Data parallel computation with `DaskInfluenceCalculator`
+  [PR #26](https://github.com/aai-institute/pyDVL/issues/26)
+- Sequential batch-wise computation and write to disk with `SequentialInfluenceCalculator` 
+  [PR #377](https://github.com/aai-institute/pyDVL/issues/377)
+- Adapt notebooks to new influence abstractions
+  [PR #430](https://github.com/aai-institute/pyDVL/issues/430)
+
 ### Changed
 
+- Refactor and simplify caching implementation 
+  [PR #458](https://github.com/aai-institute/pyDVL/pull/458)
+- Simplify display of computation progress
+  [PR #466](https://github.com/aai-institute/pyDVL/pull/466)
 - Improve readme and explain better the examples
   [PR #465](https://github.com/aai-institute/pyDVL/pull/465)
 - Simplify and improve tests, add CodeCov code coverage
   [PR #429](https://github.com/aai-institute/pyDVL/pull/429)
+- **Breaking Changes**
+  - Removed `compute_influences` and all related code.
+    Replaced by new `InfluenceFunctionModel` interface. Removed modules:
+    - influence.general
+    - influence.inversion
+    - influence.twice_differentiable
+    - influence.torch.torch_differentiable
+
+### Fixed
+- Import bug in README [PR #457](https://github.com/aai-institute/pyDVL/issues/457)
 
 ## 0.7.1 - 🆕 New methods, bug fixes and improvements for local tests 🐞🧪
 
 
@@ -106,7 +106,7 @@ There are a few important arguments:
   To start memcached locally in the background with Docker use:
 
   ```shell
-   run --name pydvl-memcache -p 11211:11211 -d memcached
+  docker run --name pydvl-memcache -p 11211:11211 -d memcached
   ```
 
 - `-n` sets the number of parallel workers for 
 
@@ -7,27 +7,13 @@
 </p>
 
 <p align="center" style="text-align:center;">
-    <a href="https://pypi.org/project/pydvl/">
-        <img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI">
-    </a>
-    <a href="https://pypi.org/project/pydvl/">
-        <img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version">
-    </a>
-    <a href="https://pydvl.org">
-        <img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation">
-    </a>
-    <a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE">
-        <img alt="License" src="https://img.shields.io/pypi/l/pydvl">
-    </a>
-    <a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml">
-        <img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" >
-    </a>
-    <a href="https://codecov.io/gh/aai-institute/pyDVL">
-      <img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/>
-    </a>
-    <a href="https://zenodo.org/badge/latestdoi/354117916">
-        <img src="https://zenodo.org/badge/354117916.svg" alt="DOI">
-    </a>
+    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI"></a>
+    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version"></a>
+    <a href="https://pydvl.org"><img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation"></a>
+    <a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE"><img alt="License" src="https://img.shields.io/pypi/l/pydvl"></a>
+    <a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml"><img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" ></a>
+    <a href="https://codecov.io/gh/aai-institute/pyDVL"><img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/></a>
+    <a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
 </p>
 
 **pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
@@ -116,37 +102,34 @@ For influence computation, follow these steps:
    import torch
    from torch import nn
    from torch.utils.data import DataLoader, TensorDataset
-   from pydvl.influence import compute_influences, InversionMethod
-   from pydvl.influence.torch import TorchTwiceDifferentiable
+   
+   from pydvl.influence.torch import DirectInfluence
+   from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
+   from pydvl.influence import SequentialInfluenceCalculator
    ```
 
 2. Create PyTorch data loaders for your train and test splits.
 
    ```python
-   torch.manual_seed(16)
-   
    input_dim = (5, 5, 5)
    output_dim = 3
+   train_x = torch.rand((10, *input_dim))
+   train_y = torch.rand((10, output_dim))
+   test_x = torch.rand((5, *input_dim))
+   test_y = torch.rand((5, output_dim))
 
-   train_data_loader = DataLoader(
-      TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
-      batch_size=2,
-   )
-   test_data_loader = DataLoader(
-      TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
-      batch_size=1,
-   )
+   train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
+   test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
    ```
 
 3. Instantiate your neural network model.
 
    ```python
    nn_architecture = nn.Sequential(
-      nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
-      nn.Flatten(),
-      nn.Linear(27, 3),
+     nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
+     nn.Flatten(),
+     nn.Linear(27, 3),
    )
-   nn_architecture.eval()
    ```
 
 4. Define your loss:
@@ -155,30 +138,38 @@ For influence computation, follow these steps:
    loss = nn.MSELoss()
    ```
 
-5. Wrap your model and loss in a `TorchTwiceDifferentiable` object.
+5. Instantiate an `InfluenceFunctionModel` and fit it to the training data
 
    ```python
-   model = TorchTwiceDifferentiable(nn_architecture, loss)
+   infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
+   infl_model = infl_model.fit(train_data_loader)
    ```
 
-6. Compute influence factors by providing training data and inversion method.
-   Using the conjugate gradient algorithm, this would look like:
+6. For small input data call influence method on the fitted instance. 
 
    ```python
-   influences = compute_influences(
-      model,
-      training_data=train_data_loader,
-      test_data=test_data_loader,
-      inversion_method=InversionMethod.Cg,
-      hessian_regularization=1e-1,
-      maxiter=200,
-      progress=True,
-   )
+   influences = infl_model.influences(test_x, test_y, train_x, train_y)
    ```
    The result is a tensor of shape `(training samples x test samples)`
    that contains at index `(i, j`) the influence of training sample `i` on
    test sample `j`.
 
+7. For larger data, wrap the model into a
+   calculator and call methods on the calculator.
+   ```python
+   infl_calc = SequentialInfluenceCalculator(infl_model)
+   
+    # Lazy object providing arrays batch-wise in a sequential manner
+   lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
+
+   # Trigger computation and pull results to memory
+   influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
+
+   # Trigger computation and write results batch-wise to disk
+   lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
+   ```
+   
+
    The higher the absolute value of the influence of a training sample
    on a test sample, the more influential it is for the chosen test sample, model
    and data loaders. The sign of the influence determines whether it is 
@@ -328,6 +319,7 @@ We currently implement the following papers:
   [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). 
   In Proceedings of the AAAI-22. arXiv, 2021.
 
+  
 # License
 
 pyDVL is distributed under
 
@@ -9,8 +9,7 @@ alias:
 
 !!! Warning
     Make sure you have read [[installation]] before using the library. 
-    In particular read about how caching and parallelization work,
-    since they might require additional setup.
+    In particular read about which extra dependencies you may need.
 
 ## Main concepts
 
@@ -23,7 +22,6 @@ should be enough to get you started.
   computation and related methods.
 * [[influence-values]] for instructions on how to compute influence functions.
 
-
 ## Running the examples
 
 If you are somewhat familiar with the concepts of data valuation, you can start
@@ -36,23 +34,22 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
   have to install jupyter first manually since it's not a dependency of the
   library.
 
-# Advanced usage
+## Advanced usage
 
-Besides the do's and don'ts of data valuation itself, which are the subject of
+Besides the dos and don'ts of data valuation itself, which are the subject of
 the examples and the documentation of each method, there are two main things to
 keep in mind when using pyDVL.
 
-## Caching
-
-pyDVL uses [memcached](https://memcached.org/) to cache the computation of the
-utility function and speed up some computations (see the [installation
-guide](installation.md/#setting-up-the-cache)).
+### Caching
 
-Caching of the utility function is disabled by default. When it is enabled it
-takes into account the data indices passed as argument and the utility function
-wrapped into the [Utility][pydvl.utils.utility.Utility] object. This means that
+PyDVL can cache (memoize) the computation of the utility function
+and speed up some computations for data valuation.
+It is however disabled by default.
+When it is enabled it takes into account the data indices passed as argument
+and the utility function wrapped into the
+[Utility][pydvl.utils.utility.Utility] object. This means that
 care must be taken when reusing the same utility function with different data,
-see the documentation for the [caching module][pydvl.utils.caching] for more
+see the documentation for the [caching package][pydvl.utils.caching] for more
 information.
 
 In general, caching won't play a major role in the computation of Shapley values
@@ -61,24 +58,82 @@ the same utility function computation, is very low. However, it can be very
 useful when comparing methods that use the same utility function, or when
 running multiple experiments with the same data.
 
+pyDVL supports 3 different caching backends:
+
+- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
+  an in-memory cache backend that uses a dictionary to store and retrieve
+  cached values. This is used to share cached values between threads
+  in a single process.
+- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
+  a disk-based cache backend that uses pickled values written to and read from disk.  
+  This is used to share cached values between processes in a single machine.
+- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
+  a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
+  and read from a Memcached server. This is used to share cached values
+  between processes across multiple machines.
+
+  **Note** This specific backend requires optional dependencies.
+  See [[installation#extras]] for more information)
+
 !!! tip "When is the cache really necessary?"
     Crucially, semi-value computations with the
     [PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
     to be enabled, or they will take twice as long as the direct implementation
     in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
 
-## Parallelization
+!!! tip "Using the cache"
+    Continue reading about the cache in the documentation
+    for the [caching package][pydvl.utils.caching].
+
+#### Setting up the Memcached cache
+
+[Memcached](https://memcached.org/) is an in-memory key-value store accessible
+over the network. pyDVL can use it to cache the computation of the utility function
+and speed up some computations (in particular, semi-value computations with the
+[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
+may benefit as well).
+
+You can either install it as a package or run it inside a docker container (the
+simplest). For installation instructions, refer to the [Getting
+started](https://github.com/memcached/memcached/wiki#getting-started) section in
+memcached's wiki. Then you can run it with:
 
-pyDVL supports [joblib](https://joblib.readthedocs.io/en/latest/) for local
-parallelization (within one machine) and [ray](https://ray.io) for distributed
-parallelization (across multiple machines).
+```shell
+memcached -u user
+```
 
-The former works out of the box but for the latter you will need to provide a
-running cluster (or run ray in local mode).
+To run memcached inside a container in daemon mode instead, use:
+
+```shell
+docker container run -d --rm -p 11211:11211 memcached:latest
+```
+
+### Parallelization
+
+pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
+parallelization (within one machine) and supports using
+[Ray](https://ray.io) for distributed parallelization (across multiple machines).
+
+The former works out of the box but for the latter you will need to install
+additional dependencies (see [[installation#extras]] )
+and to provide a running cluster (or run ray in local mode).
 
 As of v0.7.0 pyDVL does not allow requesting resources per task sent to the
 cluster, so you will need to make sure that each worker has enough resources to
 handle the tasks it receives. A data valuation task using game-theoretic methods
 will typically make a copy of the whole model and dataset to each worker, even
 if the re-training only happens on a subset of the data. This means that you
 should make sure that each worker has enough memory to handle the whole dataset.
+
+#### Ray
+
+Please follow the instructions in Ray's documentation to set up a cluster.
+Once you have a running cluster, you can use it by passing the address
+of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].
+
+For a local ray cluster you would use:
+
+```python
+from pydvl.parallel.config import ParallelConfig
+config = ParallelConfig(backend="ray") 
+```