1616 <a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
1717</p >
1818
19- ** pyDVL** collects algorithms for ** Data Valuation** and ** Influence Function** computation.
20-
21- Refer to the [ Methods] ( https://pydvl.org/devel/getting-started/methods/ )
22- page of our documentation for a list of all implemented methods.
19+ ** pyDVL** collects algorithms for ** Data Valuation** and ** Influence Function**
20+ computation. Here is the list of [ all methods implemented] ( https://pydvl.org/devel/getting-started/methods/ ) .
2321
2422** Data Valuation** for machine learning is the task of assigning a scalar
2523to each element of a training set which reflects its contribution to the final
@@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.
2927
3028<div align =" center " style =" text-align :center ;" >
3129 <img
32- width="70 %"
30+ width="60 %"
3331 align="center"
3432 style="display: block; margin-left: auto; margin-right: auto;"
3533 src="https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg"
@@ -48,7 +46,7 @@ of training samples over individual test points.
4846
4947<div align =" center " style =" text-align :center ;" >
5048 <img
51- width="70 %"
49+ width="60 %"
5250 align="center"
5351 style="display: block; margin-left: auto; margin-right: auto;"
5452 src="https://pydvl.org/devel/examples/img/influence_functions_example.png"
@@ -82,180 +80,133 @@ $ pip install pyDVL[influence]
8280```
8381
8482For more instructions and information refer to [ Installing pyDVL
85- ] ( https://pydvl.org/stable/getting-started/#installation ) in the
86- documentation.
83+ ] ( https://pydvl.org/stable/getting-started/#installation ) in the documentation.
8784
8885# Usage
8986
90- In the following subsections, we will showcase the usage of pyDVL
91- for Data Valuation and Influence Functions using simple examples.
92-
93- For more instructions and information refer to [ Getting
94- Started] ( https://pydvl.org/stable/getting-started/first-steps/ ) in
95- the documentation.
96- We provide several examples for data valuation
97- (e.g. [ Shapley Data Valuation] ( https://pydvl.org/stable/examples/shapley_basic_spotify/ ) )
98- and for influence functions
99- (e.g. [ Influence Functions for Neural Networks] ( https://pydvl.org/stable/examples/influence_imagenet/ ) )
100- with details on the algorithms and their applications.
87+ Please read [ Getting
88+ Started] ( https://pydvl.org/stable/getting-started/first-steps/ ) in the
89+ documentation for more instructions. We provide several examples for data
90+ valuation and for influence functions in our [ Example
91+ Gallery] ( https://pydvl.org/stable/examples/ ) .
10192
10293## Influence Functions
10394
104- For influence computation, follow these steps:
105-
106- 1 . Import the necessary packages (The exact packages depend on your specific use case).
107-
108- ``` python
109- import torch
110- from torch import nn
111- from torch.utils.data import DataLoader, TensorDataset
112-
113- from pydvl.influence.torch import DirectInfluence
114- from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
115- from pydvl.influence import SequentialInfluenceCalculator
116- ```
117-
95+ 1 . Import the necessary packages (the exact ones depend on your specific use case).
118962 . Create PyTorch data loaders for your train and test splits.
119-
120- ``` python
121- input_dim = (5 , 5 , 5 )
122- output_dim = 3
123- train_x = torch.rand((10 , * input_dim))
124- train_y = torch.rand((10 , output_dim))
125- test_x = torch.rand((5 , * input_dim))
126- test_y = torch.rand((5 , output_dim))
127-
128- train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size = 2 )
129- test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size = 1 )
130- ```
131-
132- 3 . Instantiate your neural network model.
133-
134- ``` python
135- nn_architecture = nn.Sequential(
136- nn.Conv2d(in_channels = 5 , out_channels = 3 , kernel_size = 3 ),
137- nn.Flatten(),
138- nn.Linear(27 , 3 ),
97+ 3 . Instantiate your neural network model and define your loss function.
98+ 4 . Instantiate an ` InfluenceFunctionModel ` and fit it to the training data
99+ 5 . For small input data, you can call the ` influences() ` method on the fitted
100+ instance. The result is a tensor of shape ` (training samples, test samples) `
101+ that contains at index ` (i, j ` ) the influence of training sample ` i ` on
102+ test sample ` j ` .
103+ 6 . For larger datasets, wrap the model into a "calculator" and call methods on
104+ it. This splits the computation into smaller chunks and allows for lazy
105+ evaluation and out-of-core computation.
106+
107+ The higher the absolute value of the influence of a training sample
108+ on a test sample, the more influential it is for the chosen test sample, model
109+ and data loaders. The sign of the influence determines whether it is
110+ useful (positive) or harmful (negative).
111+
112+ > ** Note** pyDVL currently only support PyTorch for Influence Functions. We plan
113+ > to add support for Jax next.
114+
115+ ``` python
116+ import torch
117+ from torch import nn
118+ from torch.utils.data import DataLoader, TensorDataset
119+
120+ from pydvl.influence import SequentialInfluenceCalculator
121+ from pydvl.influence.torch import DirectInfluence
122+ from pydvl.influence.torch.util import (
123+ NestedTorchCatAggregator,
124+ TorchNumpyConverter,
139125 )
140- ```
141-
142- 4 . Define your loss:
143-
144- ``` python
145- loss = nn.MSELoss()
146- ```
147-
148- 5 . Instantiate an ` InfluenceFunctionModel ` and fit it to the training data
149126
150- ``` python
151- infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization = 0.01 )
152- infl_model = infl_model.fit(train_data_loader)
153- ```
127+ input_dim = (5 , 5 , 5 )
128+ output_dim = 3
129+ train_x, train_y = torch.rand((10 , * input_dim)), torch.rand((10 , output_dim))
130+ test_x, test_y = torch.rand((5 , * input_dim)), torch.rand((5 , output_dim))
131+ train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size = 2 )
132+ test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size = 1 )
133+ model = nn.Sequential(
134+ nn.Conv2d(in_channels = 5 , out_channels = 3 , kernel_size = 3 ),
135+ nn.Flatten(),
136+ nn.Linear(27 , 3 ),
137+ )
138+ loss = nn.MSELoss()
154139
155- 6 . For small input data call influence method on the fitted instance.
156-
157- ``` python
158- influences = infl_model.influences(test_x, test_y, train_x, train_y)
159- ```
160- The result is a tensor of shape ` (training samples x test samples) `
161- that contains at index ` (i, j ` ) the influence of training sample ` i ` on
162- test sample ` j ` .
140+ infl_model = DirectInfluence(model, loss, hessian_regularization = 0.01 )
141+ infl_model = infl_model.fit(train_data_loader)
163142
164- 7 . For larger data, wrap the model into a
165- calculator and call methods on the calculator.
166- ``` python
167- infl_calc = SequentialInfluenceCalculator(infl_model)
168-
169- # Lazy object providing arrays batch-wise in a sequential manner
170- lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
143+ # For small datasets, instantiate the full influence matrix:
144+ influences = infl_model.influences(test_x, test_y, train_x, train_y)
171145
172- # Trigger computation and pull results to memory
173- influences = lazy_influences.compute( aggregator = NestedTorchCatAggregator() )
146+ # For larger datasets, use the Influence calculators:
147+ infl_calc = SequentialInfluenceCalculator(infl_model )
174148
175- # Trigger computation and write results batch-wise to disk
176- lazy_influences.to_zarr(" influences_result" , TorchNumpyConverter())
177- ```
178-
149+ # Lazy object providing arrays batch-wise in a sequential manner
150+ lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
179151
180- The higher the absolute value of the influence of a training sample
181- on a test sample, the more influential it is for the chosen test sample, model
182- and data loaders. The sign of the influence determines whether it is
183- useful (positive) or harmful (negative).
152+ # Trigger computation and pull results to memory
153+ influences = lazy_influences.compute(aggregator = NestedTorchCatAggregator())
184154
185- > ** Note** pyDVL currently only support PyTorch for Influence Functions.
186- > We are planning to add support for Jax and perhaps TensorFlow or even Keras.
155+ # Trigger computation and write results batch-wise to disk
156+ lazy_influences.to_zarr(" influences_result" , TorchNumpyConverter())
157+ ```
187158
188159## Data Valuation
189160
190161The steps required to compute data values for your samples are:
191162
192- 1 . Import the necessary packages (The exact packages depend on your specific use case).
193-
194- ``` python
195- import matplotlib.pyplot as plt
196- from sklearn.datasets import load_breast_cancer
197- from sklearn.linear_model import LogisticRegression
198- from pydvl.utils import Dataset, Scorer, Utility
199- from pydvl.value import (
200- compute_shapley_values,
201- ShapleyMode,
202- MaxUpdates,
203- )
204- ```
205-
163+ 1 . Import the necessary packages (the exact ones will depend on your specific
164+ use case).
2061652 . Create a ` Dataset ` object with your train and test splits.
207-
208- ``` python
209- data = Dataset.from_sklearn(
210- load_breast_cancer(),
211- train_size = 10 ,
212- stratify_by_target = True ,
213- random_state = 16 ,
214- )
215- ```
216-
2171663 . Create an instance of a ` SupervisedModel ` (basically any sklearn compatible
218- predictor).
219-
220- ``` python
221- model = LogisticRegression()
222- ```
223-
224- 4 . Create a ` Utility ` object to wrap the Dataset, the model and a scoring
225- function.
226-
227- ``` python
228- u = Utility(
229- model,
230- data,
231- Scorer(" accuracy" , default = 0.0 )
232- )
233- ```
234-
235- 5 . Use one of the methods defined in the library to compute the values.
236- In our example, we will use * Permutation Montecarlo Shapley* ,
237- an approximate method for computing Data Shapley values.
238-
239- ``` python
240- values = compute_shapley_values(
241- u,
242- mode = ShapleyMode.PermutationMontecarlo,
243- done = MaxUpdates(100 ),
244- seed = 16 ,
245- progress = True
246- )
247- ```
248- The result is a variable of type ` ValuationResult ` that contains
249- the indices and their values as well as other attributes.
250-
251- The higher the value for an index, the more important it is for the chosen
252- model, dataset and scorer.
253-
254- 6 . (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
255-
256- ``` python
257- df = values.to_dataframe(column = " data_value" )
258- ```
167+ predictor), and wrap it in a ` Utility ` object together with the data and a
168+ scoring function.
169+ 4 . Use one of the methods defined in the library to compute the values. In the
170+ example below, we will use * Permutation Montecarlo Shapley* , an approximate
171+ method for computing Data Shapley values. The result is a variable of type
172+ ` ValuationResult ` that contains the indices and their values as well as other
173+ attributes.
174+ 5 . Convert the valuation result to a dataframe, and analyze and visualize the
175+ values.
176+
177+ The higher the value for an index, the more important it is for the chosen
178+ model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
179+ or out-of-distribution, and dropping them can improve the model's performance.
180+
181+ ``` python
182+ from sklearn.datasets import load_breast_cancer
183+ from sklearn.linear_model import LogisticRegression
184+
185+ from pydvl.utils import Dataset, Scorer, Utility
186+ from pydvl.value import (MaxUpdates, RelativeTruncation,
187+ permutation_montecarlo_shapley)
188+
189+ data = Dataset.from_sklearn(
190+ load_breast_cancer(),
191+ train_size = 10 ,
192+ stratify_by_target = True ,
193+ random_state = 16 ,
194+ )
195+ model = LogisticRegression()
196+ u = Utility(
197+ model,
198+ data,
199+ Scorer(" accuracy" , default = 0.0 )
200+ )
201+ values = permutation_montecarlo_shapley(
202+ u,
203+ truncation = RelativeTruncation(u, 0.05 ),
204+ done = MaxUpdates(1000 ),
205+ seed = 16 ,
206+ progress = True
207+ )
208+ df = values.to_dataframe(column = " data_value" )
209+ ```
259210
260211# Contributing
261212
0 commit comments