Skip to content

Commit c4940ef

Browse files
authored
Merge pull request #674 from aai-institute/feature/dul-extensions-again
DeepSets for DUL
2 parents fa74c06 + 7f51ea5 commit c4940ef

File tree

3 files changed

+222
-255
lines changed

3 files changed

+222
-255
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
[skorch.NeuralNetClassifier](https://skorch.readthedocs.io/en/stable/classifier.html)
1313
models
1414
[PR #673](https://github.com/aai-institute/pyDVL/pull/673)
15+
- Improved documentation and examples using DeepSets for Data Utility Learning
16+
[PR #674](https://github.com/aai-institute/pyDVL/pull/674)
1517

1618
### Fixed
1719

docs/value/dul.md

Lines changed: 68 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,12 @@ alias:
77

88
# Data Utility Learning { #data-utility-learning-intro }
99

10+
!!! Example
11+
See the notebook on [Data Utility Learning](/examples/shapley_utility_learning/)
12+
for a complete example.
13+
1014
DUL [@wang_improving_2022] uses an ML model $\hat{u}$ to learn the utility function
11-
$u:2^N \to \matbb{R}$ during the fitting phase of any valuation method. This
15+
$u:2^N \to \mathbb{R}$ during the fitting phase of any valuation method. This
1216
_utility model_ is trained with tuples $(S, U(S))$ for a certain warm-up period.
1317
Then it is used instead of $u$ in the valuation method. The cost of training
1418
$\hat{u}$ is quickly amortized by avoiding costly re-evaluations of the original
@@ -20,7 +24,7 @@ utility.
2024
In other words, DUL accelerates data valuation by learning the utility function
2125
from a small number of subsets. The process is as follows:
2226

23-
1. Collect a given_budget_ of so-called _utility samples_ (subsets and their
27+
1. Collect a given _budget_ of so-called _utility samples_ (subsets and their
2428
utility values) during the normal course of data valuation.
2529
2. Fit a model $\hat{u}$ to the utility samples. The model is trained to predict
2630
the utility of new subsets.
@@ -50,10 +54,10 @@ Assuming you have some data valuation algorithm and your `utility` object:
5054
an indicator vector of the set as done in [@wang_improving_2022], with
5155
[IndicatorUtilityModel][pydvl.valuation.utility.learning.IndicatorUtilityModel].
5256
This wrapper accepts any machine learning model for the actual fitting.
53-
5457
An alternative way to encode the data is to use a permutation-invariant model,
5558
such as [DeepSet][pydvl.valuation.utility.deepset.DeepSet] [@zaheer_deep_2017],
56-
which is a simple architecture to learn embeddings for sets of points.
59+
which is a simple architecture to learn embeddings for sets of points (see
60+
below).
5761
2. Wrap both your `utility` object and the utility model just constructed within
5862
a [DataUtilityLearning][pydvl.valuation.utility.learning.DataUtilityLearning].
5963
3. Use this last object in your data valuation algorithm instead of the original
@@ -73,27 +77,29 @@ implementation of a permutation-invariant model called [Deep
7377
Sets][deep-sets-intro] which can serve as guidance for a more complex
7478
architecture.
7579

76-
!!! example "DUL with a linear regression model"
77-
??? Example
78-
``` python
79-
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, \
80-
Sample, SupervisedScorer
81-
from sklearn.linear_model import LinearRegression, LogisticRegression
82-
from sklearn.datasets import load_iris
83-
84-
train, test = Dataset.from_sklearn(load_iris())
85-
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
86-
utility = ModelUtility(LinearRegression(), scorer)
87-
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
88-
dul = DataUtilityLearning(utility, 300, utility_model)
89-
valuation = ShapleyValuation(
90-
utility=dul,
91-
sampler=PermutationSampler(),
92-
stopping=MaxUpdates(6000)
93-
)
94-
# Note: DUL does not support parallel training yet
95-
valuation.fit(train)
96-
```
80+
??? example "DUL with indicator encoding"
81+
In this example we use a linear regression model to learn the utility
82+
function, with inputs encoded as an indicator vector.
83+
84+
``` python
85+
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, \
86+
Sample, SupervisedScorer
87+
from sklearn.linear_model import LinearRegression, LogisticRegression
88+
from sklearn.datasets import load_iris
89+
90+
train, test = Dataset.from_sklearn(load_iris())
91+
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
92+
utility = ModelUtility(LinearRegression(), scorer)
93+
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
94+
dul = DataUtilityLearning(utility, 300, utility_model)
95+
valuation = ShapleyValuation(
96+
utility=dul,
97+
sampler=PermutationSampler(),
98+
stopping=MaxUpdates(6000)
99+
)
100+
# Note: DUL does not support parallel training yet
101+
valuation.fit(train)
102+
```
97103

98104
## Deep Sets { #deep-sets-intro }
99105

@@ -109,31 +115,40 @@ $\rho$ that predicts the output $y$ from the aggregated representation:
109115
$$ y = \rho(\Phi(S)). $$
110116

111117

112-
!!! example "DUL with DeepSets"
113-
??? Example
114-
This example requires pytorch installed.
115-
``` python
116-
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, \
117-
Sample, SupervisedScorer
118-
from pydvl.valuation.utility.deepset import DeepSet
119-
from sklearn.datasets import load_iris
120-
121-
train, test = Dataset.from_sklearn(load_iris())
122-
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
123-
utility = ModelUtility(LinearRegression(), scorer)
124-
utility_model = DeepSet(
125-
input_dim=len(train),
126-
phi_hidden_dim=10,
127-
phi_output_dim=20,
128-
rho_hidden_dim=10
129-
)
130-
dul = DataUtilityLearning(utility, 3000, utility_model)
131-
132-
valuation = ShapleyValuation(
133-
utility=dul,
134-
sampler=PermutationSampler(),
135-
stopping=MaxUpdates(10000)
136-
)
137-
# Note: DUL does not support parallel training yet
138-
valuation.fit(train)
139-
```
118+
??? example "DUL with DeepSets"
119+
(This example requires pytorch installed). Here we use a DeepSet model to
120+
learn the utility function.
121+
``` python
122+
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, \
123+
Sample, SupervisedScorer
124+
from pydvl.valuation.utility.deepset import DeepSetUtilityModel
125+
from sklearn.datasets import load_iris
126+
127+
train, test = Dataset.from_sklearn(load_iris())
128+
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
129+
utility = ModelUtility(LinearRegression(), scorer)
130+
utility_model = DeepSetUtilityModel(
131+
input_dim=len(train),
132+
phi_hidden_dim=10,
133+
phi_output_dim=20,
134+
rho_hidden_dim=10
135+
)
136+
dul = DataUtilityLearning(utility, 3000, utility_model)
137+
138+
valuation = ShapleyValuation(
139+
utility=dul,
140+
sampler=PermutationSampler(),
141+
stopping=MaxUpdates(10000)
142+
)
143+
# Note: DUL does not support parallel training yet
144+
valuation.fit(train)
145+
```
146+
147+
## Other architectures
148+
149+
As mentioned above, what makes DeepSets suitable for DUL is the
150+
permutation-invariance of the model, which is a required property of any
151+
estimator of a function defined over sets like the utility. Any alternative
152+
architecture with this property should work as well. Alternatively, one can use
153+
other encodings of the sets, as long as they are injective and invariant under
154+
permutations (or defined for fixed orderings as the indicator encoding above).

notebooks/shapley_utility_learning.ipynb

Lines changed: 152 additions & 202 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)