Skip to content

Commit 9238695

Browse files
committed
Mention lava, add links to semivalue functions in main doc
1 parent 3469361 commit 9238695

File tree

3 files changed

+74
-41
lines changed

3 files changed

+74
-41
lines changed

docs/assets/pydvl.bib

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,19 @@ @article{jia_efficient_2019a
132132
keywords = {notion}
133133
}
134134

135+
@inproceedings{just_lava_2023,
136+
title = {{{LAVA}}: {{Data Valuation}} without {{Pre-Specified Learning Algorithms}}},
137+
shorttitle = {{{LAVA}}},
138+
author = {Just, Hoang Anh and Kang, Feiyang and Wang, Tianhao and Zeng, Yi and Ko, Myeongseob and Jin, Ming and Jia, Ruoxi},
139+
date = {2023-02-01},
140+
url = {https://openreview.net/forum?id=JJuP86nBl4q},
141+
urldate = {2023-04-25},
142+
abstract = {Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. \$\textbackslash textbf\{(1)\}\$ We develop a proxy for the validation performance associated with a training set based on a non-conventional \$\textbackslash textit\{class-wise\}\$ \$\textbackslash textit\{Wasserstein distance\}\$ between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. \$\textbackslash textbf\{(2)\}\$ We develop a novel method to value individual data based on the sensitivity analysis of the \$\textbackslash textit\{class-wise\}\$ Wasserstein distance. Importantly, these values can be directly obtained \$\textbackslash textit\{for free\}\$ from the output of off-the-shelf optimization solvers once the Wasserstein distance is computed. \$\textbackslash textbf\{(3) \}\$We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a \$\textbackslash textit\{significant improvement\}\$ over the state-of-the-art performance while being \$\textbackslash textit\{orders of magnitude faster.\}\$},
143+
eventtitle = {The {{Eleventh International Conference}} on {{Learning Representations}} ({{ICLR}} 2023)},
144+
langid = {english},
145+
keywords = {notion}
146+
}
147+
135148
@inproceedings{koh_understanding_2017,
136149
title = {Understanding {{Black-box Predictions}} via {{Influence Functions}}},
137150
booktitle = {Proceedings of the 34th {{International Conference}} on {{Machine Learning}}},

docs/value/index.md

Lines changed: 48 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,24 @@ performance metric used. For instance, accuracy is a poor metric for imbalanced
8989
sets and this has a stark effect on data values. Some models exhibit great
9090
variance in some regimes and this again has a detrimental effect on values.
9191

92-
Nevertheless, some of the most promising applications are: Cleaning of corrupted
93-
data, pruning unnecessary or irrelevant data, repairing mislabeled data, guiding
94-
data acquisition and annotation (active learning), anomaly detection and model
95-
debugging and interpretation.
92+
Nevertheless, some of the most promising applications are:
93+
94+
* Cleaning of corrupted data.
95+
* Pruning unnecessary or irrelevant data.
96+
* Repairing mislabeled data.
97+
* Guiding data acquisition and annotation (active learning).
98+
* Anomaly detection and model debugging and interpretation.
99+
100+
Additionally, one of the motivating applications for the whole field is that of
101+
data markets: a marketplace where data owners can sell their data to interested
102+
parties. In this setting, data valuation can be key component to determine the
103+
price of data. Algorithm-agnostic methods like LAVA [@just_lava_2023] are
104+
particularly well suited for this, as they use the Wasserstein distance between
105+
a vendor's data and the buyer's to determine the value of the former.
106+
107+
However, this is a complex problem which can face practical banal problems like
108+
the fact that data owners may not wish to disclose their data for valuation.
109+
96110

97111
## Computing data values
98112

@@ -299,26 +313,26 @@ nature of every (non-trivial) ML problem can have an effect:
299313
that matters, and this tends to be accurate (wrt. to the true ranking) despite
300314
inaccurate values.
301315

302-
pyDVL offers a dedicated [function composition][pydvl.utils.score.compose_score]
303-
for scorer functions which can be used to squash a score.
304-
The following is defined in module [score][pydvl.utils.score]:
305-
306-
```python
307-
import numpy as np
308-
from pydvl.utils import compose_score
309-
310-
def sigmoid(x: float) -> float:
311-
return float(1 / (1 + np.exp(-x)))
312-
313-
squashed_r2 = compose_score("r2", sigmoid, "squashed r2")
314-
315-
squashed_variance = compose_score(
316-
"explained_variance", sigmoid, "squashed explained variance"
317-
)
318-
```
319-
320-
These squashed scores can prove useful in regression problems, but they can
321-
also introduce issues in the low-value regime.
316+
??? tip "Squashing scores"
317+
pyDVL offers a dedicated [function
318+
composition][pydvl.utils.score.compose_score] for scorer functions which
319+
can be used to squash a score. The following is defined in module
320+
[score][pydvl.utils.score]:
321+
```python
322+
import numpy as np
323+
from pydvl.utils import compose_score
324+
325+
def sigmoid(x: float) -> float:
326+
return float(1 / (1 + np.exp(-x)))
327+
328+
squashed_r2 = compose_score("r2", sigmoid, "squashed r2")
329+
330+
squashed_variance = compose_score(
331+
"explained_variance", sigmoid, "squashed explained variance"
332+
)
333+
```
334+
These squashed scores can prove useful in regression problems, but they
335+
can also introduce issues in the low-value regime.
322336

323337
* **High variance utility**: Classical applications of game theoretic value
324338
concepts operate with deterministic utilities, but in ML we use an evaluation
@@ -328,27 +342,27 @@ nature of every (non-trivial) ML problem can have an effect:
328342
configure the caching system to allow multiple evaluations of the utility for
329343
every index set. A moving average is computed and returned once the standard
330344
error is small, see [MemcachedConfig][pydvl.utils.config.MemcachedConfig].
331-
332345
[@wang_data_2022] prove that by relaxing one of the Shapley axioms
333346
and considering the general class of semi-values, of which Shapley is an
334347
instance, one can prove that a choice of constant weights is the best one can
335-
do in a utility-agnostic setting. So-called *Data Banzhaf*.
348+
do in a utility-agnostic setting. This method, dubbed *Data Banzhaf*, is
349+
available in pyDVL as
350+
[compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues].
336351

337352
* **Data set size**: Computing exact Shapley values is NP-hard, and Monte Carlo
338353
approximations can converge slowly. Massive datasets are thus impractical, at
339-
least with current techniques. A workaround is to group samples and investigate
340-
their value together. In pyDVL you can do this using
341-
[GroupedDataset][pydvl.utils.dataset.GroupedDataset].
342-
There is a fully worked-out [example here](../examples/shapley_basic_spotify).
343-
Some algorithms also provide different sampling strategies to reduce
344-
the variance, but due to a no-free-lunch-type theorem,
345-
no single strategy can be optimal for all utilities.
354+
least with [game-theoretical methods][game-theoretical-methods]. A workaround
355+
is to group samples and investigate their value together. You can do this using
356+
[GroupedDataset][pydvl.utils.dataset.GroupedDataset]. There is a fully
357+
worked-out [example here](../examples/shapley_basic_spotify). Some algorithms
358+
also provide different sampling strategies to reduce the variance, but due to a
359+
no-free-lunch-type theorem, no single strategy can be optimal for all utilities.
346360

347361
* **Model size**: Since every evaluation of the utility entails retraining the
348362
whole model on a subset of the data, large models require great amounts of
349363
computation. But also, they will effortlessly interpolate small to medium
350364
datasets, leading to great variance in the evaluation of performance on the
351365
dedicated validation set. One mitigation for this problem is cross-validation,
352-
but this would incur massive computational cost. As of v.0.3.0 there are no
366+
but this would incur massive computational cost. As of v.0.7.0 there are no
353367
facilities in pyDVL for cross-validating the utility (note that this would
354368
require cross-validating the whole value computation).

docs/value/semi-values.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ $$
4848

4949
where $B$ is the [Beta function](https://en.wikipedia.org/wiki/Beta_function),
5050
and $\alpha$ and $\beta$ are parameters that control the weighting of the
51-
subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$, and
52-
$\beta = 16$ is reported in [@kwon_beta_2022] to be a good choice for
53-
some applications. See however the [Banzhaf indices][banzhaf-indices] section
54-
for an alternative choice of weights which is reported to work better.
51+
subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$,
52+
and $\beta = 16$ is reported in [@kwon_beta_2022] to be a good choice for some
53+
applications. Beta Shapley values are available in pyDVL through
54+
[compute_beta_shapley_semivalues][pydvl.value.semivalues.compute_beta_shapley_semivalues]:
5555

5656
```python
5757
from pydvl.value import *
@@ -62,6 +62,9 @@ values = compute_beta_shapley_semivalues(
6262
)
6363
```
6464

65+
See however the [Banzhaf indices][banzhaf-indices] section
66+
for an alternative choice of weights which is reported to work better.
67+
6568
## Banzhaf indices
6669

6770
As noted in the section [Problems of Data Values][problems-of-data-values], the
@@ -81,8 +84,10 @@ any choice of weight function $w$, one can always construct a utility with
8184
higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
8285
one can do is to pick a constant weight.
8386

84-
The authors of [@wang_data_2022] show that Banzhaf indices are more
85-
robust to variance in the utility function than Shapley and Beta Shapley values.
87+
The authors of [@wang_data_2022] show that Banzhaf indices are more robust to
88+
variance in the utility function than Shapley and Beta Shapley values. They are
89+
available in pyDVL through
90+
[compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues]:
8691

8792
```python
8893
from pydvl.value import *
@@ -103,6 +108,8 @@ combination of the three ingredients that define a semi-value:
103108
- A sampling method
104109
- A weighting scheme $w$.
105110

111+
You can construct any combination of these three ingredients with
112+
[compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues].
106113
The utility function is the same as for Shapley values, and the sampling method
107114
can be any of the types defined in [the samplers module][pydvl.value.sampler].
108115
For instance, the following snippet is equivalent to the above:
@@ -125,7 +132,6 @@ sensitive to changes in training set size. However, Data Banzhaf indices are
125132
proven to be the most robust to variance in the utility function, in the sense
126133
of rank stability, across a range of models and datasets [@wang_data_2022].
127134

128-
129135
!!! warning "Careful with permutation sampling"
130136
This generic implementation of semi-values allowing for any combination of
131137
sampling and weighting schemes is very flexible and, in principle, it

0 commit comments

Comments
 (0)