Mention lava, add links to semivalue functions in main doc

mdbenito · mdbenito · commit 9238695c4985 · 2023-09-01T19:33:02.000+02:00
diff --git a/docs/assets/pydvl.bib b/docs/assets/pydvl.bib
@@ -132,6 +132,19 @@ @article{jia_efficient_2019a
   keywords = {notion}
 }
 
+@inproceedings{just_lava_2023,
+  title = {{{LAVA}}: {{Data Valuation}} without {{Pre-Specified Learning Algorithms}}},
+  shorttitle = {{{LAVA}}},
+  author = {Just, Hoang Anh and Kang, Feiyang and Wang, Tianhao and Zeng, Yi and Ko, Myeongseob and Jin, Ming and Jia, Ruoxi},
+  date = {2023-02-01},
+  url = {https://openreview.net/forum?id=JJuP86nBl4q},
+  urldate = {2023-04-25},
+  abstract = {Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. \$\textbackslash textbf\{(1)\}\$ We develop a proxy for the validation performance associated with a training set based on a non-conventional \$\textbackslash textit\{class-wise\}\$ \$\textbackslash textit\{Wasserstein distance\}\$ between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. \$\textbackslash textbf\{(2)\}\$ We develop a novel method to value individual data based on the sensitivity analysis of the \$\textbackslash textit\{class-wise\}\$ Wasserstein distance. Importantly, these values can be directly obtained \$\textbackslash textit\{for free\}\$ from the output of off-the-shelf optimization solvers once the Wasserstein distance is computed. \$\textbackslash textbf\{(3) \}\$We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a \$\textbackslash textit\{significant improvement\}\$ over the state-of-the-art performance while being \$\textbackslash textit\{orders of magnitude faster.\}\$},
+  eventtitle = {The {{Eleventh International Conference}} on {{Learning Representations}} ({{ICLR}} 2023)},
+  langid = {english},
+  keywords = {notion}
+}
+
 @inproceedings{koh_understanding_2017,
   title = {Understanding {{Black-box Predictions}} via {{Influence Functions}}},
   booktitle = {Proceedings of the 34th {{International Conference}} on {{Machine Learning}}},
diff --git a/docs/value/index.md b/docs/value/index.md
@@ -89,10 +89,24 @@ performance metric used. For instance, accuracy is a poor metric for imbalanced
 sets and this has a stark effect on data values. Some models exhibit great
 variance in some regimes and this again has a detrimental effect on values.
 
-Nevertheless, some of the most promising applications are: Cleaning of corrupted
-data, pruning unnecessary or irrelevant data, repairing mislabeled data, guiding
-data acquisition and annotation (active learning), anomaly detection and model
-debugging and interpretation.
+Nevertheless, some of the most promising applications are:
+
+* Cleaning of corrupted data.
+* Pruning unnecessary or irrelevant data.
+* Repairing mislabeled data.
+* Guiding data acquisition and annotation (active learning).
+* Anomaly detection and model debugging and interpretation.
+
+Additionally, one of the motivating applications for the whole field is that of
+data markets: a marketplace where data owners can sell their data to interested
+parties. In this setting, data valuation can be key component to determine the
+price of data. Algorithm-agnostic methods like LAVA [@just_lava_2023] are
+particularly well suited for this, as they use the Wasserstein distance between
+a vendor's data and the buyer's to determine the value of the former. 
+
+However, this is a complex problem which can face practical banal problems like
+the fact that data owners may not wish to disclose their data for valuation.
+
 
 ## Computing data values
 
@@ -299,26 +313,26 @@ nature of every (non-trivial) ML problem can have an effect:
   that matters, and this tends to be accurate (wrt. to the true ranking) despite
   inaccurate values.
 
-  pyDVL offers a dedicated [function composition][pydvl.utils.score.compose_score]
-  for scorer functions which can be used to squash a score.
-  The following is defined in module [score][pydvl.utils.score]:
-
-  ```python
-  import numpy as np
-  from pydvl.utils import compose_score
-  
-  def sigmoid(x: float) -> float:
-    return float(1 / (1 + np.exp(-x)))
-  
-  squashed_r2 = compose_score("r2", sigmoid, "squashed r2")
-  
-  squashed_variance = compose_score(
-    "explained_variance", sigmoid, "squashed explained variance"
-  )
-  ```
-
-  These squashed scores can prove useful in regression problems, but they can
-  also introduce issues in the low-value regime.
+    ??? tip "Squashing scores" 
+        pyDVL offers a dedicated [function
+        composition][pydvl.utils.score.compose_score] for scorer functions which
+        can be used to squash a score. The following is defined in module
+        [score][pydvl.utils.score]:
+        ```python
+        import numpy as np
+        from pydvl.utils import compose_score
+        
+        def sigmoid(x: float) -> float:
+          return float(1 / (1 + np.exp(-x)))
+        
+        squashed_r2 = compose_score("r2", sigmoid, "squashed r2")
+        
+        squashed_variance = compose_score(
+          "explained_variance", sigmoid, "squashed explained variance"
+        )
+        ```
+        These squashed scores can prove useful in regression problems, but they
+        can also introduce issues in the low-value regime.
 
 * **High variance utility**: Classical applications of game theoretic value
   concepts operate with deterministic utilities, but in ML we use an evaluation
@@ -328,27 +342,27 @@ nature of every (non-trivial) ML problem can have an effect:
   configure the caching system to allow multiple evaluations of the utility for
   every index set. A moving average is computed and returned once the standard
   error is small, see [MemcachedConfig][pydvl.utils.config.MemcachedConfig].
-
   [@wang_data_2022] prove that by relaxing one of the Shapley axioms
   and considering the general class of semi-values, of which Shapley is an
   instance, one can prove that a choice of constant weights is the best one can
-  do in a utility-agnostic setting. So-called *Data Banzhaf*.
+  do in a utility-agnostic setting. This method, dubbed *Data Banzhaf*, is
+  available in pyDVL as
+  [compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues].
 
 * **Data set size**: Computing exact Shapley values is NP-hard, and Monte Carlo
   approximations can converge slowly. Massive datasets are thus impractical, at
-  least with current techniques. A workaround is to group samples and investigate
-  their value together. In pyDVL you can do this using
-  [GroupedDataset][pydvl.utils.dataset.GroupedDataset]. 
-  There is a fully worked-out [example here](../examples/shapley_basic_spotify).
-  Some algorithms also provide different sampling strategies to reduce 
-  the variance, but due to a no-free-lunch-type theorem,
-  no single strategy can be optimal for all utilities.
+  least with [game-theoretical methods][game-theoretical-methods]. A workaround
+  is to group samples and investigate their value together. You can do this using
+  [GroupedDataset][pydvl.utils.dataset.GroupedDataset]. There is a fully
+  worked-out [example here](../examples/shapley_basic_spotify). Some algorithms
+  also provide different sampling strategies to reduce the variance, but due to a
+  no-free-lunch-type theorem, no single strategy can be optimal for all utilities.
 
 * **Model size**: Since every evaluation of the utility entails retraining the
   whole model on a subset of the data, large models require great amounts of
   computation. But also, they will effortlessly interpolate small to medium
   datasets, leading to great variance in the evaluation of performance on the
   dedicated validation set. One mitigation for this problem is cross-validation,
-  but this would incur massive computational cost. As of v.0.3.0 there are no
+  but this would incur massive computational cost. As of v.0.7.0 there are no
   facilities in pyDVL for cross-validating the utility (note that this would
   require cross-validating the whole value computation).
diff --git a/docs/value/semi-values.md b/docs/value/semi-values.md
@@ -48,10 +48,10 @@ $$
 
 where $B$ is the [Beta function](https://en.wikipedia.org/wiki/Beta_function),
 and $\alpha$ and $\beta$ are parameters that control the weighting of the
-subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$, and
-$\beta = 16$ is reported in [@kwon_beta_2022] to be a good choice for
-some applications. See however the [Banzhaf indices][banzhaf-indices] section 
-for an alternative choice of weights which is reported to work better.
+subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$,
+and $\beta = 16$ is reported in [@kwon_beta_2022] to be a good choice for some
+applications. Beta Shapley values are available in pyDVL through
+[compute_beta_shapley_semivalues][pydvl.value.semivalues.compute_beta_shapley_semivalues]:
 
 ```python
 from pydvl.value import *
@@ -62,6 +62,9 @@ values = compute_beta_shapley_semivalues(
 )
 ```
 
+See however the [Banzhaf indices][banzhaf-indices] section 
+for an alternative choice of weights which is reported to work better.
+
 ## Banzhaf indices
 
 As noted in the section [Problems of Data Values][problems-of-data-values], the
@@ -81,8 +84,10 @@ any choice of weight function $w$, one can always construct a utility with
 higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
 one can do is to pick a constant weight.
 
-The authors of [@wang_data_2022] show that Banzhaf indices are more
-robust to variance in the utility function than Shapley and Beta Shapley values.
+The authors of [@wang_data_2022] show that Banzhaf indices are more robust to
+variance in the utility function than Shapley and Beta Shapley values. They are
+available in pyDVL through
+[compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues]:
 
 ```python
 from pydvl.value import *
@@ -103,6 +108,8 @@ combination of the three ingredients that define a semi-value:
 - A sampling method
 - A weighting scheme $w$.
 
+You can construct any combination of these three ingredients with
+[compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues].
 The utility function is the same as for Shapley values, and the sampling method
 can be any of the types defined in [the samplers module][pydvl.value.sampler].
 For instance, the following snippet is equivalent to the above:
@@ -125,7 +132,6 @@ sensitive to changes in training set size. However, Data Banzhaf indices are
 proven to be the most robust to variance in the utility function, in the sense
 of rank stability, across a range of models and datasets [@wang_data_2022]. 
 
-
 !!! warning "Careful with permutation sampling"
     This generic implementation of semi-values allowing for any combination of
     sampling and weighting schemes is very flexible and, in principle, it