aai-institute
diff --git a/‎CHANGELOG.md‎
Lines changed: 5 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 23 additions & 16 deletions b/‎README.md‎
Lines changed: 23 additions & 16 deletions
diff --git a/‎docs/30-data-valuation.rst‎
Lines changed: 133 additions & 6 deletions b/‎docs/30-data-valuation.rst‎
Lines changed: 133 additions & 6 deletions
diff --git a/‎docs/examples/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/examples/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/pydvl.bib‎
Lines changed: 14 additions & 0 deletions b/‎docs/pydvl.bib‎
Lines changed: 14 additions & 0 deletions
@@ -22,6 +22,11 @@
   fixes a bug in `MapReduceJob`'s chunkification when `n_runs` >= `n_jobs`,
   and defines a sequential parallel backend to run all jobs in the current thread
   [PR #232](https://github.com/appliedAI-Initiative/pyDVL/pull/232)
+- **New method**: Implements exact and monte carlo Least Core for data valuation,
+  adds `from_arrays()` class method to the `Dataset` and `GroupedDataset` classes,
+  adds `extra_values` argument to `ValuationResult`,
+  adds `compute_removal_score()` and `compute_random_removal_score()` helper functions
+  [PR #237](https://github.com/appliedAI-Initiative/pyDVL/pull/237)
 
 ## 0.3.0 - 💥 Breaking changes
 
 
@@ -46,6 +46,9 @@ methods from the following papers:
   [A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
   In 2020 25th International Conference on Pattern Recognition (ICPR), 7992–99.
   IEEE, 2021.
+- Yan, T., & Procaccia, A. D. 
+  [If You Like Shapley Then You’ll Love the Core]().
+  Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
 
 Influence Functions compute the effect that single points have on an estimator /
 model. We implement methods from the following papers:
@@ -76,18 +79,7 @@ documentation.
 
 # Usage
 
-pyDVL uses [Memcached](https://memcached.org/) to cache certain results and
-speed up computation. You can run it either locally or, using
-[Docker](https://www.docker.com/):
-
-```shell
-docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
-```
-
-You can read more in the [caching module's
-documentation](https://appliedAI-Initiative.github.io/pyDVL/pydvl/utils/caching.html).
-
-Once that's done, the steps required to compute values for your samples are
+The steps required to compute values for your samples are:
 
 1. Create a `Dataset` object with your train and test splits.
 2. Create an instance of a `SupervisedModel` (basically any sklearn compatible
@@ -108,14 +100,14 @@ from sklearn.model_selection import train_test_split
 
 X, y = np.arange(100).reshape((50, 2)), np.arange(50)
 X_train, X_test, y_train, y_test = train_test_split(
-        X, y, test_size=0.5, random_state=16
-        )
+    X, y, test_size=0.5, random_state=16
+)
 dataset = Dataset(X_train, y_train, X_test, y_test)
 model = LinearRegression()
 utility = Utility(model, dataset)
 values = compute_shapley_values(
-        u=utility, max_iterations=100, mode="truncated_montecarlo"
-    )
+    u=utility, max_iterations=100, mode="truncated_montecarlo"
+)
 ```
 
 For more instructions and information refer to [Getting
@@ -124,6 +116,21 @@ the documentation. We provide several
 [examples](https://appliedAI-Initiative.github.io/pyDVL/examples/index.html)
 with details on the algorithms and their applications.
 
+## Caching
+
+pyDVL offers the possibility to cache certain results and
+speed up computation. It uses [Memcached](https://memcached.org/) For that.
+
+You can run it either locally or, using
+[Docker](https://www.docker.com/):
+
+```shell
+docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
+```
+
+You can read more in the [caching module's
+documentation](https://appliedAI-Initiative.github.io/pyDVL/pydvl/utils/caching.html).
+
 # Contributing
 
 Please open new issues for bugs, feature requests and extensions. You can read
 
@@ -205,7 +205,10 @@ The value $v$ of the $i$-th sample in dataset $D$ wrt. utility $u$ is computed
 as a weighted sum of its marginal utility wrt. every possible coalition of
 training samples within the training set:
 
-$$v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}} \binom{n-1}{ | S | }^{-1} [u(S \cup \{x_i\}) − u(S)] ,$$
+$$
+v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
+\binom{n-1}{ | S | }^{-1} [u(S \cup \{x_i\}) − u(S)]
+,$$
 
 .. code-block:: python
 
@@ -253,7 +256,10 @@ of the utility from $\{0,1\}^n$, where a 1 in position $i$ means that sample
 $x_i$ is used to train the model, to $[0,1]^n$. The ensuing expression for
 Shapley value uses integration instead of discrete weights:
 
-$$v_u(i) = \int_0^1 \mathbb{E}_{S \sim P_q(D_{\backslash \{ i \}})} [u(S \cup {i}) - u(S)].$$
+$$
+v_u(i) = \int_0^1 \mathbb{E}_{S \sim P_q(D_{\backslash \{ i \}})}
+[u(S \cup {i}) - u(S)]
+.$$
 
 Using Owen sampling follows the same pattern as every other method for Shapley
 values in pyDVL. First construct the dataset and utility, then call
@@ -282,7 +288,10 @@ Permutation Shapley
 An equivalent way of computing Shapley values appears often in the literature.
 It uses permutations over indices instead of subsets:
 
-$$v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})],$$
+$$
+v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
+[u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})]
+,$$
 
 where $\sigma_i$ denotes the set of indices in permutation sigma up until the
 position of index $i$. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
@@ -324,13 +333,131 @@ and can be used in pyDVL with:
    utility = Utility(model, data)
    values = compute_shapley_values(u=utility, mode="knn")
 
+.. _Least Core:
+
+Core values
+===========
+
+The Shapley values define a fair way to distribute payoffs amongst all
+participants when they form a grand coalition. But they do not consider
+the question of stability: under which conditions do all participants
+form the grand coalition? Would the participants be willing to form
+the grand coalition given how the payoffs are assigned,
+or would some of them prefer to form smaller coalitions?
+
+The Core is another approach to computing data values originating
+in cooperative game theory that attempts to ensure this stability.
+It is the set of feasible payoffs that cannot be improved upon
+by a coalition of the participants.
+
+It satisfies the following 2 properties:
+
+- **Efficiency**:
+  The payoffs are distributed such that it is not possible
+  to make any participant better off
+  without making another one worse off.
+  $$\sum_{x_i\in D} v_u(x_i) = u(D)\,$$
+
+- **Coalitional rationality**:
+  The sum of payoffs to the agents in any coalition S is at
+  least as large as the amount that these agents could earn by
+  forming a coalition on their own.
+  $$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subseteq D\,$$
+
+The second property states that the sum of payoffs to the agents
+in any subcoalition $S$ is at least as large as the amount that
+these agents could earn by forming a coalition on their own.
+
+Least Core values
+^^^^^^^^^^^^^^^^^
+
+Unfortunately, for many cooperative games the Core may be empty.
+By relaxing the coalitional rationality property by $e \gt 0$,
+we are then able to find approximate payoffs:
+
+$$
+\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subseteq D\
+,$$
+
+The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
+utility $u$ is computed by solving the following Linear Program:
+
+$$
+\begin{array}{lll}
+\text{minimize} & e & \\
+\text{subject to} & \sum_{x_i\in D} v_u(x_i) = u(D) & \\
+& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subseteq D \\
+\end{array}
+$$
+
+Exact Least Core
+----------------
+
+This first algorithm is just a verbatim implementation of the definition.
+As such it returns as exact a value as the utility function allows
+(see what this means in :ref:`problems of data values`).
+
+.. code-block:: python
+
+   from pydvl.utils import Dataset, Utility
+   from pydvl.value.least_core import exact_least_core
+   model = ...
+   dataset = Dataset(...)
+   utility = Utility(data, model)
+   values = exact_least_core(utility)
+
+Monte Carlo Least Core
+----------------------
+
+Because the number of subsets $S \subseteq D \setminus \{x_i\}$ is
+$2^{ | D | - 1 }$, one typically must resort to approximations.
+
+The simplest approximation consists of two relaxations of the Least Core
+(:footcite:t:`yan_procaccia_2021`):
+
+- Further relaxing the coalitional rationality property by
+  a constant value $\epsilon > 0$:
+
+  $$
+  \sum_{x_i\in S} v_u(x_i) + e + \epsilon \geq u(S)
+  $$
+
+- Using a fraction of all subsets instead of all possible subsets.
+
+Combined, this gives us the following property:
+
+$$
+P_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} + \epsilon \geq u(S)\right]
+\geq 1 - \delta
+$$
+
+Where $e^{*}$ is the optimal least core value.
+
+With these relaxations, we obtain a polynomial running time.
+
+.. code-block:: python
+
+   from pydvl.utils import Dataset, Utility
+   from pydvl.value.least_core import montecarlo_least_core
+   model = ...
+   dataset = Dataset(...)
+   max_iterations = ...
+   utility = Utility(data, model)
+   values = montecarlo_least_core(utility, max_iterations=max_iterations)
+
+.. note::
+
+   ``max_iterations`` needs to be at least equal to the number of data points.
 
 Other methods
 =============
 
-Other game-theoretic concepts in pyDVL's roadmap are the **Least Core** (in
-progress), and **Banzhaf indices** (the latter is just a different weighting
-scheme with better numerical stability properties). Contributions are welcome!
+There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
+of semivalue, which is a generalization to different weighting schemes:
+in particular **Banzhaf indices** and **Beta Shapley**, with better numerical
+and rank stability in certain situations.
+
+Contributions are welcome!
 
 
 .. _problems of data values:
 
@@ -11,6 +11,7 @@ The following examples illustrate the usage of pyDVL features.
    :glob:
 
    shapley*
+   least_core*
 
 .. toctree::
    :caption: Influence function
 
@@ -106,3 +106,17 @@ @inproceedings{wang_improving_2022
   archiveprefix = {arXiv},
   langid = {english}
 }
+
+@article{yan_procaccia_2021,
+  title = {If You Like Shapley Then You’ll Love the Core},
+  volume = {35},
+  url = {https://ojs.aaai.org/index.php/AAAI/article/view/16721},
+  doi = {10.1609/aaai.v35i6.16721},
+  abstract = {The prevalent approach to problems of credit assignment in machine learning -- such as feature and data valuation -- is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community’s current consensus around the Shapley value, and make a case for the core as a viable alternative. To that end, we prove that arbitrarily good approximations to the least core -- a core relaxation that is always feasible -- can be computed efficiently (but prove an impossibility for a more refined solution concept, the nucleolus). We also perform experiments that corroborate these theoretical results and shed light on settings where the least core may be preferable to the Shapley value.},
+  number = {6},
+  journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
+  author = {Yan, Tom and Procaccia, Ariel D.},
+  year = {2021},
+  month = {May},
+  pages = {5751-5759}
+}