aai-institute
diff --git a/‎CHANGELOG.md‎
Lines changed: 12 additions & 10 deletions b/‎CHANGELOG.md‎
Lines changed: 12 additions & 10 deletions
diff --git a/‎README.md‎
Lines changed: 8 additions & 3 deletions b/‎README.md‎
Lines changed: 8 additions & 3 deletions
diff --git a/‎docs/30-data-valuation.rst‎
Lines changed: 52 additions & 5 deletions b/‎docs/30-data-valuation.rst‎
Lines changed: 52 additions & 5 deletions
diff --git a/‎docs/pydvl.bib‎
Lines changed: 15 additions & 0 deletions b/‎docs/pydvl.bib‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎notebooks/influence_synthetic.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎notebooks/influence_synthetic.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎notebooks/least_core_basic.ipynb‎
Lines changed: 3 additions & 3 deletions b/‎notebooks/least_core_basic.ipynb‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎notebooks/shapley_basic_spotify.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎notebooks/shapley_basic_spotify.ipynb‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎notebooks/shapley_utility_learning.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎notebooks/shapley_utility_learning.ipynb‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎requirements.txt‎
Lines changed: 2 additions & 1 deletion b/‎requirements.txt‎
Lines changed: 2 additions & 1 deletion
@@ -1,6 +1,6 @@
 # Changelog
 
-## Unreleased - 💥 More breaking changes!
+## Unreleased - 💥 More breaking changes and new algorithms 🏭 
 
 - GH action to mark issues as stale
   [PR #201](https://github.com/appliedAI-Initiative/pyDVL/pull/201)
@@ -15,18 +15,20 @@
   new example notebook
   [PR #195](https://github.com/appliedAI-Initiative/pyDVL/pull/195)
 - **Breaking change**: Passes the input to `MapReduceJob` at initialization,
-  removes `chunkify_inputs` argument from `MapReduceJob`,
-  removes `n_runs` argument from `MapReduceJob`,
-  calls the parallel backend's `put()` method for each generated chunk in `_chunkify()`,
-  renames ParallelConfig's `num_workers` attribute to `n_local_workers`,
-  fixes a bug in `MapReduceJob`'s chunkification when `n_runs` >= `n_jobs`,
-  and defines a sequential parallel backend to run all jobs in the current thread
+  removes `chunkify_inputs` argument from `MapReduceJob`, removes `n_runs`
+  argument from `MapReduceJob`, calls the parallel backend's `put()` method for
+  each generated chunk in `_chunkify()`, renames ParallelConfig's `num_workers`
+  attribute to `n_local_workers`, fixes a bug in `MapReduceJob`'s chunkification
+  when `n_runs` >= `n_jobs`, and defines a sequential parallel backend to run
+  all jobs in the current thread
   [PR #232](https://github.com/appliedAI-Initiative/pyDVL/pull/232)
 - **New method**: Implements exact and monte carlo Least Core for data valuation,
-  adds `from_arrays()` class method to the `Dataset` and `GroupedDataset` classes,
-  adds `extra_values` argument to `ValuationResult`,
-  adds `compute_removal_score()` and `compute_random_removal_score()` helper functions
+  adds `from_arrays()` class method to the `Dataset` and `GroupedDataset`
+  classes, adds `extra_values` argument to `ValuationResult`, adds
+  `compute_removal_score()` and `compute_random_removal_score()` helper functions
   [PR #237](https://github.com/appliedAI-Initiative/pyDVL/pull/237)
+- **New method**: Group Testing Shapley for valuation, from _Jia et al. 2019_
+  [PR #240](https://github.com/appliedAI-Initiative/pyDVL/pull/240)
 - Fixes bug in ray initialization in `RayParallelBackend` class
   [PR #239](https://github.com/appliedAI-Initiative/pyDVL/pull/239)
 
 
@@ -44,11 +44,16 @@ methods from the following papers:
   Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
 - Okhrati, Ramin, and Aldo Lipani.
   [A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
-  In 2020 25th International Conference on Pattern Recognition (ICPR), 7992–99.
+  In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99.
   IEEE, 2021.
-- Yan, T., & Procaccia, A. D. 
+- Yan, T., & Procaccia, A. D.
   [If You Like Shapley Then You’ll Love the Core]().
   Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
+- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
+  Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos.
+  [Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
+  In 22nd International Conference on Artificial Intelligence and Statistics,
+  1167–76. PMLR, 2019.
 
 Influence Functions compute the effect that single points have on an estimator /
 model. We implement methods from the following papers:
@@ -106,7 +111,7 @@ dataset = Dataset(X_train, y_train, X_test, y_test)
 model = LinearRegression()
 utility = Utility(model, dataset)
 values = compute_shapley_values(
-    u=utility, max_iterations=100, mode="truncated_montecarlo"
+    u=utility, n_iterations=100, mode="truncated_montecarlo"
 )
 ```
 
 
@@ -273,7 +273,7 @@ values in pyDVL. First construct the dataset and utility, then call
    dataset = Dataset(...)
    utility = Utility(data, model)
    values = compute_shapley_values(
-       u=utility, mode="owen", max_iterations=4, max_q=200
+       u=utility, mode="owen", n_iterations=4, max_q=200
    )
 
 There are more details on Owen
@@ -309,7 +309,7 @@ efficient enough to be useful in some applications.
    data = Dataset(...)
    utility = Utility(model, data)
    values = compute_shapley_values(
-       u=utility, mode="truncated_montecarlo", max_iterations=100
+       u=utility, mode="truncated_montecarlo", n_iterations=100
    )
 
 
@@ -333,6 +333,53 @@ and can be used in pyDVL with:
    utility = Utility(model, data)
    values = compute_shapley_values(u=utility, mode="knn")
 
+
+Group testing
+^^^^^^^^^^^^^
+
+An alternative approach introduced in :footcite:t:`jia_efficient_2019a`
+first approximates the differences of values with a Monte Carlo sum. With
+
+$$\hat{\Delta}_{i j} \approx v_i - v_j,$$
+
+one then solves the following linear constraint satisfaction problem (CSP) to
+infer the final values:
+
+$$
+\begin{array}{lll}
+\sum_{i = 1}^N v_i & = & U (D)\\
+| v_i - v_j - \hat{\Delta}_{i j} | & \leqslant &
+\frac{\varepsilon}{2 \sqrt{N}}
+\end{array}
+$$
+
+.. warning::
+   We have reproduced this method in pyDVL for completeness and benchmarking,
+   but we don't advocate its use because of the speed and memory cost. Despite
+   our best efforts, the number of samples required in practice for convergence
+   can be several orders of magnitude worse than with e.g. Truncated Monte Carlo.
+
+Usage follows the same pattern as every other Shapley method, but with the
+addition of an ``eps`` parameter required for the solution of the CSP. It should
+be the same value used to compute the minimum number of samples required. This
+can be done with :func:`~pydvl.value.shapley.gt.num_samples_eps_delta`, but note
+that the number returned will be huge! In practice, fewer samples can be enough,
+but the actual number will strongly depend on the utility, in particular its
+variance.
+
+.. code-block:: python
+
+   from pydvl.utils import Dataset, Utility
+   from pydvl.value.shapley import compute_shapley_values
+
+   model = ...
+   data = Dataset(...)
+   utility = Utility(model, data, score_range=(_min, _max))
+   min_iterations = num_samples_eps_delta(epsilon, delta, n, utility.score_range)
+   values = compute_shapley_values(
+       u=utility, mode="group_testing", n_iterations=min_iterations, eps=eps
+   )
+
 .. _Least Core:
 
 Core values
@@ -441,13 +488,13 @@ With these relaxations, we obtain a polynomial running time.
    from pydvl.value.least_core import montecarlo_least_core
    model = ...
    dataset = Dataset(...)
-   max_iterations = ...
+   n_iterations = ...
    utility = Utility(data, model)
-   values = montecarlo_least_core(utility, max_iterations=max_iterations)
+   values = montecarlo_least_core(utility, n_iterations=n_iterations)
 
 .. note::
 
-   ``max_iterations`` needs to be at least equal to the number of data points.
+   ``n_iterations`` needs to be at least equal to the number of data points.
 
 Other methods
 =============
 
@@ -17,6 +17,21 @@ @inproceedings{ghorbani_data_2019
   langid = {english}
 }
 
+@inproceedings{jia_efficient_2019,
+  title = {Towards {{Efficient Data Valuation Based}} on the {{Shapley Value}}},
+  booktitle = {Proceedings of the 22nd {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
+  author = {Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Hynes, Nick and G{\"u}rel, Nezihe Merve and Li, Bo and Zhang, Ce and Song, Dawn and Spanos, Costas J.},
+  year = {2019},
+  month = apr,
+  pages = {1167--1176},
+  publisher = {{PMLR}},
+  issn = {2640-3498},
+  url = {http://proceedings.mlr.press/v89/jia19a.html},
+  urldate = {2021-02-12},
+  abstract = {``How much is my data worth?'' is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits...},
+  langid = {english}
+}
+
 @article{jia_efficient_2019a,
   title = {Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms},
   shorttitle = {{{VLDB}} 2019},
 
@@ -670,7 +670,7 @@
     "    y_test=test_data[1].astype(float),\n",
     "    influence_type=\"up\",\n",
     "    inversion_method=\"cg\",\n",
-    "    inversion_method_kwargs={\"max_iterations\": 10, \"max_step_size\": 1},\n",
+    "    inversion_method_kwargs={\"n_iterations\": 10, \"max_step_size\": 1},\n",
     ")\n",
     "mean_train_influences = np.mean(influence_values, axis=0)\n",
     "\n",
 
@@ -442,7 +442,7 @@
     "        column_name = f\"estimated_value_{budget}\"\n",
     "        values = montecarlo_least_core(\n",
     "            u=utility,\n",
-    "            max_iterations=budget,\n",
+    "            n_iterations=budget,\n",
     "            n_jobs=n_jobs,\n",
     "            progress=False,\n",
     "        )\n",
@@ -649,7 +649,7 @@
     "        else:\n",
     "            values = montecarlo_least_core(\n",
     "                u=utility,\n",
-    "                max_iterations=budget,\n",
+    "                n_iterations=budget,\n",
     "                n_jobs=n_jobs,\n",
     "                progress=False,\n",
     "            )\n",
@@ -775,7 +775,7 @@
     "        else:\n",
     "            values = montecarlo_least_core(\n",
     "                u=utility,\n",
-    "                max_iterations=budget,\n",
+    "                n_iterations=budget,\n",
     "                n_jobs=n_jobs,\n",
     "                progress=False,\n",
     "            )\n",
 
@@ -415,7 +415,7 @@
     "    enable_cache=enable_cache,\n",
     ")\n",
     "dvl_df = compute_shapley_values(\n",
-    "    utility, max_iterations=200, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
+    "    utility, n_iterations=200, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
     ").to_dataframe(column=\"data_value\")"
    ]
   },
@@ -703,7 +703,7 @@
     "    enable_cache=True,\n",
     ")\n",
     "dvl_df = compute_shapley_values(\n",
-    "    utility, max_iterations=100, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
+    "    utility, n_iterations=100, n_jobs=available_cpus(), mode=\"truncated_montecarlo\"\n",
     ").to_dataframe(column=\"data_value\")"
    ]
   },
 
@@ -295,7 +295,7 @@
     "            mode=\"truncated_montecarlo\",\n",
     "            n_jobs=min(len(dataset), available_cpus()),\n",
     "            progress=False,\n",
-    "            max_iterations=2 ** len(dataset),  # DUL will kick in after training_budget\n",
+    "            n_iterations=2 ** len(dataset),  # DUL will kick in after training_budget\n",
     "        )\n",
     "        .to_dataframe(column=f\"{budget}_{idx}\")\n",
     "        .drop(columns=[f\"{budget}_{idx}_stderr\"])\n",
@@ -567,7 +567,7 @@
     "    u=dul_utility,\n",
     "    mode=\"truncated_montecarlo\",\n",
     "    n_jobs=min(len(corrupted_dataset), available_cpus()),\n",
-    "    max_iterations=2 ** len(corrupted_dataset),\n",
+    "    n_iterations=2 ** len(corrupted_dataset),\n",
     "    progress=False,\n",
     ").to_dataframe(column=\"estimated\")\n",
     "df_corrupted = pd.concat([df_corrupted, dul_df], axis=1)"
 
@@ -6,4 +6,5 @@ joblib
 pymemcache
 cloudpickle
 tqdm
-matplotlib
+matplotlib
+scipy>=1.7.0