aai-institute
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 16 additions & 15 deletions b/‎CONTRIBUTING.md‎
Lines changed: 16 additions & 15 deletions
diff --git a/‎docs/assets/pydvl.bib‎
Lines changed: 26 additions & 14 deletions b/‎docs/assets/pydvl.bib‎
Lines changed: 26 additions & 14 deletions
diff --git a/‎docs/examples/index.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/examples/index.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/getting-started/advanced-usage.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/getting-started/advanced-usage.md‎
Lines changed: 1 addition & 1 deletion
@@ -92,7 +92,7 @@ failing pipelines. tox will:
 * generate coverage reports in html, as well as badges.
 
 You can configure pytest, coverage and ruff by adjusting
-[pyproject.toml](pyproject.toml).
+[pyproject.toml](https://github.com/aai-institute/pyDVL/blob/develop/pyproject.toml).
 
 Besides the usual unit tests, most algorithms are tested using pytest. This
 requires ray for the parallelization and Memcached for caching. Please install
@@ -132,11 +132,11 @@ There are a few important arguments:
   of slow tests.
 
 - `--with-cuda` sets the device fixture in [tests/influence/torch/conftest.py](
-  tests/influence/torch/conftest.py) to `cuda` if it is available.
-  Using this fixture within tests, you can run parts of your tests on a `cuda` 
-  device. Be aware, that you still have to take care of the usage of the device
-  manually in a specific test. Setting this flag does not result in
-  running all tests on a GPU.
+  https://github.com/aai-institute/pyDVL/blob/develop/tests/influence/torch/conftest.py)
+  to `cuda` if it is available. Using this fixture within tests, you can run parts
+  of your tests on a `cuda` device. Be aware, that you still have to take care of
+  the usage of the device manually in a specific test. Setting this flag does not
+  result in running all tests on a GPU.
 
 ### Markers
 
@@ -384,7 +384,8 @@ library](https://www.zotero.org/groups/2703043/transferlab/library). All other
 contributors just add the bibtex data, and a maintainer will add it to the group
 library upon merging.
 
-To add a citation inside a markdown file, use the notation `[@citekey]`. Alas,
+To add a citation inside a markdown file, use the notation `[@ citekey]` (with
+no space). Alas,
 because of when mkdocs-bibtex enters the pipeline, it won't process docstrings.
 For module documentation, we manually inject html into the markdown files. For
 example, in `pydvl.value.shapley.montecarlo` we have:
@@ -440,7 +441,7 @@ use braces for legibility like in the first example.
 ### Abbreviations
 
 We keep the abbreviations used in the documentation inside the
-[docs_include/abbreviations.md](docs_includes%2Fabbreviations.md) file.
+[docs_include/abbreviations.md](https://github.com/aai-institute/pyDVL/blob/develop/docs_includes%2Fabbreviations.md) file.
 
 The syntax for abbreviations is:
 
@@ -569,7 +570,7 @@ act -j lint
 act --artifact-server-path /tmp/artifacts
 
 # Run a job in a specific workflow (useful if you have duplicate job names)
-act -j lint -W .github/workflows/tox.yml
+act -j lint -W .github/workflows/publish.yml
 
 # Run in dry-run mode:
 act -n
@@ -727,9 +728,10 @@ PYPI_PASSWORD
 The first 2 are used after tests run on the develop branch's CI workflow 
 to automatically publish packages to [TestPyPI](https://test.pypi.org/).
 
-The last 2 are used in the [publish.yaml](.github/workflows/publish.yaml) CI
-workflow to publish packages to [PyPI](https://pypi.org/) from `develop` after
-a GitHub release.
+The last 2 are used in the
+[publish.yaml](https://github.com/aai-institute/pyDVL/blob/develop/.github/workflows/publish.yaml)
+CI workflow to publish packages to [PyPI](https://pypi.org/) from `develop`
+after a GitHub release.
 
 #### Publish to TestPyPI
 
@@ -738,6 +740,5 @@ the build part of the version number without commiting or tagging the change
 and then publish a package to TestPyPI from CI using Twine. The version
 has the GitHub run number appended. 
 
-For more details refer to the files
-[.github/workflows/publish.yaml](.github/workflows/publish.yaml) and
-[.github/workflows/tox.yaml](.github/workflows/tox.yaml).
+For more details refer to the file
+[.github/workflows/publish.yaml](https://github.com/aai-institute/pyDVL/blob/develop/.github/workflows/publish.yaml).
@@ -6,7 +6,7 @@ @article{agarwal_secondorder_2017
   shortjournal = {JMLR},
   volume = {18},
   eprint = {1602.03943},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   pages = {1--40},
   url = {https://www.jmlr.org/papers/v18/16-491.html},
   abstract = {First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. In this paper we develop second-order stochastic methods for optimization problems in machine learning that match the per-iteration cost of gradient based methods, and in certain settings improve upon the overall running time over popular first-order methods. Furthermore, our algorithm has the desirable property of being implementable in time linear in the sparsity of the input data.},
@@ -67,7 +67,7 @@ @unpublished{broderick_automatic_2021
   author = {Broderick, Tamara and Giordano, Ryan and Meager, Rachael},
   date = {2021-11-03},
   eprint = {2011.14999},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   url = {https://arxiv.org/abs/2011.14999},
   abstract = {We propose a method to assess the sensitivity of econometric analyses to the removal of a small fraction of the data. Manually checking the influence of all possible small subsets is computationally infeasible, so we provide an approximation to find the most influential subset. Our metric, the "Approximate Maximum Influence Perturbation," is automatically computable for common methods including (but not limited to) OLS, IV, MLE, GMM, and variational Bayes. We provide finite-sample error bounds on approximation performance. At minimal extra cost, we provide an exact finite-sample lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, is not reflected in standard errors, does not disappear asymptotically, and is not due to misspecification. While some empirical applications are robust, results of several economics papers can be overturned by removing less than 1\% of the sample.},
   langid = {english},
@@ -118,7 +118,7 @@ @inproceedings{george_fast_2018
   date = {2018},
   volume = {31},
   eprint = {1806.03884},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   publisher = {Curran Associates, Inc.},
   url = {https://proceedings.neurips.cc/paper/2018/hash/48000647b315f6f00f913caa757a70b3-Abstract.html},
   urldate = {2024-01-12},
@@ -133,7 +133,7 @@ @inproceedings{ghorbani_data_2019
   author = {Ghorbani, Amirata and Zou, James},
   date = {2019-05-24},
   eprint = {1904.02868},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   pages = {2242--2251},
   publisher = {PMLR},
   issn = {2640-3498},
@@ -251,7 +251,7 @@ @inproceedings{koh_understanding_2017
   author = {Koh, Pang Wei and Liang, Percy},
   date = {2017-07-17},
   eprint = {1703.04730},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   pages = {1885--1894},
   publisher = {PMLR},
   url = {https://proceedings.mlr.press/v70/koh17a.html},
@@ -283,7 +283,7 @@ @inproceedings{kwon_beta_2022
   date = {2022-01-18},
   volume = {151},
   eprint = {2110.14049},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   publisher = {PMLR},
   location = {Valencia, Spain},
   url = {https://arxiv.org/abs/2110.14049},
@@ -329,7 +329,7 @@ @inproceedings{kwon_efficient_2021
   author = {Kwon, Yongchan and Rivas, Manuel A. and Zou, James},
   date = {2021-03-18},
   eprint = {2007.01357},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   pages = {793--801},
   publisher = {PMLR},
   issn = {2640-3498},
@@ -361,7 +361,7 @@ @article{maleki_bounding_2014
   date = {2014-02-12},
   journaltitle = {ArXiv13064265 Cs},
   eprint = {1306.4265},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   eprintclass = {cs},
   url = {https://arxiv.org/abs/1306.4265},
   urldate = {2020-11-16},
@@ -404,7 +404,7 @@ @inproceedings{okhrati_multilinear_2021
   author = {Okhrati, Ramin and Lipani, Aldo},
   date = {2021-01},
   eprint = {2010.12082},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   pages = {7992--7999},
   publisher = {IEEE},
   issn = {1051-4651},
@@ -425,7 +425,7 @@ @article{schioppa_scaling_2022
   volume = {36},
   number = {8},
   eprint = {2112.03052},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   pages = {8179--8186},
   issn = {2374-3468},
   doi = {10.1609/aaai.v36i8.20791},
@@ -485,7 +485,7 @@ @inproceedings{wang_improving_2022
   author = {Wang, Tianhao and Yang, Yu and Jia, Ruoxi},
   date = {2022-04-07},
   eprint = {2107.06336v2},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   publisher = {arXiv},
   doi = {10.48550/arXiv.2107.06336},
   url = {https://arxiv.org/abs/2107.06336v2},
@@ -501,13 +501,13 @@ @online{watson_accelerated_2023
   author = {Watson, Lauren and Kujawa, Zeno and Andreeva, Rayna and Yang, Hao-Tsung and Elahi, Tariq and Sarkar, Rik},
   date = {2023-11-09},
   eprint = {2311.05346},
-  eprinttype = {arxiv},
+  eprinttype = {arXiv},
   eprintclass = {cs},
   doi = {10.48550/arXiv.2311.05346},
   url = {https://arxiv.org/abs/2311.05346},
   urldate = {2023-12-07},
   abstract = {Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe \$\textbackslash delta\$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.},
-  pubstate = {preprint}
+  pubstate = {prepublished}
 }
 
 @inproceedings{wu_davinz_2022,
@@ -528,7 +528,7 @@ @inproceedings{wu_davinz_2022
 
 @inproceedings{yan_if_2021,
   title = {If {{You Like Shapley Then You}}’ll {{Love}} the {{Core}}},
-  booktitle = {Proceedings of the 35th {{AAAI Conference}} on {{Artificial Intelligence}}, 2021},
+  booktitle = {Proceedings of the 35th {{AAAI Conference}} on {{Artificial Intelligence}}},
   author = {Yan, Tom and Procaccia, Ariel D.},
   date = {2021-05-18},
   volume = {6},
@@ -543,3 +543,15 @@ @inproceedings{yan_if_2021
   langid = {english},
   keywords = {notion}
 }
+
+@inproceedings{zaheer_deep_2017,
+  title = {Deep {{Sets}}},
+  booktitle = {Advances in {{Neural Information Processing Systems}}},
+  author = {Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak and Poczos, Barnabas and Salakhutdinov, Russ R and Smola, Alexander J},
+  date = {2017},
+  volume = {30},
+  publisher = {Curran Associates, Inc.},
+  url = {https://papers.nips.cc/paper_files/paper/2017/hash/f22e4747da1aa27e363d86d40ff442fe-Abstract.html},
+  urldate = {2025-03-03},
+  abstract = {We study the problem of designing models for machine learning tasks defined on sets. In contrast to the traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets and are invariant to permutations. Such problems are widespread, ranging from the estimation of population statistics, to anomaly detection in piezometer data of embankment dams, to cosmology. Our main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.}
+}
@@ -5,7 +5,7 @@ alias:
   text: Example gallery
 ---
 
-## Data valuation
+## Data valuation { #data-valuation-example-gallery }
 
 <div class="grid cards" markdown>
 
 
@@ -18,7 +18,7 @@ using one of Dask, Ray or Joblib. The first is used in
 the [influence][pydvl.influence] package whereas the other two
 are used in the [value][pydvl.value] package.
 
-### Data valuation
+### Data valuation  { #setting-up-parallelization-data-valuation }
 
 For data valuation, pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
 parallelization (within one machine) and supports using