Skip to content

Commit 3796dc8

Browse files
committed
Merge branch 'develop' into feature/dul-extensions
2 parents d7fb6ff + 9f2a5f2 commit 3796dc8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+563
-540
lines changed

CONTRIBUTING.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ failing pipelines. tox will:
9292
* generate coverage reports in html, as well as badges.
9393

9494
You can configure pytest, coverage and ruff by adjusting
95-
[pyproject.toml](pyproject.toml).
95+
[pyproject.toml](https://github.com/aai-institute/pyDVL/blob/develop/pyproject.toml).
9696

9797
Besides the usual unit tests, most algorithms are tested using pytest. This
9898
requires ray for the parallelization and Memcached for caching. Please install
@@ -132,11 +132,11 @@ There are a few important arguments:
132132
of slow tests.
133133

134134
- `--with-cuda` sets the device fixture in [tests/influence/torch/conftest.py](
135-
tests/influence/torch/conftest.py) to `cuda` if it is available.
136-
Using this fixture within tests, you can run parts of your tests on a `cuda`
137-
device. Be aware, that you still have to take care of the usage of the device
138-
manually in a specific test. Setting this flag does not result in
139-
running all tests on a GPU.
135+
https://github.com/aai-institute/pyDVL/blob/develop/tests/influence/torch/conftest.py)
136+
to `cuda` if it is available. Using this fixture within tests, you can run parts
137+
of your tests on a `cuda` device. Be aware, that you still have to take care of
138+
the usage of the device manually in a specific test. Setting this flag does not
139+
result in running all tests on a GPU.
140140

141141
### Markers
142142

@@ -384,7 +384,8 @@ library](https://www.zotero.org/groups/2703043/transferlab/library). All other
384384
contributors just add the bibtex data, and a maintainer will add it to the group
385385
library upon merging.
386386

387-
To add a citation inside a markdown file, use the notation `[@citekey]`. Alas,
387+
To add a citation inside a markdown file, use the notation `[@ citekey]` (with
388+
no space). Alas,
388389
because of when mkdocs-bibtex enters the pipeline, it won't process docstrings.
389390
For module documentation, we manually inject html into the markdown files. For
390391
example, in `pydvl.value.shapley.montecarlo` we have:
@@ -440,7 +441,7 @@ use braces for legibility like in the first example.
440441
### Abbreviations
441442

442443
We keep the abbreviations used in the documentation inside the
443-
[docs_include/abbreviations.md](docs_includes%2Fabbreviations.md) file.
444+
[docs_include/abbreviations.md](https://github.com/aai-institute/pyDVL/blob/develop/docs_includes%2Fabbreviations.md) file.
444445

445446
The syntax for abbreviations is:
446447

@@ -569,7 +570,7 @@ act -j lint
569570
act --artifact-server-path /tmp/artifacts
570571

571572
# Run a job in a specific workflow (useful if you have duplicate job names)
572-
act -j lint -W .github/workflows/tox.yml
573+
act -j lint -W .github/workflows/publish.yml
573574

574575
# Run in dry-run mode:
575576
act -n
@@ -727,9 +728,10 @@ PYPI_PASSWORD
727728
The first 2 are used after tests run on the develop branch's CI workflow
728729
to automatically publish packages to [TestPyPI](https://test.pypi.org/).
729730
730-
The last 2 are used in the [publish.yaml](.github/workflows/publish.yaml) CI
731-
workflow to publish packages to [PyPI](https://pypi.org/) from `develop` after
732-
a GitHub release.
731+
The last 2 are used in the
732+
[publish.yaml](https://github.com/aai-institute/pyDVL/blob/develop/.github/workflows/publish.yaml)
733+
CI workflow to publish packages to [PyPI](https://pypi.org/) from `develop`
734+
after a GitHub release.
733735
734736
#### Publish to TestPyPI
735737
@@ -738,6 +740,5 @@ the build part of the version number without commiting or tagging the change
738740
and then publish a package to TestPyPI from CI using Twine. The version
739741
has the GitHub run number appended.
740742
741-
For more details refer to the files
742-
[.github/workflows/publish.yaml](.github/workflows/publish.yaml) and
743-
[.github/workflows/tox.yaml](.github/workflows/tox.yaml).
743+
For more details refer to the file
744+
[.github/workflows/publish.yaml](https://github.com/aai-institute/pyDVL/blob/develop/.github/workflows/publish.yaml).

docs/assets/pydvl.bib

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ @article{agarwal_secondorder_2017
66
shortjournal = {JMLR},
77
volume = {18},
88
eprint = {1602.03943},
9-
eprinttype = {arxiv},
9+
eprinttype = {arXiv},
1010
pages = {1--40},
1111
url = {https://www.jmlr.org/papers/v18/16-491.html},
1212
abstract = {First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. In this paper we develop second-order stochastic methods for optimization problems in machine learning that match the per-iteration cost of gradient based methods, and in certain settings improve upon the overall running time over popular first-order methods. Furthermore, our algorithm has the desirable property of being implementable in time linear in the sparsity of the input data.},
@@ -67,7 +67,7 @@ @unpublished{broderick_automatic_2021
6767
author = {Broderick, Tamara and Giordano, Ryan and Meager, Rachael},
6868
date = {2021-11-03},
6969
eprint = {2011.14999},
70-
eprinttype = {arxiv},
70+
eprinttype = {arXiv},
7171
url = {https://arxiv.org/abs/2011.14999},
7272
abstract = {We propose a method to assess the sensitivity of econometric analyses to the removal of a small fraction of the data. Manually checking the influence of all possible small subsets is computationally infeasible, so we provide an approximation to find the most influential subset. Our metric, the "Approximate Maximum Influence Perturbation," is automatically computable for common methods including (but not limited to) OLS, IV, MLE, GMM, and variational Bayes. We provide finite-sample error bounds on approximation performance. At minimal extra cost, we provide an exact finite-sample lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, is not reflected in standard errors, does not disappear asymptotically, and is not due to misspecification. While some empirical applications are robust, results of several economics papers can be overturned by removing less than 1\% of the sample.},
7373
langid = {english},
@@ -118,7 +118,7 @@ @inproceedings{george_fast_2018
118118
date = {2018},
119119
volume = {31},
120120
eprint = {1806.03884},
121-
eprinttype = {arxiv},
121+
eprinttype = {arXiv},
122122
publisher = {Curran Associates, Inc.},
123123
url = {https://proceedings.neurips.cc/paper/2018/hash/48000647b315f6f00f913caa757a70b3-Abstract.html},
124124
urldate = {2024-01-12},
@@ -133,7 +133,7 @@ @inproceedings{ghorbani_data_2019
133133
author = {Ghorbani, Amirata and Zou, James},
134134
date = {2019-05-24},
135135
eprint = {1904.02868},
136-
eprinttype = {arxiv},
136+
eprinttype = {arXiv},
137137
pages = {2242--2251},
138138
publisher = {PMLR},
139139
issn = {2640-3498},
@@ -251,7 +251,7 @@ @inproceedings{koh_understanding_2017
251251
author = {Koh, Pang Wei and Liang, Percy},
252252
date = {2017-07-17},
253253
eprint = {1703.04730},
254-
eprinttype = {arxiv},
254+
eprinttype = {arXiv},
255255
pages = {1885--1894},
256256
publisher = {PMLR},
257257
url = {https://proceedings.mlr.press/v70/koh17a.html},
@@ -283,7 +283,7 @@ @inproceedings{kwon_beta_2022
283283
date = {2022-01-18},
284284
volume = {151},
285285
eprint = {2110.14049},
286-
eprinttype = {arxiv},
286+
eprinttype = {arXiv},
287287
publisher = {PMLR},
288288
location = {Valencia, Spain},
289289
url = {https://arxiv.org/abs/2110.14049},
@@ -329,7 +329,7 @@ @inproceedings{kwon_efficient_2021
329329
author = {Kwon, Yongchan and Rivas, Manuel A. and Zou, James},
330330
date = {2021-03-18},
331331
eprint = {2007.01357},
332-
eprinttype = {arxiv},
332+
eprinttype = {arXiv},
333333
pages = {793--801},
334334
publisher = {PMLR},
335335
issn = {2640-3498},
@@ -361,7 +361,7 @@ @article{maleki_bounding_2014
361361
date = {2014-02-12},
362362
journaltitle = {ArXiv13064265 Cs},
363363
eprint = {1306.4265},
364-
eprinttype = {arxiv},
364+
eprinttype = {arXiv},
365365
eprintclass = {cs},
366366
url = {https://arxiv.org/abs/1306.4265},
367367
urldate = {2020-11-16},
@@ -404,7 +404,7 @@ @inproceedings{okhrati_multilinear_2021
404404
author = {Okhrati, Ramin and Lipani, Aldo},
405405
date = {2021-01},
406406
eprint = {2010.12082},
407-
eprinttype = {arxiv},
407+
eprinttype = {arXiv},
408408
pages = {7992--7999},
409409
publisher = {IEEE},
410410
issn = {1051-4651},
@@ -425,7 +425,7 @@ @article{schioppa_scaling_2022
425425
volume = {36},
426426
number = {8},
427427
eprint = {2112.03052},
428-
eprinttype = {arxiv},
428+
eprinttype = {arXiv},
429429
pages = {8179--8186},
430430
issn = {2374-3468},
431431
doi = {10.1609/aaai.v36i8.20791},
@@ -485,7 +485,7 @@ @inproceedings{wang_improving_2022
485485
author = {Wang, Tianhao and Yang, Yu and Jia, Ruoxi},
486486
date = {2022-04-07},
487487
eprint = {2107.06336v2},
488-
eprinttype = {arxiv},
488+
eprinttype = {arXiv},
489489
publisher = {arXiv},
490490
doi = {10.48550/arXiv.2107.06336},
491491
url = {https://arxiv.org/abs/2107.06336v2},
@@ -501,13 +501,13 @@ @online{watson_accelerated_2023
501501
author = {Watson, Lauren and Kujawa, Zeno and Andreeva, Rayna and Yang, Hao-Tsung and Elahi, Tariq and Sarkar, Rik},
502502
date = {2023-11-09},
503503
eprint = {2311.05346},
504-
eprinttype = {arxiv},
504+
eprinttype = {arXiv},
505505
eprintclass = {cs},
506506
doi = {10.48550/arXiv.2311.05346},
507507
url = {https://arxiv.org/abs/2311.05346},
508508
urldate = {2023-12-07},
509509
abstract = {Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe \$\textbackslash delta\$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.},
510-
pubstate = {preprint}
510+
pubstate = {prepublished}
511511
}
512512

513513
@inproceedings{wu_davinz_2022,
@@ -528,7 +528,7 @@ @inproceedings{wu_davinz_2022
528528

529529
@inproceedings{yan_if_2021,
530530
title = {If {{You Like Shapley Then You}}’ll {{Love}} the {{Core}}},
531-
booktitle = {Proceedings of the 35th {{AAAI Conference}} on {{Artificial Intelligence}}, 2021},
531+
booktitle = {Proceedings of the 35th {{AAAI Conference}} on {{Artificial Intelligence}}},
532532
author = {Yan, Tom and Procaccia, Ariel D.},
533533
date = {2021-05-18},
534534
volume = {6},
@@ -543,3 +543,15 @@ @inproceedings{yan_if_2021
543543
langid = {english},
544544
keywords = {notion}
545545
}
546+
547+
@inproceedings{zaheer_deep_2017,
548+
title = {Deep {{Sets}}},
549+
booktitle = {Advances in {{Neural Information Processing Systems}}},
550+
author = {Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak and Poczos, Barnabas and Salakhutdinov, Russ R and Smola, Alexander J},
551+
date = {2017},
552+
volume = {30},
553+
publisher = {Curran Associates, Inc.},
554+
url = {https://papers.nips.cc/paper_files/paper/2017/hash/f22e4747da1aa27e363d86d40ff442fe-Abstract.html},
555+
urldate = {2025-03-03},
556+
abstract = {We study the problem of designing models for machine learning tasks defined on sets. In contrast to the traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets and are invariant to permutations. Such problems are widespread, ranging from the estimation of population statistics, to anomaly detection in piezometer data of embankment dams, to cosmology. Our main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.}
557+
}

docs/examples/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ alias:
55
text: Example gallery
66
---
77

8-
## Data valuation
8+
## Data valuation { #data-valuation-example-gallery }
99

1010
<div class="grid cards" markdown>
1111

docs/getting-started/advanced-usage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ using one of Dask, Ray or Joblib. The first is used in
1818
the [influence][pydvl.influence] package whereas the other two
1919
are used in the [value][pydvl.value] package.
2020

21-
### Data valuation
21+
### Data valuation { #setting-up-parallelization-data-valuation }
2222

2323
For data valuation, pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
2424
parallelization (within one machine) and supports using

0 commit comments

Comments
 (0)