Merge pull request #238 from appliedAI-Initiative/fix/cleanup

mdbenito · web-flow · commit 372a34128fe6 · 2023-01-12T16:27:00.000+01:00
Some docs and cleanup
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -183,19 +183,36 @@ any rst files which are not manually created), you can use a file watcher.
 This is not part of the development setup of pyDVL (yet! PRs welcome), but
 modern IDEs provide functionality for this.
 
-Use the **docs** tox environment to build the documentation the same way it is done in CI:
+Use the **docs** tox environment to build the documentation the same way it is
+done in CI:
 
 ```bash
 tox -e docs
 ```
 
-Locally, you can use the **docs-dev** tox environment to continuously rebuild docs on changes:
+Locally, you can use the **docs-dev** tox environment to continuously rebuild
+documentation on changes to the `docs` folder:
 
 ```bash
 tox -e docs-dev
 ```
 
-**NOTE:** This currently only rebuilds on changes to `.rst` files and notebooks.
+**Again:** this only rebuilds on changes to `.rst` files and notebooks inside
+`docs`.
+
+### Using bibliography
+
+Bibliographic citations are managed with the plugin 
+[sphinx-bibtex](https://sphinxcontrib-bibtex.readthedocs.io/en/latest/index.html).
+To enter a citation first add the entry to `docs/pydvl.bib`. For team
+contributor this should be an export of the Zotero folder `software/pydvl` in
+the [TransferLab Zotero library](https://www.zotero.org/groups/2703043/transferlab/library).
+All other contributors just add the bibtex data, and a maintainer will add it to
+the group library upon merging.
+
+To add a citation inside a module or function's docstring, use the sphinx role
+`:footcite:t:`. A references section is automatically added at the bottom of
+each module's auto-generated documentation.
 
 ### Writing mathematics
 
@@ -269,7 +286,8 @@ satisfied:
 
 Then, a new release can be created using the script
 `build_scripts/release-version.sh` (leave out the version parameter to have
-`bumpversion` automatically derive the next release version by bumping the patch part):
+`bumpversion` automatically derive the next release version by bumping the patch
+part):
 
 ```shell script
 ./scripts/release-version.sh 0.1.6
@@ -285,7 +303,8 @@ If running in interactive mode (without `-y|--yes`), the script will output a
 summary of pending changes and ask for confirmation before executing the
 actions.
 
-Once this is done, a package will be automatically created and published from CI to PyPI.
+Once this is done, a package will be automatically created and published from CI
+to PyPI.
 
 ### Manual release process
 
diff --git a/build_scripts/update_docs.py b/build_scripts/update_docs.py
@@ -22,6 +22,14 @@ def module_template(module_qualname: str):
 .. automodule:: {module_qualname}
    :members:
    :undoc-members:
+   
+   ----
+   
+   Module members
+   ==============
+ 
+.. footbibliography::
+
 """
     return template
 
diff --git a/docs/10-getting-started.rst b/docs/10-getting-started.rst
@@ -14,12 +14,12 @@ algorithms for data valuation and influence functions. You can read:
 
 * :ref:`data valuation` for key objects and usage patterns for Shapley value
   computation and related methods.
-* :ref:`influence` for instruction on how to compute influence functions (still
+* :ref:`influence` for instructions on how to compute influence functions (still
   in a pre-alpha state)
 
 We only briefly introduce key concepts in the documentation. For a thorough
 introduction and survey of the field, we refer to **the upcoming review** at the
-:tfl:`TransferLab website <>`.
+:tfl:`TransferLab website <reviews/data-valuation>`.
 
 Running the examples
 ====================
diff --git a/docs/20-install.rst b/docs/20-install.rst
@@ -10,13 +10,7 @@ To install the latest release use:
 
     pip install pyDVL
 
-You can also install the latest development version from `TestPyPI <https://test.pypi.org/project/pyDVL/>`_:
-
-.. code-block:: shell
-
-    pip install pyDVL --index-url https://test.pypi.org/simple/
-
-To use all features of influence functions execute:
+To use all features of influence functions use instead:
 
 .. code-block:: shell
 
@@ -29,7 +23,14 @@ In order to check the installation you can use:
 
 .. code-block:: shell
 
-    python -c "import valuation; print(pydvl.__version__)"
+    python -c "import pydvl; print(pydvl.__version__)"
+
+You can also install the latest development version from
+`TestPyPI <https://test.pypi.org/project/pyDVL/>`_:
+
+.. code-block:: shell
+
+    pip install pyDVL --index-url https://test.pypi.org/simple/
 
 Dependencies
 ============
diff --git a/docs/30-data-valuation.rst b/docs/30-data-valuation.rst
@@ -460,7 +460,7 @@ Because the number of subsets $S \subseteq D \setminus \{x_i\}$ is
 $2^{ | D | - 1 }$, one typically must resort to approximations.
 
 The simplest approximation consists of two relaxations of the Least Core
-(:footcite:t:`yan_procaccia_2021`):
+(:footcite:t:`yan_if_2021`):
 
 - Further relaxing the coalitional rationality property by
   a constant value $\epsilon > 0$:
diff --git a/docs/conf.py b/docs/conf.py
@@ -70,6 +70,8 @@
 }
 
 bibtex_bibfiles = ["pydvl.bib"]
+bibtex_bibliography_header = "References\n=========="
+bibtex_footbibliography_header = bibtex_bibliography_header
 
 # NBSphinx
 
diff --git a/docs/pydvl.bib b/docs/pydvl.bib
@@ -12,7 +12,7 @@ @inproceedings{ghorbani_data_2019
   issn = {2640-3498},
   url = {http://proceedings.mlr.press/v97/ghorbani19c.html},
   urldate = {2020-11-01},
-  abstract = {As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare...},
+  abstract = {As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.},
   archiveprefix = {arXiv},
   langid = {english}
 }
@@ -122,16 +122,20 @@ @inproceedings{wang_improving_2022
   langid = {english}
 }
 
-@article{yan_procaccia_2021,
-  title = {If You Like Shapley Then You’ll Love the Core},
-  volume = {35},
-  url = {https://ojs.aaai.org/index.php/AAAI/article/view/16721},
-  doi = {10.1609/aaai.v35i6.16721},
-  abstract = {The prevalent approach to problems of credit assignment in machine learning -- such as feature and data valuation -- is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community’s current consensus around the Shapley value, and make a case for the core as a viable alternative. To that end, we prove that arbitrarily good approximations to the least core -- a core relaxation that is always feasible -- can be computed efficiently (but prove an impossibility for a more refined solution concept, the nucleolus). We also perform experiments that corroborate these theoretical results and shed light on settings where the least core may be preferable to the Shapley value.},
-  number = {6},
-  journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
+@inproceedings{yan_if_2021,
+  title = {If {{You Like Shapley Then You}}'ll {{Love}} the {{Core}}},
+  booktitle = {Proceedings of the 35th {{AAAI Conference}} on {{Artificial Intelligence}}, 2021},
   author = {Yan, Tom and Procaccia, Ariel D.},
   year = {2021},
-  month = {May},
-  pages = {5751-5759}
+  month = may,
+  volume = {6},
+  pages = {5751--5759},
+  publisher = {{Association for the Advancement of Artificial Intelligence}},
+  address = {{Virtual conference}},
+  doi = {10.1609/aaai.v35i6.16721},
+  url = {https://ojs.aaai.org/index.php/AAAI/article/view/16721},
+  urldate = {2021-04-23},
+  abstract = {The prevalent approach to problems of credit assignment in machine learning \textemdash{} such as feature and data valuation\textemdash{} is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community's current consensus around the Shapley value, and make a case for the core as a viable alternative. To that end, we prove that arbitrarily good approximations to the least core \textemdash{} a core relaxation that is always feasible \textemdash{} can be computed efficiently (but prove an impossibility for a more refined solution concept, the nucleolus). We also perform experiments that corroborate these theoretical results and shed light on settings where the least core may be preferable to the Shapley value.},
+  copyright = {Copyright (c) 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.},
+  langid = {english}
 }
diff --git a/src/pydvl/utils/utility.py b/src/pydvl/utils/utility.py
@@ -217,10 +217,6 @@ class DataUtilityLearning:
     >>> wrapped_u((1, 2, 3)) # Subsequent calls will be computed using the fit model for DUL
     0.0
 
-    .. rubric:: References
-
-    .. footbibliography::
-
     """
 
     def __init__(
diff --git a/src/pydvl/value/banzhaf/__init__.py b/src/pydvl/value/banzhaf/__init__.py
diff --git a/src/pydvl/value/banzhaf/montecarlo.py b/src/pydvl/value/banzhaf/montecarlo.py
diff --git a/src/pydvl/value/least_core/_common.py b/src/pydvl/value/least_core/_common.py
@@ -26,10 +26,12 @@ def _solve_linear_program(
     bounds: BOUNDS_TYPE,
     **options,
 ) -> Optional[NDArray[np.float_]]:
-    """Solves a linear program using scipy's :func:`~scipy.optimize.linprog` function.
+    """Solves a linear program using scipy's :func:`~scipy.optimize.linprog`
+    function.
 
-    > **NOTE**: The following description of the linear program and the parameters is taken verbatim
-    > from scipy
+    .. note::
+       The following description of the linear program and the parameters is
+       taken verbatim from scipy
 
     .. math::
 
@@ -38,26 +40,26 @@ def _solve_linear_program(
         & A_{eq} x = b_{eq},\\
         & l \leq x \leq u ,
 
-     where :math:`x` is a vector of decision variables; :math:`c`,
-    :math:`b_{ub}`, :math:`b_{eq}`, :math:`l`, and :math:`u` are vectors; and
-    :math:`A_{ub}` and :math:`A_{eq}` are matrices.
+    where $x$ is a vector of decision variables; $c$, $b_{ub}$, $b_{eq}$, $l$,
+    and $u$ are vectors, and $A_{ub}$ and $A_{eq}$ are matrices.
 
     :param c: The coefficients of the linear objective function to be minimized.
-    :param A_eq: The equality constraint matrix. Each row of ``A_eq`` specifies the
-        coefficients of a linear equality constraint on ``x``.
-    :param b_eq: The equality constraint vector. Each element of ``A_eq @ x`` must equal
-        the corresponding element of ``b_eq``.
-    :param A_ub: The inequality constraint matrix. Each row of ``A_ub`` specifies the
-        coefficients of a linear inequality constraint on ``x``.
+    :param A_eq: The equality constraint matrix. Each row of ``A_eq`` specifies
+        the coefficients of a linear equality constraint on ``x``.
+    :param b_eq: The equality constraint vector. Each element of ``A_eq @ x``
+        must equal the corresponding element of ``b_eq``.
+    :param A_ub: The inequality constraint matrix. Each row of ``A_ub``
+        specifies the coefficients of a linear inequality constraint on ``x``.
     :param b_ub: The inequality constraint vector. Each element represents an
         upper bound on the corresponding value of ``A_ub @ x``.
-    :param bounds: A sequence of ``(min, max)`` pairs for each element in ``x``, defining
-        the minimum and maximum values of that decision variable. Use ``None``
-        to indicate that there is no bound. By default, bounds are
-        ``(0, None)`` (all decision variables are non-negative).
-        If a single tuple ``(min, max)`` is provided, then ``min`` and
-        ``max`` will serve as bounds for all decision variables.
-    :param options: A dictionary of solver options. Refer to scipy's documentation for all possible values.
+    :param bounds: A sequence of ``(min, max)`` pairs for each element in ``x``,
+        defining the minimum and maximum values of that decision variable. Use
+        ``None`` to indicate that there is no bound. By default, bounds are
+        ``(0, None)`` (all decision variables are non-negative). If a single
+        tuple ``(min, max)`` is provided, then ``min`` and ``max`` will serve as
+        bounds for all decision variables.
+    :param options: A dictionary of solver options. Refer to scipy's
+        documentation for all possible values.
     """
     logger.debug(
         f"Solving linear programming problem: {c=}, {A_eq=}, {b_eq=}, {A_ub=}, {b_ub=}"
diff --git a/src/pydvl/value/least_core/montecarlo.py b/src/pydvl/value/least_core/montecarlo.py
@@ -106,11 +106,11 @@ def montecarlo_least_core(
     :param config: Object configuring parallel computation, with cluster
         address, number of cpus, etc.
     :param epsilon: Relaxation value by which the subset utility is decreased.
-    :param options: LP Solver options. \
-        Refer to this page for more information https://docs.scipy.org/doc/scipy/reference/optimize.linprog-highs.html
-    :param progress: If True, shows a tqdm progress bar
-    :return: Dictionary of {"index or label": exact_value}, sorted by decreasing
-        value.
+    :param options: LP Solver options. Refer to `SciPy's documentation
+        <https://docs.scipy.org/doc/scipy/reference/optimize.linprog-highs.html>`_
+        for more information
+    :param progress: Whether to display a progress bar
+    :return: Object with the data values.
     """
     n = len(u.data)
 
diff --git a/src/pydvl/value/least_core/naive.py b/src/pydvl/value/least_core/naive.py
@@ -16,7 +16,16 @@
 def exact_least_core(
     u: Utility, *, options: Optional[dict] = None, progress: bool = True, **kwargs
 ) -> ValuationResult:
-    r"""Computes the exact Least Core values by solving the following Linear Programming problem:
+    r"""Computes the exact Least Core values.
+
+    .. note::
+       If the training set contains more than 20 instances a warning is printed
+       because the computation is very expensive. This method is mostly used for
+       internal testing and simple use cases. Please refer to the
+       :func:`Monte Carlo method <pydvl.value.least_core.montecarlo.montecarlo_least_core>`
+       for practical applications.
+
+    The least core is the solution to the following Linear Programming problem:
 
     $$
     \begin{array}{lll}
@@ -26,24 +35,14 @@ def exact_least_core(
     \end{array}
     $$
 
-    Where $N = \{1, 2, \dots, n\}$ is the set of the training set's indices.
-
-    If the training set contains more than 20 instances a warning is printed
-    because the computation is very expensive.
-
-    .. note::
-
-        This method is mostly used for internal testing and simple use cases.
-        Please refer to the :func:`Monte Carlo method <pydvl.least_core.montecarlo.montecarlo_least_core>` 
-        for all other cases.
+    Where $N = \{1, 2, \dots, n\}$ are the training set's indices.
 
     :param u: Utility object with model, data, and scoring function
-    :param options: LP Solver options. \
-        Refer to this page for more information https://docs.scipy.org/doc/scipy/reference/optimize.linprog-highs.html
-    :param progress: If True, shows a tqdm progress bar
-
-    :return: Dictionary of {"index or label": exact_value}, sorted by decreasing
-        value.
+    :param options: LP Solver options. Refer to `SciPy's documentation
+        <https://docs.scipy.org/doc/scipy/reference/optimize.linprog-highs.html>`_
+        for more information
+    :param progress: Whether to display a progress bar
+    :return: Object with the data values.
     """
     n = len(u.data)
 
diff --git a/src/pydvl/value/shapley/knn.py b/src/pydvl/value/shapley/knn.py
@@ -32,10 +32,6 @@ def knn_shapley(u: Utility, *, progress: bool = True) -> ValuationResult:
     :raises TypeError: If the model in the utility is not a `KNeighborsClassifier
         <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html>`_
 
-    . rubric:: References
-
-    .. footbibliography::
-
     .. versionadded:: 0.1.0
 
     """
diff --git a/src/pydvl/value/shapley/montecarlo.py b/src/pydvl/value/shapley/montecarlo.py
diff --git a/src/pydvl/value/shapley/naive.py b/src/pydvl/value/shapley/naive.py
diff --git a/tests/value/conftest.py b/tests/value/conftest.py

Original file line number	Diff line number	Diff line change
`@@ -70,6 +70,8 @@`
`70`	`70`	`}`
`71`	`71`
`72`	`72`	`bibtex_bibfiles = ["pydvl.bib"]`
	`73`	`+bibtex_bibliography_header = "References\n=========="`
	`74`	`+bibtex_footbibliography_header = bibtex_bibliography_header`
`73`	`75`
`74`	`76`	`# NBSphinx`
`75`	`77`