Skip to content

Commit 5656be0

Browse files
Merge pull request #492 from aai-institute/fix/docs-and-cleanup
Docs and cleanup
2 parents f2f7466 + 67c5fb2 commit 5656be0

File tree

27 files changed

+275
-310
lines changed

27 files changed

+275
-310
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@
2020
- Fixed bug with checking for converged values in semivalues
2121
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)
2222

23+
### Docs
24+
25+
- Add applications of data valuation section, display examples more prominently,
26+
make all sections visible in table of contents, use mkdocs material cards
27+
in the home page [PR #492](https://github.com/aai-institute/pyDVL/pull/492)
28+
2329
## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁
2430

2531
### Added

CONTRIBUTING.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ to make your life easier.
2323

2424
Run the following to set up the pre-commit git hook to run before pushes:
2525

26-
```shell script
26+
```shell
2727
pre-commit install --hook-type pre-push
2828
```
2929

@@ -32,15 +32,15 @@ pre-commit install --hook-type pre-push
3232
We strongly suggest using some form of virtual environment for working with the
3333
library. E.g. with venv:
3434

35-
```shell script
35+
```shell
3636
python -m venv ./venv
3737
. venv/bin/activate # `venv\Scripts\activate` in windows
3838
pip install -r requirements-dev.txt -r requirements-docs.txt
3939
```
4040

4141
With conda:
4242

43-
```shell script
43+
```shell
4444
conda create -n pydvl python=3.8
4545
conda activate pydvl
4646
pip install -r requirements-dev.txt -r requirements-docs.txt
@@ -49,7 +49,7 @@ pip install -r requirements-dev.txt -r requirements-docs.txt
4949
A very convenient way of working with your library during development is to
5050
install it in editable mode into your environment by running
5151

52-
```shell script
52+
```shell
5353
pip install -e .
5454
```
5555

@@ -58,7 +58,7 @@ suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be i
5858
automatically as a dependency with `requirements-docs.txt`. Under OSX you can
5959
install pandoc (you'll need at least version 2.11) with:
6060

61-
```shell script
61+
```shell
6262
brew install pandoc
6363
```
6464

@@ -152,11 +152,11 @@ Two important markers are:
152152
To test the notebooks separately, run (see [below](#notebooks) for details):
153153

154154
```shell
155-
tox -e tests -- notebooks/
155+
tox -e notebook-tests
156156
```
157157

158158
To create a package locally, run:
159-
```shell script
159+
```shell
160160
python setup.py sdist bdist_wheel
161161
```
162162

@@ -517,13 +517,13 @@ Then, a new release can be created using the script
517517
`bumpversion` automatically derive the next release version by bumping the patch
518518
part):
519519

520-
```shell script
520+
```shell
521521
build_scripts/release-version.sh 0.1.6
522522
```
523523

524524
To find out how to use the script, pass the `-h` or `--help` flags:
525525

526-
```shell script
526+
```shell
527527
build_scripts/release-version.sh --help
528528
```
529529

@@ -549,7 +549,7 @@ create a new release manually by following these steps:
549549
2. When ready to release: From the develop branch create the release branch and
550550
perform release activities (update changelog, news, ...). For your own
551551
convenience, define an env variable for the release version
552-
```shell script
552+
```shell
553553
export RELEASE_VERSION="vX.Y.Z"
554554
git checkout develop
555555
git branch release/${RELEASE_VERSION} && git checkout release/${RELEASE_VERSION}
@@ -560,7 +560,7 @@ create a new release manually by following these steps:
560560
(the `release` part is ignored but required by bumpversion :rolling_eyes:).
561561
4. Merge the release branch into `master`, tag the merge commit, and push back to the repo.
562562
The CI pipeline publishes the package based on the tagged commit.
563-
```shell script
563+
```shell
564564
git checkout master
565565
git merge --no-ff release/${RELEASE_VERSION}
566566
git tag -a ${RELEASE_VERSION} -m"Release ${RELEASE_VERSION}"
@@ -571,7 +571,7 @@ create a new release manually by following these steps:
571571
always strictly more recent than the last published release version from
572572
`master`.
573573
6. Merge the release branch into `develop`:
574-
```shell script
574+
```shell
575575
git checkout develop
576576
git merge --no-ff release/${RELEASE_VERSION}
577577
git push origin develop
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import logging
2+
import os
3+
from pathlib import Path
4+
5+
import mkdocs.plugins
6+
7+
logger = logging.getLogger(__name__)
8+
9+
root_dir = Path(__file__).parent.parent
10+
docs_dir = root_dir / "docs"
11+
contributing_file = root_dir / "CONTRIBUTING.md"
12+
target_filepath = docs_dir / contributing_file.name
13+
14+
15+
@mkdocs.plugins.event_priority(100)
16+
def on_pre_build(config):
17+
logger.info("Temporarily copying contributing guide to docs directory")
18+
try:
19+
if os.path.getmtime(contributing_file) <= os.path.getmtime(target_filepath):
20+
logger.info(
21+
f"Contributing guide '{os.fspath(contributing_file)}' hasn't been updated, skipping."
22+
)
23+
return
24+
except FileNotFoundError:
25+
pass
26+
logger.info(
27+
f"Creating symbolic link for '{os.fspath(contributing_file)}' "
28+
f"at '{os.fspath(target_filepath)}'"
29+
)
30+
target_filepath.symlink_to(contributing_file)
31+
32+
logger.info("Finished copying contributing guide to docs directory")
33+
34+
35+
@mkdocs.plugins.event_priority(-100)
36+
def on_shutdown():
37+
logger.info("Removing temporary contributing guide in docs directory")
38+
target_filepath.unlink()

docs/css/extra.css

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ a.autorefs-external:hover::after {
6969
.nt-card-image:focus {
7070
filter: invert(32%) sepia(93%) saturate(1535%) hue-rotate(220deg) brightness(102%) contrast(99%);
7171
}
72+
7273
.md-header__button.md-logo {
7374
padding: 0;
7475
}

docs/css/grid-cards.css

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
/* Shadow and Hover */
2+
.grid.cards > ul > li {
3+
box-shadow: 0 2px 2px 0 rgb(0 0 0 / 14%), 0 3px 1px -2px rgb(0 0 0 / 20%), 0 1px 5px 0 rgb(0 0 0 / 12%);
4+
5+
&:hover {
6+
transform: scale(1.05);
7+
z-index: 999;
8+
background-color: rgba(0, 0, 0, 0.05);
9+
}
10+
}
11+
12+
[data-md-color-scheme="slate"] {
13+
.grid.cards > ul > li {
14+
box-shadow: 0 2px 2px 0 rgb(4 40 33 / 14%), 0 3px 1px -2px rgb(40 86 94 / 47%), 0 1px 5px 0 rgb(139 252 255 / 64%);
15+
16+
&:hover {
17+
transform: scale(1.05);
18+
z-index: 999;
19+
background-color: rgba(139, 252, 255, 0.05);
20+
}
21+
}
22+
}

docs/css/neoteroi.css

Lines changed: 0 additions & 1 deletion
This file was deleted.

docs/getting-started/first-steps.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
2-
title: Getting Started
2+
title: First Steps
33
alias:
4-
name: getting-started
5-
text: Getting Started
4+
name: first-steps
5+
text: First Steps
66
---
77

8-
# Getting started
8+
# First Steps
99

1010
!!! Warning
1111
Make sure you have read [[installation]] before using the library.

docs/index.md

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,26 +9,39 @@ It runs most of them in parallel either locally or in a cluster and supports
99
distributed caching of results.
1010

1111
If you're a first time user of pyDVL, we recommend you to go through the
12-
[[getting-started]] and [[installation]] guides.
12+
[[installation]] and [[first-steps]] guides in the Getting Started section.
1313

14-
::cards:: cols=2
14+
<div class="grid cards" markdown>
1515

16-
- title: Installation
17-
content: Steps to install and requirements
18-
url: getting-started/installation.md
16+
- :fontawesome-solid-toolbox:{ .lg .middle } __Installation__
17+
18+
---
19+
Steps to install and requirements
20+
21+
[[installation|:octicons-arrow-right-24: Installation]]
22+
23+
- :fontawesome-solid-scale-unbalanced:{ .lg .middle } __Data valuation__
24+
25+
---
1926

20-
- title: Data valuation
21-
content: >
2227
Basics of data valuation and description of the main algorithms
23-
url: value/
2428

25-
- title: Influence Function
26-
content: >
29+
[[data-valuation|:octicons-arrow-right-24: Data Valuation]]
30+
31+
- :fontawesome-solid-scale-unbalanced-flip:{ .lg .middle } __Influence Function__
32+
33+
---
34+
2735
An introduction to the influence function and its computation with pyDVL
28-
url: influence/
2936

30-
- title: Browse the API
31-
content: Full documentation of the API
32-
url: api/pydvl/
37+
[[influence-values|:octicons-arrow-right-24: Influence Values]]
38+
39+
- :fontawesome-regular-file-code:{ .lg .middle } __API Reference__
40+
41+
---
42+
43+
Full documentation of the API
44+
45+
[:octicons-arrow-right-24: API Reference](api/pydvl/)
3346

34-
::/cards::
47+
</div>

docs/value/applications.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: Applications of data valuation
3+
---
4+
5+
# Applications of data valuation
6+
7+
Data valuation methods hold promise for improving various aspects
8+
of data engineering and machine learning workflows. When applied judiciously,
9+
these methods can enhance data quality, model performance, and cost-effectiveness.
10+
11+
However, the results can be inconsistent. Values have a strong dependency
12+
on the training procedure and the performance metric used. For instance,
13+
accuracy is a poor metric for imbalanced sets and this has a stark effect
14+
on data values. Some models exhibit great variance in some regimes
15+
and this again has a detrimental effect on values.
16+
17+
While still an evolving field with methods requiring careful use, data valuation can
18+
be applied across a wide range of data engineering tasks. For a comprehensive
19+
overview, along with concrete examples, please refer to the [Transferlab blog
20+
post]({{ transferlab.website }}blog/data-valuation-applications/) on this topic.
21+
22+
## Data Engineering
23+
24+
While still an emerging field, judicious use of data valuation techniques
25+
has the potential to enhance data quality, model performance,
26+
and the cost-effectiveness of data workflows in many applications.
27+
Some of the promising applications in data engineering include:
28+
29+
- Removing low-value data points can reduce noise and increase model performance.
30+
However, care is needed to avoid overfitting when iteratively retraining on pruned datasets.
31+
- Pruning redundant samples enables more efficient training of large models.
32+
Value-based metrics can determine which data to discard for optimal efficiency gains.
33+
- Computing value scores for unlabeled data points supports efficient active learning.
34+
High-value points can be prioritized for labeling to maximize gains in model performance.
35+
- Analyzing high- and low-value data provides insights to guide targeted data collection
36+
and improve upstream data processes. Low-value points may reveal data issues to address.
37+
- Data value metrics can also help identify irrelevant or duplicated data
38+
when evaluating offerings from data providers.
39+
40+
## Model development
41+
42+
Data valuation techniques can provide insights for model debugging and interpretation.
43+
Some of the useful applications include:
44+
45+
- Interpretation and debugging: Analyzing the most or least valuable samples
46+
for a class can reveal cases where the model relies on confounding features
47+
instead of true signal. Investigating influential points for misclassified examples
48+
highlights limitations to address.
49+
- Sensitivity/robustness analysis: Prior work shows removing a small fraction
50+
of highly influential data can completely flip model conclusions.
51+
This reveals potential issues with the modeling approach, data collection process,
52+
or intrinsic difficulty of the problem that require further inspection.
53+
Robust models require many points removed before conclusions meaningfully shift.
54+
High sensitivity means conclusions heavily depend on small subsets of data,
55+
indicating deeper problems to resolve.
56+
- Monitoring changes in data value during training provides insights into
57+
model convergence and overfitting.
58+
- Continual learning: in order to avoid forgetting when training on new data,
59+
a subset of previously seen data is presented again. Data valuation helps
60+
in the selection of highly influential samples.
61+
62+
## Attacks
63+
64+
Data valuation techniques have applications in detecting data manipulation and contamination:
65+
66+
- Watermark removal: Points with low value on a correct validation set may be
67+
part of a watermarking mechanism. Removing them can strip a model of its fingerprints.
68+
- Poisoning attacks: Influential points can be shifted to induce large changes
69+
in model estimators. However, the feasibility of such attacks is limited,
70+
and their value for adversarial training is unclear.
71+
72+
Overall, while data valuation techniques show promise for identifying anomalous
73+
or manipulated data, more research is needed to develop robust methods suited
74+
for security applications.
75+
76+
## Data markets
77+
78+
Additionally, one of the motivating applications for the whole field is that of
79+
data markets: a marketplace where data owners can sell their data to interested
80+
parties. In this setting, data valuation can be key component to determine the
81+
price of data. Market pricing depends on the value addition for buyers
82+
(e.g. improved model performance) and costs/privacy concerns for sellers.
83+
84+
Game-theoretic valuation methods like Shapley values can help assign fair prices,
85+
but have limitations around handling duplicates or adversarial data.
86+
Model-free methods like LAVA [@just_lava_2023] and CRAIG are
87+
particularly well suited for this, as they use the Wasserstein distance between
88+
a vendor's data and the buyer's to determine the value of the former.
89+
90+
However, this is a complex problem which can face practical banal problems like
91+
the fact that data owners may not wish to disclose their data for valuation.

docs/value/index.md

Lines changed: 0 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -83,33 +83,6 @@ among all samples, failing to identify repeated ones as unnecessary, with e.g. a
8383
zero value.
8484

8585

86-
## Applications of data valuation
87-
88-
Many applications are touted for data valuation, but the results can be
89-
inconsistent. Values have a strong dependency on the training procedure and the
90-
performance metric used. For instance, accuracy is a poor metric for imbalanced
91-
sets and this has a stark effect on data values. Some models exhibit great
92-
variance in some regimes and this again has a detrimental effect on values.
93-
94-
Nevertheless, some of the most promising applications are:
95-
96-
* Cleaning of corrupted data.
97-
* Pruning unnecessary or irrelevant data.
98-
* Repairing mislabeled data.
99-
* Guiding data acquisition and annotation (active learning).
100-
* Anomaly detection and model debugging and interpretation.
101-
102-
Additionally, one of the motivating applications for the whole field is that of
103-
data markets: a marketplace where data owners can sell their data to interested
104-
parties. In this setting, data valuation can be key component to determine the
105-
price of data. Algorithm-agnostic methods like LAVA [@just_lava_2023] are
106-
particularly well suited for this, as they use the Wasserstein distance between
107-
a vendor's data and the buyer's to determine the value of the former.
108-
109-
However, this is a complex problem which can face practical banal problems like
110-
the fact that data owners may not wish to disclose their data for valuation.
111-
112-
11386
## Computing data values
11487

11588
Using pyDVL to compute data values is a simple process that can be broken down

0 commit comments

Comments
 (0)