Skip to content

Commit 691b509

Browse files
committed
Update readme and add problem_classes
1 parent 6f944f0 commit 691b509

File tree

3 files changed

+56
-65
lines changed

3 files changed

+56
-65
lines changed

README.md

Lines changed: 53 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,31 @@
1010
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
1111
[![codecov](https://codecov.io/gh/data-centric-ai/dcbench/branch/main/graph/badge.svg?token=MOLQYUSYQU)](https://codecov.io/gh/data-centric-ai/dcbench)
1212

13-
dcbench tests various data-centric aspects of improving the quality of machine learning workflows.
13+
A benchmark of aspects of improving the quality of machine learning workflows.
1414

1515
[**Getting Started**](⚡️-Quickstart)
16-
| [**What is Meerkat?**](💡-what-is-Meerkat)
17-
| [**Docs**](https://meerkat.readthedocs.io/en/latest/index.html)
16+
| [**What is dcbench?**](💡-what-is-dcbench)
17+
| [**Docs**](https://dcbench.readthedocs.io/en/latest/index.html)
1818
| [**Contributing**](CONTRIBUTING.md)
19-
| [**Blogpost**](https://www.notion.so/sabrieyuboglu/Meerkat-DataPanels-for-Machine-Learning-64891aca2c584f1889eb0129bb747863)
19+
| [**Website**](https://www.datacentricai.cc/)
2020
| [**About**](✉️-About)
2121

22+
23+
## ⚡️ Quickstart
24+
25+
```bash
26+
pip install dcbench
27+
```
28+
29+
Using a Jupyter notebook or some other interactive environment, you can import the library
30+
and explore the data-centric problems in the benchmark:
31+
32+
```python
33+
import dcbench
34+
dcbench.problem_classes
35+
```
36+
37+
## 💡 What is dcbench?
2238
This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.
2339

2440
It features a growing list of *tasks*:
@@ -29,93 +45,66 @@ It features a growing list of *tasks*:
2945
* Minimal training dataset selection (`minitrain`)
3046

3147
Each task features a collection of *scenarios* which are defined by datasets and ML pipeline elements (e.g. a model, feature pre-processors, etc.)
48+
## ⚙️ How does it work?
49+
### `Problem`
3250

33-
## Basic Usage
34-
35-
The very first step is to install the PyPI package:
36-
37-
```bash
38-
pip install dcai
39-
```
40-
41-
Then, we advise using Jupyter notebooks or some other interactive environment. You start off by importing the library and listing all the available artefacts:
42-
43-
```python
44-
from dcai import scenarios
51+
This benchmark is a collection of *data-centric problems*. *What is a data-centric problem?* A useful analogy is: chess problems are to a full chess game as *data-centric* *problems* are to the full data-centric ML lifecycle. For example, many machine-learning workflows include a label cleaning phase where labels are audited and corrected. Therefore, our benchmark includes a collection of label cleaning *problems* each with a different dataset and set of sullied labels to be cleaned.
4552

46-
scenarios.list()
47-
```
53+
The benchmark supports a diverse set of problems that may look very different from one another. For example, a slice discovery problem has different inputs and outputs than a data cleaning problem. To deal with this, we group problems by *problem class.* In `dcbench`, each problem class is represented by a subclass of `Problem` (*e.g.* `SliceDiscoveryProblem`, `MiniCleanProblem`). The problems themselves are represented by instances of these subclasses.
4854

49-
You can then load a specific scenario and view its *artefacts*:
55+
We can get a list all of the problem classes in `dcbench` with:
5056

5157
```python
52-
scenario = scenarios.get("miniclean/bank")
53-
scenario.artefacts
54-
```
58+
import dcbench
59+
dcbench.problem_classes
5560

56-
In the above example we are loading the `bank` scenario of the `miniclean` task. We can then load all the artefacts into a dictionary:
57-
58-
```python
59-
a = scenario.artefacts.load()
61+
# OUT:
62+
[SliceDiscoveryProblem, MiniCleanProblem]
6063
```
6164

62-
This automatically downloads all the available artefacts, saves a local copy and loads it into memory. Artefacts can be accessed directly from the dictionary. We can then go ahead and write the code that will provide us with a scenario-specific solution:
65+
`dcbench` includes a set of problems for each task. We can list them with:
6366

6467
```python
65-
model.fit(a["X_train_dirty"], a["y_train"])
68+
from dcbench import SliceDiscoveryProblem
69+
SliceDiscoveryProblem.instances
6670

67-
X_train_selection = ...
71+
# Out: TODO, get the actual dataframe output here
72+
dataframe
6873
```
6974

70-
Once we have an object (e.g. `X_train_selection`) containing the scenario-specific solution, we can package it into a solution object:
75+
We can get one of these problems with
7176

7277
```python
73-
solution = scenario.solve(X_train_selection=X_train_selection)
78+
problem = SliceDiscoveryProblem.from_id("eda4")
7479
```
7580

76-
We can then perform an evaluation on that solution that will give us the result:
81+
### `Artefact`
7782

78-
```python
79-
solution.evaluate()
80-
solution.result
81-
```
82-
83-
After you're happy with the obtained result, you can bundle your solution artefacts and see their location.
83+
Each *problem* is made up of a set of artefacts: a dataset with labels to clean, a dataset and a model to perform error analysis on. In `dcbench` , these artefacts are represented by instances of `Artefact`. We can think of each `Problem` object as a container for `Artefact` objects.
8484

8585
```python
86-
solution.save()
87-
solution.location
88-
```
86+
problem.artefacts
8987

90-
After obtaining the `/path/to/your/artefacts` you can upload it as a bundle to [CodaLab](https://codalab.org/):
88+
# Out:
89+
{
90+
"dataset": CSVArtefact()
91+
}
9192

92-
```bash
93-
cl upload /path/to/your/artefacts
93+
artefact: CSVArtefact = problem["dataset"]
9494
```
9595

96-
This command will display the URL of your uploaded bundle. It assumes that you have a user account on CodaLab ([click here](https://codalab-worksheets.readthedocs.io/en/latest/features/bundles/uploading/) for more info).
97-
98-
After that, you simply go to our [FORM LINK](#), fill it in with all required details and paste the bundle link so we can run a full evaluation on it.
96+
Note that `Artefact` objects don't actually hold their underlying data in memory. Instead, they hold pointers to where the `Artefact` lives in [dcbench cloud storage](https://console.cloud.google.com/storage/browser/dcbench?authuser=1&project=hai-gcp-fine-grained&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false) and, if it's been downloaded, where it lives locally on disk. This makes the `Problem` objects very lightweight.
9997

100-
Congratulations! Your solution is now uploaded to our system and after evaluation it will show up on the [leaderboard](#).
98+
**Downloading to disk.** By default, `dcbench` downloads artefacts to `~/.dcbench/artefacts` but this can be configured in the dcbench settings TODO: add support for configuration. To download an `Artefact` via the Python API, use `artefact.download()`. You can also download all the artefacts in a problem with `problem.download()`.
10199

102-
## Adding a Submitted solution to the Repo
100+
**Loading into memory.** `dcbench` includes loading functionality for each artefact type. To load an artefact into memory you can use `artefact.load()` . Note that this will also download the artefact if it hasn't yet been downloaded.
103101

104-
This step is performed manually by us (although it could be possible to automate). It looks like this:
102+
Finally, we should point out that `problem` is a Python mapping, so we can index it directly to load artefacts.
105103

106-
```bash
107-
dcai add-solution \
108-
--scenario miniclean/bank \
109-
--name MySolution \
110-
--paper https://arxiv.org/abs/... \
111-
--code https://github.com/... \
112-
--artefacts-url https://worksheets.codalab.org/rest/bundles/...
104+
```python
105+
# this is equivalent to problem.artefacts["dataset"].load()
106+
df: pd.DataFrame = problem["dataset"]
113107
```
114108

115-
## Performing the Full Evaluation
116-
117-
This step is performed by GitHub Actions and is triggered after each commit.
118-
119-
```bash
120-
dcai evaluate --leaderboard-output /path/to/leaderboard/dir
121-
```
109+
## ✉️ About
110+
`dcbench` is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)

dcbench/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33
# flake8: noqa
44

55
from .__main__ import main
6-
from .common import Artefact, Solution
6+
from .common import Artefact, Problem, Solution
77
from .tasks.miniclean import *
88
from .tasks.slice import SliceDiscoveryProblem
99
from .version import __version__
10+
11+
problem_classes = Problem.__subclasses__()

docs/banner.png

18.4 KB
Loading

0 commit comments

Comments
 (0)