You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using a Jupyter notebook or some other interactive environment, you can import the library
30
+
and explore the data-centric problems in the benchmark:
31
+
32
+
```python
33
+
import dcbench
34
+
dcbench.problem_classes
35
+
```
36
+
37
+
## 💡 What is dcbench?
22
38
This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.
23
39
24
40
It features a growing list of *tasks*:
@@ -29,93 +45,66 @@ It features a growing list of *tasks*:
29
45
* Minimal training dataset selection (`minitrain`)
30
46
31
47
Each task features a collection of *scenarios* which are defined by datasets and ML pipeline elements (e.g. a model, feature pre-processors, etc.)
48
+
## ⚙️ How does it work?
49
+
### `Problem`
32
50
33
-
## Basic Usage
34
-
35
-
The very first step is to install the PyPI package:
36
-
37
-
```bash
38
-
pip install dcai
39
-
```
40
-
41
-
Then, we advise using Jupyter notebooks or some other interactive environment. You start off by importing the library and listing all the available artefacts:
42
-
43
-
```python
44
-
from dcai import scenarios
51
+
This benchmark is a collection of *data-centric problems*. *What is a data-centric problem?* A useful analogy is: chess problems are to a full chess game as *data-centric**problems* are to the full data-centric ML lifecycle. For example, many machine-learning workflows include a label cleaning phase where labels are audited and corrected. Therefore, our benchmark includes a collection of label cleaning *problems* each with a different dataset and set of sullied labels to be cleaned.
45
52
46
-
scenarios.list()
47
-
```
53
+
The benchmark supports a diverse set of problems that may look very different from one another. For example, a slice discovery problem has different inputs and outputs than a data cleaning problem. To deal with this, we group problems by *problem class.* In `dcbench`, each problem class is represented by a subclass of `Problem` (*e.g.*`SliceDiscoveryProblem`, `MiniCleanProblem`). The problems themselves are represented by instances of these subclasses.
48
54
49
-
You can then load a specific scenario and view its *artefacts*:
55
+
We can get a list all of the problem classes in `dcbench` with:
50
56
51
57
```python
52
-
scenario = scenarios.get("miniclean/bank")
53
-
scenario.artefacts
54
-
```
58
+
import dcbench
59
+
dcbench.problem_classes
55
60
56
-
In the above example we are loading the `bank` scenario of the `miniclean` task. We can then load all the artefacts into a dictionary:
57
-
58
-
```python
59
-
a = scenario.artefacts.load()
61
+
# OUT:
62
+
[SliceDiscoveryProblem, MiniCleanProblem]
60
63
```
61
64
62
-
This automatically downloads all the available artefacts, saves a local copy and loads it into memory. Artefacts can be accessed directly from the dictionary. We can then go ahead and write the code that will provide us with a scenario-specific solution:
65
+
`dcbench` includes a set of problems for each task. We can list them with:
63
66
64
67
```python
65
-
model.fit(a["X_train_dirty"], a["y_train"])
68
+
from dcbench import SliceDiscoveryProblem
69
+
SliceDiscoveryProblem.instances
66
70
67
-
X_train_selection =...
71
+
# Out: TODO, get the actual dataframe output here
72
+
dataframe
68
73
```
69
74
70
-
Once we have an object (e.g. `X_train_selection`) containing the scenario-specific solution, we can package it into a solution object:
We can then perform an evaluation on that solution that will give us the result:
81
+
### `Artefact`
77
82
78
-
```python
79
-
solution.evaluate()
80
-
solution.result
81
-
```
82
-
83
-
After you're happy with the obtained result, you can bundle your solution artefacts and see their location.
83
+
Each *problem* is made up of a set of artefacts: a dataset with labels to clean, a dataset and a model to perform error analysis on. In `dcbench` , these artefacts are represented by instances of `Artefact`. We can think of each `Problem` object as a container for `Artefact` objects.
84
84
85
85
```python
86
-
solution.save()
87
-
solution.location
88
-
```
86
+
problem.artefacts
89
87
90
-
After obtaining the `/path/to/your/artefacts` you can upload it as a bundle to [CodaLab](https://codalab.org/):
88
+
# Out:
89
+
{
90
+
"dataset": CSVArtefact()
91
+
}
91
92
92
-
```bash
93
-
cl upload /path/to/your/artefacts
93
+
artefact: CSVArtefact = problem["dataset"]
94
94
```
95
95
96
-
This command will display the URL of your uploaded bundle. It assumes that you have a user account on CodaLab ([click here](https://codalab-worksheets.readthedocs.io/en/latest/features/bundles/uploading/) for more info).
97
-
98
-
After that, you simply go to our [FORM LINK](#), fill it in with all required details and paste the bundle link so we can run a full evaluation on it.
96
+
Note that `Artefact` objects don't actually hold their underlying data in memory. Instead, they hold pointers to where the `Artefact` lives in [dcbench cloud storage](https://console.cloud.google.com/storage/browser/dcbench?authuser=1&project=hai-gcp-fine-grained&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false) and, if it's been downloaded, where it lives locally on disk. This makes the `Problem` objects very lightweight.
99
97
100
-
Congratulations! Your solution is now uploaded to our system and after evaluation it will show up on the [leaderboard](#).
98
+
**Downloading to disk.** By default, `dcbench` downloads artefacts to `~/.dcbench/artefacts` but this can be configured in the dcbench settings TODO: add support for configuration. To download an `Artefact` via the Python API, use `artefact.download()`. You can also download all the artefacts in a problem with `problem.download()`.
101
99
102
-
## Adding a Submitted solution to the Repo
100
+
**Loading into memory.**`dcbench` includes loading functionality for each artefact type. To load an artefact into memory you can use `artefact.load()` . Note that this will also download the artefact if it hasn't yet been downloaded.
103
101
104
-
This step is performed manually by us (although it could be possible to automate). It looks like this:
102
+
Finally, we should point out that `problem`is a Python mapping, so we can index it directly to load artefacts.
`dcbench` is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)
0 commit comments