Skip to content

Commit 2791074

Browse files
authored
Release 0.14 (#1266)
1 parent 3380bbb commit 2791074

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+1031
-601
lines changed

.github/workflows/pre-commit.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ jobs:
77
runs-on: ubuntu-latest
88
steps:
99
- uses: actions/checkout@v3
10-
- name: Setup Python 3.7
10+
- name: Setup Python 3.8
1111
uses: actions/setup-python@v4
1212
with:
13-
python-version: 3.7
13+
python-version: 3.8
1414
- name: Install pre-commit
1515
run: |
1616
pip install pre-commit

.github/workflows/test.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ jobs:
5353
- os: windows-latest
5454
sklearn-only: 'false'
5555
scikit-learn: 0.24.*
56+
scipy: 1.10.0
5657
fail-fast: false
5758
max-parallel: 4
5859

@@ -113,5 +114,6 @@ jobs:
113114
uses: codecov/codecov-action@v3
114115
with:
115116
files: coverage.xml
117+
token: ${{ secrets.CODECOV_TOKEN }}
116118
fail_ci_if_error: true
117119
verbose: true

.pre-commit-config.yaml

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
repos:
22
- repo: https://github.com/psf/black
3-
rev: 22.6.0
3+
rev: 23.3.0
44
hooks:
55
- id: black
66
args: [--line-length=100]
77
- repo: https://github.com/pre-commit/mirrors-mypy
8-
rev: v0.961
8+
rev: v1.4.1
99
hooks:
1010
- id: mypy
1111
name: mypy openml
@@ -19,8 +19,16 @@ repos:
1919
additional_dependencies:
2020
- types-requests
2121
- types-python-dateutil
22+
- id: mypy
23+
name: mypy top-level-functions
24+
files: openml/_api_calls.py
25+
additional_dependencies:
26+
- types-requests
27+
- types-python-dateutil
28+
args: [ --disallow-untyped-defs, --disallow-any-generics,
29+
--disallow-any-explicit, --implicit-optional ]
2230
- repo: https://github.com/pycqa/flake8
23-
rev: 4.0.1
31+
rev: 6.0.0
2432
hooks:
2533
- id: flake8
2634
name: flake8 openml

README.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,19 @@ following paper:
2020

2121
[Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, Frank Hutter<br/>
2222
**OpenML-Python: an extensible Python API for OpenML**<br/>
23-
*arXiv:1911.02490 [cs.LG]*](https://arxiv.org/abs/1911.02490)
23+
Journal of Machine Learning Research, 22(100):1−5, 2021](https://www.jmlr.org/papers/v22/19-920.html)
2424

2525
Bibtex entry:
2626
```bibtex
27-
@article{feurer-arxiv19a,
28-
author = {Matthias Feurer and Jan N. van Rijn and Arlind Kadra and Pieter Gijsbers and Neeratyoy Mallik and Sahithya Ravi and Andreas Müller and Joaquin Vanschoren and Frank Hutter},
29-
title = {OpenML-Python: an extensible Python API for OpenML},
30-
journal = {arXiv:1911.02490},
31-
year = {2019},
27+
@article{JMLR:v22:19-920,
28+
author = {Matthias Feurer and Jan N. van Rijn and Arlind Kadra and Pieter Gijsbers and Neeratyoy Mallik and Sahithya Ravi and Andreas Müller and Joaquin Vanschoren and Frank Hutter},
29+
title = {OpenML-Python: an extensible Python API for OpenML},
30+
journal = {Journal of Machine Learning Research},
31+
year = {2021},
32+
volume = {22},
33+
number = {100},
34+
pages = {1--5},
35+
url = {http://jmlr.org/papers/v22/19-920.html}
3236
}
3337
```
3438

doc/index.rst

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Example
3030
('estimator', tree.DecisionTreeClassifier())
3131
]
3232
)
33-
# Download the OpenML task for the german credit card dataset with 10-fold
33+
# Download the OpenML task for the pendigits dataset with 10-fold
3434
# cross-validation.
3535
task = openml.tasks.get_task(32)
3636
# Run the scikit-learn model on the task.
@@ -93,17 +93,21 @@ Citing OpenML-Python
9393
If you use OpenML-Python in a scientific publication, we would appreciate a
9494
reference to the following paper:
9595

96-
97-
`OpenML-Python: an extensible Python API for OpenML
98-
<https://arxiv.org/abs/1911.02490>`_,
99-
Feurer *et al.*, arXiv:1911.02490.
96+
| Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, Frank Hutter
97+
| **OpenML-Python: an extensible Python API for OpenML**
98+
| Journal of Machine Learning Research, 22(100):1−5, 2021
99+
| `https://www.jmlr.org/papers/v22/19-920.html <https://www.jmlr.org/papers/v22/19-920.html>`_
100100
101101
Bibtex entry::
102102

103-
@article{feurer-arxiv19a,
104-
author = {Matthias Feurer and Jan N. van Rijn and Arlind Kadra and Pieter Gijsbers and Neeratyoy Mallik and Sahithya Ravi and Andreas Müller and Joaquin Vanschoren and Frank Hutter},
105-
title = {OpenML-Python: an extensible Python API for OpenML},
106-
journal = {arXiv:1911.02490},
107-
year = {2019},
108-
}
103+
@article{JMLR:v22:19-920,
104+
author = {Matthias Feurer and Jan N. van Rijn and Arlind Kadra and Pieter Gijsbers and Neeratyoy Mallik and Sahithya Ravi and Andreas Müller and Joaquin Vanschoren and Frank Hutter},
105+
title = {OpenML-Python: an extensible Python API for OpenML},
106+
journal = {Journal of Machine Learning Research},
107+
year = {2021},
108+
volume = {22},
109+
number = {100},
110+
pages = {1--5},
111+
url = {http://jmlr.org/papers/v22/19-920.html}
112+
}
109113

doc/progress.rst

Lines changed: 46 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,55 @@
66
Changelog
77
=========
88

9+
0.14.0
10+
~~~~~~
11+
12+
**IMPORTANT:** This release paves the way towards a breaking update of OpenML-Python. From version
13+
0.15, functions that had the option to return a pandas DataFrame will return a pandas DataFrame
14+
by default. This version (0.14) emits a warning if you still use the old access functionality.
15+
More concretely:
16+
17+
* In 0.15 we will drop the ability to return dictionaries in listing calls and only provide
18+
pandas DataFrames. To disable warnings in 0.14 you have to request a pandas DataFrame
19+
(using ``output_format="dataframe"``).
20+
* In 0.15 we will drop the ability to return datasets as numpy arrays and only provide
21+
pandas DataFrames. To disable warnings in 0.14 you have to request a pandas DataFrame
22+
(using ``dataset_format="dataframe"``).
23+
24+
Furthermore, from version 0.15, OpenML-Python will no longer download datasets and dataset metadata
25+
by default. This version (0.14) emits a warning if you don't explicitly specifiy the desired behavior.
26+
27+
Please see the pull requests #1258 and #1260 for further information.
28+
29+
* ADD #1081: New flag that allows disabling downloading dataset features.
30+
* ADD #1132: New flag that forces a redownload of cached data.
31+
* FIX #1244: Fixes a rare bug where task listing could fail when the server returned invalid data.
32+
* DOC #1229: Fixes a comment string for the main example.
33+
* DOC #1241: Fixes a comment in an example.
34+
* MAINT #1124: Improve naming of helper functions that govern the cache directories.
35+
* MAINT #1223, #1250: Update tools used in pre-commit to the latest versions (``black==23.30``, ``mypy==1.3.0``, ``flake8==6.0.0``).
36+
* MAINT #1253: Update the citation request to the JMLR paper.
37+
* MAINT #1246: Add a warning that warns the user that checking for duplicate runs on the server cannot be done without an API key.
38+
939
0.13.1
1040
~~~~~~
1141

12-
* ADD #1028: Add functions to delete runs, flows, datasets, and tasks (e.g., ``openml.datasets.delete_dataset``).
13-
* ADD #1144: Add locally computed results to the ``OpenMLRun`` object's representation if the run was created locally and not downloaded from the server.
14-
* ADD #1180: Improve the error message when the checksum of a downloaded dataset does not match the checksum provided by the API.
15-
* ADD #1201: Make ``OpenMLTraceIteration`` a dataclass.
16-
* DOC #1069: Add argument documentation for the ``OpenMLRun`` class.
17-
* FIX #1197 #559 #1131: Fix the order of ground truth and predictions in the ``OpenMLRun`` object and in ``format_prediction``.
18-
* FIX #1198: Support numpy 1.24 and higher.
19-
* FIX #1216: Allow unknown task types on the server. This is only relevant when new task types are added to the test server.
20-
* MAINT #1155: Add dependabot github action to automatically update other github actions.
21-
* MAINT #1199: Obtain pre-commit's flake8 from github.com instead of gitlab.com.
22-
* MAINT #1215: Support latest numpy version.
23-
* MAINT #1218: Test Python3.6 on Ubuntu 20.04 instead of the latest Ubuntu (which is 22.04).
24-
* MAINT #1221 #1212 #1206 #1211: Update github actions to the latest versions.
42+
* ADD #1081 #1132: Add additional options for (not) downloading datasets ``openml.datasets.get_dataset`` and cache management.
43+
* ADD #1028: Add functions to delete runs, flows, datasets, and tasks (e.g., ``openml.datasets.delete_dataset``).
44+
* ADD #1144: Add locally computed results to the ``OpenMLRun`` object's representation if the run was created locally and not downloaded from the server.
45+
* ADD #1180: Improve the error message when the checksum of a downloaded dataset does not match the checksum provided by the API.
46+
* ADD #1201: Make ``OpenMLTraceIteration`` a dataclass.
47+
* DOC #1069: Add argument documentation for the ``OpenMLRun`` class.
48+
* DOC #1241 #1229 #1231: Minor documentation fixes and resolve documentation examples not working.
49+
* FIX #1197 #559 #1131: Fix the order of ground truth and predictions in the ``OpenMLRun`` object and in ``format_prediction``.
50+
* FIX #1198: Support numpy 1.24 and higher.
51+
* FIX #1216: Allow unknown task types on the server. This is only relevant when new task types are added to the test server.
52+
* FIX #1223: Fix mypy errors for implicit optional typing.
53+
* MAINT #1155: Add dependabot github action to automatically update other github actions.
54+
* MAINT #1199: Obtain pre-commit's flake8 from github.com instead of gitlab.com.
55+
* MAINT #1215: Support latest numpy version.
56+
* MAINT #1218: Test Python3.6 on Ubuntu 20.04 instead of the latest Ubuntu (which is 22.04).
57+
* MAINT #1221 #1212 #1206 #1211: Update github actions to the latest versions.
2558

2659
0.13.0
2760
~~~~~~

examples/20_basic/simple_flows_and_runs_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
# NOTE: We are using dataset 20 from the test server: https://test.openml.org/d/20
2424
dataset = openml.datasets.get_dataset(20)
2525
X, y, categorical_indicator, attribute_names = dataset.get_data(
26-
dataset_format="array", target=dataset.default_target_attribute
26+
target=dataset.default_target_attribute
2727
)
2828
clf = neighbors.KNeighborsClassifier(n_neighbors=3)
2929
clf.fit(X, y)

examples/30_extended/configure_logging.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@
3737

3838
import logging
3939

40-
openml.config.console_log.setLevel(logging.DEBUG)
41-
openml.config.file_log.setLevel(logging.WARNING)
40+
openml.config.set_console_log_level(logging.DEBUG)
41+
openml.config.set_file_log_level(logging.WARNING)
4242
openml.datasets.get_dataset("iris")
4343

4444
# Now the log level that was previously written to file should also be shown in the console.

examples/30_extended/custom_flow_.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@
7777
# you can use the Random Forest Classifier flow as a *subflow*. It allows for
7878
# all hyperparameters of the Random Classifier Flow to also be specified in your pipeline flow.
7979
#
80+
# Note: you can currently only specific one subflow as part of the components.
81+
#
8082
# In this example, the auto-sklearn flow is a subflow: the auto-sklearn flow is entirely executed as part of this flow.
8183
# This allows people to specify auto-sklearn hyperparameters used in this flow.
8284
# In general, using a subflow is not required.
@@ -87,6 +89,8 @@
8789
autosklearn_flow = openml.flows.get_flow(9313) # auto-sklearn 0.5.1
8890
subflow = dict(
8991
components=OrderedDict(automl_tool=autosklearn_flow),
92+
# If you do not want to reference a subflow, you can use the following:
93+
# components=OrderedDict(),
9094
)
9195

9296
####################################################################################################
@@ -124,7 +128,7 @@
124128
OrderedDict([("oml:name", "time"), ("oml:value", 120), ("oml:component", flow_id)]),
125129
]
126130

127-
task_id = 1965 # Iris Task
131+
task_id = 1200 # Iris Task
128132
task = openml.tasks.get_task(task_id)
129133
dataset_id = task.get_dataset().dataset_id
130134

examples/30_extended/datasets_tutorial.py

Lines changed: 15 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,9 @@
2121
# * Use the output_format parameter to select output type
2222
# * Default gives 'dict' (other option: 'dataframe', see below)
2323
#
24-
openml_list = openml.datasets.list_datasets() # returns a dict
25-
26-
# Show a nice table with some key data properties
27-
datalist = pd.DataFrame.from_dict(openml_list, orient="index")
24+
# Note: list_datasets will return a pandas dataframe by default from 0.15. When using
25+
# openml-python 0.14, `list_datasets` will warn you to use output_format='dataframe'.
26+
datalist = openml.datasets.list_datasets(output_format="dataframe")
2827
datalist = datalist[["did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses"]]
2928

3029
print(f"First 10 of {len(datalist)} datasets...")
@@ -65,23 +64,16 @@
6564
############################################################################
6665
# Get the actual data.
6766
#
68-
# The dataset can be returned in 3 possible formats: as a NumPy array, a SciPy
69-
# sparse matrix, or as a Pandas DataFrame. The format is
70-
# controlled with the parameter ``dataset_format`` which can be either 'array'
71-
# (default) or 'dataframe'. Let's first build our dataset from a NumPy array
72-
# and manually create a dataframe.
73-
X, y, categorical_indicator, attribute_names = dataset.get_data(
74-
dataset_format="array", target=dataset.default_target_attribute
75-
)
76-
eeg = pd.DataFrame(X, columns=attribute_names)
77-
eeg["class"] = y
78-
print(eeg[:10])
67+
# openml-python returns data as pandas dataframes (stored in the `eeg` variable below),
68+
# and also some additional metadata that we don't care about right now.
69+
eeg, *_ = dataset.get_data()
7970

8071
############################################################################
81-
# Instead of manually creating the dataframe, you can already request a
82-
# dataframe with the correct dtypes.
72+
# You can optionally choose to have openml separate out a column from the
73+
# dataset. In particular, many datasets for supervised problems have a set
74+
# `default_target_attribute` which may help identify the target variable.
8375
X, y, categorical_indicator, attribute_names = dataset.get_data(
84-
target=dataset.default_target_attribute, dataset_format="dataframe"
76+
target=dataset.default_target_attribute
8577
)
8678
print(X.head())
8779
print(X.info())
@@ -92,6 +84,9 @@
9284
# data file. The dataset object can be used as normal.
9385
# Whenever you use any functionality that requires the data,
9486
# such as `get_data`, the data will be downloaded.
87+
# Starting from 0.15, not downloading data will be the default behavior instead.
88+
# The data will be downloading automatically when you try to access it through
89+
# openml objects, e.g., using `dataset.features`.
9590
dataset = openml.datasets.get_dataset(1471, download_data=False)
9691

9792
############################################################################
@@ -100,8 +95,8 @@
10095
# * Explore the data visually.
10196
eegs = eeg.sample(n=1000)
10297
_ = pd.plotting.scatter_matrix(
103-
eegs.iloc[:100, :4],
104-
c=eegs[:100]["class"],
98+
X.iloc[:100, :4],
99+
c=y[:100],
105100
figsize=(10, 10),
106101
marker="o",
107102
hist_kwds={"bins": 20},

0 commit comments

Comments
 (0)