Skip to content

Commit bd8ae14

Browse files
authored
Fix 1013: Store run setup_string (#1015)
* Test setup_string is stored and retrievable * Add setup_string to run dictionary representation * Add fix to release notes * Test setup_string in xml without roundtrip Also moved the test to OpenMLRun, since it mainly tests the OpenMLRun behavior, not a function from openml.runs.functions. * Serialize run_details * Update with merged PRs since 11.0 * Prepare for run_details being provided by the server * Remove pipeline code from setup_string Long pipelines (e.g. gridsearches) could lead to too long setup strings. This prevented run uploads. Also add mypy ignores for old errors which weren't yet vetted by mypy.
1 parent f94672e commit bd8ae14

File tree

5 files changed

+70
-16
lines changed

5 files changed

+70
-16
lines changed

doc/progress.rst

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,33 @@ Changelog
88

99
0.11.1
1010
~~~~~~
11-
* MAINT #1018 : Refactor data loading and storage. Data is now compressed on the first call to `get_data`.
12-
* MAINT #891: Changed the way that numerical features are stored. Numerical features that range from 0 to 255 are now stored as uint8, which reduces the storage space required as well as storing and loading times.
13-
* MAINT #671: Improved the performance of ``check_datasets_active`` by only querying the given list of datasets in contrast to querying all datasets. Modified the corresponding unit test.
14-
* FIX #964 : AValidate `ignore_attribute`, `default_target_attribute`, `row_id_attribute` are set to attributes that exist on the dataset when calling ``create_dataset``.
15-
* DOC #973 : Change the task used in the welcome page example so it no longer fails using numerical dataset.
11+
* ADD #964: Validate ``ignore_attribute``, ``default_target_attribute``, ``row_id_attribute`` are set to attributes that exist on the dataset when calling ``create_dataset``.
12+
* ADD #979: Dataset features and qualities are now also cached in pickle format.
13+
* ADD #982: Add helper functions for column transformers.
14+
* ADD #989: ``run_model_on_task`` will now warn the user the the model passed has already been fitted.
1615
* ADD #1009 : Give possibility to not download the dataset qualities. The cached version is used even so download attribute is false.
16+
* ADD #1016: Add scikit-learn 0.24 support.
17+
* ADD #1020: Add option to parallelize evaluation of tasks with joblib.
18+
* ADD #1022: Allow minimum version of dependencies to be listed for a flow, use more accurate minimum versions for scikit-learn dependencies.
19+
* ADD #1023: Add admin-only calls for adding topics to datasets.
20+
* ADD #1029: Add support for fetching dataset from a minio server in parquet format.
21+
* ADD #1031: Generally improve runtime measurements, add them for some previously unsupported flows (e.g. BaseSearchCV derived flows).
22+
* DOC #973 : Change the task used in the welcome page example so it no longer fails using numerical dataset.
23+
* MAINT #671: Improved the performance of ``check_datasets_active`` by only querying the given list of datasets in contrast to querying all datasets. Modified the corresponding unit test.
24+
* MAINT #891: Changed the way that numerical features are stored. Numerical features that range from 0 to 255 are now stored as uint8, which reduces the storage space required as well as storing and loading times.
25+
* MAINT #975, #988: Add CI through Github Actions.
26+
* MAINT #977: Allow ``short`` and ``long`` scenarios for unit tests. Reduce the workload for some unit tests.
27+
* MAINT #985, #1000: Improve unit test stability and output readability, and adds load balancing.
28+
* MAINT #1018: Refactor data loading and storage. Data is now compressed on the first call to `get_data`.
29+
* MAINT #1024: Remove flaky decorator for study unit test.
30+
* FIX #883 #884 #906 #972: Various improvements to the caching system.
31+
* FIX #980: Speed up ``check_datasets_active``.
32+
* FIX #984: Add a retry mechanism when the server encounters a database issue.
33+
* FIX #1004: Fixed an issue that prevented installation on some systems (e.g. Ubuntu).
34+
* FIX #1013: Fixes a bug where ``OpenMLRun.setup_string`` was not uploaded to the server, prepares for ``run_details`` being sent from the server.
35+
* FIX #1021: Fixes an issue that could occur when running unit tests and openml-python was not in PATH.
36+
* FIX #1037: Fixes a bug where a dataset could not be loaded if a categorical value had listed nan-like as a possible category.
37+
1738
0.11.0
1839
~~~~~~
1940
* ADD #753: Allows uploading custom flows to OpenML via OpenML-Python.

openml/extensions/sklearn/extension.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,10 @@
5252

5353

5454
SIMPLE_NUMPY_TYPES = [
55-
nptype for type_cat, nptypes in np.sctypes.items() for nptype in nptypes if type_cat != "others"
55+
nptype
56+
for type_cat, nptypes in np.sctypes.items()
57+
for nptype in nptypes # type: ignore
58+
if type_cat != "others"
5659
]
5760
SIMPLE_TYPES = tuple([bool, int, float, str] + SIMPLE_NUMPY_TYPES)
5861

@@ -546,7 +549,7 @@ def get_version_information(self) -> List[str]:
546549
major, minor, micro, _, _ = sys.version_info
547550
python_version = "Python_{}.".format(".".join([str(major), str(minor), str(micro)]))
548551
sklearn_version = "Sklearn_{}.".format(sklearn.__version__)
549-
numpy_version = "NumPy_{}.".format(numpy.__version__)
552+
numpy_version = "NumPy_{}.".format(numpy.__version__) # type: ignore
550553
scipy_version = "SciPy_{}.".format(scipy.__version__)
551554

552555
return [python_version, sklearn_version, numpy_version, scipy_version]
@@ -563,8 +566,7 @@ def create_setup_string(self, model: Any) -> str:
563566
str
564567
"""
565568
run_environment = " ".join(self.get_version_information())
566-
# fixme str(model) might contain (...)
567-
return run_environment + " " + str(model)
569+
return run_environment
568570

569571
def _is_cross_validator(self, o: Any) -> bool:
570572
return isinstance(o, sklearn.model_selection.BaseCrossValidator)
@@ -1237,11 +1239,11 @@ def _check_dependencies(self, dependencies: str, strict_version: bool = True) ->
12371239
def _serialize_type(self, o: Any) -> "OrderedDict[str, str]":
12381240
mapping = {
12391241
float: "float",
1240-
np.float: "np.float",
1242+
np.float: "np.float", # type: ignore
12411243
np.float32: "np.float32",
12421244
np.float64: "np.float64",
12431245
int: "int",
1244-
np.int: "np.int",
1246+
np.int: "np.int", # type: ignore
12451247
np.int32: "np.int32",
12461248
np.int64: "np.int64",
12471249
}
@@ -1253,11 +1255,11 @@ def _serialize_type(self, o: Any) -> "OrderedDict[str, str]":
12531255
def _deserialize_type(self, o: str) -> Any:
12541256
mapping = {
12551257
"float": float,
1256-
"np.float": np.float,
1258+
"np.float": np.float, # type: ignore
12571259
"np.float32": np.float32,
12581260
"np.float64": np.float64,
12591261
"int": int,
1260-
"np.int": np.int,
1262+
"np.int": np.int, # type: ignore
12611263
"np.int32": np.int32,
12621264
"np.int64": np.int64,
12631265
}
@@ -1675,7 +1677,7 @@ def _run_model_on_fold(
16751677
"""
16761678

16771679
def _prediction_to_probabilities(
1678-
y: np.ndarray, model_classes: List[Any], class_labels: Optional[List[str]]
1680+
y: Union[np.ndarray, List], model_classes: List[Any], class_labels: Optional[List[str]]
16791681
) -> pd.DataFrame:
16801682
"""Transforms predicted probabilities to match with OpenML class indices.
16811683

openml/runs/functions.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -805,6 +805,9 @@ def obtain_field(xml_obj, fieldname, from_server, cast=None):
805805
flow_name = obtain_field(run, "oml:flow_name", from_server)
806806
setup_id = obtain_field(run, "oml:setup_id", from_server, cast=int)
807807
setup_string = obtain_field(run, "oml:setup_string", from_server)
808+
# run_details is currently not sent by the server, so we need to retrieve it safely.
809+
# whenever that's resolved, we can enforce it being present (OpenML#1087)
810+
run_details = obtain_field(run, "oml:run_details", from_server=False)
808811

809812
if "oml:input_data" in run:
810813
dataset_id = int(run["oml:input_data"]["oml:dataset"]["oml:did"])
@@ -827,6 +830,7 @@ def obtain_field(xml_obj, fieldname, from_server, cast=None):
827830
if "oml:output_data" not in run:
828831
if from_server:
829832
raise ValueError("Run does not contain output_data " "(OpenML server error?)")
833+
predictions_url = None
830834
else:
831835
output_data = run["oml:output_data"]
832836
predictions_url = None
@@ -911,6 +915,7 @@ def obtain_field(xml_obj, fieldname, from_server, cast=None):
911915
sample_evaluations=sample_evaluations,
912916
tags=tags,
913917
predictions_url=predictions_url,
918+
run_details=run_details,
914919
)
915920

916921

openml/runs/run.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,9 @@ class OpenMLRun(OpenMLBase):
5757
run_id: int
5858
description_text: str, optional
5959
Description text to add to the predictions file.
60-
If left None,
60+
If left None, is set to the time the arff file is generated.
61+
run_details: str, optional (default=None)
62+
Description of the run stored in the run meta-data.
6163
"""
6264

6365
def __init__(
@@ -86,6 +88,7 @@ def __init__(
8688
flow=None,
8789
run_id=None,
8890
description_text=None,
91+
run_details=None,
8992
):
9093
self.uploader = uploader
9194
self.uploader_name = uploader_name
@@ -112,6 +115,7 @@ def __init__(
112115
self.tags = tags
113116
self.predictions_url = predictions_url
114117
self.description_text = description_text
118+
self.run_details = run_details
115119

116120
@property
117121
def id(self) -> Optional[int]:
@@ -543,11 +547,15 @@ def _to_dict(self) -> "OrderedDict[str, OrderedDict]":
543547
description["oml:run"]["@xmlns:oml"] = "http://openml.org/openml"
544548
description["oml:run"]["oml:task_id"] = self.task_id
545549
description["oml:run"]["oml:flow_id"] = self.flow_id
550+
if self.setup_string is not None:
551+
description["oml:run"]["oml:setup_string"] = self.setup_string
546552
if self.error_message is not None:
547553
description["oml:run"]["oml:error_message"] = self.error_message
554+
if self.run_details is not None:
555+
description["oml:run"]["oml:run_details"] = self.run_details
548556
description["oml:run"]["oml:parameter_setting"] = self.parameter_settings
549557
if self.tags is not None:
550-
description["oml:run"]["oml:tag"] = self.tags # Tags describing the run
558+
description["oml:run"]["oml:tag"] = self.tags
551559
if (self.fold_evaluations is not None and len(self.fold_evaluations) > 0) or (
552560
self.sample_evaluations is not None and len(self.sample_evaluations) > 0
553561
):

tests/test_runs/test_run.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,13 @@
55
import os
66
from time import time
77

8+
import xmltodict
89
from sklearn.dummy import DummyClassifier
910
from sklearn.tree import DecisionTreeClassifier
1011
from sklearn.model_selection import GridSearchCV
1112
from sklearn.pipeline import Pipeline
1213

14+
from openml import OpenMLRun
1315
from openml.testing import TestBase, SimpleImputer
1416
import openml
1517
import openml.extensions.sklearn
@@ -215,3 +217,19 @@ def test_publish_with_local_loaded_flow(self):
215217
# make sure the flow is published as part of publishing the run.
216218
self.assertTrue(openml.flows.flow_exists(flow.name, flow.external_version))
217219
openml.runs.get_run(loaded_run.run_id)
220+
221+
def test_run_setup_string_included_in_xml(self):
222+
SETUP_STRING = "setup-string"
223+
run = OpenMLRun(
224+
task_id=0,
225+
flow_id=None, # if not none, flow parameters are required.
226+
dataset_id=0,
227+
setup_string=SETUP_STRING,
228+
)
229+
xml = run._to_xml()
230+
run_dict = xmltodict.parse(xml)["oml:run"]
231+
assert "oml:setup_string" in run_dict
232+
assert run_dict["oml:setup_string"] == SETUP_STRING
233+
234+
recreated_run = openml.runs.functions._create_run_from_xml(xml, from_server=False)
235+
assert recreated_run.setup_string == SETUP_STRING

0 commit comments

Comments
 (0)