Skip to content

Commit 7c0a77d

Browse files
janvanrijnmfeurer
authored andcommitted
Adds flow.get_structure and flow.get_subflow (which are complements of each other). Also fixes #564 (#567)
* fixes minor indentation problems * initial commit * adds a function to deduce the flow structure * removes sklearn converter from this PR * added main functionality * fix code quality * adds flow name to setup test file * adds functionality to return sklearn parameter name into openml flow name * PEP8 fixes * changed structure of PR, such that get_structure is not part of flow class. updated unit tests accordingly * pep8 fix * fixes last typo * flow name doc string * also added additional filter for task list * renamed id argument of parameter object (for code quality) * fix reference to input id * updated reinitialize model fn * removed imputer (deprecated) * fixes PEP8 problems * pep8 * PEP8 * incorporated changes by Matthias * fix 604 * bugfix * flake fix * import error * removed sentence * updated comment
1 parent 04c4d0e commit 7c0a77d

File tree

13 files changed

+383
-87
lines changed

13 files changed

+383
-87
lines changed

examples/run_setup_tutorial.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
"""
2+
=========
3+
Run Setup
4+
=========
5+
6+
By: Jan N. van Rijn
7+
8+
One of the key features of the openml-python library is that is allows to
9+
reinstantiate flows with hyperparameter settings that were uploaded before.
10+
This tutorial uses the concept of setups. Although setups are not extensively
11+
described in the OpenML documentation (because most users will not directly
12+
use them), they form a important concept within OpenML distinguishing between
13+
hyperparameter configurations.
14+
A setup is the combination of a flow with all its hyperparameters set.
15+
16+
A key requirement for reinstantiating a flow is to have the same scikit-learn
17+
version as the flow that was uploaded. However, this tutorial will upload the
18+
flow (that will later be reinstantiated) itself, so it can be ran with any
19+
scikit-learn version that is supported by this library. In this case, the
20+
requirement of the corresponding scikit-learn versions is automatically met.
21+
22+
In this tutorial we will
23+
1) Create a flow and use it to solve a task;
24+
2) Download the flow, reinstantiate the model with same hyperparameters,
25+
and solve the same task again;
26+
3) We will verify that the obtained results are exactly the same.
27+
"""
28+
import logging
29+
import numpy as np
30+
import openml
31+
import sklearn.ensemble
32+
import sklearn.impute
33+
import sklearn.preprocessing
34+
35+
36+
root = logging.getLogger()
37+
root.setLevel(logging.INFO)
38+
39+
###############################################################################
40+
# 1) Create a flow and use it to solve a task
41+
###############################################################################
42+
43+
# first, let's download the task that we are interested in
44+
task = openml.tasks.get_task(6)
45+
46+
47+
# we will create a fairly complex model, with many preprocessing components and
48+
# many potential hyperparameters. Of course, the model can be as complex and as
49+
# easy as you want it to be
50+
model_original = sklearn.pipeline.make_pipeline(
51+
sklearn.impute.SimpleImputer(),
52+
sklearn.ensemble.RandomForestClassifier()
53+
)
54+
55+
56+
# Let's change some hyperparameters. Of course, in any good application we
57+
# would tune them using, e.g., Random Search or Bayesian Optimization, but for
58+
# the purpose of this tutorial we set them to some specific values that might
59+
# or might not be optimal
60+
hyperparameters_original = {
61+
'simpleimputer__strategy': 'median',
62+
'randomforestclassifier__criterion': 'entropy',
63+
'randomforestclassifier__max_features': 0.2,
64+
'randomforestclassifier__min_samples_leaf': 1,
65+
'randomforestclassifier__n_estimators': 16,
66+
'randomforestclassifier__random_state': 42,
67+
}
68+
model_original.set_params(**hyperparameters_original)
69+
70+
# solve the task and upload the result (this implicitly creates the flow)
71+
run = openml.runs.run_model_on_task(
72+
model_original,
73+
task,
74+
avoid_duplicate_runs=False)
75+
run_original = run.publish() # this implicitly uploads the flow
76+
77+
###############################################################################
78+
# 2) Download the flow, reinstantiate the model with same hyperparameters,
79+
# and solve the same task again.
80+
###############################################################################
81+
82+
# obtain setup id (note that the setup id is assigned by the OpenML server -
83+
# therefore it was not yet available in our local copy of the run)
84+
run_downloaded = openml.runs.get_run(run_original.run_id)
85+
setup_id = run_downloaded.setup_id
86+
87+
# after this, we can easily reinstantiate the model
88+
model_duplicate = openml.setups.initialize_model(setup_id)
89+
# it will automatically have all the hyperparameters set
90+
91+
# and run the task again
92+
run_duplicate = openml.runs.run_model_on_task(
93+
model_duplicate, task, avoid_duplicate_runs=False)
94+
95+
96+
###############################################################################
97+
# 3) We will verify that the obtained results are exactly the same.
98+
###############################################################################
99+
100+
# the run has stored all predictions in the field data content
101+
np.testing.assert_array_equal(run_original.data_content,
102+
run_duplicate.data_content)

openml/flows/__init__.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
from .flow import OpenMLFlow, _copy_server_fields
1+
from .flow import OpenMLFlow
22

3-
from .sklearn_converter import sklearn_to_flow, flow_to_sklearn, _check_n_jobs
3+
from .sklearn_converter import sklearn_to_flow, flow_to_sklearn, \
4+
openml_param_name_to_sklearn
45
from .functions import get_flow, list_flows, flow_exists, assert_flows_equal
56

6-
__all__ = ['OpenMLFlow', 'create_flow_from_model', 'get_flow', 'list_flows',
7-
'sklearn_to_flow', 'flow_to_sklearn', 'flow_exists']
7+
__all__ = ['OpenMLFlow', 'get_flow', 'list_flows', 'sklearn_to_flow',
8+
'flow_to_sklearn', 'flow_exists', 'openml_param_name_to_sklearn']

openml/flows/flow.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,60 @@ def publish(self):
346346
(flow_id, message))
347347
return self
348348

349+
def get_structure(self, key_item):
350+
"""
351+
Returns for each sub-component of the flow the path of identifiers that
352+
should be traversed to reach this component. The resulting dict maps a
353+
key (identifying a flow by either its id, name or fullname) to the
354+
parameter prefix.
355+
356+
Parameters
357+
----------
358+
key_item: str
359+
The flow attribute that will be used to identify flows in the
360+
structure. Allowed values {flow_id, name}
361+
362+
Returns
363+
-------
364+
dict[str, List[str]]
365+
The flow structure
366+
"""
367+
if key_item not in ['flow_id', 'name']:
368+
raise ValueError('key_item should be in {flow_id, name}')
369+
structure = dict()
370+
for key, sub_flow in self.components.items():
371+
sub_structure = sub_flow.get_structure(key_item)
372+
for flow_name, flow_sub_structure in sub_structure.items():
373+
structure[flow_name] = [key] + flow_sub_structure
374+
structure[getattr(self, key_item)] = []
375+
return structure
376+
377+
def get_subflow(self, structure):
378+
"""
379+
Returns a subflow from the tree of dependencies.
380+
381+
Parameters
382+
----------
383+
structure: list[str]
384+
A list of strings, indicating the location of the subflow
385+
386+
Returns
387+
-------
388+
OpenMLFlow
389+
The OpenMLFlow that corresponds to the structure
390+
"""
391+
if len(structure) < 1:
392+
raise ValueError('Please provide a structure list of size >= 1')
393+
sub_identifier = structure[0]
394+
if sub_identifier not in self.components:
395+
raise ValueError('Flow %s does not contain component with '
396+
'identifier %s' % (self.name, sub_identifier))
397+
if len(structure) == 1:
398+
return self.components[sub_identifier]
399+
else:
400+
structure.pop(0)
401+
return self.components[sub_identifier].get_subflow(structure)
402+
349403
def push_tag(self, tag):
350404
"""Annotates this flow with a tag on the server.
351405

openml/flows/sklearn_converter.py

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
import six
1212
import warnings
1313
import sys
14-
import inspect
1514

1615
import numpy as np
1716
import scipy.stats.distributions
@@ -177,6 +176,37 @@ def flow_to_sklearn(o, components=None, initialize_with_defaults=False):
177176
return rval
178177

179178

179+
def openml_param_name_to_sklearn(openml_parameter, flow):
180+
"""
181+
Converts the name of an OpenMLParameter into the sklean name, given a flow.
182+
183+
Parameters
184+
----------
185+
openml_parameter: OpenMLParameter
186+
The parameter under consideration
187+
188+
flow: OpenMLFlow
189+
The flow that provides context.
190+
191+
Returns
192+
-------
193+
sklearn_parameter_name: str
194+
The name the parameter will have once used in scikit-learn
195+
"""
196+
if not isinstance(openml_parameter, openml.setups.OpenMLParameter):
197+
raise ValueError('openml_parameter should be an instance of '
198+
'OpenMLParameter')
199+
if not isinstance(flow, OpenMLFlow):
200+
raise ValueError('flow should be an instance of OpenMLFlow')
201+
202+
flow_structure = flow.get_structure('name')
203+
if openml_parameter.flow_name not in flow_structure:
204+
raise ValueError('Obtained OpenMLParameter and OpenMLFlow do not '
205+
'correspond. ')
206+
name = openml_parameter.flow_name # for PEP8
207+
return '__'.join(flow_structure[name] + [openml_parameter.parameter_name])
208+
209+
180210
def _serialize_model(model):
181211
"""Create an OpenMLFlow.
182212

openml/runs/functions.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,9 @@
1717
import openml._api_calls
1818
from ..exceptions import PyOpenMLError
1919
from .. import config
20-
from ..flows import sklearn_to_flow, get_flow, flow_exists, _check_n_jobs, \
21-
_copy_server_fields, OpenMLFlow
20+
from openml.flows.sklearn_converter import _check_n_jobs
21+
from openml.flows.flow import _copy_server_fields
22+
from ..flows import sklearn_to_flow, get_flow, flow_exists, OpenMLFlow
2223
from ..setups import setup_exists, initialize_model
2324
from ..exceptions import OpenMLCacheException, OpenMLServerException
2425
from ..tasks import OpenMLTask

openml/setups/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
from .setup import OpenMLSetup
1+
from .setup import OpenMLSetup, OpenMLParameter
22
from .functions import get_setup, list_setups, setup_exists, initialize_model
33

4-
__all__ = ['get_setup', 'list_setups', 'setup_exists', 'initialize_model']
4+
__all__ = ['OpenMLSetup', 'OpenMLParameter', 'get_setup', 'list_setups',
5+
'setup_exists', 'initialize_model']

openml/setups/functions.py

Lines changed: 16 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -211,44 +211,16 @@ def initialize_model(setup_id):
211211
# transform an openml setup object into
212212
# a dict of dicts, structured: flow_id maps to dict of
213213
# parameter_names mapping to parameter_value
214-
215214
setup = get_setup(setup_id)
216-
parameters = {}
217-
for _param in setup.parameters:
218-
_flow_id = setup.parameters[_param].flow_id
219-
_param_name = setup.parameters[_param].parameter_name
220-
_param_value = setup.parameters[_param].value
221-
if _flow_id not in parameters:
222-
parameters[_flow_id] = {}
223-
parameters[_flow_id][_param_name] = _param_value
224-
225-
def _reconstruct_flow(_flow, _params):
226-
# recursively set the values of flow parameters (and subflows) to
227-
# the specific values from a setup. _params is a dict of
228-
# dicts, mapping from flow id to param name to param value
229-
# (obtained by using the subfunction _to_dict_of_dicts)
230-
for _param in _flow.parameters:
231-
# It can happen that no parameters of a flow are in a setup,
232-
# then the flow_id is not in _params; usually happens for a
233-
# sklearn.pipeline.Pipeline object, where the steps parameter is
234-
# not in the setup
235-
if _flow.flow_id not in _params:
236-
continue
237-
# It is not guaranteed that a setup on OpenML has all parameter
238-
# settings of a flow, thus a param must not be in _params!
239-
if _param not in _params[_flow.flow_id]:
240-
continue
241-
_flow.parameters[_param] = _params[_flow.flow_id][_param]
242-
for _identifier in _flow.components:
243-
_flow.components[_identifier] = _reconstruct_flow(_flow.components[_identifier], _params)
244-
return _flow
245-
246-
# now we 'abuse' the parameter object by passing in the
247-
# parameters obtained from the setup
248215
flow = openml.flows.get_flow(setup.flow_id)
249-
flow = _reconstruct_flow(flow, parameters)
250-
251-
return openml.flows.flow_to_sklearn(flow)
216+
model = openml.flows.flow_to_sklearn(flow)
217+
hyperparameters = {
218+
openml.flows.openml_param_name_to_sklearn(hp, flow):
219+
openml.flows.flow_to_sklearn(hp.value)
220+
for hp in setup.parameters.values()
221+
}
222+
model.set_params(**hyperparameters)
223+
return model
252224

253225

254226
def _to_dict(flow_id, openml_parameter_settings):
@@ -288,10 +260,11 @@ def _create_setup_from_xml(result_dict):
288260

289261

290262
def _create_setup_parameter_from_xml(result_dict):
291-
return OpenMLParameter(int(result_dict['oml:id']),
292-
int(result_dict['oml:flow_id']),
293-
result_dict['oml:full_name'],
294-
result_dict['oml:parameter_name'],
295-
result_dict['oml:data_type'],
296-
result_dict['oml:default_value'],
297-
result_dict['oml:value'])
263+
return OpenMLParameter(input_id=int(result_dict['oml:id']),
264+
flow_id=int(result_dict['oml:flow_id']),
265+
flow_name=result_dict['oml:flow_name'],
266+
full_name=result_dict['oml:full_name'],
267+
parameter_name=result_dict['oml:parameter_name'],
268+
data_type=result_dict['oml:data_type'],
269+
default_value=result_dict['oml:default_value'],
270+
value=result_dict['oml:value'])

openml/setups/setup.py

Lines changed: 24 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -29,27 +29,32 @@ def __init__(self, setup_id, flow_id, parameters):
2929
class OpenMLParameter(object):
3030
"""Parameter object (used in setup).
3131
32-
Parameters
33-
----------
34-
id : int
35-
The input id from the openml database
36-
flow id : int
37-
The flow to which this parameter is associated
38-
full_name : str
39-
The name of the flow and parameter combined
40-
parameter_name : str
41-
The name of the parameter
42-
data_type : str
43-
The datatype of the parameter. generally unused for sklearn flows
44-
default_value : str
45-
The default value. For sklearn parameters, this is unknown and a
46-
default value is selected arbitrarily
47-
value : str
48-
If the parameter was set, the value that it was set to.
32+
Parameters
33+
----------
34+
input_id : int
35+
The input id from the openml database
36+
flow id : int
37+
The flow to which this parameter is associated
38+
flow name : str
39+
The name of the flow (no version number) to which this parameter
40+
is associated
41+
full_name : str
42+
The name of the flow and parameter combined
43+
parameter_name : str
44+
The name of the parameter
45+
data_type : str
46+
The datatype of the parameter. generally unused for sklearn flows
47+
default_value : str
48+
The default value. For sklearn parameters, this is unknown and a
49+
default value is selected arbitrarily
50+
value : str
51+
If the parameter was set, the value that it was set to.
4952
"""
50-
def __init__(self, id, flow_id, full_name, parameter_name, data_type, default_value, value):
51-
self.id = id
53+
def __init__(self, input_id, flow_id, flow_name, full_name, parameter_name,
54+
data_type, default_value, value):
55+
self.id = input_id
5256
self.flow_id = flow_id
57+
self.flow_name = flow_name
5358
self.full_name = full_name
5459
self.parameter_name = parameter_name
5560
self.data_type = data_type

openml/tasks/functions.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ def _list_tasks(task_type_id=None, **kwargs):
172172
- Survival Analysis: 7
173173
- Subgroup Discovery: 8
174174
kwargs: dict, optional
175-
Legal filter operators: tag, data_tag, status, limit,
175+
Legal filter operators: tag, task_id (list), data_tag, status, limit,
176176
offset, data_id, data_name, number_instances, number_features,
177177
number_classes, number_missing_values.
178178
Returns
@@ -184,6 +184,8 @@ def _list_tasks(task_type_id=None, **kwargs):
184184
api_call += "/type/%d" % int(task_type_id)
185185
if kwargs is not None:
186186
for operator, value in kwargs.items():
187+
if operator == 'task_id':
188+
value = ','.join([str(int(i)) for i in value])
187189
api_call += "/%s/%s" % (operator, value)
188190
return __list_tasks(api_call)
189191

0 commit comments

Comments
 (0)