Skip to content

Commit 94dc09f

Browse files
authored
Update and reorganize examples (#936)
* update and reorganize examples * fixing examples * extend examples * bugfix and flake8 * incorporate feedback
1 parent 7a84dd4 commit 94dc09f

28 files changed

+533
-346
lines changed

autosklearn/estimators.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -606,8 +606,7 @@ def _get_automl_class(self):
606606
raise NotImplementedError()
607607

608608
def get_configuration_space(self, X, y):
609-
self._automl = self.build_automl()
610-
return self._automl[0].fit(X, y, only_return_configuration_space=True)
609+
return self._automl[0].configuration_space
611610

612611

613612
class AutoSklearnClassifier(AutoSklearnEstimator):

doc/manual.rst

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -15,22 +15,23 @@ Examples
1515
*auto-sklearn* comes with the following examples which demonstrate several
1616
aspects of its usage:
1717

18-
* `Holdout <examples/example_holdout.html>`_
19-
* `Cross-validation <examples/example_crossvalidation.html>`_
20-
* `Parallel usage (n_jobs) <examples/example_parallel_n_jobs.html>`_
21-
* `Parallel usage (manual) <examples/example_parallel_manual_spawning.html>`_
22-
* `Sequential usage <examples/example_sequential.html>`_
23-
* `Regression <examples/example_regression.html>`_
24-
* `Continuous and categorical data <examples/example_feature_types.html>`_
25-
* `Using custom metrics <examples/example_metrics.html>`_
26-
* `Random search <examples/example_random_search.html>`_
27-
* `EIPS <examples/example_eips.html>`_
28-
* `Successive Halving <examples/example_successive_halving.html>`_
29-
* `Extending with a new classifier <examples/example_extending_classification.html>`_
30-
* `Extending with a new regressor <examples/example_extending_regression.html>`_
31-
* `Extending with a new preprocessor <examples/example_extending_preprocessor.html>`_
32-
* `Iterating over the models <examples/example_get_pipeline_components.html>`_
33-
* `Pandas Train and Test inputs <examples/example_pandas_train_test.html>`_
18+
* `Classification <examples/20_basic/example_classification.html>`_
19+
* `Multi-label Classification <examples/20_basic/example_multilabel_classification.html>`_
20+
* `Regression <examples/20_basic/example_regression.html>`_
21+
* `Continuous and categorical data <examples/40_advanced/example_feature_types.html>`_
22+
* `Iterating over the models <examples/40_advanced/example_get_pipeline_components.html>`_
23+
* `Using custom metrics <examples/40_advanced/example_metrics.html>`_
24+
* `Pandas Train and Test inputs <examples/40_advanced/example_pandas_train_test.html>`_
25+
* `Resampling strategies <examples/40_advanced/example_resampling.html>`_
26+
* `Parallel usage (manual) <examples/60_search/example_parallel_manual_spawning.html>`_
27+
* `Parallel usage (n_jobs) <examples/60_search/example_parallel_n_jobs.html>`_
28+
* `Random search <examples/60_search/example_random_search.html>`_
29+
* `Sequential usage <examples/60_search/example_sequential.html>`_
30+
* `Successive Halving <examples/60_search/example_successive_halving.html>`_
31+
* `Extending with a new classifier <examples/80_extending/example_extending_classification.html>`_
32+
* `Extending with a new regressor <examples/80_extending/example_extending_regression.html>`_
33+
* `Extending with a new preprocessor <examples/80_extending/example_extending_preprocessor.html>`_
34+
* `Restrict hyperparameters for a component <examples/80_extending/example_restrict_number_of_hyperparameters.html>`_
3435

3536

3637
Time and memory limits
@@ -103,15 +104,15 @@ Supported Inputs
103104
* Multioutput Regression
104105

105106
You can provide feature and target training pairs (X_train/y_train) to *auto-sklearn* to fit an ensemble of pipelines as described in the next section. This X_train/y_train dataset must belong to one of the supported formats: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
106-
Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Pandas Train and Test inputs <examples/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
107+
Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Train and Test inputs <examples/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
107108

108-
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_ for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
109+
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_ for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
109110

110111
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
111112
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the example `Feature Types <examples/example_feature_types.html>`_.
112113
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the column has a categorical/boolean class, it will be encoded. If the column is of any other type (Object or Timeseries), an error will be raised. For further details on how to properly encode your data, you can check the example `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach `Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
113114

114-
Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets.
115+
Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets.
115116

116117
Ensemble Building Process
117118
=========================

examples/20_basic/README.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.. _basic_examples:
2+
3+
==============
4+
Basic Examples
5+
==============
6+
7+
Examples for basic classification, regression and multi-label classification datasets.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# -*- encoding: utf-8 -*-
2+
"""
3+
==============
4+
Classification
5+
==============
6+
7+
The following example shows how to fit a simple classification model with
8+
*auto-sklearn*.
9+
"""
10+
import sklearn.datasets
11+
import sklearn.metrics
12+
13+
import autosklearn.classification
14+
15+
16+
############################################################################
17+
# Data Loading
18+
# ============
19+
20+
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
21+
X_train, X_test, y_train, y_test = \
22+
sklearn.model_selection.train_test_split(X, y, random_state=1)
23+
24+
############################################################################
25+
# Build and fit a regressor
26+
# =========================
27+
28+
automl = autosklearn.classification.AutoSklearnClassifier(
29+
time_left_for_this_task=120,
30+
per_run_time_limit=30,
31+
tmp_folder='/tmp/autosklearn_classification_example_tmp',
32+
output_folder='/tmp/autosklearn_classification_example_out',
33+
ml_memory_limit=60,
34+
)
35+
automl.fit(X_train, y_train, dataset_name='breast_cancer')
36+
37+
############################################################################
38+
# Print the final ensemble constructed by auto-sklearn
39+
# ====================================================
40+
41+
print(automl.show_models())
42+
43+
###########################################################################
44+
# Get the Score of the final ensemble
45+
# ===================================
46+
47+
predictions = automl.predict(X_test)
48+
print("R2 score:", sklearn.metrics.accuracy_score(y_test, predictions))

examples/example_multilabel_classification.py renamed to examples/20_basic/example_multilabel_classification.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
11
"""
2-
=================================
3-
example_multilabel_classification
4-
=================================
2+
==========================
3+
Multi-label Classification
4+
==========================
55
66
This examples shows how to format the targets for a multilabel classification
7-
problem. Details on multilabel classification can be found on
8-
`here https://scikit-learn.org/stable/modules/multiclass.html>`_).
7+
problem. Details on multilabel classification can be found
8+
`here <https://scikit-learn.org/stable/modules/multiclass.html>`_.
99
"""
1010
import numpy as np
1111

12-
import sklearn.model_selection
1312
import sklearn.datasets
1413
import sklearn.metrics
1514
from sklearn.utils.multiclass import type_of_target

examples/example_regression.py renamed to examples/20_basic/example_regression.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77
The following example shows how to fit a simple regression model with
88
*auto-sklearn*.
99
"""
10-
import sklearn.model_selection
1110
import sklearn.datasets
1211
import sklearn.metrics
1312

@@ -19,7 +18,7 @@
1918
# ============
2019

2120
X, y = sklearn.datasets.load_boston(return_X_y=True)
22-
feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9)
21+
2322
X_train, X_test, y_train, y_test = \
2423
sklearn.model_selection.train_test_split(X, y, random_state=1)
2524

@@ -33,8 +32,7 @@
3332
tmp_folder='/tmp/autosklearn_regression_example_tmp',
3433
output_folder='/tmp/autosklearn_regression_example_out',
3534
)
36-
automl.fit(X_train, y_train, dataset_name='boston',
37-
feat_type=feature_types)
35+
automl.fit(X_train, y_train, dataset_name='boston')
3836

3937
############################################################################
4038
# Print the final ensemble constructed by auto-sklearn

examples/40_advanced/README.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
.. _advanced_examples:
2+
3+
=================
4+
Advanced Examples
5+
=================
6+
7+
Examples on customizing Auto-sklearn to ones use case by changing the
8+
metric to optimize, the train-validation split, giving feature types,
9+
using pandas dataframes as input and inspecting the results of the search
10+
procedure.

examples/example_feature_types.py renamed to examples/40_advanced/example_feature_types.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
# ==========================
4545

4646
cls = autosklearn.classification.AutoSklearnClassifier(
47-
time_left_for_this_task=60,
47+
time_left_for_this_task=30,
4848
# Bellow two flags are provided to speed up calculations
4949
# Not recommended for a real implementation
5050
initial_configurations_via_metalearning=0,
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# -*- encoding: utf-8 -*-
2+
"""
3+
======================
4+
Obtain run information
5+
======================
6+
7+
The following example shows how to obtain information from a finished
8+
Auto-sklearn run. In particular, it shows:
9+
* how to query which models were evaluated by Auto-sklearn
10+
* how to query the models in the final ensemble
11+
* how to get general statistics on the what Auto-sklearn evaluated
12+
13+
Auto-sklearn is a wrapper on top of
14+
the sklearn models. This example illustrates how to interact
15+
with the sklearn components directly, in this case a PCA preprocessor.
16+
"""
17+
import sklearn.datasets
18+
import sklearn.metrics
19+
20+
import autosklearn.classification
21+
22+
############################################################################
23+
# Data Loading
24+
# ============
25+
26+
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
27+
X_train, X_test, y_train, y_test = \
28+
sklearn.model_selection.train_test_split(X, y, random_state=1)
29+
30+
############################################################################
31+
# Build and fit the classifier
32+
# ============================
33+
34+
automl = autosklearn.classification.AutoSklearnClassifier(
35+
time_left_for_this_task=30,
36+
per_run_time_limit=10,
37+
disable_evaluator_output=False,
38+
# To simplify querying the models in the final ensemble, we
39+
# restrict auto-sklearn to use only pca as a preprocessor
40+
include_preprocessors=['pca'],
41+
)
42+
automl.fit(X_train, y_train, dataset_name='breast_cancer')
43+
44+
############################################################################
45+
# Predict using the model
46+
# =======================
47+
48+
predictions = automl.predict(X_test)
49+
print("Accuracy score:{}".format(
50+
sklearn.metrics.accuracy_score(y_test, predictions))
51+
)
52+
53+
54+
############################################################################
55+
# Report the models found by Auto-Sklearn
56+
# =======================================
57+
#
58+
# Auto-sklearn uses
59+
# `Ensemble Selection <https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf>`_
60+
# to construct ensembles in a post-hoc fashion. The ensemble is a linear
61+
# weighting of all models constructed during the hyperparameter optimization.
62+
# This prints the final ensemble. It is a list of tuples, each tuple being
63+
# the model weight in the ensemble and the model itself.
64+
65+
print(automl.show_models())
66+
67+
###########################################################################
68+
# Report statistics about the search
69+
# ==================================
70+
#
71+
# Print statistics about the auto-sklearn run such as number of
72+
# iterations, number of models failed with a time out etc.
73+
print(automl.sprint_statistics())
74+
75+
############################################################################
76+
# Detailed statistics about the search - part 1
77+
# =============================================
78+
#
79+
# Auto-sklearn also keeps detailed statistics of the hyperparameter
80+
# optimization procedurce, which are stored in a so-called
81+
# `run history <https://automl.github.io/SMAC3/master/apidoc/smac.
82+
# runhistory.runhistory.html#smac.runhistory# .runhistory.RunHistory>`_.
83+
84+
print(automl._automl[0].runhistory_)
85+
86+
############################################################################
87+
# Runs are stored inside an ``OrderedDict`` called ``data``:
88+
89+
print(len(automl._automl[0].runhistory_.data))
90+
91+
############################################################################
92+
# Let's iterative over all entries
93+
94+
for run_key in automl._automl[0].runhistory_.data:
95+
print('#########')
96+
print(run_key)
97+
print(automl._automl[0].runhistory_.data[run_key])
98+
99+
############################################################################
100+
# and have a detailed look at one entry:
101+
102+
run_key = list(automl._automl[0].runhistory_.data.keys())[0]
103+
run_value = automl._automl[0].runhistory_.data[run_key]
104+
105+
############################################################################
106+
# The ``run_key`` contains all information describing a run:
107+
108+
print("Configuration ID:", run_key.config_id)
109+
print("Instance:", run_key.instance_id)
110+
print("Seed:", run_key.seed)
111+
print("Budget:", run_key.budget)
112+
113+
############################################################################
114+
# and the configuration can be looked up in the run history as well:
115+
116+
print(automl._automl[0].runhistory_.ids_config[run_key.config_id])
117+
118+
############################################################################
119+
# The only other important entry is the budget in case you are using
120+
# auto-sklearn with
121+
# `successive halving <examples/60_search/example_successive_halving.py>`_.
122+
# The remaining parts of the key can be ignored for auto-sklearn and are
123+
# only there because the underlying optimizer, SMAC, can handle more general
124+
# problems, too.
125+
126+
############################################################################
127+
# The ``run_value`` contains all output from running the configuration:
128+
129+
print("Cost:", run_value.cost)
130+
print("Time:", run_value.time)
131+
print("Status:", run_value.status)
132+
print("Additional information:", run_value.additional_info)
133+
print("Start time:", run_value.starttime)
134+
print("End time", run_value.endtime)
135+
136+
############################################################################
137+
# Cost is basically the same as a loss. In case the metric to optimize for
138+
# should be maximized, it is internally transformed into a minimization
139+
# metric. Additionally, the status type gives information on whether the run
140+
# was successful, while the additional information's most interesting entry
141+
# is the internal training loss. Furthermore, there is detailed information
142+
# on the runtime available.
143+
144+
############################################################################
145+
# Detailed statistics about the search - part 2
146+
# =============================================
147+
#
148+
# To maintain compatibility with scikit-learn, Auto-sklearn gives the
149+
# same data as
150+
# `cv_results_ <https://scikit-learn.org/stable/modules/generated/sklearn.
151+
# model_selection.GridSearchCV.html>`_.
152+
153+
print(automl.cv_results_)
154+
155+
############################################################################
156+
# Inspect the components of the best model
157+
# ========================================
158+
#
159+
# Iterate over the components of the model and print
160+
# The explained variance ratio per stage
161+
for i, (weight, pipeline) in enumerate(automl.get_models_with_weights()):
162+
for stage_name, component in pipeline.named_steps.items():
163+
if 'preprocessor' in stage_name:
164+
print(
165+
"The {}th pipeline has a explained variance of {}".format(
166+
i,
167+
# The component is an instance of AutoSklearnChoice.
168+
# Access the sklearn object via the choice attribute
169+
# We want the explained variance attributed of
170+
# each principal component
171+
component.choice.preprocessor.explained_variance_ratio_
172+
)
173+
)
File renamed without changes.

0 commit comments

Comments
 (0)