automl
diff --git a/‎autosklearn/estimators.py‎
Lines changed: 1 addition & 2 deletions b/‎autosklearn/estimators.py‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎doc/manual.rst‎
Lines changed: 20 additions & 19 deletions b/‎doc/manual.rst‎
Lines changed: 20 additions & 19 deletions
diff --git a/‎examples/20_basic/README.txt‎
Lines changed: 7 additions & 0 deletions b/‎examples/20_basic/README.txt‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎examples/20_basic/example_classification.py‎
Lines changed: 48 additions & 0 deletions b/‎examples/20_basic/example_classification.py‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎examples/example_multilabel_classification.py‎ renamed to ‎examples/20_basic/example_multilabel_classification.py‎
Lines changed: 5 additions & 6 deletions b/‎examples/example_multilabel_classification.py‎ renamed to ‎examples/20_basic/example_multilabel_classification.py‎
Lines changed: 5 additions & 6 deletions
diff --git a/‎examples/example_regression.py‎ renamed to ‎examples/20_basic/example_regression.py‎
Lines changed: 2 additions & 4 deletions b/‎examples/example_regression.py‎ renamed to ‎examples/20_basic/example_regression.py‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎examples/40_advanced/README.txt‎
Lines changed: 10 additions & 0 deletions b/‎examples/40_advanced/README.txt‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎examples/example_feature_types.py‎ renamed to ‎examples/40_advanced/example_feature_types.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/example_feature_types.py‎ renamed to ‎examples/40_advanced/example_feature_types.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/40_advanced/example_get_pipeline_components.py‎
Lines changed: 173 additions & 0 deletions b/‎examples/40_advanced/example_get_pipeline_components.py‎
Lines changed: 173 additions & 0 deletions
diff --git a/‎examples/example_metrics.py‎ renamed to ‎examples/40_advanced/example_metrics.py‎ b/‎examples/example_metrics.py‎ renamed to ‎examples/40_advanced/example_metrics.py‎
@@ -606,8 +606,7 @@ def _get_automl_class(self):
         raise NotImplementedError()
 
     def get_configuration_space(self, X, y):
-        self._automl = self.build_automl()
-        return self._automl[0].fit(X, y, only_return_configuration_space=True)
+        return self._automl[0].configuration_space
 
 
 class AutoSklearnClassifier(AutoSklearnEstimator):
 
@@ -15,22 +15,23 @@ Examples
 *auto-sklearn* comes with the following examples which demonstrate several
 aspects of its usage:
 
-* `Holdout <examples/example_holdout.html>`_
-* `Cross-validation <examples/example_crossvalidation.html>`_
-* `Parallel usage (n_jobs) <examples/example_parallel_n_jobs.html>`_
-* `Parallel usage (manual) <examples/example_parallel_manual_spawning.html>`_
-* `Sequential usage <examples/example_sequential.html>`_
-* `Regression <examples/example_regression.html>`_
-* `Continuous and categorical data <examples/example_feature_types.html>`_
-* `Using custom metrics <examples/example_metrics.html>`_
-* `Random search <examples/example_random_search.html>`_
-* `EIPS <examples/example_eips.html>`_
-* `Successive Halving <examples/example_successive_halving.html>`_
-* `Extending with a new classifier <examples/example_extending_classification.html>`_
-* `Extending with a new regressor <examples/example_extending_regression.html>`_
-* `Extending with a new preprocessor <examples/example_extending_preprocessor.html>`_
-* `Iterating over the models <examples/example_get_pipeline_components.html>`_
-* `Pandas Train and Test inputs <examples/example_pandas_train_test.html>`_
+* `Classification <examples/20_basic/example_classification.html>`_
+* `Multi-label Classification <examples/20_basic/example_multilabel_classification.html>`_
+* `Regression <examples/20_basic/example_regression.html>`_
+* `Continuous and categorical data <examples/40_advanced/example_feature_types.html>`_
+* `Iterating over the models <examples/40_advanced/example_get_pipeline_components.html>`_
+* `Using custom metrics <examples/40_advanced/example_metrics.html>`_
+* `Pandas Train and Test inputs <examples/40_advanced/example_pandas_train_test.html>`_
+* `Resampling strategies <examples/40_advanced/example_resampling.html>`_
+* `Parallel usage (manual) <examples/60_search/example_parallel_manual_spawning.html>`_
+* `Parallel usage (n_jobs) <examples/60_search/example_parallel_n_jobs.html>`_
+* `Random search <examples/60_search/example_random_search.html>`_
+* `Sequential usage <examples/60_search/example_sequential.html>`_
+* `Successive Halving <examples/60_search/example_successive_halving.html>`_
+* `Extending with a new classifier <examples/80_extending/example_extending_classification.html>`_
+* `Extending with a new regressor <examples/80_extending/example_extending_regression.html>`_
+* `Extending with a new preprocessor <examples/80_extending/example_extending_preprocessor.html>`_
+* `Restrict hyperparameters for a component <examples/80_extending/example_restrict_number_of_hyperparameters.html>`_
 
 
 Time and memory limits
@@ -103,15 +104,15 @@ Supported Inputs
 * Multioutput Regression
 
 You can provide feature and target training pairs (X_train/y_train) to *auto-sklearn* to fit an ensemble of pipelines as described in the next section. This X_train/y_train dataset must belong to one of the supported formats: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
- Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Pandas Train and Test inputs <examples/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
+ Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Train and Test inputs <examples/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
 
-If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_ for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data. 
+If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_ for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
 
 Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
 * Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the example `Feature Types <examples/example_feature_types.html>`_.
 * You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the column has a categorical/boolean class, it will be encoded. If the column is of any other type (Object or Timeseries), an error will be raised. For further details on how to properly encode your data, you can check the example `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach `Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
 
-Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets. 
+Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets.
 
 Ensemble Building Process
 =========================
 
@@ -0,0 +1,7 @@
+.. _basic_examples:
+
+==============
+Basic Examples
+==============
+
+Examples for basic classification, regression and multi-label classification datasets.
@@ -0,0 +1,48 @@
+# -*- encoding: utf-8 -*-
+"""
+==============
+Classification
+==============
+
+The following example shows how to fit a simple classification model with
+*auto-sklearn*.
+"""
+import sklearn.datasets
+import sklearn.metrics
+
+import autosklearn.classification
+
+
+############################################################################
+# Data Loading
+# ============
+
+X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
+X_train, X_test, y_train, y_test = \
+    sklearn.model_selection.train_test_split(X, y, random_state=1)
+
+############################################################################
+# Build and fit a regressor
+# =========================
+
+automl = autosklearn.classification.AutoSklearnClassifier(
+    time_left_for_this_task=120,
+    per_run_time_limit=30,
+    tmp_folder='/tmp/autosklearn_classification_example_tmp',
+    output_folder='/tmp/autosklearn_classification_example_out',
+    ml_memory_limit=60,
+)
+automl.fit(X_train, y_train, dataset_name='breast_cancer')
+
+############################################################################
+# Print the final ensemble constructed by auto-sklearn
+# ====================================================
+
+print(automl.show_models())
+
+###########################################################################
+# Get the Score of the final ensemble
+# ===================================
+
+predictions = automl.predict(X_test)
+print("R2 score:", sklearn.metrics.accuracy_score(y_test, predictions))
@@ -1,15 +1,14 @@
 """
-=================================
-example_multilabel_classification
-=================================
+==========================
+Multi-label Classification
+==========================
 
 This examples shows how to format the targets for a multilabel classification
-problem. Details on multilabel classification can be found on
-`here https://scikit-learn.org/stable/modules/multiclass.html>`_).
+problem. Details on multilabel classification can be found
+`here <https://scikit-learn.org/stable/modules/multiclass.html>`_.
 """
 import numpy as np
 
-import sklearn.model_selection
 import sklearn.datasets
 import sklearn.metrics
 from sklearn.utils.multiclass import type_of_target
 
@@ -7,7 +7,6 @@
 The following example shows how to fit a simple regression model with
 *auto-sklearn*.
 """
-import sklearn.model_selection
 import sklearn.datasets
 import sklearn.metrics
 
@@ -19,7 +18,7 @@
 # ============
 
 X, y = sklearn.datasets.load_boston(return_X_y=True)
-feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9)
+
 X_train, X_test, y_train, y_test = \
     sklearn.model_selection.train_test_split(X, y, random_state=1)
 
@@ -33,8 +32,7 @@
     tmp_folder='/tmp/autosklearn_regression_example_tmp',
     output_folder='/tmp/autosklearn_regression_example_out',
 )
-automl.fit(X_train, y_train, dataset_name='boston',
-           feat_type=feature_types)
+automl.fit(X_train, y_train, dataset_name='boston')
 
 ############################################################################
 # Print the final ensemble constructed by auto-sklearn
 
@@ -0,0 +1,10 @@
+.. _advanced_examples:
+
+=================
+Advanced Examples
+=================
+
+Examples on customizing Auto-sklearn to ones use case by changing the
+metric to optimize, the train-validation split, giving feature types,
+using pandas dataframes as input and inspecting the results of the search
+procedure.
@@ -44,7 +44,7 @@
 # ==========================
 
 cls = autosklearn.classification.AutoSklearnClassifier(
-    time_left_for_this_task=60,
+    time_left_for_this_task=30,
     # Bellow two flags are provided to speed up calculations
     # Not recommended for a real implementation
     initial_configurations_via_metalearning=0,
 
@@ -0,0 +1,173 @@
+# -*- encoding: utf-8 -*-
+"""
+======================
+Obtain run information
+======================
+
+The following example shows how to obtain information from a finished
+Auto-sklearn run. In particular, it shows:
+* how to query which models were evaluated by Auto-sklearn
+* how to query the models in the final ensemble
+* how to get general statistics on the what Auto-sklearn evaluated
+
+Auto-sklearn is a wrapper on top of
+the sklearn models. This example illustrates how to interact
+with the sklearn components directly, in this case a PCA preprocessor.
+"""
+import sklearn.datasets
+import sklearn.metrics
+
+import autosklearn.classification
+
+############################################################################
+# Data Loading
+# ============
+
+X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
+X_train, X_test, y_train, y_test = \
+    sklearn.model_selection.train_test_split(X, y, random_state=1)
+
+############################################################################
+# Build and fit the classifier
+# ============================
+
+automl = autosklearn.classification.AutoSklearnClassifier(
+    time_left_for_this_task=30,
+    per_run_time_limit=10,
+    disable_evaluator_output=False,
+    # To simplify querying the models in the final ensemble, we
+    # restrict auto-sklearn to use only pca as a preprocessor
+    include_preprocessors=['pca'],
+)
+automl.fit(X_train, y_train, dataset_name='breast_cancer')
+
+############################################################################
+# Predict using the model
+# =======================
+
+predictions = automl.predict(X_test)
+print("Accuracy score:{}".format(
+    sklearn.metrics.accuracy_score(y_test, predictions))
+)
+
+
+############################################################################
+# Report the models found by Auto-Sklearn
+# =======================================
+#
+# Auto-sklearn uses
+# `Ensemble Selection <https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf>`_
+# to construct ensembles in a post-hoc fashion. The ensemble is a linear
+# weighting of all models constructed during the hyperparameter optimization.
+# This prints the final ensemble. It is a list of tuples, each tuple being
+# the model weight in the ensemble and the model itself.
+
+print(automl.show_models())
+
+###########################################################################
+# Report statistics about the search
+# ==================================
+#
+# Print statistics about the auto-sklearn run such as number of
+# iterations, number of models failed with a time out etc.
+print(automl.sprint_statistics())
+
+############################################################################
+# Detailed statistics about the search - part 1
+# =============================================
+#
+# Auto-sklearn also keeps detailed statistics of the hyperparameter
+# optimization procedurce, which are stored in a so-called
+# `run history <https://automl.github.io/SMAC3/master/apidoc/smac.
+# runhistory.runhistory.html#smac.runhistory# .runhistory.RunHistory>`_.
+
+print(automl._automl[0].runhistory_)
+
+############################################################################
+# Runs are stored inside an ``OrderedDict`` called ``data``:
+
+print(len(automl._automl[0].runhistory_.data))
+
+############################################################################
+# Let's iterative over all entries
+
+for run_key in automl._automl[0].runhistory_.data:
+    print('#########')
+    print(run_key)
+    print(automl._automl[0].runhistory_.data[run_key])
+
+############################################################################
+# and have a detailed look at one entry:
+
+run_key = list(automl._automl[0].runhistory_.data.keys())[0]
+run_value = automl._automl[0].runhistory_.data[run_key]
+
+############################################################################
+# The ``run_key`` contains all information describing a run:
+
+print("Configuration ID:", run_key.config_id)
+print("Instance:", run_key.instance_id)
+print("Seed:", run_key.seed)
+print("Budget:", run_key.budget)
+
+############################################################################
+# and the configuration can be looked up in the run history as well:
+
+print(automl._automl[0].runhistory_.ids_config[run_key.config_id])
+
+############################################################################
+# The only other important entry is the budget in case you are using
+# auto-sklearn with
+# `successive halving <examples/60_search/example_successive_halving.py>`_.
+# The remaining parts of the key can be ignored for auto-sklearn and are
+# only there because the underlying optimizer, SMAC, can handle more general
+# problems, too.
+
+############################################################################
+# The ``run_value`` contains all output from running the configuration:
+
+print("Cost:", run_value.cost)
+print("Time:", run_value.time)
+print("Status:", run_value.status)
+print("Additional information:", run_value.additional_info)
+print("Start time:", run_value.starttime)
+print("End time", run_value.endtime)
+
+############################################################################
+# Cost is basically the same as a loss. In case the metric to optimize for
+# should be maximized, it is internally transformed into a minimization
+# metric. Additionally, the status type gives information on whether the run
+# was successful, while the additional information's most interesting entry
+# is the internal training loss. Furthermore, there is detailed information
+# on the runtime available.
+
+############################################################################
+# Detailed statistics about the search - part 2
+# =============================================
+#
+# To maintain compatibility with scikit-learn, Auto-sklearn gives the
+# same data as
+# `cv_results_ <https://scikit-learn.org/stable/modules/generated/sklearn.
+# model_selection.GridSearchCV.html>`_.
+
+print(automl.cv_results_)
+
+############################################################################
+# Inspect the components of the best model
+# ========================================
+#
+# Iterate over the components of the model and print
+# The explained variance ratio per stage
+for i, (weight, pipeline) in enumerate(automl.get_models_with_weights()):
+    for stage_name, component in pipeline.named_steps.items():
+        if 'preprocessor' in stage_name:
+            print(
+                "The {}th pipeline has a explained variance of {}".format(
+                    i,
+                    # The component is an instance of AutoSklearnChoice.
+                    # Access the sklearn object via the choice attribute
+                    # We want the explained variance attributed of
+                    # each principal component
+                    component.choice.preprocessor.explained_variance_ratio_
+                )
+            )