Skip to content

Commit a119e7f

Browse files
Fxi broken links in docs and update parallel docs (#1088)
* Fxi broken links * Feedback from comments * Update manual.rst Co-authored-by: Matthias Feurer <[email protected]>
1 parent d3aa95e commit a119e7f

File tree

2 files changed

+12
-19
lines changed

2 files changed

+12
-19
lines changed

doc/manual.rst

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -104,12 +104,12 @@ Supported Inputs
104104
* Multioutput Regression
105105

106106
You can provide feature and target training pairs (X_train/y_train) to *auto-sklearn* to fit an ensemble of pipelines as described in the next section. This X_train/y_train dataset must belong to one of the supported formats: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
107-
Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Train and Test inputs <examples/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
107+
Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Train and Test inputs <examples/40_advanced/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
108108

109109
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_ for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
110110

111111
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
112-
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the example `Feature Types <examples/example_feature_types.html>`_.
112+
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the example `Feature Types <examples/40_advanced/example_feature_types.html>`_.
113113
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the column has a categorical/boolean class, it will be encoded. If the column is of any other type (Object or Timeseries), an error will be raised. For further details on how to properly encode your data, you can check the example `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach `Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
114114

115115
Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets.
@@ -143,28 +143,21 @@ obtained by running *auto-sklearn*. It additionally prints the number of both su
143143
algorithm runs.
144144

145145
The results obtained from the final ensemble can be printed by calling ``show_models()``. *auto-sklearn* ensemble is composed of scikit-learn models that can be inspected as exemplified by
146-
`model inspection example <examples/example_get_pipeline_components.html>`_
146+
`model inspection example <examples/40_advanced/example_get_pipeline_components.html>`_
147147
.
148148

149149
Parallel computation
150150
====================
151151

152-
*auto-sklearn* supports parallel execution by data sharing on a shared file
153-
system. In this mode, the SMAC algorithm shares the training data for it's
154-
model by writing it to disk after every iteration. At the beginning of each
155-
iteration, SMAC loads all newly found data points. We provide an example
156-
implementing
157-
`scikit-learn's n_jobs functionality <examples/example_parallel_n_jobs.html>`_
158-
and an example on how
159-
to
160-
`manually start multiple instances of auto-sklearn <examples/example_parallel_manual_spawning.html>`_
161-
.
162-
163152
In it's default mode, *auto-sklearn* already uses two cores. The first one is
164153
used for model building, the second for building an ensemble every time a new
165-
machine learning model has finished training. The
166-
`sequential example <examples/example_sequential.html>`_
167-
shows how to run these tasks sequentially to use only a single core at a time.
154+
machine learning model has finished training. An example on how to do this sequentially (first searching for individual models, and then building an ensemble from them) can be seen in `sequential auto-sklearn example <examples/60_search/example_sequential.html>`_.
155+
156+
Nevertheless, *auto-sklearn* also supports parallel Bayesian optimization via the use of `Dask.distributed <https://distributed.dask.org/>`_. By providing the arguments ``n_jobs`` to the estimator construction, one can control the number of cores available to *auto-sklearn* (As exemplified in `sequential auto-sklearn example <examples/60_search/example_parallel_n_jobs>`_). Distributed processes are also supported by providing a custom client object to *auto-sklearn* like in the
157+
example: `sequential auto-sklearn example <examples/60_search/example_parallel_manual_spawning_python>`_. When multiple cores are available, *auto-sklearn*
158+
will create a worker per core, and use the available workers to both search for better machine learning models as well as building an ensemble with them until the time resource is exhausted.
159+
160+
**Note:** *auto-sklearn* requires all workers to have access to a shared file system for storing training data and models.
168161

169162
Furthermore, depending on the installation of scikit-learn and numpy,
170163
the model building procedure may use up to all cores. Such behaviour is

examples/20_basic/example_classification.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222
sklearn.model_selection.train_test_split(X, y, random_state=1)
2323

2424
############################################################################
25-
# Build and fit a regressor
26-
# =========================
25+
# Build and fit a classifier
26+
# ==========================
2727

2828
automl = autosklearn.classification.AutoSklearnClassifier(
2929
time_left_for_this_task=120,

0 commit comments

Comments
 (0)