Skip to content

Commit 6e58865

Browse files
authored
Merge pull request #1098 from automl/development
Development
2 parents 58e36be + f657ba4 commit 6e58865

File tree

12 files changed

+201
-37
lines changed

12 files changed

+201
-37
lines changed

.github/stale.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Configuration for probot-stale - https://github.com/probot/stale
2+
3+
# Number of days of inactivity before an Issue or Pull Request becomes stale
4+
daysUntilStale: 60
5+
6+
# Number of days of inactivity before an Issue or Pull Request with the stale label is closed.
7+
# Set to false to disable. If disabled, issues still need to be closed manually, but will remain marked as stale.
8+
daysUntilClose: 7
9+
10+
# Only issues or pull requests with all of these labels are check if stale. Defaults to `[]` (disabled)
11+
onlyLabels:
12+
- Answered
13+
- "Feedback Required"
14+
- invalid
15+
- wontfix
16+
17+
# Issues or Pull Requests with these labels will never be considered stale. Set to `[]` to disable
18+
exemptLabels:
19+
- Bug
20+
21+
# Set to true to ignore issues in a project (defaults to false)
22+
exemptProjects: false
23+
24+
# Set to true to ignore issues in a milestone (defaults to false)
25+
exemptMilestones: false
26+
27+
# Set to true to ignore issues with an assignee (defaults to false)
28+
exemptAssignees: false
29+
30+
# Label to use when marking as stale
31+
staleLabel: stale
32+
33+
# Comment to post when marking as stale. Set to `false` to disable
34+
markComment: >
35+
This issue has been automatically marked as stale because it has not had
36+
recent activity. It will be closed if no further activity occurs for the
37+
next 7 days. Thank you for your contributions.
38+
39+
# Comment to post when removing the stale label.
40+
# unmarkComment: >
41+
# Your comment here.
42+
43+
# Comment to post when closing a stale Issue or Pull Request.
44+
# closeComment: >
45+
# Your comment here.
46+
47+
# Limit the number of actions per hour, from 1-30. Default is 30
48+
limitPerRun: 30

.github/workflows/pytest.yml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
name: Tests
22

3-
on: [push, pull_request]
3+
on:
4+
push:
5+
pull_request:
6+
schedule:
7+
# Every Monday at 7AM UTC
8+
- cron: '0 07 * * 1'
49

510
jobs:
611
ubuntu:
@@ -17,12 +22,15 @@ jobs:
1722
- python-version: 3.7
1823
use-conda: false
1924
use-dist: true
20-
fail-fast: false
25+
fail-fast: false
2126

2227
steps:
2328
- uses: actions/checkout@v2
2429
- name: Setup Python ${{ matrix.python-version }}
2530
uses: actions/setup-python@v2
31+
# A note on checkout: When checking out the repository that
32+
# triggered a workflow, this defaults to the reference or SHA for that event.
33+
# Otherwise, uses the default branch (master) is used.
2634
with:
2735
python-version: ${{ matrix.python-version }}
2836
- name: Conda Install test dependencies

autosklearn/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
"""Version information."""
22

33
# The following line *must* be the last in the module, exactly as formatted:
4-
__version__ = "0.12.3"
4+
__version__ = "0.12.4"

autosklearn/experimental/askl2.py

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,16 @@
2525
m = hashlib.md5()
2626
m.update(fh.read().encode('utf8'))
2727
training_data_hash = m.hexdigest()[:10]
28-
sklearn_version = sklearn.__version__
29-
autosklearn_version = autosklearn.__version__
30-
selector_file = pathlib.Path(
31-
os.environ.get(
32-
'XDG_CACHE_HOME',
33-
'~/.cache/auto-sklearn/askl2_selector_%s_%s_%s.pkl'
34-
% (autosklearn_version, sklearn_version, training_data_hash),
35-
)
36-
).expanduser()
28+
selector_filename = "askl2_selector_%s_%s_%s.pkl" % (
29+
autosklearn.__version__,
30+
sklearn.__version__,
31+
training_data_hash
32+
)
33+
selector_directory = os.environ.get('XDG_CACHE_HOME')
34+
if selector_directory is None:
35+
selector_directory = pathlib.Path.home()
36+
selector_directory = pathlib.Path(selector_directory).joinpath('auto-sklearn').expanduser()
37+
selector_file = selector_directory / selector_filename
3738
metafeatures = pd.DataFrame(training_data['metafeatures'])
3839
y_values = np.array(training_data['y_values'])
3940
strategies = training_data['strategies']
@@ -53,8 +54,14 @@
5354
maxima=maxima_for_methods,
5455
)
5556
selector_file.parent.mkdir(exist_ok=True, parents=True)
56-
with open(selector_file, 'wb') as fh:
57-
pickle.dump(selector, fh)
57+
try:
58+
with open(selector_file, 'wb') as fh:
59+
pickle.dump(selector, fh)
60+
except Exception as e:
61+
print("AutoSklearn2Classifier needs to create a selector file under "
62+
"the user's home directory or XDG_CACHE_HOME. Nevertheless "
63+
"the path {} is not writable.".format(selector_file))
64+
raise e
5865

5966

6067
class SmacObjectCallback:
@@ -156,6 +163,7 @@ class AutoSklearn2Classifier(AutoSklearnClassifier):
156163
def __init__(
157164
self,
158165
time_left_for_this_task: int = 3600,
166+
per_run_time_limit=None,
159167
ensemble_size: int = 50,
160168
ensemble_nbest: Union[float, int] = 50,
161169
max_models_on_disc: int = 50,
@@ -183,6 +191,13 @@ def __init__(
183191
models. By increasing this value, *auto-sklearn* has a higher
184192
chance of finding better models.
185193
194+
per_run_time_limit : int, optional (default=1/10 of time_left_for_this_task)
195+
Time limit for a single call to the machine learning model.
196+
Model fitting will be terminated if the machine learning
197+
algorithm runs over the time limit. Set this value high enough so
198+
that typical machine learning algorithms can be fit on the
199+
training data.
200+
186201
ensemble_size : int, optional (default=50)
187202
Number of models added to the ensemble built by *Ensemble
188203
selection from libraries of models*. Models are drawn with
@@ -255,7 +270,7 @@ def __init__(
255270
256271
smac_scenario_args : dict, optional (None)
257272
Additional arguments inserted into the scenario of SMAC. See the
258-
`SMAC documentation
273+
`SMAC documentation
259274
<https://automl.github.io/SMAC3/master/options.html?highlight=scenario
260275
#scenario>`_
261276
for a list of available arguments.
@@ -272,7 +287,7 @@ def __init__(
272287
If None is provided, a default metric is selected depending on the task.
273288
274289
scoring_functions : List[Scorer], optional (None)
275-
List of scorers which will be calculated for each pipeline and results will be
290+
List of scorers which will be calculated for each pipeline and results will be
276291
available via ``cv_results``
277292
278293
load_models : bool, optional (True)
@@ -295,6 +310,7 @@ def __init__(
295310
include_preprocessors = ["no_preprocessing"]
296311
super().__init__(
297312
time_left_for_this_task=time_left_for_this_task,
313+
per_run_time_limit=per_run_time_limit,
298314
initial_configurations_via_metalearning=0,
299315
ensemble_size=ensemble_size,
300316
ensemble_nbest=ensemble_nbest,

autosklearn/metalearning/metalearning/clustering/gmeans.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ def fit(self, X):
3434
indices = KMeans.labels_ == i
3535
X_ = X[indices]
3636

37+
if np.sum(indices) < self.minimum_samples_per_cluster*2:
38+
cluster_centers.append(cluster_center)
39+
continue
40+
3741
for i in range(10):
3842
KMeans_ = sklearn.cluster.KMeans(n_clusters=2,
3943
n_init=self.n_init,
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from autosklearn.pipeline.constants import DENSE, UNSIGNED_DATA, INPUT
2+
from autosklearn.pipeline.components.data_preprocessing.rescaling.abstract_rescaling \
3+
import Rescaling
4+
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
5+
6+
7+
class PowerTransformerComponent(Rescaling, AutoSklearnPreprocessingAlgorithm):
8+
def __init__(self, random_state):
9+
from sklearn.preprocessing import PowerTransformer
10+
self.preprocessor = PowerTransformer(copy=False)
11+
12+
@staticmethod
13+
def get_properties(dataset_properties=None):
14+
return {'shortname': 'PowerTransformer',
15+
'name': 'PowerTransformer',
16+
'handles_missing_values': False,
17+
'handles_nominal_values': False,
18+
'handles_numerical_features': True,
19+
'prefers_data_scaled': False,
20+
'prefers_data_normalized': False,
21+
'handles_regression': True,
22+
'handles_classification': True,
23+
'handles_multiclass': True,
24+
'handles_multilabel': True,
25+
'handles_multioutput': True,
26+
'is_deterministic': True,
27+
# TODO find out of this is right!
28+
'handles_sparse': False,
29+
'handles_dense': True,
30+
'input': (DENSE, UNSIGNED_DATA),
31+
'output': (INPUT,),
32+
'preferred_dtype': None}

doc/manual.rst

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -104,12 +104,12 @@ Supported Inputs
104104
* Multioutput Regression
105105

106106
You can provide feature and target training pairs (X_train/y_train) to *auto-sklearn* to fit an ensemble of pipelines as described in the next section. This X_train/y_train dataset must belong to one of the supported formats: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
107-
Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Train and Test inputs <examples/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
107+
Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the example `Train and Test inputs <examples/40_advanced/example_pandas_train_test.html>`_. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.
108108

109109
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_ for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
110110

111111
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
112-
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the example `Feature Types <examples/example_feature_types.html>`_.
112+
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the example `Feature Types <examples/40_advanced/example_feature_types.html>`_.
113113
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the column has a categorical/boolean class, it will be encoded. If the column is of any other type (Object or Timeseries), an error will be raised. For further details on how to properly encode your data, you can check the example `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach `Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
114114

115115
Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets.
@@ -143,28 +143,21 @@ obtained by running *auto-sklearn*. It additionally prints the number of both su
143143
algorithm runs.
144144

145145
The results obtained from the final ensemble can be printed by calling ``show_models()``. *auto-sklearn* ensemble is composed of scikit-learn models that can be inspected as exemplified by
146-
`model inspection example <examples/example_get_pipeline_components.html>`_
146+
`model inspection example <examples/40_advanced/example_get_pipeline_components.html>`_
147147
.
148148

149149
Parallel computation
150150
====================
151151

152-
*auto-sklearn* supports parallel execution by data sharing on a shared file
153-
system. In this mode, the SMAC algorithm shares the training data for it's
154-
model by writing it to disk after every iteration. At the beginning of each
155-
iteration, SMAC loads all newly found data points. We provide an example
156-
implementing
157-
`scikit-learn's n_jobs functionality <examples/example_parallel_n_jobs.html>`_
158-
and an example on how
159-
to
160-
`manually start multiple instances of auto-sklearn <examples/example_parallel_manual_spawning.html>`_
161-
.
162-
163152
In it's default mode, *auto-sklearn* already uses two cores. The first one is
164153
used for model building, the second for building an ensemble every time a new
165-
machine learning model has finished training. The
166-
`sequential example <examples/example_sequential.html>`_
167-
shows how to run these tasks sequentially to use only a single core at a time.
154+
machine learning model has finished training. An example on how to do this sequentially (first searching for individual models, and then building an ensemble from them) can be seen in `sequential auto-sklearn example <examples/60_search/example_sequential.html>`_.
155+
156+
Nevertheless, *auto-sklearn* also supports parallel Bayesian optimization via the use of `Dask.distributed <https://distributed.dask.org/>`_. By providing the arguments ``n_jobs`` to the estimator construction, one can control the number of cores available to *auto-sklearn* (As exemplified in `sequential auto-sklearn example <examples/60_search/example_parallel_n_jobs>`_). Distributed processes are also supported by providing a custom client object to *auto-sklearn* like in the
157+
example: `sequential auto-sklearn example <examples/60_search/example_parallel_manual_spawning_python>`_. When multiple cores are available, *auto-sklearn*
158+
will create a worker per core, and use the available workers to both search for better machine learning models as well as building an ensemble with them until the time resource is exhausted.
159+
160+
**Note:** *auto-sklearn* requires all workers to have access to a shared file system for storing training data and models.
168161

169162
Furthermore, depending on the installation of scikit-learn and numpy,
170163
the model building procedure may use up to all cores. Such behaviour is

doc/releases.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,25 @@
1212
Releases
1313
========
1414

15+
Version 0.12.4
16+
==============
17+
18+
* ADD #660: Enable scikit-learn's power transformation for input features.
19+
* MAINT: Bump the `pyrfr` minimum dependency to 0.8.1 to automatically download wheels from pypi
20+
if possible.
21+
* FIX #732: Add a missing size check into the GMEANS clustering used for the NeurIPS 2015 paper.
22+
* FIX #1050: Add missing arguments to the `AutoSklearn2Classifier` signature.
23+
* FIX #1072: Fixes a bug where the `AutoSklearn2Classifier` could not be created due to trying to
24+
cache to the wrong directory.
25+
26+
Contributors v0.12.4
27+
********************
28+
29+
* Matthias Feurer
30+
* Francisco Rivera
31+
* Maximilian Greil
32+
* Pepe Berba
33+
1534
Version 0.12.3
1635
==============
1736

examples/20_basic/example_classification.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222
sklearn.model_selection.train_test_split(X, y, random_state=1)
2323

2424
############################################################################
25-
# Build and fit a regressor
26-
# =========================
25+
# Build and fit a classifier
26+
# ==========================
2727

2828
automl = autosklearn.classification.AutoSklearnClassifier(
2929
time_left_for_this_task=120,

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,5 @@ liac-arff
1414

1515
ConfigSpace>=0.4.14,<0.5
1616
pynisher>=0.6.3
17-
pyrfr>=0.7,<0.9
17+
pyrfr>=0.8.1,<0.9
1818
smac>=0.13.1,<0.14

0 commit comments

Comments
 (0)