-
Notifications
You must be signed in to change notification settings - Fork 227
Update to latest scikit-learn release for deprecation and compatibility #53
Description
Using the current head 0.2.0 release of spark-sklearn and the current release of scikit-learn (0.18.1), I'm getting the following deprecation warning:
/.../python3.4/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
the library needs to be updated to use the new model_selection module and iterator interfaces.
In addition, due to changes in sklearn.model_selection.GridSearchCV, the attributes available on the fitted spark-sklearn.GridSearchCV are out of date.
sklearn.model_selection.GridSearchCV now has:
- cv_results_ : dict of numpy (masked) ndarrays - A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.
- best_estimator_ : estimator - Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
- best_score_ : float - Score of best_estimator on the left out data.
- best_params_ : dict - Parameter setting that gave the best results on the hold out data.
- best_index_ : int - The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.
- scorer_ : function - Scorer function used on the held out data to choose the best parameters for the model.
- n_splits_ : int - The number of cross-validation splits (folds/iterations).
While spark-sklearn.GridSearchCV has:
- grid_scores_ : list of named tuples
- best_estimator_ : estimator - Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
- best_score_ : float - Score of best_estimator on the left out data.
- best_params_ : dict - Parameter setting that gave the best results on the hold out data.
- scorer_ : function - Scorer function used on the held out data to choose the best parameters for the model.
The most critical difference is that sklearn added the more comprehensive cv_results_ which adds data that the formerly compatible grid_scores_ is lacking.