Skip to content

Commit 2173b62

Browse files
authored
Merge pull request #212 from ragrawal/circleci-project-setup
Lots of changes. Please see change log. Also the project maintainer changed from Israel to Ritesh Agrawal.
2 parents 611254d + babf9ab commit 2173b62

12 files changed

+213
-647
lines changed

README.rst

Lines changed: 78 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,7 @@ Sklearn-pandas
66
:target: https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas
77

88
This module provides a bridge between `Scikit-Learn <http://scikit-learn.org/stable>`__'s machine learning methods and `pandas <https://pandas.pydata.org>`__-style Data Frames.
9-
10-
In particular, it provides:
11-
12-
1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features.
13-
2. A compatibility shim for old ``scikit-learn`` versions to cross-validate a pipeline that takes a pandas ``DataFrame`` as input. This is only needed for ``scikit-learn<0.16.0`` (see `#11 <https://github.com/paulgb/sklearn-pandas/issues/11>`__ for details). It is deprecated and will likely be dropped in ``skearn-pandas==2.0``.
14-
3. A couple of special transformers that work well with pandas inputs: ``CategoricalImputer`` and ``FunctionTransformer``.
9+
In particular, it provides a way to map ``DataFrame`` columns to transformations, which are later recombined into features.
1510

1611
Installation
1712
------------
@@ -20,6 +15,7 @@ You can install ``sklearn-pandas`` with ``pip``::
2015

2116
# pip install sklearn-pandas
2217

18+
2319
Tests
2420
-----
2521

@@ -36,11 +32,11 @@ Import
3632
Import what you need from the ``sklearn_pandas`` package. The choices are:
3733

3834
* ``DataFrameMapper``, a class for mapping pandas data frame columns to different sklearn transformations
39-
* ``cross_val_score``, similar to ``sklearn.cross_validation.cross_val_score`` but working on pandas DataFrames
35+
4036

4137
For this demonstration, we will import both::
4238

43-
>>> from sklearn_pandas import DataFrameMapper, cross_val_score
39+
>>> from sklearn_pandas import DataFrameMapper
4440

4541
For these examples, we'll also use pandas, numpy, and sklearn::
4642

@@ -136,8 +132,18 @@ of the feature definition::
136132
>>> mapper_alias.transformed_names_
137133
['children_scaled']
138134

135+
Alternatively, you can also specify prefix and/or suffix to add to the column name. For example::
136+
137+
138+
>>> mapper_alias = DataFrameMapper([
139+
... (['children'], sklearn.preprocessing.StandardScaler(), {'prefix': 'standard_scaled_'}),
140+
... (['children'], sklearn.preprocessing.StandardScaler(), {'suffix': '_raw'})
141+
... ])
142+
>>> _ = mapper_alias.fit_transform(data.copy())
143+
>>> mapper_alias.transformed_names_
144+
['standard_scaled_children', 'children_raw']
139145

140-
Passing Series/DataFrames to the transformers
146+
Passing Series/DataFrames to the transformerså
141147
*********************************************
142148

143149
By default the transformers are passed a numpy array of the selected columns
@@ -231,8 +237,9 @@ Multiple transformers for the same column
231237
Multiple transformers can be applied to the same column specifying them
232238
in a list::
233239

240+
>>> from sklearn.impute import SimpleImputer
234241
>>> mapper3 = DataFrameMapper([
235-
... (['age'], [sklearn.preprocessing.Imputer(),
242+
... (['age'], [SimpleImputer(),
236243
... sklearn.preprocessing.StandardScaler()])])
237244
>>> data_3 = pd.DataFrame({'age': [1, np.nan, 3]})
238245
>>> mapper3.fit_transform(data_3)
@@ -302,7 +309,7 @@ into generator, and then use returned definition as ``features`` argument for ``
302309
... classes=[sklearn.preprocessing.LabelEncoder]
303310
... )
304311
>>> feature_def
305-
[('col1', [LabelEncoder()]), ('col2', [LabelEncoder()]), ('col3', [LabelEncoder()])]
312+
[('col1', [LabelEncoder()], {}), ('col2', [LabelEncoder()], {}), ('col3', [LabelEncoder()], {})]
306313
>>> mapper5 = DataFrameMapper(feature_def)
307314
>>> data5 = pd.DataFrame({
308315
... 'col1': ['yes', 'no', 'yes'],
@@ -318,23 +325,42 @@ If it is required to override some of transformer parameters, then a dict with '
318325
transformer parameters should be provided. For example, consider a dataset with missing values.
319326
Then the following code could be used to override default imputing strategy:
320327

328+
>>> from sklearn.impute import SimpleImputer
329+
>>> import numpy as np
321330
>>> feature_def = gen_features(
322331
... columns=[['col1'], ['col2'], ['col3']],
323-
... classes=[{'class': sklearn.preprocessing.Imputer, 'strategy': 'most_frequent'}]
332+
... classes=[{'class': SimpleImputer, 'strategy':'most_frequent'}]
324333
... )
325334
>>> mapper6 = DataFrameMapper(feature_def)
326335
>>> data6 = pd.DataFrame({
327-
... 'col1': [None, 1, 1, 2, 3],
328-
... 'col2': [True, False, None, None, True],
329-
... 'col3': [0, 0, 0, None, None]
336+
... 'col1': [np.nan, 1, 1, 2, 3],
337+
... 'col2': [True, False, np.nan, np.nan, True],
338+
... 'col3': [0, 0, 0, np.nan, np.nan]
330339
... })
331340
>>> mapper6.fit_transform(data6)
332-
array([[1., 1., 0.],
333-
[1., 0., 0.],
334-
[1., 1., 0.],
335-
[2., 1., 0.],
336-
[3., 1., 0.]])
341+
array([[1.0, True, 0.0],
342+
[1.0, False, 0.0],
343+
[1.0, True, 0.0],
344+
[2.0, True, 0.0],
345+
[3.0, True, 0.0]], dtype=object)
337346

347+
You can also specify global prefix or suffix for the generated transformed column names using the prefix and suffix
348+
parameters::
349+
350+
>>> feature_def = gen_features(
351+
... columns=['col1', 'col2', 'col3'],
352+
... classes=[sklearn.preprocessing.LabelEncoder],
353+
... prefix="lblencoder_"
354+
... )
355+
>>> mapper5 = DataFrameMapper(feature_def)
356+
>>> data5 = pd.DataFrame({
357+
... 'col1': ['yes', 'no', 'yes'],
358+
... 'col2': [True, False, False],
359+
... 'col3': ['one', 'two', 'three']
360+
... })
361+
>>> _ = mapper5.fit_transform(data5)
362+
>>> mapper5.transformed_names_
363+
['lblencoder_col1', 'lblencoder_col2', 'lblencoder_col3']
338364

339365
Feature selection and other supervised transformations
340366
******************************************************
@@ -356,7 +382,8 @@ Feature selection and other supervised transformations
356382
Working with sparse features
357383
****************************
358384

359-
A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return a sparse array whenever any of the extracted features is sparse. Example:
385+
A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return
386+
a sparse array whenever any of the extracted features is sparse. Example:
360387

361388
>>> mapper5 = DataFrameMapper([
362389
... ('pet', CountVectorizer()),
@@ -366,62 +393,44 @@ A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sp
366393

367394
The stacking of the sparse features is done without ever densifying them.
368395

369-
Cross-Validation
370-
****************
371-
372-
Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. ``scikit-learn<0.16.0`` provided features for cross-validation, but they expect numpy data structures and won't work with ``DataFrameMapper``.
373-
374-
To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_score`` function which passes a pandas DataFrame to the estimator rather than a numpy array::
375-
376-
>>> pipe = sklearn.pipeline.Pipeline([
377-
... ('featurize', mapper),
378-
... ('lm', sklearn.linear_model.LinearRegression())])
379-
>>> np.round(cross_val_score(pipe, X=data.copy(), y=data.salary, scoring='r2'), 2)
380-
array([ -1.09, -5.3 , -15.38])
381-
382-
Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface as sklearn's function of the same name.
383-
384-
``CategoricalImputer``
385-
**********************
386-
387-
Since the ``scikit-learn`` ``Imputer`` transformer currently only works with
388-
numbers, ``sklearn-pandas`` provides an equivalent helper transformer that
389-
works with strings, substituting null values with the most frequent value in
390-
that column. Alternatively, you can specify a fixed value to use.
391-
392-
Example: imputing with the mode:
393-
394-
>>> from sklearn_pandas import CategoricalImputer
395-
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
396-
>>> imputer = CategoricalImputer()
397-
>>> imputer.fit_transform(data)
398-
array(['a', 'b', 'b', 'b'], dtype=object)
399396

400-
Example: imputing with a fixed value:
401-
402-
>>> from sklearn_pandas import CategoricalImputer
403-
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
404-
>>> imputer = CategoricalImputer(strategy='constant', fill_value='a')
405-
>>> imputer.fit_transform(data)
406-
array(['a', 'b', 'b', 'a'], dtype=object)
407-
408-
409-
``FunctionTransformer``
410-
***********************
397+
Using ``NumericalTransformer``
398+
****************************
411399

412-
Often one wants to apply simple transformations to data such as ``np.log``. ``FunctionTransformer`` is a simple wrapper that takes any function and applies vectorization so that it can be used as a transformer.
400+
While you can use ``FunctionTransformation`` to generate arbitrary transformers, it can present serialization issues
401+
when pickling. Use ``NumericalTransformer`` instead, which takes the function name as a string parameter and hence
402+
can be easily serialized.
413403

414-
Example:
404+
>>> from sklearn_pandas import NumericalTransformer
405+
>>> mapper5 = DataFrameMapper([
406+
... ('children', NumericalTransformer('log')),
407+
... ])
408+
>>> mapper5.fit_transform(data)
409+
array([[1.38629436],
410+
[1.79175947],
411+
[1.09861229],
412+
[1.09861229],
413+
[0.69314718],
414+
[1.09861229],
415+
[1.60943791],
416+
[1.38629436]])
415417

416-
>>> from sklearn_pandas import FunctionTransformer
417-
>>> array = np.array([10, 100])
418-
>>> transformer = FunctionTransformer(np.log10)
419418

420-
>>> transformer.fit_transform(array)
421-
array([1., 2.])
422419

423420
Changelog
424421
---------
422+
2.0.0 (2020-08-01)
423+
******************
424+
* Deprecated support for Python < 3.6.
425+
* Deprecated support for old versions of scikit-learn, pandas and numpy. Please check setup.py for minimum requirement.
426+
* Removed CategoricalImputer, cross_val_score and GridSearchCV. All these functionality now exists as part of
427+
scikit-learn. Please use SimpleImputer instead of CategoricalImputer. Also
428+
Cross validation from sklearn now supports dataframe so we don't need to use cross validation wrapper provided over
429+
here.
430+
* Added ``NumericalTransformer`` for common numerical transformations. Currently it implements log and log1p
431+
transformation.
432+
* Added prefix and suffix options. See examples above. These are usually helpful when using gen_features.
433+
425434

426435
1.8.0 (2018-12-01)
427436
******************

setup.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,19 +29,20 @@ def run(self):
2929
raise SystemExit(errno)
3030

3131

32-
setup(name='sklearn-pandas',
32+
setup(name='sklearn-pandas2',
3333
version=__version__,
3434
description='Pandas integration with sklearn',
35-
maintainer='Israel Saeta Pérez',
36-
maintainer_email='israel.saeta@dukebody.com',
37-
url='https://github.com/paulgb/sklearn-pandas',
35+
maintainer='Ritesh Agrawal',
36+
maintainer_email='ragrawal@gmail.com',
37+
url='https://github.com/scikit-learn-contrib/sklearn-pandas',
3838
packages=['sklearn_pandas'],
3939
keywords=['scikit', 'sklearn', 'pandas'],
4040
install_requires=[
41-
'scikit-learn>=0.15.0',
42-
'scipy>=0.14',
43-
'pandas>=0.11.0',
44-
'numpy>=1.6.1'],
41+
'scikit-learn>=0.23.0',
42+
'scipy>=1.4.1',
43+
'pandas>=1.0.5',
44+
'numpy>=1.18.1'
45+
],
4546
tests_require=['pytest', 'mock'],
4647
cmdclass={'test': PyTest},
4748
)

sklearn_pandas/__init__.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
1-
__version__ = '1.8.0'
1+
__version__ = '2.0.0'
22

33
from .dataframe_mapper import DataFrameMapper # NOQA
4-
from .cross_validation import cross_val_score, GridSearchCV, RandomizedSearchCV # NOQA
5-
from .transformers import CategoricalImputer, FunctionTransformer # NOQA
64
from .features_generator import gen_features # NOQA
5+
from .transformers import NumericalTransformer # NOQA

0 commit comments

Comments
 (0)