You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This module provides a bridge between `Scikit-Learn <http://scikit-learn.org/stable>`__'s machine learning methods and `pandas <https://pandas.pydata.org>`__-style Data Frames.
9
-
10
-
In particular, it provides:
11
-
12
-
1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features.
13
-
2. A compatibility shim for old ``scikit-learn`` versions to cross-validate a pipeline that takes a pandas ``DataFrame`` as input. This is only needed for ``scikit-learn<0.16.0`` (see `#11 <https://github.com/paulgb/sklearn-pandas/issues/11>`__ for details). It is deprecated and will likely be dropped in ``skearn-pandas==2.0``.
14
-
3. A couple of special transformers that work well with pandas inputs: ``CategoricalImputer`` and ``FunctionTransformer``.
9
+
In particular, it provides a way to map ``DataFrame`` columns to transformations, which are later recombined into features.
15
10
16
11
Installation
17
12
------------
@@ -20,6 +15,7 @@ You can install ``sklearn-pandas`` with ``pip``::
20
15
21
16
# pip install sklearn-pandas
22
17
18
+
23
19
Tests
24
20
-----
25
21
@@ -36,11 +32,11 @@ Import
36
32
Import what you need from the ``sklearn_pandas`` package. The choices are:
37
33
38
34
* ``DataFrameMapper``, a class for mapping pandas data frame columns to different sklearn transformations
39
-
* ``cross_val_score``, similar to ``sklearn.cross_validation.cross_val_score`` but working on pandas DataFrames
35
+
40
36
41
37
For this demonstration, we will import both::
42
38
43
-
>>> from sklearn_pandas import DataFrameMapper, cross_val_score
39
+
>>> from sklearn_pandas import DataFrameMapper
44
40
45
41
For these examples, we'll also use pandas, numpy, and sklearn::
46
42
@@ -136,8 +132,18 @@ of the feature definition::
136
132
>>> mapper_alias.transformed_names_
137
133
['children_scaled']
138
134
135
+
Alternatively, you can also specify prefix and/or suffix to add to the column name. For example::
@@ -356,7 +382,8 @@ Feature selection and other supervised transformations
356
382
Working with sparse features
357
383
****************************
358
384
359
-
A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return a sparse array whenever any of the extracted features is sparse. Example:
385
+
A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return
386
+
a sparse array whenever any of the extracted features is sparse. Example:
360
387
361
388
>>> mapper5 = DataFrameMapper([
362
389
... ('pet', CountVectorizer()),
@@ -366,62 +393,44 @@ A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sp
366
393
367
394
The stacking of the sparse features is done without ever densifying them.
368
395
369
-
Cross-Validation
370
-
****************
371
-
372
-
Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. ``scikit-learn<0.16.0`` provided features for cross-validation, but they expect numpy data structures and won't work with ``DataFrameMapper``.
373
-
374
-
To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_score`` function which passes a pandas DataFrame to the estimator rather than a numpy array::
Often one wants to apply simple transformations to data such as ``np.log``. ``FunctionTransformer`` is a simple wrapper that takes any function and applies vectorization so that it can be used as a transformer.
400
+
While you can use ``FunctionTransformation`` to generate arbitrary transformers, it can present serialization issues
401
+
when pickling. Use ``NumericalTransformer`` instead, which takes the function name as a string parameter and hence
402
+
can be easily serialized.
413
403
414
-
Example:
404
+
>>> from sklearn_pandas import NumericalTransformer
405
+
>>> mapper5 = DataFrameMapper([
406
+
... ('children', NumericalTransformer('log')),
407
+
... ])
408
+
>>> mapper5.fit_transform(data)
409
+
array([[1.38629436],
410
+
[1.79175947],
411
+
[1.09861229],
412
+
[1.09861229],
413
+
[0.69314718],
414
+
[1.09861229],
415
+
[1.60943791],
416
+
[1.38629436]])
415
417
416
-
>>> from sklearn_pandas import FunctionTransformer
417
-
>>> array = np.array([10, 100])
418
-
>>> transformer = FunctionTransformer(np.log10)
419
418
420
-
>>> transformer.fit_transform(array)
421
-
array([1., 2.])
422
419
423
420
Changelog
424
421
---------
422
+
2.0.0 (2020-08-01)
423
+
******************
424
+
* Deprecated support for Python < 3.6.
425
+
* Deprecated support for old versions of scikit-learn, pandas and numpy. Please check setup.py for minimum requirement.
426
+
* Removed CategoricalImputer, cross_val_score and GridSearchCV. All these functionality now exists as part of
427
+
scikit-learn. Please use SimpleImputer instead of CategoricalImputer. Also
428
+
Cross validation from sklearn now supports dataframe so we don't need to use cross validation wrapper provided over
429
+
here.
430
+
* Added ``NumericalTransformer`` for common numerical transformations. Currently it implements log and log1p
431
+
transformation.
432
+
* Added prefix and suffix options. See examples above. These are usually helpful when using gen_features.
0 commit comments