Skip to content

Commit fb01a23

Browse files
ragrawalparo1234ragrawal
authored
minor changes to paro1234-feature/drop (#217)
* explicit drop feature + tests * updated as per rules * updated as per PR comments * updated version number and using empty list as default Co-authored-by: Parul Singh <[email protected]> Co-authored-by: ragrawal <[email protected]>
1 parent 03bd0de commit fb01a23

File tree

4 files changed

+96
-7
lines changed

4 files changed

+96
-7
lines changed

README.rst

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,32 @@ attribute.
210210

211211
Note this does not work together with the ``default=True`` or ``sparse=True`` arguments to the mapper.
212212

213+
Dropping columns explictly
214+
*******************************
215+
216+
Sometimes it is required to drop a specific column/ list of columns.
217+
For this purpose, ``drop_cols`` argument for ``DataFrameMapper`` can be used.
218+
Default value is ``None``
219+
220+
>>> mapper_df = DataFrameMapper([
221+
... ('pet', sklearn.preprocessing.LabelBinarizer()),
222+
... (['children'], sklearn.preprocessing.StandardScaler())
223+
... ], drop_cols=['salary'])
224+
225+
Now running ``fit_transform`` will run transformations on 'pet' and 'children' and drop 'salary' column:
226+
227+
>>> np.round(mapper_df.fit_transform(data.copy()), 1)
228+
array([[ 1. , 0. , 0. , 0.2],
229+
[ 0. , 1. , 0. , 1.9],
230+
[ 0. , 1. , 0. , -0.6],
231+
[ 0. , 0. , 1. , -0.6],
232+
[ 1. , 0. , 0. , -1.5],
233+
[ 0. , 1. , 0. , -0.6],
234+
[ 1. , 0. , 0. , 1. ],
235+
[ 0. , 0. , 1. , 0.2]])
236+
237+
Transformations may require multiple input columns. In these
238+
213239
Transform Multiple Columns
214240
**************************
215241

@@ -395,7 +421,7 @@ The stacking of the sparse features is done without ever densifying them.
395421

396422

397423
Using ``NumericalTransformer``
398-
****************************
424+
***********************************
399425

400426
While you can use ``FunctionTransformation`` to generate arbitrary transformers, it can present serialization issues
401427
when pickling. Use ``NumericalTransformer`` instead, which takes the function name as a string parameter and hence
@@ -419,8 +445,15 @@ can be easily serialized.
419445

420446
Changelog
421447
---------
448+
2.0.1 (2020-09-07)
449+
******************
450+
451+
* Added an option to explicitly drop columns.
452+
453+
422454
2.0.0 (2020-08-01)
423455
******************
456+
424457
* Deprecated support for Python < 3.6.
425458
* Deprecated support for old versions of scikit-learn, pandas and numpy. Please check setup.py for minimum requirement.
426459
* Removed CategoricalImputer, cross_val_score and GridSearchCV. All these functionality now exists as part of
@@ -430,32 +463,39 @@ Changelog
430463
* Added ``NumericalTransformer`` for common numerical transformations. Currently it implements log and log1p
431464
transformation.
432465
* Added prefix and suffix options. See examples above. These are usually helpful when using gen_features.
466+
* Added ``drop_cols`` argument to DataframeMapper. This can be used to explicitly drop columns
433467

434468

435469
1.8.0 (2018-12-01)
436470
******************
471+
437472
* Add ``FunctionTransformer`` class (#117).
438473
* Fix column names derivation for dataframes with multi-index or non-string
439474
columns (#166).
440475
* Change behaviour of DataFrameMapper's fit_transform method to invoke each underlying transformers'
441476
native fit_transform if implemented. (#150)
442477

478+
443479
1.7.0 (2018-08-15)
444480
******************
481+
445482
* Fix issues with unicode names in ``get_names`` (#160).
446483
* Update to build using ``numpy==1.14`` and ``python==3.6`` (#154).
447484
* Add ``strategy`` and ``fill_value`` parameters to ``CategoricalImputer`` to allow imputing
448485
with values other than the mode (#144), (#161).
449486
* Preserve input data types when no transform is supplied (#138).
450487

488+
451489
1.6.0 (2017-10-28)
452490
******************
491+
453492
* Add column name to exception during fit/transform (#110).
454493
* Add ``gen_feature`` helper function to help generating the same transformation for multiple columns (#126).
455494

456495

457496
1.5.0 (2017-06-24)
458497
******************
498+
459499
* Allow inputting a dataframe/series per group of columns.
460500
* Get feature names also from ``estimator.get_feature_names()`` if present.
461501
* Attempt to derive feature names from individual transformers when applying a
@@ -466,6 +506,7 @@ Changelog
466506

467507
1.4.0 (2017-05-13)
468508
******************
509+
469510
* Allow specifying a custom name (alias) for transformed columns (#83).
470511
* Capture output columns generated names in ``transformed_names_`` attribute (#78).
471512
* Add ``CategoricalImputer`` that replaces null-like values with the mode
@@ -543,3 +584,4 @@ Other contributors:
543584
* Timothy Sweetser (@hacktuarial)
544585
* Vitaley Zaretskey (@vzaretsk)
545586
* Zac Stewart (@zacstewart)
587+
* Parul Singh (@paro1234)

sklearn_pandas/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = '2.0.0'
1+
__version__ = '2.0.1'
22

33
from .dataframe_mapper import DataFrameMapper # NOQA
44
from .features_generator import gen_features # NOQA

sklearn_pandas/dataframe_mapper.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,15 +63,15 @@ class DataFrameMapper(BaseEstimator, TransformerMixin):
6363
"""
6464

6565
def __init__(self, features, default=False, sparse=False, df_out=False,
66-
input_df=False):
66+
input_df=False, drop_cols=None):
6767
"""
6868
Params:
6969
7070
features a list of tuples with features definitions.
7171
The first element is the pandas column selector. This can
7272
be a string (for one column) or a list of strings.
7373
The second element is an object that supports
74-
sklearn's transform interface, or a list of such objects.
74+
sklearn's transform interface, or a list of such objects
7575
The third element is optional and, if present, must be
7676
a dictionary with the options to apply to the
7777
transformation. Example: {'alias': 'day_of_week'}
@@ -96,14 +96,16 @@ def __init__(self, features, default=False, sparse=False, df_out=False,
9696
as a pandas DataFrame or Series. Otherwise pass them as a
9797
numpy array. Defaults to ``False``.
9898
99+
drop_cols List of columns to be dropped. Defaults to None.
100+
99101
"""
100102
self.features = features
101-
self.built_features = None
102103
self.default = default
103104
self.built_default = None
104105
self.sparse = sparse
105106
self.df_out = df_out
106107
self.input_df = input_df
108+
self.drop_columns = drop_cols or []
107109
self.transformed_names_ = []
108110

109111
if (df_out and (sparse or default)):
@@ -144,7 +146,8 @@ def _unselected_columns(self, X):
144146
"""
145147
X_columns = list(X.columns)
146148
return [column for column in X_columns if
147-
column not in self._selected_columns]
149+
column not in self._selected_columns
150+
and column not in self.drop_columns]
148151

149152
def __setstate__(self, state):
150153
# compatibility for older versions of sklearn-pandas
@@ -153,6 +156,7 @@ def __setstate__(self, state):
153156
self.default = state.get('default', False)
154157
self.df_out = state.get('df_out', False)
155158
self.input_df = state.get('input_df', False)
159+
self.drop_columns = state.get('drop_cols', None)
156160
self.built_features = state.get('built_features', self.features)
157161
self.built_default = state.get('built_default', self.default)
158162
self.transformed_names_ = state.get('transformed_names_', [])

tests/test_dataframe_mapper.py

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -649,7 +649,7 @@ def test_selected_columns():
649649

650650
def test_unselected_columns():
651651
"""
652-
selected_columns returns a list of the columns not appearing in the
652+
unselected_columns returns a list of the columns not appearing in the
653653
features of the mapper but present in the given dataframe.
654654
"""
655655
df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})
@@ -660,6 +660,49 @@ def test_unselected_columns():
660660
assert 'c' in mapper._unselected_columns(df)
661661

662662

663+
def test_drop_and_default_false():
664+
"""
665+
If default=False, non explicitly selected columns and drop columns
666+
are discarded.
667+
"""
668+
df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})
669+
mapper = DataFrameMapper([
670+
('a', None)
671+
], drop_cols=['c'], default=False)
672+
transformed = mapper.fit_transform(df)
673+
assert transformed.shape == (1, 1)
674+
assert mapper.transformed_names_ == ['a']
675+
676+
677+
def test_drop_and_default_none():
678+
"""
679+
If default=None, drop columns are discarded and
680+
remaining non explicitly selected columns are passed through untransformed
681+
"""
682+
df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})
683+
mapper = DataFrameMapper([
684+
('a', None)
685+
], drop_cols=['c'], default=None)
686+
687+
transformed = mapper.fit_transform(df)
688+
assert transformed.shape == (3, 2)
689+
assert mapper.transformed_names_ == ['a', 'b']
690+
691+
692+
def test_conflicting_drop():
693+
"""
694+
Drop column name shouldn't get confused with transformed columns.
695+
"""
696+
df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})
697+
mapper = DataFrameMapper([
698+
('a', None)
699+
], drop_cols=['a'], default=False)
700+
701+
transformed = mapper.fit_transform(df)
702+
assert transformed.shape == (3, 1)
703+
assert mapper.transformed_names_ == ['a']
704+
705+
663706
def test_default_false():
664707
"""
665708
If default=False, non explicitly selected columns are discarded.

0 commit comments

Comments
 (0)