@@ -26,7 +26,8 @@ scikit-learn :class:`~sklearn.pipeline.Pipeline`. These DAGs may be created from
2626 ... (" impute" , SimpleImputer()),
2727 ... (" pca" , PCA()),
2828 ... (" lr" , LogisticRegression())
29- ... ]
29+ ... ],
30+ ... infer_dataframe = True ,
3031 ... )
3132
3233 You may view a diagram of the DAG with the :meth: `~skdag.dag.DAG.show ` method. In a
@@ -44,18 +45,25 @@ ASCII text:
4445
4546 .. image :: _static/img/dag1.png
4647
48+ Note that we also provided an extra option, ``infer_dataframe ``. This is entirely
49+ optional, but if set the DAG will ensure that dataframe inputs have column and index
50+ information preserved (or inferred), and the output of the pipeline will also be a
51+ dataframe. This is useful if you wish to filter down the inputs for one particular step
52+ to only include certain columns; something we shall see in action later.
53+
4754For more complex DAGs, it is recommended to use a :class: `skdag.dag.DAGBuilder `,
4855which allows you to define the graph by specifying the dependencies of each new
4956estimator:
5057
5158.. code-block :: python
5259
5360 >> > from skdag import DAGBuilder
61+ >> > from sklearn.compose import make_column_selector
5462 >> > dag = (
55- ... DAGBuilder()
63+ ... DAGBuilder(infer_dataframe = True )
5664 ... .add_step(" impute" , SimpleImputer())
57- ... .add_step(" vitals" , " passthrough" , deps = {" impute" : slice ( 0 , 4 ) })
58- ... .add_step(" blood" , PCA(n_components = 2 , random_state = 0 ), deps = {" impute" : slice ( 4 , 10 )})
65+ ... .add_step(" vitals" , " passthrough" , deps = {" impute" : [ " age " , " sex " , " bmi " , " bp " ] })
66+ ... .add_step(" blood" , PCA(n_components = 2 , random_state = 0 ), deps = {" impute" : make_column_selector( " s[0-9]+ " )})
5967 ... .add_step(" lr" , LogisticRegression(random_state = 0 ), deps = [" blood" , " vitals" ])
6068 ... .make_dag()
6169 ... )
@@ -73,7 +81,16 @@ the remaining columns have dimensionality reduction applied first before being
7381passed to the same regressor. Note that we can define our graph edges in two
7482different ways: as a dict (if we need to select only certain columns from the source
7583node) or as a simple list (if we want to simply grab all columns from all input
76- nodes).
84+ nodes). Columns may be specified as any kind of iterable (list, slice etc.) or a column
85+ selector function that conforms to :meth: `sklearn.compose.make_column_selector `.
86+
87+ If you wish to specify string column names for dependencies, ensure you provide the
88+ ``infer_dataframe=True `` option when you create a dag. This will ensure that all
89+ estimator outputs are coerced into dataframes. Where possible column names will be
90+ inferred, otherwise the column names will just be the name of the estimator step with an
91+ appended index number. If you do not specify ``infer_dataframe=True ``, the dag will
92+ leave the outputs unmodified, which in most cases will mean numpy arrays that only
93+ support numeric column indices.
7794
7895The DAG may now be used as an estimator in its own right:
7996
@@ -189,7 +206,7 @@ as a dictionary of step name to column indices instead:
189206 >> > from sklearn.ensemble import RandomForestClassifier
190207 >> > from sklearn.svm import SVC
191208 >> > clf_stack = (
192- ... DAGBuilder()
209+ ... DAGBuilder(infer_dataframe = True )
193210 ... .add_step(" pass" , " passthrough" )
194211 ... .add_step(" rf" , RandomForestClassifier(), deps = [" pass" ])
195212 ... .add_step(" svr" , SVC(), deps = [" pass" ])
0 commit comments