Skip to content

Commit 9c306e0

Browse files
committed
Merge branch 'main' into develop
2 parents 4fccd2e + 55d8bc3 commit 9c306e0

File tree

1 file changed

+71
-6
lines changed

1 file changed

+71
-6
lines changed

README.rst

Lines changed: 71 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,78 @@
1414
skdag - A more flexible alternative to scikit-learn Pipelines
1515
=============================================================
1616

17-
.. _scikit-learn: https://scikit-learn.org
17+
scikit-dag (``skdag``) is an open-sourced, MIT-licenced library that provides advanced
18+
workflow management to any machine learning operations that follow
19+
scikit-learn_ conventions. Installation is simple:
1820

19-
**skdag** brings the flexibility of Directed Acyclic Graphs (DAGs) to scikit-learn_.
21+
.. code-block:: bash
2022
21-
It enables the construction of complex workflows, including model stacking, to allow
22-
data to flow through multiple estimators, with one or more start/endpoints.
23+
pip install skdag
2324
24-
.. _documentation: https://skdag.readthedocs.io/
25+
It works by introducing Directed Acyclic Graphs as a drop-in replacement for traditional
26+
scikit-learn ``Pipeline``. This gives you a simple interface for a range of use cases
27+
including complex pre-processing, model stacking and benchmarking.
2528

26-
Refer to the documentation_ for usage details.
29+
.. code-block:: python
30+
31+
from skdag import DAGBuilder
32+
33+
dag = (
34+
DAGBuilder()
35+
.add_step("impute", SimpleImputer())
36+
.add_step("vitals", "passthrough", deps={"impute": slice(0, 4)})
37+
.add_step(
38+
"blood",
39+
PCA(n_components=2, random_state=0),
40+
deps={"impute": slice(4, 10)}
41+
)
42+
.add_step(
43+
"rf",
44+
RandomForestRegressor(max_depth=5, random_state=0),
45+
deps=["blood", "vitals"]
46+
)
47+
.add_step("svm", SVR(C=0.7), deps=["blood", "vitals"])
48+
.add_step(
49+
"knn",
50+
KNeighborsRegressor(n_neighbors=5),
51+
deps=["blood", "vitals"]
52+
)
53+
.add_step("meta", LinearRegression(), deps=["rf", "svm", "knn"])
54+
.make_dag(n_jobs=2, verbose=True)
55+
)
56+
57+
dag.show(detailed=True)
58+
59+
.. image:: doc/_static/img/cover.png
60+
61+
The above DAG imputes missing values, runs PCA on the columns relating to blood test
62+
results and leaves the other columns as they are. Then they get passed to three
63+
different regressors before being passed onto a final meta-estimator. Because DAGs
64+
(unlike pipelines) allow predictors in the middle or a workflow, you can use them to
65+
implement model stacking. We also chose to run the DAG steps in parallel wherever
66+
possible.
67+
68+
After building our DAG, we can treat it as any other estimator:
69+
70+
.. code-block:: python
71+
72+
from sklearn import datasets
73+
74+
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
75+
X_train, X_test, y_train, y_test = train_test_split(
76+
X, y, test_size=0.2, random_state=0
77+
)
78+
79+
dag.fit(X_train, y_train)
80+
dag.predict(X_test)
81+
82+
Just like a pipeline, you can optimise it with a gridsearch, pickle it etc.
83+
84+
Note that this package does not deal with things like delayed dependencies and
85+
distributed architectures - consider an `established <https://airflow.apache.org/>`_
86+
`solution <https://dagster.io/>`_ for such use cases. ``skdag`` is just for building and
87+
executing local ensembles from estimators.
88+
89+
`Read on <https://skdag.readthedocs.io/>`_ to learn more about ``skdag``...
90+
91+
.. _scikit-learn: https://scikit-learn.org

0 commit comments

Comments
 (0)