|
14 | 14 | skdag - A more flexible alternative to scikit-learn Pipelines |
15 | 15 | ============================================================= |
16 | 16 |
|
17 | | -.. _scikit-learn: https://scikit-learn.org |
| 17 | +scikit-dag (``skdag``) is an open-sourced, MIT-licenced library that provides advanced |
| 18 | +workflow management to any machine learning operations that follow |
| 19 | +scikit-learn_ conventions. Installation is simple: |
18 | 20 |
|
19 | | -**skdag** brings the flexibility of Directed Acyclic Graphs (DAGs) to scikit-learn_. |
| 21 | +.. code-block:: bash |
20 | 22 |
|
21 | | -It enables the construction of complex workflows, including model stacking, to allow |
22 | | -data to flow through multiple estimators, with one or more start/endpoints. |
| 23 | + pip install skdag |
23 | 24 |
|
24 | | -.. _documentation: https://skdag.readthedocs.io/ |
| 25 | +It works by introducing Directed Acyclic Graphs as a drop-in replacement for traditional |
| 26 | +scikit-learn ``Pipeline``. This gives you a simple interface for a range of use cases |
| 27 | +including complex pre-processing, model stacking and benchmarking. |
25 | 28 |
|
26 | | -Refer to the documentation_ for usage details. |
| 29 | +.. code-block:: python |
| 30 | +
|
| 31 | + from skdag import DAGBuilder |
| 32 | +
|
| 33 | + dag = ( |
| 34 | + DAGBuilder() |
| 35 | + .add_step("impute", SimpleImputer()) |
| 36 | + .add_step("vitals", "passthrough", deps={"impute": slice(0, 4)}) |
| 37 | + .add_step( |
| 38 | + "blood", |
| 39 | + PCA(n_components=2, random_state=0), |
| 40 | + deps={"impute": slice(4, 10)} |
| 41 | + ) |
| 42 | + .add_step( |
| 43 | + "rf", |
| 44 | + RandomForestRegressor(max_depth=5, random_state=0), |
| 45 | + deps=["blood", "vitals"] |
| 46 | + ) |
| 47 | + .add_step("svm", SVR(C=0.7), deps=["blood", "vitals"]) |
| 48 | + .add_step( |
| 49 | + "knn", |
| 50 | + KNeighborsRegressor(n_neighbors=5), |
| 51 | + deps=["blood", "vitals"] |
| 52 | + ) |
| 53 | + .add_step("meta", LinearRegression(), deps=["rf", "svm", "knn"]) |
| 54 | + .make_dag(n_jobs=2, verbose=True) |
| 55 | + ) |
| 56 | +
|
| 57 | + dag.show(detailed=True) |
| 58 | +
|
| 59 | +.. image:: doc/_static/img/cover.png |
| 60 | + |
| 61 | +The above DAG imputes missing values, runs PCA on the columns relating to blood test |
| 62 | +results and leaves the other columns as they are. Then they get passed to three |
| 63 | +different regressors before being passed onto a final meta-estimator. Because DAGs |
| 64 | +(unlike pipelines) allow predictors in the middle or a workflow, you can use them to |
| 65 | +implement model stacking. We also chose to run the DAG steps in parallel wherever |
| 66 | +possible. |
| 67 | + |
| 68 | +After building our DAG, we can treat it as any other estimator: |
| 69 | + |
| 70 | +.. code-block:: python |
| 71 | +
|
| 72 | + from sklearn import datasets |
| 73 | +
|
| 74 | + X, y = datasets.load_diabetes(return_X_y=True, as_frame=True) |
| 75 | + X_train, X_test, y_train, y_test = train_test_split( |
| 76 | + X, y, test_size=0.2, random_state=0 |
| 77 | + ) |
| 78 | +
|
| 79 | + dag.fit(X_train, y_train) |
| 80 | + dag.predict(X_test) |
| 81 | +
|
| 82 | +Just like a pipeline, you can optimise it with a gridsearch, pickle it etc. |
| 83 | + |
| 84 | +Note that this package does not deal with things like delayed dependencies and |
| 85 | +distributed architectures - consider an `established <https://airflow.apache.org/>`_ |
| 86 | +`solution <https://dagster.io/>`_ for such use cases. ``skdag`` is just for building and |
| 87 | +executing local ensembles from estimators. |
| 88 | + |
| 89 | +`Read on <https://skdag.readthedocs.io/>`_ to learn more about ``skdag``... |
| 90 | + |
| 91 | +.. _scikit-learn: https://scikit-learn.org |
0 commit comments