@@ -5,7 +5,70 @@ scikit-dag (``skdag``) is an open-sourced, MIT-licenced library that provides ad
55workflow management to any machine learning operations that follow
66:mod: `sklearn ` conventions. It does this by introducing Directed Acyclic
77Graphs (:class: `skdag.dag.DAG `) as a drop-in replacement for traditional scikit-learn
8- :mod: `sklearn.pipeline.Pipeline `.
8+ :mod: `sklearn.pipeline.Pipeline `. This gives you a simple interface for a range of use
9+ cases including complex pre-processing, model stacking and benchmarking.
10+
11+ .. code-block :: python
12+
13+ from skdag import DAGBuilder
14+
15+ dag = (
16+ DAGBuilder()
17+ .add_step(" impute" , SimpleImputer())
18+ .add_step(" vitals" , " passthrough" , deps = {" impute" : slice (0 , 4 )})
19+ .add_step(
20+ " blood" ,
21+ PCA(n_components = 2 , random_state = 0 ),
22+ deps = {" impute" : slice (4 , 10 )}
23+ )
24+ .add_step(
25+ " rf" ,
26+ RandomForestRegressor(max_depth = 5 , random_state = 0 ),
27+ deps = [" blood" , " vitals" ]
28+ )
29+ .add_step(" svm" , SVR(C = 0.7 ), deps = [" blood" , " vitals" ])
30+ .add_step(
31+ " knn" ,
32+ KNeighborsRegressor(n_neighbors = 5 ),
33+ deps = [" blood" , " vitals" ]
34+ )
35+ .add_step(" meta" , LinearRegression(), deps = [" rf" , " svm" , " knn" ])
36+ .make_dag(n_jobs = 2 , verbose = True )
37+ )
38+
39+ dag.show(detailed = True )
40+
41+ .. image :: _static/img/cover.svg
42+
43+ The above DAG imputes missing values, runs PCA on the columns relating to blood test
44+ results and leaves the other columns as they are. Then they get passed to three
45+ different regressors before being passed onto a final meta-estimator. Because DAGs
46+ (unlike pipelines) allow predictors in the middle or a workflow, you can use them to
47+ implement model stacking. We also chose to run the DAG steps in parallel wherever
48+ possible.
49+
50+ After building our DAG, we can treat it as any other estimator:
51+
52+ .. code-block :: python
53+
54+ from sklearn import datasets
55+
56+ X, y = datasets.load_diabetes(return_X_y = True , as_frame = True )
57+ X_train, X_test, y_train, y_test = train_test_split(
58+ X, y, test_size = 0.2 , random_state = 0
59+ )
60+
61+ dag.fit(X_train, y_train)
62+ dag.predict(X_test)
63+
64+ Just like a pipeline, you can optimise it with a gridsearch, pickle it etc.
65+
66+ Note that this package does not deal with things like delayed dependencies and
67+ distributed architectures - consider an `established <https://airflow.apache.org/ >`_
68+ `solution <https://dagster.io/ >`_ for such use cases. ``skdag `` is just for building and
69+ executing local ensembles from estimators.
70+
71+ :ref: `Read on<quickstart> ` to learn more about ``skdag ``...
972
1073.. toctree ::
1174 :maxdepth: 2
0 commit comments