|
| 1 | +# Directed Acyclic Graphs (DAGs) for struct Models |
| 2 | + |
| 3 | +This document describes the new DAG functionality added to the struct package, which allows you to create and execute directed acyclic graphs of struct models. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The DAG functionality consists of three main classes: |
| 8 | + |
| 9 | +1. **`model_dag`** - Represents a directed acyclic graph with edges defining the workflow |
| 10 | +2. **`model_node`** - Represents a node containing a struct model and a mode function |
| 11 | +3. **`data_node`** - Represents a node containing a DatasetExperiment object |
| 12 | + |
| 13 | +## Classes |
| 14 | + |
| 15 | +### model_dag |
| 16 | + |
| 17 | +The `model_dag` class extends `struct_class` and contains an `edges` slot that defines the connections between nodes. |
| 18 | + |
| 19 | +```r |
| 20 | +# Create a DAG |
| 21 | +dag = model_dag( |
| 22 | + name = 'My Workflow', |
| 23 | + description = 'A simple workflow', |
| 24 | + edges = list( |
| 25 | + list(from = 'Data', to = 'Preprocessing'), |
| 26 | + list(from = 'Preprocessing', to = 'Analysis') |
| 27 | + ) |
| 28 | +) |
| 29 | +``` |
| 30 | + |
| 31 | +### model_node |
| 32 | + |
| 33 | +The `model_node` class contains a struct model object and a mode function that operates on the model. |
| 34 | + |
| 35 | +```r |
| 36 | +# Create a model node |
| 37 | +pca_model = PCA() |
| 38 | +node = model_node( |
| 39 | + name = 'PCA Analysis', |
| 40 | + description = 'Principal Component Analysis', |
| 41 | + model = pca_model, |
| 42 | + mode = model_apply # or model_train, model_predict, model_reverse |
| 43 | +) |
| 44 | +``` |
| 45 | + |
| 46 | +### data_node |
| 47 | + |
| 48 | +The `data_node` class contains a DatasetExperiment object that serves as input to other nodes. |
| 49 | + |
| 50 | +```r |
| 51 | +# Create a data node |
| 52 | +D = iris_DatasetExperiment() |
| 53 | +data_node = data_node( |
| 54 | + name = 'My Data', |
| 55 | + description = 'Iris dataset', |
| 56 | + data = D |
| 57 | +) |
| 58 | + |
| 59 | +# Access the data |
| 60 | +data_value(data_node) |
| 61 | +``` |
| 62 | + |
| 63 | +## Execution |
| 64 | + |
| 65 | +The `dag_execute` function executes a DAG by: |
| 66 | + |
| 67 | +1. Validating the DAG structure |
| 68 | +2. Performing topological sorting to determine execution order |
| 69 | +3. Executing nodes in the correct order |
| 70 | +4. Passing outputs between nodes according to the edges |
| 71 | + |
| 72 | +```r |
| 73 | +# Execute a DAG |
| 74 | +nodes = list( |
| 75 | + 'Data' = data_node, |
| 76 | + 'Preprocessing' = preprocessing_node, |
| 77 | + 'Analysis' = analysis_node |
| 78 | +) |
| 79 | +results = dag_execute(dag, nodes, verbose = TRUE) |
| 80 | +``` |
| 81 | + |
| 82 | +## Example Workflows |
| 83 | + |
| 84 | +### Simple Preprocessing Workflow |
| 85 | + |
| 86 | +```r |
| 87 | +# Load data |
| 88 | +D = iris_DatasetExperiment() |
| 89 | + |
| 90 | +# Create nodes |
| 91 | +data_node = data_node(name = 'Data', data = D) |
| 92 | +mean_center_node = model_node( |
| 93 | + name = 'Mean Centering', |
| 94 | + model = mean_centre(), |
| 95 | + mode = model_apply |
| 96 | +) |
| 97 | +pca_node = model_node( |
| 98 | + name = 'PCA', |
| 99 | + model = PCA(), |
| 100 | + mode = model_apply |
| 101 | +) |
| 102 | + |
| 103 | +# Create DAG |
| 104 | +dag = model_dag( |
| 105 | + name = 'Preprocessing Workflow', |
| 106 | + edges = list( |
| 107 | + list(from = 'Data', to = 'Mean Centering'), |
| 108 | + list(from = 'Mean Centering', to = 'PCA') |
| 109 | + ) |
| 110 | +) |
| 111 | + |
| 112 | +# Execute |
| 113 | +nodes = list( |
| 114 | + 'Data' = data_node, |
| 115 | + 'Mean Centering' = mean_center_node, |
| 116 | + 'PCA' = pca_node |
| 117 | +) |
| 118 | +results = dag_execute(dag, nodes) |
| 119 | +``` |
| 120 | + |
| 121 | +### Complex Workflow with Parallel Paths |
| 122 | + |
| 123 | +```r |
| 124 | +# Create a workflow with parallel PCA and PLS analysis |
| 125 | +dag = model_dag( |
| 126 | + name = 'Complex Analysis', |
| 127 | + edges = list( |
| 128 | + list(from = 'Data', to = 'Preprocessing'), |
| 129 | + list(from = 'Preprocessing', to = 'PCA Train'), |
| 130 | + list(from = 'Preprocessing', to = 'PLS Train'), |
| 131 | + list(from = 'PCA Train', to = 'PCA Predict'), |
| 132 | + list(from = 'PLS Train', to = 'PLS Predict') |
| 133 | + ) |
| 134 | +) |
| 135 | +``` |
| 136 | + |
| 137 | +## Available Modes |
| 138 | + |
| 139 | +The following modes can be used with model nodes: |
| 140 | + |
| 141 | +- `model_apply` - Train and apply the model in one step |
| 142 | +- `model_train` - Train the model only |
| 143 | +- `model_predict` - Apply a trained model |
| 144 | +- `model_reverse` - Apply the reverse transformation |
| 145 | + |
| 146 | +## Validation |
| 147 | + |
| 148 | +The DAG execution includes several validation checks: |
| 149 | + |
| 150 | +1. Ensures all nodes referenced in edges exist |
| 151 | +2. Validates that all nodes are of the correct type |
| 152 | +3. Checks for cycles in the graph |
| 153 | +4. Ensures all model nodes have input data |
| 154 | + |
| 155 | +## Error Handling |
| 156 | + |
| 157 | +The DAG execution provides informative error messages for common issues: |
| 158 | + |
| 159 | +- Missing nodes referenced in edges |
| 160 | +- Invalid node types |
| 161 | +- Cycles in the graph |
| 162 | +- Missing input data for model nodes |
| 163 | + |
| 164 | +## Benefits |
| 165 | + |
| 166 | +The DAG functionality provides several benefits: |
| 167 | + |
| 168 | +1. **Modularity** - Each step is encapsulated in its own node |
| 169 | +2. **Reusability** - Nodes can be reused in different workflows |
| 170 | +3. **Clarity** - The workflow structure is explicitly defined |
| 171 | +4. **Validation** - Automatic validation of workflow structure |
| 172 | +5. **Flexibility** - Support for complex workflows with parallel paths |
0 commit comments