|
| 1 | +# Provenance Tracking |
| 2 | + |
| 3 | +Sprocket automatically tracks all workflow executions in a SQLite database while |
| 4 | +maintaining an organized filesystem structure for outputs. Both `sprocket run` |
| 5 | +and `sprocket dev server` share the same execution engine and output structure, so |
| 6 | +the concepts described here apply equally to both commands. |
| 7 | + |
| 8 | +> [!NOTE] |
| 9 | +> |
| 10 | +> Provenance is a well-established research area within scientific workflow |
| 11 | +> systems. Formal models such as the |
| 12 | +> [W3C PROV](https://www.w3.org/TR/prov-overview/) family and its |
| 13 | +> workflow-oriented extension |
| 14 | +> [ProvONE](https://purl.dataone.org/provone-v1-dev) define rich vocabularies |
| 15 | +> for describing data lineage, activity chains, and agent relationships (cf. |
| 16 | +> [Ludäscher et al., 2016](https://link.springer.com/chapter/10.1007/978-3-319-40226-0_7); |
| 17 | +> [Deelman et al., 2018](https://journals.sagepub.com/doi/abs/10.1177/1094342017704893)). |
| 18 | +> Sprocket uses the term "provenance" more loosely here to describe its |
| 19 | +> execution tracking capabilities—recording what was run, with which inputs, |
| 20 | +> when, and by whom—rather than implementing the full data lineage and |
| 21 | +> dependency tracking described in those formal models. |
| 22 | +
|
| 23 | +For design details, see [RFC #3](https://github.com/stjude-rust-labs/rfcs/pull/3). |
| 24 | + |
| 25 | +## Runs and indexes |
| 26 | + |
| 27 | +The output directory contains two complementary directory hierarchies that |
| 28 | +together address a fundamental tension in workflow management: users need both a |
| 29 | +complete provenance record for reproducibility and auditing _and_ a simplified, |
| 30 | +domain-specific view for everyday access to results. Rather than forcing users to |
| 31 | +choose one or maintain both manually, Sprocket provides both automatically. |
| 32 | + |
| 33 | +The **`runs/`** directory is the immutable record of truth. It organizes every |
| 34 | +execution chronologically by target name and timestamp, preserving the full |
| 35 | +history of inputs, outputs, and individual task attempts. This structure is |
| 36 | +append-only—Sprocket never modifies or removes previous runs—so it serves as |
| 37 | +a reliable audit trail. When a workflow is run multiple times, each execution |
| 38 | +receives its own timestamped directory, and the complete set of attempts is |
| 39 | +always available for inspection. |
| 40 | + |
| 41 | +The **`index/`** directory is an optional, user-curated view layered on top of |
| 42 | +the runs. When the `--index-on` flag is provided, Sprocket creates symlinks |
| 43 | +under `index/` that point back into `runs/`, giving users a way to organize |
| 44 | +results by whatever dimension makes sense for their domain (e.g., by project or |
| 45 | +by experiment) without duplicating any data. Because the index |
| 46 | +consists entirely of relative symlinks, it adds negligible storage overhead and |
| 47 | +can be reconstructed from the provenance database at any time. |
| 48 | + |
| 49 | +This separation means that the provenance record remains intact regardless of how |
| 50 | +the index evolves. Re-running a workflow with the same `--index-on` path updates |
| 51 | +the index symlinks to point to the latest results, but the previous run's |
| 52 | +directory under `runs/` is preserved, and the database records the full history |
| 53 | +of index changes. The design follows a principle of |
| 54 | +[progressive disclosure](https://en.wikipedia.org/wiki/Progressive_disclosure): |
| 55 | +users who simply run `sprocket run` get a well-organized `runs/` directory and a |
| 56 | +provenance database with no extra configuration, and those who need logical |
| 57 | +organization can opt into indexing by adding a single flag. |
| 58 | + |
| 59 | +## Output directory |
| 60 | + |
| 61 | +By default, Sprocket creates an `out/` directory in your current working |
| 62 | +directory to store all workflow outputs and provenance data. This location can |
| 63 | +be configured via: |
| 64 | + |
| 65 | +- The `-o, --output-dir` CLI flag (for `sprocket run`). |
| 66 | +- The `-o, --output-directory` CLI flag (for `sprocket dev server`). |
| 67 | +- The `run.output_dir` configuration option (for run mode). |
| 68 | +- The `server.output_directory` configuration option (for server mode). |
| 69 | + |
| 70 | +### Directory structure |
| 71 | + |
| 72 | +The layout within each run directory differs slightly depending on whether the |
| 73 | +target is a standalone task or a workflow containing multiple task calls. |
| 74 | + |
| 75 | +#### Task runs |
| 76 | + |
| 77 | +When running a task directly, the `attempts/` directory sits at the top level of |
| 78 | +the run directory. |
| 79 | + |
| 80 | +``` |
| 81 | +./out/ |
| 82 | +├── sprocket.db # SQLite provenance database |
| 83 | +├── output.log # Execution log |
| 84 | +├── runs/ |
| 85 | +│ └── <target>/ |
| 86 | +│ ├── <timestamp>/ # Individual run (YYYY-MM-DD_HHMMSSffffff) |
| 87 | +│ │ ├── inputs.json # Serialized inputs for the run |
| 88 | +│ │ ├── outputs.json # Serialized outputs from the run |
| 89 | +│ │ ├── tmp/ # Temporary localization files |
| 90 | +│ │ └── attempts/ |
| 91 | +│ │ └── <n>/ # Attempt number (0, 1, 2, ...) |
| 92 | +│ │ ├── command # Executed shell script |
| 93 | +│ │ ├── stdout # Task standard output |
| 94 | +│ │ ├── stderr # Task standard error |
| 95 | +│ │ └── work/ # Task working directory |
| 96 | +│ └── _latest -> <timestamp>/ # Symlink to most recent run |
| 97 | +└── index/ # Optional output indexing |
| 98 | + └── <output_name>/ |
| 99 | + └── outputs.json # Symlink to run outputs |
| 100 | +``` |
| 101 | + |
| 102 | +#### Workflow runs |
| 103 | + |
| 104 | +When running a workflow, each task call within the workflow gets its own |
| 105 | +subdirectory under `calls/`. Each call directory then contains the same |
| 106 | +`attempts/` and `tmp/` structure as a standalone task run. |
| 107 | + |
| 108 | +``` |
| 109 | +./out/ |
| 110 | +├── sprocket.db |
| 111 | +├── output.log |
| 112 | +├── runs/ |
| 113 | +│ └── <target>/ |
| 114 | +│ ├── <timestamp>/ |
| 115 | +│ │ ├── inputs.json |
| 116 | +│ │ ├── outputs.json |
| 117 | +│ │ ├── tmp/ # Workflow-level temporary files |
| 118 | +│ │ └── calls/ # Task execution directories |
| 119 | +│ │ └── <task_call_id>/ # One per task call in the workflow |
| 120 | +│ │ ├── tmp/ # Task-level temporary files |
| 121 | +│ │ └── attempts/ |
| 122 | +│ │ └── <n>/ |
| 123 | +│ │ ├── command |
| 124 | +│ │ ├── stdout |
| 125 | +│ │ ├── stderr |
| 126 | +│ │ └── work/ |
| 127 | +│ └── _latest -> <timestamp>/ |
| 128 | +└── index/ |
| 129 | + └── <output_name>/ |
| 130 | + └── outputs.json |
| 131 | +``` |
| 132 | + |
| 133 | +### The `_latest` symlink |
| 134 | + |
| 135 | +For each target, Sprocket maintains a `_latest` symlink pointing to the most |
| 136 | +recent execution directory. This provides quick access to the latest results |
| 137 | +without needing to know the exact timestamp. |
| 138 | + |
| 139 | +```shell |
| 140 | +# Access the latest run outputs |
| 141 | +ls out/runs/my_workflow/_latest/ |
| 142 | +``` |
| 143 | + |
| 144 | +> [!NOTE] |
| 145 | +> |
| 146 | +> On Windows, creating symlinks may require administrator privileges or |
| 147 | +> Developer Mode. If symlink creation fails, the `_latest` symlink will be |
| 148 | +> omitted but workflow execution will continue normally. |
| 149 | +
|
| 150 | +## Provenance database |
| 151 | + |
| 152 | +The `sprocket.db` SQLite database tracks all workflow executions, including: |
| 153 | + |
| 154 | +- **Sessions**: Groups of related workflow submissions. |
| 155 | +- **Runs**: Individual workflow executions with inputs, outputs, and status. |
| 156 | +- **Tasks**: Individual task executions within a workflow run. |
| 157 | + |
| 158 | +## Run contents |
| 159 | + |
| 160 | +Each run creates a timestamped directory under `runs/<target>/` containing |
| 161 | +the following: |
| 162 | + |
| 163 | +| File/Directory | Description | |
| 164 | +|----------------|-------------| |
| 165 | +| `output.log` | Log of all messages emitted during the run | |
| 166 | +| `inputs.json` | Serialized inputs provided for the run | |
| 167 | +| `outputs.json` | Serialized outputs produced by the run | |
| 168 | +| `tmp/` | Temporary files used during input localization | |
| 169 | +| `attempts/` | Directory containing attempt subdirectories (task runs) | |
| 170 | +| `calls/` | Directory containing per-task-call subdirectories (workflow runs) | |
| 171 | +| `attempts/<n>/command` | The shell script that was executed | |
| 172 | +| `attempts/<n>/stdout` | Standard output from the task | |
| 173 | +| `attempts/<n>/stderr` | Standard error from the task | |
| 174 | +| `attempts/<n>/work/` | Task working directory containing output files | |
| 175 | + |
| 176 | +### Retries |
| 177 | + |
| 178 | +When a task fails and is retried, each attempt gets its own numbered |
| 179 | +subdirectory under `attempts/`. This preserves the complete history of all |
| 180 | +execution attempts, which is valuable for debugging intermittent failures. |
| 181 | + |
| 182 | +## Output indexing |
| 183 | + |
| 184 | +When the `--index-on` flag is provided, Sprocket indexes run outputs by the |
| 185 | +specified output name. For each run, a symlink is created under |
| 186 | +`index/<output_name>/` pointing to the run's `outputs.json` file. This enables |
| 187 | +efficient lookup of runs by output values without scanning the entire `runs/` |
| 188 | +directory. |
| 189 | + |
| 190 | +```shell |
| 191 | +# Run a workflow with output indexing on the `greeting` output |
| 192 | +sprocket run hello.wdl -t hello --index-on greeting |
| 193 | +``` |
| 194 | + |
| 195 | +The resulting index entry is a relative symlink: |
| 196 | + |
| 197 | +``` |
| 198 | +index/greeting/outputs.json -> ../../runs/hello/<timestamp>/outputs.json |
| 199 | +``` |
| 200 | + |
| 201 | +## Portability |
| 202 | + |
| 203 | +The entire output directory is designed to be portable: |
| 204 | + |
| 205 | +- All paths stored in the database are relative to the database file. |
| 206 | +- Symlinks (including index entries) use relative paths. |
| 207 | +- Moving the `out/` directory with `mv` or `rsync` preserves all relationships. |
| 208 | + |
| 209 | +## Concurrent access |
| 210 | + |
| 211 | +Both `sprocket run` and `sprocket dev server` share the same execution engine and |
| 212 | +can operate on the same output directory simultaneously: |
| 213 | + |
| 214 | +- The SQLite WAL mode enables concurrent access. |
| 215 | +- Database locks are held briefly (milliseconds per transaction). |
| 216 | +- A workflow submitted via CLI is immediately visible to the server. |
| 217 | +- All workflows share the same database regardless of submission method. |
| 218 | + |
| 219 | +## Best practices |
| 220 | + |
| 221 | +### Organizing output directories |
| 222 | + |
| 223 | +Use a dedicated output directory for each project or analysis domain. This keeps |
| 224 | +provenance data isolated, makes backups straightforward, and avoids confusion |
| 225 | +when multiple unrelated workflows share the same `runs/` hierarchy. |
| 226 | + |
| 227 | +```shell |
| 228 | +sprocket run pipeline_a.wdl -o ./pipeline-a-out ... |
| 229 | +sprocket run pipeline_b.wdl -o ./pipeline-b-out ... |
| 230 | +``` |
| 231 | + |
| 232 | +### Querying execution history |
| 233 | + |
| 234 | +The REST API (available via `sprocket dev server`) is the recommended way to query |
| 235 | +execution history. The API provides endpoints for listing sessions, runs, and |
| 236 | +tasks with filtering capabilities. See the |
| 237 | +[server documentation](/subcommands/server) for endpoint details and the |
| 238 | +interactive Swagger UI at `/api/v1/swagger-ui` for exploration. |
| 239 | + |
| 240 | +Avoid parsing the `runs/` directory structure directly for programmatic access. |
| 241 | +The layout within `runs/` is an implementation detail that may evolve, whereas |
| 242 | +the API provides a stable interface. The `index/` directory, on the other hand, |
| 243 | +is user-assembled via `--index-on` and is designed to be consumed directly. |
| 244 | + |
| 245 | +### Backing up provenance data |
| 246 | + |
| 247 | +The output directory is self-contained: backing up the entire `out/` directory |
| 248 | +(including `sprocket.db` and the `runs/` and `index/` hierarchies) captures the |
| 249 | +full provenance record. Because all paths in the database and all symlinks are |
| 250 | +relative, a backup can be restored to any location without reconfiguration. |
| 251 | + |
| 252 | +When backing up a live system, be aware that SQLite WAL mode uses auxiliary |
| 253 | +files (`sprocket.db-wal` and `sprocket.db-shm`). For a consistent backup, |
| 254 | +either stop active workflows first, or use SQLite's |
| 255 | +[backup API](https://www.sqlite.org/backup.html) to safely copy the database |
| 256 | +while it is in use. |
| 257 | + |
| 258 | +### Preserving the `runs/` directory |
| 259 | + |
| 260 | +The `runs/` hierarchy is the immutable record of truth for all workflow |
| 261 | +executions. Do not modify, rename, or delete files within it, as doing so may |
| 262 | +invalidate provenance records and break index symlinks that reference those |
| 263 | +paths. If disk space becomes a concern, consider archiving older runs rather |
| 264 | +than deleting them. |
0 commit comments