Skip to content

Commit ab2932f

Browse files
authored
docs: add server mode and provenance tracking documentation (#30)
1 parent 5eef914 commit ab2932f

File tree

9 files changed

+550
-20
lines changed

9 files changed

+550
-20
lines changed

.vitepress/config.mts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ export default defineConfig({
6161
],
6262
},
6363
{ text: "Call Cache", link: "/configuration/cache", docFooterText: "Configuration > Call Cache" },
64+
{ text: "Provenance Tracking", link: "/configuration/provenance", docFooterText: "Configuration > Provenance Tracking" },
6465
],
6566
},
6667
{
@@ -79,6 +80,7 @@ export default defineConfig({
7980
text: "Experimental Commands", collapsed: true, items: [
8081
{ text: "doc", link: "/subcommands/doc", docFooterText: "Experimental Commands > doc" },
8182
{ text: "lock", link: "/subcommands/lock", docFooterText: "Experimental Commands > lock" },
83+
{ text: "server", link: "/subcommands/server", docFooterText: "Experimental Commands > server" },
8284
{ text: "test", link: "/subcommands/test", docFooterText: "Experimental Commands > test" },
8385
]
8486
},

configuration/overview.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,10 @@ channels (listed in order of the relative priority during loading).
2323
via the `SPROCKET_CONFIG` environment variable.
2424
* **Current working directory.** Sprocket will attempt to load a `sprocket.toml`
2525
within the current working directory when the `sprocket` command runs.
26+
* **Executable-adjacent configuration.** Sprocket will attempt to load a
27+
`sprocket.toml` located in the same directory as the `sprocket` executable.
28+
This is useful for bundled or deployed installations where a default
29+
configuration should travel with the binary.
2630
* **System-wide configuration locations.** See [the section
2731
below](#system-wide-configuration-locations) on how to use the system-wide
2832
configuration directory.
@@ -96,6 +100,15 @@ the command line. This will disable the searching for and loading of configurati
96100
files. The only configuration loaded will be that (if) specified by the `--config`
97101
command line argument.
98102

103+
## Global options
104+
105+
Sprocket provides a few options that apply across all subcommands.
106+
107+
| Option | Config Key | Values | Default | Description |
108+
|--------|-----------|--------|---------|-------------|
109+
| `--color` | `common.color` | `auto`, `always`, `never` | `auto` | Controls output colorization |
110+
| `-m, --report-mode` | `common.report_mode` | `full`, `one-line` | `full` | Controls diagnostic output format |
111+
99112
## Ignoring WDL files and directories
100113

101114
Sprocket is able to parse `.sprocketignore` files found in the current working

configuration/provenance.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# Provenance Tracking
2+
3+
Sprocket automatically tracks all workflow executions in a SQLite database while
4+
maintaining an organized filesystem structure for outputs. Both `sprocket run`
5+
and `sprocket dev server` share the same execution engine and output structure, so
6+
the concepts described here apply equally to both commands.
7+
8+
> [!NOTE]
9+
>
10+
> Provenance is a well-established research area within scientific workflow
11+
> systems. Formal models such as the
12+
> [W3C PROV](https://www.w3.org/TR/prov-overview/) family and its
13+
> workflow-oriented extension
14+
> [ProvONE](https://purl.dataone.org/provone-v1-dev) define rich vocabularies
15+
> for describing data lineage, activity chains, and agent relationships (cf.
16+
> [Ludäscher et al., 2016](https://link.springer.com/chapter/10.1007/978-3-319-40226-0_7);
17+
> [Deelman et al., 2018](https://journals.sagepub.com/doi/abs/10.1177/1094342017704893)).
18+
> Sprocket uses the term "provenance" more loosely here to describe its
19+
> execution tracking capabilities—recording what was run, with which inputs,
20+
> when, and by whom—rather than implementing the full data lineage and
21+
> dependency tracking described in those formal models.
22+
23+
For design details, see [RFC #3](https://github.com/stjude-rust-labs/rfcs/pull/3).
24+
25+
## Runs and indexes
26+
27+
The output directory contains two complementary directory hierarchies that
28+
together address a fundamental tension in workflow management: users need both a
29+
complete provenance record for reproducibility and auditing _and_ a simplified,
30+
domain-specific view for everyday access to results. Rather than forcing users to
31+
choose one or maintain both manually, Sprocket provides both automatically.
32+
33+
The **`runs/`** directory is the immutable record of truth. It organizes every
34+
execution chronologically by target name and timestamp, preserving the full
35+
history of inputs, outputs, and individual task attempts. This structure is
36+
append-only—Sprocket never modifies or removes previous runs—so it serves as
37+
a reliable audit trail. When a workflow is run multiple times, each execution
38+
receives its own timestamped directory, and the complete set of attempts is
39+
always available for inspection.
40+
41+
The **`index/`** directory is an optional, user-curated view layered on top of
42+
the runs. When the `--index-on` flag is provided, Sprocket creates symlinks
43+
under `index/` that point back into `runs/`, giving users a way to organize
44+
results by whatever dimension makes sense for their domain (e.g., by project or
45+
by experiment) without duplicating any data. Because the index
46+
consists entirely of relative symlinks, it adds negligible storage overhead and
47+
can be reconstructed from the provenance database at any time.
48+
49+
This separation means that the provenance record remains intact regardless of how
50+
the index evolves. Re-running a workflow with the same `--index-on` path updates
51+
the index symlinks to point to the latest results, but the previous run's
52+
directory under `runs/` is preserved, and the database records the full history
53+
of index changes. The design follows a principle of
54+
[progressive disclosure](https://en.wikipedia.org/wiki/Progressive_disclosure):
55+
users who simply run `sprocket run` get a well-organized `runs/` directory and a
56+
provenance database with no extra configuration, and those who need logical
57+
organization can opt into indexing by adding a single flag.
58+
59+
## Output directory
60+
61+
By default, Sprocket creates an `out/` directory in your current working
62+
directory to store all workflow outputs and provenance data. This location can
63+
be configured via:
64+
65+
- The `-o, --output-dir` CLI flag (for `sprocket run`).
66+
- The `-o, --output-directory` CLI flag (for `sprocket dev server`).
67+
- The `run.output_dir` configuration option (for run mode).
68+
- The `server.output_directory` configuration option (for server mode).
69+
70+
### Directory structure
71+
72+
The layout within each run directory differs slightly depending on whether the
73+
target is a standalone task or a workflow containing multiple task calls.
74+
75+
#### Task runs
76+
77+
When running a task directly, the `attempts/` directory sits at the top level of
78+
the run directory.
79+
80+
```
81+
./out/
82+
├── sprocket.db # SQLite provenance database
83+
├── output.log # Execution log
84+
├── runs/
85+
│ └── <target>/
86+
│ ├── <timestamp>/ # Individual run (YYYY-MM-DD_HHMMSSffffff)
87+
│ │ ├── inputs.json # Serialized inputs for the run
88+
│ │ ├── outputs.json # Serialized outputs from the run
89+
│ │ ├── tmp/ # Temporary localization files
90+
│ │ └── attempts/
91+
│ │ └── <n>/ # Attempt number (0, 1, 2, ...)
92+
│ │ ├── command # Executed shell script
93+
│ │ ├── stdout # Task standard output
94+
│ │ ├── stderr # Task standard error
95+
│ │ └── work/ # Task working directory
96+
│ └── _latest -> <timestamp>/ # Symlink to most recent run
97+
└── index/ # Optional output indexing
98+
└── <output_name>/
99+
└── outputs.json # Symlink to run outputs
100+
```
101+
102+
#### Workflow runs
103+
104+
When running a workflow, each task call within the workflow gets its own
105+
subdirectory under `calls/`. Each call directory then contains the same
106+
`attempts/` and `tmp/` structure as a standalone task run.
107+
108+
```
109+
./out/
110+
├── sprocket.db
111+
├── output.log
112+
├── runs/
113+
│ └── <target>/
114+
│ ├── <timestamp>/
115+
│ │ ├── inputs.json
116+
│ │ ├── outputs.json
117+
│ │ ├── tmp/ # Workflow-level temporary files
118+
│ │ └── calls/ # Task execution directories
119+
│ │ └── <task_call_id>/ # One per task call in the workflow
120+
│ │ ├── tmp/ # Task-level temporary files
121+
│ │ └── attempts/
122+
│ │ └── <n>/
123+
│ │ ├── command
124+
│ │ ├── stdout
125+
│ │ ├── stderr
126+
│ │ └── work/
127+
│ └── _latest -> <timestamp>/
128+
└── index/
129+
└── <output_name>/
130+
└── outputs.json
131+
```
132+
133+
### The `_latest` symlink
134+
135+
For each target, Sprocket maintains a `_latest` symlink pointing to the most
136+
recent execution directory. This provides quick access to the latest results
137+
without needing to know the exact timestamp.
138+
139+
```shell
140+
# Access the latest run outputs
141+
ls out/runs/my_workflow/_latest/
142+
```
143+
144+
> [!NOTE]
145+
>
146+
> On Windows, creating symlinks may require administrator privileges or
147+
> Developer Mode. If symlink creation fails, the `_latest` symlink will be
148+
> omitted but workflow execution will continue normally.
149+
150+
## Provenance database
151+
152+
The `sprocket.db` SQLite database tracks all workflow executions, including:
153+
154+
- **Sessions**: Groups of related workflow submissions.
155+
- **Runs**: Individual workflow executions with inputs, outputs, and status.
156+
- **Tasks**: Individual task executions within a workflow run.
157+
158+
## Run contents
159+
160+
Each run creates a timestamped directory under `runs/<target>/` containing
161+
the following:
162+
163+
| File/Directory | Description |
164+
|----------------|-------------|
165+
| `output.log` | Log of all messages emitted during the run |
166+
| `inputs.json` | Serialized inputs provided for the run |
167+
| `outputs.json` | Serialized outputs produced by the run |
168+
| `tmp/` | Temporary files used during input localization |
169+
| `attempts/` | Directory containing attempt subdirectories (task runs) |
170+
| `calls/` | Directory containing per-task-call subdirectories (workflow runs) |
171+
| `attempts/<n>/command` | The shell script that was executed |
172+
| `attempts/<n>/stdout` | Standard output from the task |
173+
| `attempts/<n>/stderr` | Standard error from the task |
174+
| `attempts/<n>/work/` | Task working directory containing output files |
175+
176+
### Retries
177+
178+
When a task fails and is retried, each attempt gets its own numbered
179+
subdirectory under `attempts/`. This preserves the complete history of all
180+
execution attempts, which is valuable for debugging intermittent failures.
181+
182+
## Output indexing
183+
184+
When the `--index-on` flag is provided, Sprocket indexes run outputs by the
185+
specified output name. For each run, a symlink is created under
186+
`index/<output_name>/` pointing to the run's `outputs.json` file. This enables
187+
efficient lookup of runs by output values without scanning the entire `runs/`
188+
directory.
189+
190+
```shell
191+
# Run a workflow with output indexing on the `greeting` output
192+
sprocket run hello.wdl -t hello --index-on greeting
193+
```
194+
195+
The resulting index entry is a relative symlink:
196+
197+
```
198+
index/greeting/outputs.json -> ../../runs/hello/<timestamp>/outputs.json
199+
```
200+
201+
## Portability
202+
203+
The entire output directory is designed to be portable:
204+
205+
- All paths stored in the database are relative to the database file.
206+
- Symlinks (including index entries) use relative paths.
207+
- Moving the `out/` directory with `mv` or `rsync` preserves all relationships.
208+
209+
## Concurrent access
210+
211+
Both `sprocket run` and `sprocket dev server` share the same execution engine and
212+
can operate on the same output directory simultaneously:
213+
214+
- The SQLite WAL mode enables concurrent access.
215+
- Database locks are held briefly (milliseconds per transaction).
216+
- A workflow submitted via CLI is immediately visible to the server.
217+
- All workflows share the same database regardless of submission method.
218+
219+
## Best practices
220+
221+
### Organizing output directories
222+
223+
Use a dedicated output directory for each project or analysis domain. This keeps
224+
provenance data isolated, makes backups straightforward, and avoids confusion
225+
when multiple unrelated workflows share the same `runs/` hierarchy.
226+
227+
```shell
228+
sprocket run pipeline_a.wdl -o ./pipeline-a-out ...
229+
sprocket run pipeline_b.wdl -o ./pipeline-b-out ...
230+
```
231+
232+
### Querying execution history
233+
234+
The REST API (available via `sprocket dev server`) is the recommended way to query
235+
execution history. The API provides endpoints for listing sessions, runs, and
236+
tasks with filtering capabilities. See the
237+
[server documentation](/subcommands/server) for endpoint details and the
238+
interactive Swagger UI at `/api/v1/swagger-ui` for exploration.
239+
240+
Avoid parsing the `runs/` directory structure directly for programmatic access.
241+
The layout within `runs/` is an implementation detail that may evolve, whereas
242+
the API provides a stable interface. The `index/` directory, on the other hand,
243+
is user-assembled via `--index-on` and is designed to be consumed directly.
244+
245+
### Backing up provenance data
246+
247+
The output directory is self-contained: backing up the entire `out/` directory
248+
(including `sprocket.db` and the `runs/` and `index/` hierarchies) captures the
249+
full provenance record. Because all paths in the database and all symlinks are
250+
relative, a backup can be restored to any location without reconfiguration.
251+
252+
When backing up a live system, be aware that SQLite WAL mode uses auxiliary
253+
files (`sprocket.db-wal` and `sprocket.db-shm`). For a consistent backup,
254+
either stop active workflows first, or use SQLite's
255+
[backup API](https://www.sqlite.org/backup.html) to safely copy the database
256+
while it is in use.
257+
258+
### Preserving the `runs/` directory
259+
260+
The `runs/` hierarchy is the immutable record of truth for all workflow
261+
executions. Do not modify, rename, or delete files within it, as doing so may
262+
invalidate provenance records and break index symlinks that reference those
263+
paths. If disk space becomes a concern, consider archiving older runs rather
264+
than deleting them.

guided-tour.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -136,14 +136,14 @@ This will error right away, as we haven't told Sprocket which task or workflow
136136
to run.
137137

138138
```txt
139-
error: the `--entrypoint` option is required if no inputs are provided
139+
error: the `--target` option is required if no inputs are provided
140140
```
141141

142142
We want to run the "main" workflow defined in `example.wdl`, so we can try again
143-
but specify the entrypoint to use this time using the `--entrypoint` flag.
143+
but specify the target to use this time using the `--target` flag.
144144

145145
```shell
146-
sprocket run example.wdl --entrypoint main
146+
sprocket run example.wdl --target main
147147
```
148148

149149
After a few seconds, you'll see `sprocket` return an error.
@@ -178,19 +178,19 @@ than create many individual input files.
178178
sprocket run example.wdl hello_defaults.json main.name="Ari"
179179
```
180180

181-
Note that the above command does not specify an entrypoint with the `--entrypoint`
181+
Note that the above command does not specify a target with the `--target`
182182
flag. This is because every input is using fully qualified dot notation; each
183-
input is prefixed with the name of the entrypoint and a period, `main.`.
183+
input is prefixed with the name of the target and a period, `main.`.
184184
This fully qualified dot notation is required for inputs provided within a file.
185185
The dot notation can get repetitive if supplying many key value pairs on the command line,
186-
so specifying `--entrypoint` allows you to omit the repeated part of the keys.
186+
so specifying `--target` allows you to omit the repeated part of the keys on the command line.
187187
:::
188188

189189
Here, we can specify the `name` parameter as a key-value pair on the command
190190
line.
191191

192192
```shell
193-
sprocket run example.wdl --entrypoint main name="World"
193+
sprocket run example.wdl --target main name="World"
194194
```
195195

196196
After a few seconds, this job runs successfully with the following outputs.

subcommands/check-lint.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,38 @@ With respect to emitting warnings, there are two levels of warnings in Sprocket:
2222
(which enables the lint warnings).
2323

2424
`sprocket lint` emits both validation warnings and lint warnings — it is
25-
essentially an alias for `sprocket check -l`.
25+
essentially an alias for `sprocket check -l`.
26+
27+
## Rule configuration
28+
29+
Individual lint rules can be configured via the `[check.lint]` section in
30+
`sprocket.toml`. Currently, the following options are supported:
31+
32+
| Option | Type | Description |
33+
|--------|------|-------------|
34+
| `allowed_runtime_keys` | List of strings | Additional runtime keys to allow beyond the WDL specification defaults (used by the `ExpectedRuntimeKeys` rule) |
35+
36+
```toml
37+
[check.lint]
38+
allowed_runtime_keys = ["gpu", "queue"]
39+
```
40+
41+
## Filtering lint rules
42+
43+
The set of active lint rules can be controlled via the `[check]` section in
44+
`sprocket.toml`:
45+
46+
| Option | Type | Default | Description |
47+
|--------|------|---------|-------------|
48+
| `except` | List | `[]` | Rule IDs to exclude from running |
49+
| `all_lint_rules` | Boolean | `false` | Enable all lint rules, including those outside the default set |
50+
| `only_lint_tags` | List | `[]` | Restrict linting to rules with these tags |
51+
| `filter_lint_tags` | List | `[]` | Exclude rules with these tags |
52+
53+
For example, to enable all rules except `ContainerUri`:
54+
55+
```toml
56+
[check]
57+
all_lint_rules = true
58+
except = ["ContainerUri"]
59+
```

0 commit comments

Comments
 (0)