Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .vitepress/config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ export default defineConfig({
],
},
{ text: "Call Cache", link: "/configuration/cache", docFooterText: "Configuration > Call Cache" },
{ text: "Provenance Tracking", link: "/configuration/provenance", docFooterText: "Configuration > Provenance Tracking" },
],
},
{
Expand All @@ -79,6 +80,7 @@ export default defineConfig({
text: "Experimental Commands", collapsed: true, items: [
{ text: "doc", link: "/subcommands/doc", docFooterText: "Experimental Commands > doc" },
{ text: "lock", link: "/subcommands/lock", docFooterText: "Experimental Commands > lock" },
{ text: "server", link: "/subcommands/server", docFooterText: "Experimental Commands > server" },
{ text: "test", link: "/subcommands/test", docFooterText: "Experimental Commands > test" },
]
},
Expand Down
263 changes: 263 additions & 0 deletions configuration/provenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Provenance Tracking

Sprocket automatically tracks all workflow executions in a SQLite database while
maintaining an organized filesystem structure for outputs. Both `sprocket run`
and `sprocket server` share the same execution engine and output structure, so
the concepts described here apply equally to both commands.

> [!NOTE]
>
> Provenance is a well-established research area within scientific workflow
> systems. Formal models such as the
> [W3C PROV](https://www.w3.org/TR/prov-overview/) family and its
> workflow-oriented extension
> [ProvONE](https://purl.dataone.org/provone-v1-dev) define rich vocabularies
> for describing data lineage, activity chains, and agent relationships (cf.
> [Ludäscher et al., 2016](https://link.springer.com/chapter/10.1007/978-3-319-40226-0_7);
> [Deelman et al., 2018](https://journals.sagepub.com/doi/abs/10.1177/1094342017704893)).
> Sprocket uses the term "provenance" more loosely here to describe its
> execution tracking capabilities—recording what was run, with which inputs,
> when, and by whom—rather than implementing the full data lineage and
> dependency tracking described in those formal models.

For design details, see [RFC #3](https://github.com/stjude-rust-labs/rfcs/pull/3).

## Runs and indexes

The output directory contains two complementary directory hierarchies that
together address a fundamental tension in workflow management: users need both a
complete provenance record for reproducibility and auditing _and_ a simplified,
domain-specific view for everyday access to results. Rather than forcing users to
choose one or maintain both manually, Sprocket provides both automatically.

The **`runs/`** directory is the immutable record of truth. It organizes every
execution chronologically by target name and timestamp, preserving the full
history of inputs, outputs, and individual task attempts. This structure is
append-only—Sprocket never modifies or removes previous runs—so it serves as
a reliable audit trail. When a workflow is run multiple times, each execution
receives its own timestamped directory, and the complete set of attempts is
always available for inspection.

The **`index/`** directory is an optional, user-curated view layered on top of
the runs. When the `--index-on` flag is provided, Sprocket creates symlinks
under `index/` that point back into `runs/`, giving users a way to organize
results by whatever dimension makes sense for their domain (e.g., by project or
by experiment) without duplicating any data. Because the index
consists entirely of relative symlinks, it adds negligible storage overhead and
can be reconstructed from the provenance database at any time.

This separation means that the provenance record remains intact regardless of how
the index evolves. Re-running a workflow with the same `--index-on` path updates
the index symlinks to point to the latest results, but the previous run's
directory under `runs/` is preserved, and the database records the full history
of index changes. The design follows a principle of
[progressive disclosure](https://en.wikipedia.org/wiki/Progressive_disclosure):
users who simply run `sprocket run` get a well-organized `runs/` directory and a
provenance database with no extra configuration, and those who need logical
organization can opt into indexing by adding a single flag.

## Output directory

By default, Sprocket creates an `out/` directory in your current working
directory to store all workflow outputs and provenance data. This location can
be configured via:

- The `-o, --output-dir` CLI flag (for `sprocket run`).
- The `-o, --output-directory` CLI flag (for `sprocket server`).
- The `run.output_dir` configuration option (for run mode).
- The `server.output_directory` configuration option (for server mode).

### Directory structure

The layout within each run directory differs slightly depending on whether the
target is a standalone task or a workflow containing multiple task calls.

#### Task runs

When running a task directly, the `attempts/` directory sits at the top level of
the run directory.

```
./out/
├── sprocket.db # SQLite provenance database
├── output.log # Execution log
├── runs/
│ └── <target>/
│ ├── <timestamp>/ # Individual run (YYYY-MM-DD_HHMMSSffffff)
│ │ ├── inputs.json # Serialized inputs for the run
│ │ ├── outputs.json # Serialized outputs from the run
│ │ ├── tmp/ # Temporary localization files
│ │ └── attempts/
│ │ └── <n>/ # Attempt number (0, 1, 2, ...)
│ │ ├── command # Executed shell script
│ │ ├── stdout # Task standard output
│ │ ├── stderr # Task standard error
│ │ └── work/ # Task working directory
│ └── _latest -> <timestamp>/ # Symlink to most recent run
└── index/ # Optional output indexing
└── <output_name>/
└── outputs.json # Symlink to run outputs
```

#### Workflow runs

When running a workflow, each task call within the workflow gets its own
subdirectory under `calls/`. Each call directory then contains the same
`attempts/` and `tmp/` structure as a standalone task run.

```
./out/
├── sprocket.db
├── output.log
├── runs/
│ └── <target>/
│ ├── <timestamp>/
│ │ ├── inputs.json
│ │ ├── outputs.json
│ │ ├── tmp/ # Workflow-level temporary files
│ │ └── calls/ # Task execution directories
│ │ └── <task_call_id>/ # One per task call in the workflow
│ │ ├── tmp/ # Task-level temporary files
│ │ └── attempts/
│ │ └── <n>/
│ │ ├── command
│ │ ├── stdout
│ │ ├── stderr
│ │ └── work/
│ └── _latest -> <timestamp>/
└── index/
└── <output_name>/
└── outputs.json
```

### The `_latest` symlink

For each target, Sprocket maintains a `_latest` symlink pointing to the most
recent execution directory. This provides quick access to the latest results
without needing to know the exact timestamp.

```shell
# Access the latest run outputs
ls out/runs/my_workflow/_latest/
```

> [!NOTE]
>
> On Windows, creating symlinks may require administrator privileges or
> Developer Mode. If symlink creation fails, the `_latest` symlink will be
> omitted but workflow execution will continue normally.

## Provenance database

The `sprocket.db` SQLite database tracks all workflow executions, including:

- **Sessions**: Groups of related workflow submissions.
- **Runs**: Individual workflow executions with inputs, outputs, and status.
- **Tasks**: Individual task executions within a workflow run.

## Run contents

Each run creates a timestamped directory under `runs/<target>/` containing
the following:

| File/Directory | Description |
|----------------|-------------|
| `inputs.json` | Serialized inputs provided for the run |
| `outputs.json` | Serialized outputs produced by the run |
| `tmp/` | Temporary files used during input localization |
| `attempts/` | Directory containing attempt subdirectories (task runs) |
| `calls/` | Directory containing per-task-call subdirectories (workflow runs) |
| `attempts/<n>/command` | The shell script that was executed |
| `attempts/<n>/stdout` | Standard output from the task |
| `attempts/<n>/stderr` | Standard error from the task |
| `attempts/<n>/work/` | Task working directory containing output files |

### Retries

When a task fails and is retried, each attempt gets its own numbered
subdirectory under `attempts/`. This preserves the complete history of all
execution attempts, which is valuable for debugging intermittent failures.

## Output indexing

When the `--index-on` flag is provided, Sprocket indexes run outputs by the
specified output name. For each run, a symlink is created under
`index/<output_name>/` pointing to the run's `outputs.json` file. This enables
efficient lookup of runs by output values without scanning the entire `runs/`
directory.

```shell
# Run a workflow with output indexing on the `greeting` output
sprocket run hello.wdl -t hello --index-on greeting
```

The resulting index entry is a relative symlink:

```
index/greeting/outputs.json -> ../../runs/hello/<timestamp>/outputs.json
```

## Portability

The entire output directory is designed to be portable:

- All paths stored in the database are relative to the database file.
- Symlinks (including index entries) use relative paths.
- Moving the `out/` directory with `mv` or `rsync` preserves all relationships.

## Concurrent access

Both `sprocket run` and `sprocket server` share the same execution engine and
can operate on the same output directory simultaneously:

- The SQLite WAL mode enables concurrent access.
- Database locks are held briefly (milliseconds per transaction).
- A workflow submitted via CLI is immediately visible to the server.
- All workflows share the same database regardless of submission method.

## Best practices

### Organizing output directories

Use a dedicated output directory for each project or analysis domain. This keeps
provenance data isolated, makes backups straightforward, and avoids confusion
when multiple unrelated workflows share the same `runs/` hierarchy.

```shell
sprocket run pipeline_a.wdl -o ./pipeline-a-out ...
sprocket run pipeline_b.wdl -o ./pipeline-b-out ...
```

### Querying execution history

The REST API (available via `sprocket server`) is the recommended way to query
execution history. The API provides endpoints for listing sessions, runs, and
tasks with filtering capabilities. See the
[server documentation](/subcommands/server) for endpoint details and the
interactive Swagger UI at `/api/v1/swagger-ui` for exploration.

Avoid parsing the `runs/` directory structure directly for programmatic access.
The layout within `runs/` is an implementation detail that may evolve, whereas
the API provides a stable interface. The `index/` directory, on the other hand,
is user-assembled via `--index-on` and is designed to be consumed directly.

### Backing up provenance data

The output directory is self-contained: backing up the entire `out/` directory
(including `sprocket.db` and the `runs/` and `index/` hierarchies) captures the
full provenance record. Because all paths in the database and all symlinks are
relative, a backup can be restored to any location without reconfiguration.

When backing up a live system, be aware that SQLite WAL mode uses auxiliary
files (`sprocket.db-wal` and `sprocket.db-shm`). For a consistent backup,
either stop active workflows first, or use SQLite's
[backup API](https://www.sqlite.org/backup.html) to safely copy the database
while it is in use.

### Preserving the `runs/` directory

The `runs/` hierarchy is the immutable record of truth for all workflow
executions. Do not modify, rename, or delete files within it, as doing so may
invalidate provenance records and break index symlinks that reference those
paths. If disk space becomes a concern, consider archiving older runs rather
than deleting them.
14 changes: 7 additions & 7 deletions guided-tour.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,14 +136,14 @@ This will error right away, as we haven't told Sprocket which task or workflow
to run.

```txt
error: the `--entrypoint` option is required if no inputs are provided
error: the `--target` option is required if no inputs are provided
```

We want to run the "main" workflow defined in `example.wdl`, so we can try again
but specify the entrypoint to use this time using the `--entrypoint` flag.
but specify the target to use this time using the `--target` flag.

```shell
sprocket run example.wdl --entrypoint main
sprocket run example.wdl --target main
```

After a few seconds, you'll see `sprocket` return an error.
Expand Down Expand Up @@ -178,19 +178,19 @@ than create many individual input files.
sprocket run example.wdl hello_defaults.json main.name="Ari"
```

Note that the above command does not specify an entrypoint with the `--entrypoint`
Note that the above command does not specify a target with the `--target`
flag. This is because every input is using fully qualified dot notation; each
input is prefixed with the name of the entrypoint and a period, `main.`.
input is prefixed with the name of the target and a period, `main.`.
This fully qualified dot notation is required for inputs provided within a file.
The dot notation can get repetitive if supplying many key value pairs on the command line,
so specifying `--entrypoint` allows you to omit the repeated part of the keys.
so specifying `--target` allows you to omit the repeated part of the keys on the command line.
:::

Here, we can specify the `name` parameter as a key-value pair on the command
line.

```shell
sprocket run example.wdl --entrypoint main name="World"
sprocket run example.wdl --target main name="World"
```

After a few seconds, this job runs successfully with the following outputs.
Expand Down
46 changes: 36 additions & 10 deletions subcommands/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,22 @@ See the section on [execution backends](/configuration/backends/overview.md) to
learn more about configuring Sprocket to execute tasks in different
environments.

## Entrypoints
## Targets

The task or workflow to run can be provided explicitly with
the `--entrypoint` argument.
the `--target` argument.

```shell
sprocket run --entrypoint main example.wdl
sprocket run --target main example.wdl
```

Whether or not this argument is _required_ is based on whether inputs are
provided to Sprocket from which the entrypoint can be inferred (e.g., providing
an input of `main.is_pirate` implies an entrypoint of `main`). Conversely, if
you supply an `--entrypoint`, you don't have to prefix your inputs with the
entrypoint fully qualified name.
provided to Sprocket from which the target can be inferred (e.g., providing
an input of `main.is_pirate` implies a target of `main`). Conversely, if
you supply a `--target`, you don't have to prefix your command line inputs with the
target's fully qualified name.

Sprocket will indicate when it cannot infer the entrypoint.
Sprocket will indicate when it cannot infer the target.

## Inputs

Expand Down Expand Up @@ -54,7 +54,7 @@ from the guided tour, we can specify the `name`
parameter as a key-value pair on the command line.

```shell
sprocket run example.wdl --entrypoint main name="World"
sprocket run example.wdl --target main name="World"
```

After a few seconds, this job runs successfully with the following outputs.
Expand Down Expand Up @@ -100,4 +100,30 @@ This produces the following output.
"Ahoy, Sprocket!"
]
}
```
```

## Output directory

By default, `sprocket run` writes all execution artifacts to `./out`. This can
be changed with the `-o, --output-dir` flag.

```shell
sprocket run example.wdl --target main name="World" -o /path/to/output
```

Individual runs are stored at `<output_dir>/runs/<target>/<timestamp>/`, and a
`_latest` symlink is maintained for each target pointing to its most recent run.
The output directory also contains a SQLite provenance database (`sprocket.db`)
that tracks all executions.

The `--index-on` flag enables output indexing, creating symlinks under
`<output_dir>/index/<output_name>/` for efficient lookup of runs by output
values.

```shell
sprocket run hello.wdl -t hello --index-on greeting
```

For full details on the output directory structure, provenance database, and
output indexing, see the
[Provenance Tracking](/configuration/provenance) documentation.
Loading