diff --git a/.vitepress/config.mts b/.vitepress/config.mts index 40539ad..e1b55e0 100644 --- a/.vitepress/config.mts +++ b/.vitepress/config.mts @@ -61,6 +61,7 @@ export default defineConfig({ ], }, { text: "Call Cache", link: "/configuration/cache", docFooterText: "Configuration > Call Cache" }, + { text: "Provenance Tracking", link: "/configuration/provenance", docFooterText: "Configuration > Provenance Tracking" }, ], }, { @@ -79,6 +80,7 @@ export default defineConfig({ text: "Experimental Commands", collapsed: true, items: [ { text: "doc", link: "/subcommands/doc", docFooterText: "Experimental Commands > doc" }, { text: "lock", link: "/subcommands/lock", docFooterText: "Experimental Commands > lock" }, + { text: "server", link: "/subcommands/server", docFooterText: "Experimental Commands > server" }, { text: "test", link: "/subcommands/test", docFooterText: "Experimental Commands > test" }, ] }, diff --git a/configuration/overview.md b/configuration/overview.md index 1e25553..00030d5 100644 --- a/configuration/overview.md +++ b/configuration/overview.md @@ -23,6 +23,10 @@ channels (listed in order of the relative priority during loading). via the `SPROCKET_CONFIG` environment variable. * **Current working directory.** Sprocket will attempt to load a `sprocket.toml` within the current working directory when the `sprocket` command runs. +* **Executable-adjacent configuration.** Sprocket will attempt to load a + `sprocket.toml` located in the same directory as the `sprocket` executable. + This is useful for bundled or deployed installations where a default + configuration should travel with the binary. * **System-wide configuration locations.** See [the section below](#system-wide-configuration-locations) on how to use the system-wide configuration directory. @@ -96,6 +100,15 @@ the command line. This will disable the searching for and loading of configurati files. The only configuration loaded will be that (if) specified by the `--config` command line argument. +## Global options + +Sprocket provides a few options that apply across all subcommands. + +| Option | Config Key | Values | Default | Description | +|--------|-----------|--------|---------|-------------| +| `--color` | `common.color` | `auto`, `always`, `never` | `auto` | Controls output colorization | +| `-m, --report-mode` | `common.report_mode` | `full`, `one-line` | `full` | Controls diagnostic output format | + ## Ignoring WDL files and directories Sprocket is able to parse `.sprocketignore` files found in the current working diff --git a/configuration/provenance.md b/configuration/provenance.md new file mode 100644 index 0000000..0005b9f --- /dev/null +++ b/configuration/provenance.md @@ -0,0 +1,264 @@ +# Provenance Tracking + +Sprocket automatically tracks all workflow executions in a SQLite database while +maintaining an organized filesystem structure for outputs. Both `sprocket run` +and `sprocket dev server` share the same execution engine and output structure, so +the concepts described here apply equally to both commands. + +> [!NOTE] +> +> Provenance is a well-established research area within scientific workflow +> systems. Formal models such as the +> [W3C PROV](https://www.w3.org/TR/prov-overview/) family and its +> workflow-oriented extension +> [ProvONE](https://purl.dataone.org/provone-v1-dev) define rich vocabularies +> for describing data lineage, activity chains, and agent relationships (cf. +> [Ludäscher et al., 2016](https://link.springer.com/chapter/10.1007/978-3-319-40226-0_7); +> [Deelman et al., 2018](https://journals.sagepub.com/doi/abs/10.1177/1094342017704893)). +> Sprocket uses the term "provenance" more loosely here to describe its +> execution tracking capabilities—recording what was run, with which inputs, +> when, and by whom—rather than implementing the full data lineage and +> dependency tracking described in those formal models. + +For design details, see [RFC #3](https://github.com/stjude-rust-labs/rfcs/pull/3). + +## Runs and indexes + +The output directory contains two complementary directory hierarchies that +together address a fundamental tension in workflow management: users need both a +complete provenance record for reproducibility and auditing _and_ a simplified, +domain-specific view for everyday access to results. Rather than forcing users to +choose one or maintain both manually, Sprocket provides both automatically. + +The **`runs/`** directory is the immutable record of truth. It organizes every +execution chronologically by target name and timestamp, preserving the full +history of inputs, outputs, and individual task attempts. This structure is +append-only—Sprocket never modifies or removes previous runs—so it serves as +a reliable audit trail. When a workflow is run multiple times, each execution +receives its own timestamped directory, and the complete set of attempts is +always available for inspection. + +The **`index/`** directory is an optional, user-curated view layered on top of +the runs. When the `--index-on` flag is provided, Sprocket creates symlinks +under `index/` that point back into `runs/`, giving users a way to organize +results by whatever dimension makes sense for their domain (e.g., by project or +by experiment) without duplicating any data. Because the index +consists entirely of relative symlinks, it adds negligible storage overhead and +can be reconstructed from the provenance database at any time. + +This separation means that the provenance record remains intact regardless of how +the index evolves. Re-running a workflow with the same `--index-on` path updates +the index symlinks to point to the latest results, but the previous run's +directory under `runs/` is preserved, and the database records the full history +of index changes. The design follows a principle of +[progressive disclosure](https://en.wikipedia.org/wiki/Progressive_disclosure): +users who simply run `sprocket run` get a well-organized `runs/` directory and a +provenance database with no extra configuration, and those who need logical +organization can opt into indexing by adding a single flag. + +## Output directory + +By default, Sprocket creates an `out/` directory in your current working +directory to store all workflow outputs and provenance data. This location can +be configured via: + +- The `-o, --output-dir` CLI flag (for `sprocket run`). +- The `-o, --output-directory` CLI flag (for `sprocket dev server`). +- The `run.output_dir` configuration option (for run mode). +- The `server.output_directory` configuration option (for server mode). + +### Directory structure + +The layout within each run directory differs slightly depending on whether the +target is a standalone task or a workflow containing multiple task calls. + +#### Task runs + +When running a task directly, the `attempts/` directory sits at the top level of +the run directory. + +``` +./out/ +├── sprocket.db # SQLite provenance database +├── output.log # Execution log +├── runs/ +│ └── / +│ ├── / # Individual run (YYYY-MM-DD_HHMMSSffffff) +│ │ ├── inputs.json # Serialized inputs for the run +│ │ ├── outputs.json # Serialized outputs from the run +│ │ ├── tmp/ # Temporary localization files +│ │ └── attempts/ +│ │ └── / # Attempt number (0, 1, 2, ...) +│ │ ├── command # Executed shell script +│ │ ├── stdout # Task standard output +│ │ ├── stderr # Task standard error +│ │ └── work/ # Task working directory +│ └── _latest -> / # Symlink to most recent run +└── index/ # Optional output indexing + └── / + └── outputs.json # Symlink to run outputs +``` + +#### Workflow runs + +When running a workflow, each task call within the workflow gets its own +subdirectory under `calls/`. Each call directory then contains the same +`attempts/` and `tmp/` structure as a standalone task run. + +``` +./out/ +├── sprocket.db +├── output.log +├── runs/ +│ └── / +│ ├── / +│ │ ├── inputs.json +│ │ ├── outputs.json +│ │ ├── tmp/ # Workflow-level temporary files +│ │ └── calls/ # Task execution directories +│ │ └── / # One per task call in the workflow +│ │ ├── tmp/ # Task-level temporary files +│ │ └── attempts/ +│ │ └── / +│ │ ├── command +│ │ ├── stdout +│ │ ├── stderr +│ │ └── work/ +│ └── _latest -> / +└── index/ + └── / + └── outputs.json +``` + +### The `_latest` symlink + +For each target, Sprocket maintains a `_latest` symlink pointing to the most +recent execution directory. This provides quick access to the latest results +without needing to know the exact timestamp. + +```shell +# Access the latest run outputs +ls out/runs/my_workflow/_latest/ +``` + +> [!NOTE] +> +> On Windows, creating symlinks may require administrator privileges or +> Developer Mode. If symlink creation fails, the `_latest` symlink will be +> omitted but workflow execution will continue normally. + +## Provenance database + +The `sprocket.db` SQLite database tracks all workflow executions, including: + +- **Sessions**: Groups of related workflow submissions. +- **Runs**: Individual workflow executions with inputs, outputs, and status. +- **Tasks**: Individual task executions within a workflow run. + +## Run contents + +Each run creates a timestamped directory under `runs//` containing +the following: + +| File/Directory | Description | +|----------------|-------------| +| `output.log` | Log of all messages emitted during the run | +| `inputs.json` | Serialized inputs provided for the run | +| `outputs.json` | Serialized outputs produced by the run | +| `tmp/` | Temporary files used during input localization | +| `attempts/` | Directory containing attempt subdirectories (task runs) | +| `calls/` | Directory containing per-task-call subdirectories (workflow runs) | +| `attempts//command` | The shell script that was executed | +| `attempts//stdout` | Standard output from the task | +| `attempts//stderr` | Standard error from the task | +| `attempts//work/` | Task working directory containing output files | + +### Retries + +When a task fails and is retried, each attempt gets its own numbered +subdirectory under `attempts/`. This preserves the complete history of all +execution attempts, which is valuable for debugging intermittent failures. + +## Output indexing + +When the `--index-on` flag is provided, Sprocket indexes run outputs by the +specified output name. For each run, a symlink is created under +`index//` pointing to the run's `outputs.json` file. This enables +efficient lookup of runs by output values without scanning the entire `runs/` +directory. + +```shell +# Run a workflow with output indexing on the `greeting` output +sprocket run hello.wdl -t hello --index-on greeting +``` + +The resulting index entry is a relative symlink: + +``` +index/greeting/outputs.json -> ../../runs/hello//outputs.json +``` + +## Portability + +The entire output directory is designed to be portable: + +- All paths stored in the database are relative to the database file. +- Symlinks (including index entries) use relative paths. +- Moving the `out/` directory with `mv` or `rsync` preserves all relationships. + +## Concurrent access + +Both `sprocket run` and `sprocket dev server` share the same execution engine and +can operate on the same output directory simultaneously: + +- The SQLite WAL mode enables concurrent access. +- Database locks are held briefly (milliseconds per transaction). +- A workflow submitted via CLI is immediately visible to the server. +- All workflows share the same database regardless of submission method. + +## Best practices + +### Organizing output directories + +Use a dedicated output directory for each project or analysis domain. This keeps +provenance data isolated, makes backups straightforward, and avoids confusion +when multiple unrelated workflows share the same `runs/` hierarchy. + +```shell +sprocket run pipeline_a.wdl -o ./pipeline-a-out ... +sprocket run pipeline_b.wdl -o ./pipeline-b-out ... +``` + +### Querying execution history + +The REST API (available via `sprocket dev server`) is the recommended way to query +execution history. The API provides endpoints for listing sessions, runs, and +tasks with filtering capabilities. See the +[server documentation](/subcommands/server) for endpoint details and the +interactive Swagger UI at `/api/v1/swagger-ui` for exploration. + +Avoid parsing the `runs/` directory structure directly for programmatic access. +The layout within `runs/` is an implementation detail that may evolve, whereas +the API provides a stable interface. The `index/` directory, on the other hand, +is user-assembled via `--index-on` and is designed to be consumed directly. + +### Backing up provenance data + +The output directory is self-contained: backing up the entire `out/` directory +(including `sprocket.db` and the `runs/` and `index/` hierarchies) captures the +full provenance record. Because all paths in the database and all symlinks are +relative, a backup can be restored to any location without reconfiguration. + +When backing up a live system, be aware that SQLite WAL mode uses auxiliary +files (`sprocket.db-wal` and `sprocket.db-shm`). For a consistent backup, +either stop active workflows first, or use SQLite's +[backup API](https://www.sqlite.org/backup.html) to safely copy the database +while it is in use. + +### Preserving the `runs/` directory + +The `runs/` hierarchy is the immutable record of truth for all workflow +executions. Do not modify, rename, or delete files within it, as doing so may +invalidate provenance records and break index symlinks that reference those +paths. If disk space becomes a concern, consider archiving older runs rather +than deleting them. diff --git a/guided-tour.md b/guided-tour.md index 95d0cf7..cde71da 100644 --- a/guided-tour.md +++ b/guided-tour.md @@ -136,14 +136,14 @@ This will error right away, as we haven't told Sprocket which task or workflow to run. ```txt -error: the `--entrypoint` option is required if no inputs are provided +error: the `--target` option is required if no inputs are provided ``` We want to run the "main" workflow defined in `example.wdl`, so we can try again -but specify the entrypoint to use this time using the `--entrypoint` flag. +but specify the target to use this time using the `--target` flag. ```shell -sprocket run example.wdl --entrypoint main +sprocket run example.wdl --target main ``` After a few seconds, you'll see `sprocket` return an error. @@ -178,19 +178,19 @@ than create many individual input files. sprocket run example.wdl hello_defaults.json main.name="Ari" ``` -Note that the above command does not specify an entrypoint with the `--entrypoint` +Note that the above command does not specify a target with the `--target` flag. This is because every input is using fully qualified dot notation; each -input is prefixed with the name of the entrypoint and a period, `main.`. +input is prefixed with the name of the target and a period, `main.`. This fully qualified dot notation is required for inputs provided within a file. The dot notation can get repetitive if supplying many key value pairs on the command line, -so specifying `--entrypoint` allows you to omit the repeated part of the keys. +so specifying `--target` allows you to omit the repeated part of the keys on the command line. ::: Here, we can specify the `name` parameter as a key-value pair on the command line. ```shell -sprocket run example.wdl --entrypoint main name="World" +sprocket run example.wdl --target main name="World" ``` After a few seconds, this job runs successfully with the following outputs. diff --git a/subcommands/check-lint.md b/subcommands/check-lint.md index d35062d..116c35a 100644 --- a/subcommands/check-lint.md +++ b/subcommands/check-lint.md @@ -22,4 +22,38 @@ With respect to emitting warnings, there are two levels of warnings in Sprocket: (which enables the lint warnings). `sprocket lint` emits both validation warnings and lint warnings — it is -essentially an alias for `sprocket check -l`. \ No newline at end of file +essentially an alias for `sprocket check -l`. + +## Rule configuration + +Individual lint rules can be configured via the `[check.lint]` section in +`sprocket.toml`. Currently, the following options are supported: + +| Option | Type | Description | +|--------|------|-------------| +| `allowed_runtime_keys` | List of strings | Additional runtime keys to allow beyond the WDL specification defaults (used by the `ExpectedRuntimeKeys` rule) | + +```toml +[check.lint] +allowed_runtime_keys = ["gpu", "queue"] +``` + +## Filtering lint rules + +The set of active lint rules can be controlled via the `[check]` section in +`sprocket.toml`: + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `except` | List | `[]` | Rule IDs to exclude from running | +| `all_lint_rules` | Boolean | `false` | Enable all lint rules, including those outside the default set | +| `only_lint_tags` | List | `[]` | Restrict linting to rules with these tags | +| `filter_lint_tags` | List | `[]` | Exclude rules with these tags | + +For example, to enable all rules except `ContainerUri`: + +```toml +[check] +all_lint_rules = true +except = ["ContainerUri"] +``` \ No newline at end of file diff --git a/subcommands/doc.md b/subcommands/doc.md index eba8f21..3c3a823 100644 --- a/subcommands/doc.md +++ b/subcommands/doc.md @@ -59,6 +59,20 @@ version 1.2 workflow foo {} ``` +## Documentation comments + +The `--with-doc-comments` flag enables experimental support for documentation +comments. When enabled, Sprocket will recognize and render documentation +comments (i.e., `///` comments placed directly above declarations) in the +generated HTML output. + +> [!CAUTION] +> +> This feature is experimental and tracking the upstream WDL proposal at +> [openwdl/wdl#757](https://github.com/openwdl/wdl/issues/757). The +> `--with-doc-comments` flag will be removed in a future major version once the +> proposal is resolved. + ## Structs Structs are treated different for WDL v1.0/v1.1 and v1.2. diff --git a/subcommands/run.md b/subcommands/run.md index 2dfd109..6d87a52 100644 --- a/subcommands/run.md +++ b/subcommands/run.md @@ -10,22 +10,22 @@ See the section on [execution backends](/configuration/backends/overview.md) to learn more about configuring Sprocket to execute tasks in different environments. -## Entrypoints +## Targets The task or workflow to run can be provided explicitly with -the `--entrypoint` argument. +the `--target` argument. ```shell -sprocket run --entrypoint main example.wdl +sprocket run --target main example.wdl ``` Whether or not this argument is _required_ is based on whether inputs are -provided to Sprocket from which the entrypoint can be inferred (e.g., providing -an input of `main.is_pirate` implies an entrypoint of `main`). Conversely, if -you supply an `--entrypoint`, you don't have to prefix your inputs with the -entrypoint fully qualified name. +provided to Sprocket from which the target can be inferred (e.g., providing +an input of `main.is_pirate` implies a target of `main`). Conversely, if +you supply a `--target`, you don't have to prefix your command line inputs with the +target's fully qualified name. -Sprocket will indicate when it cannot infer the entrypoint. +Sprocket will indicate when it cannot infer the target. ## Inputs @@ -54,7 +54,7 @@ from the guided tour, we can specify the `name` parameter as a key-value pair on the command line. ```shell -sprocket run example.wdl --entrypoint main name="World" +sprocket run example.wdl --target main name="World" ``` After a few seconds, this job runs successfully with the following outputs. @@ -100,4 +100,30 @@ This produces the following output. "Ahoy, Sprocket!" ] } -``` \ No newline at end of file +``` + +## Output directory + +By default, `sprocket run` writes all execution artifacts to `./out`. This can +be changed with the `-o, --output-dir` flag. + +```shell +sprocket run example.wdl --target main name="World" -o /path/to/output +``` + +Individual runs are stored at `/runs///`, and a +`_latest` symlink is maintained for each target pointing to its most recent run. +The output directory also contains a SQLite provenance database (`sprocket.db`) +that tracks all executions. + +The `--index-on` flag enables output indexing, creating symlinks under +`/index//` for efficient lookup of runs by output +values. + +```shell +sprocket run hello.wdl -t hello --index-on greeting +``` + +For full details on the output directory structure, provenance database, and +output indexing, see the +[Provenance Tracking](/configuration/provenance) documentation. \ No newline at end of file diff --git a/subcommands/server.md b/subcommands/server.md new file mode 100644 index 0000000..fb82277 --- /dev/null +++ b/subcommands/server.md @@ -0,0 +1,174 @@ +# `sprocket dev server` + +> [!CAUTION] +> +> This document describes the beta release of the `server` command. This +> functionality is considered experimental and may change in future releases. + +The `dev server` command starts Sprocket as an HTTP server, enabling remote workflow +submission and monitoring through a REST API. This is useful for scenarios where +you want to submit workflows from a separate machine or integrate Sprocket into +larger systems. + +## Overview + +Server mode provides: + +- **Remote workflow submission** via REST API. +- **Real-time monitoring** of running workflows. +- **Provenance tracking** with a SQLite database. +- **Concurrent execution** of multiple workflows. + +The server shares the same execution engine as `sprocket run`, ensuring +consistent behavior between CLI and server-submitted workflows. + +## Starting the server + +```shell +sprocket dev server --allowed-file-paths /path/to/workflows +``` + +At least one of `--allowed-file-paths` or `--allowed-urls` must be specified to +indicate where workflow sources can be loaded from. + +## Command-line options + +| Option | Description | +|--------|-------------| +| `--host ` | Host to bind to (default: `127.0.0.1`) | +| `--port ` | Port to bind to (default: `8080`) | +| `--database-url ` | Database path. When omitted, defaults to `sprocket.db` within the output directory. When provided, relative paths resolve from the current working directory. | +| `-o, --output-directory ` | Output directory for workflow results (default: `./out`) | +| `--allowed-file-paths ` | Allowed file paths for file-based workflows (can be repeated) | +| `--allowed-urls ` | Allowed URL prefixes for URL-based workflows (can be repeated) | +| `--allowed-origins ` | Allowed CORS origins (can be repeated) | + +## Configuration + +Server settings can also be configured in `sprocket.toml`: + +```toml +[server] +host = "127.0.0.1" +port = 8080 +output_directory = "./out" +allowed_file_paths = ["/path/to/workflows"] +allowed_urls = ["https://raw.githubusercontent.com/"] +allowed_origins = ["http://localhost:3000"] +max_concurrent_runs = 500 + +[server.database] +url = "sqlite://sprocket.db" + +[server.engine] +# Engine configuration (same options as [run] section) +``` + +### Configuration options + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `host` | String | `"127.0.0.1"` | Host address to bind | +| `port` | Integer | `8080` | Port to bind | +| `output_directory` | Path | `"./out"` | Directory for workflow outputs | +| `allowed_file_paths` | List | `[]` | Allowed local paths for workflow sources | +| `allowed_urls` | List | `[]` | Allowed URL prefixes for workflow sources | +| `allowed_origins` | List | `[]` | CORS allowed origins | +| `max_concurrent_runs` | Integer | None | Maximum concurrent workflow executions | +| `database.url` | String | None | Database path. When omitted, defaults to `sprocket.db` within the output directory. When provided, relative paths resolve from the current working directory (not the output directory). | +| `engine` | Object | `{}` | Engine configuration (see execution backends) | + +## REST API + +The server exposes a REST API for managing workflow executions. Interactive +documentation is available at `/api/v1/swagger-ui` when the server is running, +and the OpenAPI specification can be retrieved from `/api/v1/openapi.json`. + +### Runs + +Runs represent individual workflow executions. + +- `POST /api/v1/runs` - Submit a new workflow. +- `GET /api/v1/runs` - List all runs. Supports optional `?status=` filter + (e.g., `?status=running`). +- `GET /api/v1/runs/{uuid}` - Get run details. +- `POST /api/v1/runs/{uuid}/cancel` - Cancel a running workflow. +- `GET /api/v1/runs/{uuid}/outputs` - Get run outputs. + +### Sessions + +Sessions group related workflow submissions. Each `sprocket run` invocation +creates its own session, while a running `sprocket dev server` instance creates a +single session at startup that is shared by all workflows submitted to it. + +- `GET /api/v1/sessions` - List sessions. +- `GET /api/v1/sessions/{uuid}` - Get session details. + +### Tasks + +Tasks represent individual task executions within a workflow run. + +- `GET /api/v1/tasks` - List tasks. +- `GET /api/v1/tasks/{name}` - Get task details. +- `GET /api/v1/tasks/{name}/logs` - Get task logs. + +## Example usage + +### Starting the server + +```shell +# Start server allowing workflows from a local directory +sprocket dev server \ + --allowed-file-paths /home/user/workflows \ + --port 8080 + +# Start server allowing workflows from GitHub +sprocket dev server \ + --allowed-urls "https://raw.githubusercontent.com/" \ + --port 8080 +``` + +### Submitting a workflow + +```shell +# Submit a workflow via the API +curl -X POST http://localhost:8080/api/v1/runs \ + -H "Content-Type: application/json" \ + -d '{ + "source": "/home/user/workflows/hello.wdl", + "inputs": { + "name": "World" + } + }' +``` + +### Checking run status + +```shell +# Get run details +curl http://localhost:8080/api/v1/runs/{run_uuid} + +# List all running workflows +curl http://localhost:8080/api/v1/runs?status=running +``` + +## Output directory + +The server uses the same output directory structure as `sprocket run`. For +details on directory layout, provenance database, and output indexing, see the +[Provenance Tracking](/configuration/provenance) documentation. + +## Security considerations + +> [!WARNING] +> +> The Sprocket server does not perform any authentication or authorization. If you +> need to secure access to the server, you must run it behind a reverse proxy +> (e.g., nginx, Caddy, or Traefik) that handles authentication. + +- Always specify `--allowed-file-paths` or `--allowed-urls` to restrict which + workflow sources can be executed. +- Use `--allowed-origins` to configure CORS for web-based clients. +- Run behind a reverse proxy with authentication for production deployments. +- The server binds to `127.0.0.1` by default; change to `0.0.0.0` to accept + remote connections (not recommended without a reverse proxy). diff --git a/subcommands/validate.md b/subcommands/validate.md index 80a42cd..a8660fe 100644 --- a/subcommands/validate.md +++ b/subcommands/validate.md @@ -8,7 +8,10 @@ such, you can check out the [`run` subcommand documentation](/subcommands/run.md) to learn more. The subcommand will give a non-zero exit code if the inputs are not valid for -the specified task or workflow. This is useful for continuous integration -purposes. The [Sprocket GitHub +the specified task or workflow. In addition to type checking, `sprocket validate` +verifies that `File` and `Directory` inputs reference paths that exist on the +filesystem, catching missing input files before a run is attempted. + +This is useful for continuous integration purposes. The [Sprocket GitHub action](https://github.com/stjude-rust-labs/sprocket-action) provides an easy way to do that on GitHub. \ No newline at end of file