diff --git a/platform-enterprise_docs/data/data-lineage.md b/platform-enterprise_docs/data/data-lineage.md new file mode 100644 index 000000000..c211e9b1c --- /dev/null +++ b/platform-enterprise_docs/data/data-lineage.md @@ -0,0 +1,166 @@ +--- +title: "Data Lineage" +description: "Using data lineage in Seqera Platform." +date created: "2026-05-11" +last updated: "2026-05-11" +tags: [data lineage, provenance, governance, reproducibility, lineage id, lid, label] +--- + +:::info +Data lineage in Platform is in public preview. It is currently supported in AWS compute environments. It requires Nextflow v25.04 or later, and AWS S3 object storage. +::: + +:::warning +The feature is experimental and subject to change. See this guide for the latest configuration recommendations and limitations. +::: + +Data lineage tracks the full provenance of every pipeline run at both the task and workflow level, including what executed, what data it consumed, and what outputs it produced. Use it to audit results, verify reproducibility, and trace file provenance. + +## Overview + +Production pipelines generate results that teams need to trust, audit, and reproduce. Data lineage provides a precise, immutable record of how each result was produced. + +- **Reproducibility**: Every run, task, and output file receives a unique lineage ID (LID), a traversable URI that points to a structured record of what ran. Verify that two runs produced identical results, or identify where they diverged. +- **Auditing and compliance**: For teams in regulated industries such as pharma, clinical genomics, and CROs, lineage provides the audit trail needed for regulatory compliance. Each record captures inputs, outputs, parameters, compute environment, and the user who launched the run. +- **Debugging**: When a cached task unexpectedly re-executes, or a pipeline produces an unexpected result, lineage traces backward from any output to all contributing tasks and parameters. Compare two task runs to isolate what changed. +- **Broader team access**: Exploring Nextflow lineage previously required CLI access and comfort reading raw JSON. Platform now surfaces lineage data in pipeline run detail pages and Data Explorer. Users can inspect provenance directly. +- **Cross-workflow discoverability**: [Workflow output labels][workflow-labels] make output files discoverable across runs. Navigate lineage records by label to find all matching outputs workspace-wide, without knowing which specific run produced a file. + +## How data lineage works + +Nextflow creates a structured JSON record for each entity in your pipeline when lineage is enabled: + +| Record type | Description | +|---|---| +| **WorkflowRun** | Full pipeline execution: repository, commit ID, parameters, compute environment, session ID, and Platform context (user, workspace, pipeline) | +| **TaskRun** | Individual task execution: script, code checksum, inputs, outputs, container, and dependencies | +| **FileOutput** | Output file: path, checksum, size, timestamp, and links back to the task and workflow that produced it | + +Each record gets a lineage ID (LID), a `lid://` URI that uniquely identifies the entity. + +## Enable data lineage + +To start collecting data lineage for all pipeline runs in your workspace, go to **Settings > Workspace Settings**. Select **Lineage** and define the credentials, region, and (optionally) storage bucket and path where lineage data is stored and indexed. Toggle the **Enable lineage by default** on to collect data lineage for all pipeline runs in the workspace or toggle off to require per pipeline launch configuration. + +:::tip +If the storage bucket field is empty, a default bucket is generated for storing lineage data. +::: + +Once set, all pipeline runs in the workspace generate data lineage. See [Lineage][workspace-lineage] for more information about the settings. + +:::danger +Changing the lineage storage bucket path after lineage data is generated will result in historic data loss. The lineage index is tied to the lineage storage bucket. Changing it makes existing records inaccessible. To move the storage location, first copy all existing lineage data to the new bucket and path (for example, `aws s3 cp --recursive s3://old-bucket/path s3://new-bucket/path`), then update the workspace setting. +::: + +When launching a pipeline in a data-lineage enabled workspace, the **Enable lineage** toggle in the pipeline **Run setup** reflects the **Enable lineage by default** workspace setting. This can be turned off to _explicitly exclude_ data lineage creation for the pipeline run. + +### Additional IAM permissions required + +If using existing AWS Batch or AWS Cloud compute environments with custom IAM roles, the following service role policies are required: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "ListObjectsInBucket", + "Effect": "Allow", + "Action": [ + "s3:ListBucket" + ], + "Resource": "arn:aws:s3:::seqera-lineage-" + }, + { + "Sid": "AllObjectActions", + "Effect": "Allow", + "Action": "s3:*Object", + "Resource": "arn:aws:s3:::seqera-lineage-/*" + }, + { + "Sid": "AllowObjectTagging", + "Effect": "Allow", + "Action": [ + "s3:PutObjectTagging", + "s3:GetObjectTagging" + ], + "Resource": "arn:aws:s3:::seqera-lineage-/*" + } + ] +} +``` + +Platform integration credentials require the following additional permissions: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "sqs:CreateQueue", + "sqs:GetQueueAttributes", + "sqs:SetQueueAttributes", + "sqs:GetQueueUrl", + "sqs:ReceiveMessage", + "sqs:DeleteMessage" + ], + "Resource": "arn:aws:sqs:*:*:seqera-lineage-*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:CreateBucket", + "s3:GetBucketNotificationConfiguration", + "s3:PutBucketNotificationConfiguration", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::seqera-lineage-*" + } + ] +} +``` + +### Advanced: Experimenting with data lineage + +To test or troubleshoot data lineage for a _specific pipeline_, add the following to your **Nextflow config file** under **Advanced options** when _adding_ a pipeline to the launchpad. + +```groovy +lineage.enabled = true +lineage.store.location = '' +``` + +To test for a _single pipeline run_, add the same code to your **Nextflow config file** under **Advanced options** when _launching_ the pipeline run. + +:::warning +If data lineage is defined for a workspace, only that data is displayed in Platform. Any unique _specific pipeline_ or _single pipeline run_ lineage data is only accessible via the AWS S3 console and other related services (such as Amazon Athena). +::: + +## Data lineage displayed in Platform + +### Workflow run details + +When a run was executed with lineage enabled, the [run details page][run-details] displays lineage data across the following tabs: + +- **Run Info**: Shows the lineage ID, lineage labels, and the full Platform context captured at execution time: user, workspace, compute environment, pipeline name, revision, and commit ID. +- **Tasks**: Displays the lineage ID and lineage labels for each `TaskRun` alongside existing task data, so you can trace any task back to its lineage record. All task file inputs and outputs, and upstream and downstream tasks linked by lineage records, are displayed. +- **Inputs**: Lists all input datasets and parameters with file paths, types, and lineage IDs and lineage labels where available. +- **Outputs**: Lists all `FileOutput` records linked to the workflow run: output name, file path, type, lineage ID, and lineage labels. Files link directly to [Data Explorer][data-explorer]. + +### Data Explorer + +Output objects from a lineage-enabled run display their LID and any lineage labels when you preview the object in Data Explorer. You can trace any file back to the pipeline run that produced it. + +## Lineage labels + +Assign lineage labels to output files using the `label` directive in your Nextflow process definitions. Labels appear in lineage records. Both Seqera Platform labels and Nextflow lineage labels propagate to lineage records. Seqera Platform excludes resource labels as they relate to underlying compute resources, not the data itself. + +:::info +Nextflow lineage labels are immutable. They are set at execution time and cannot be changed. Seqera Platform labels are mutable. Updating Platform labels after a run completes can produce a mismatch between Platform run labels and lineage labels. This is expected behavior. +::: + +{/* links */} +[workflow-labels]: https://docs.seqera.io/nextflow/workflow#labels +[workspace-lineage]: ../orgs-and-teams/workspace-management#lineage +[run-details]: ../monitoring/run-details +[data-explorer]: data-explorer diff --git a/platform-enterprise_docs/enterprise/configuration/configtables/data_features_env.yml b/platform-enterprise_docs/enterprise/configuration/configtables/data_features_env.yml index 895c03dcf..dcee09419 100644 --- a/platform-enterprise_docs/enterprise/configuration/configtables/data_features_env.yml +++ b/platform-enterprise_docs/enterprise/configuration/configtables/data_features_env.yml @@ -159,3 +159,9 @@ Description: > Number of days to retain Studio startup metrics in the database before automatic deletion. Metrics older than this threshold are deleted by a daily scheduled job. Value: 'Default: `90`' +- + Environment variable: '`TOWER_LINEAGE_ALLOWED_WORKSPACES`' + Description: > + Enable date lineage. Set to `null` (undefined) to disable for all workspaces, set to an empty string to enable for all workspaces, or provide a comma-separated list of workspace IDs to enable per workspace. + Value: 'Default: `null`' +- diff --git a/platform-enterprise_docs/monitoring/run-details.mdx b/platform-enterprise_docs/monitoring/run-details.mdx index 55e666543..978e4c549 100644 --- a/platform-enterprise_docs/monitoring/run-details.mdx +++ b/platform-enterprise_docs/monitoring/run-details.mdx @@ -2,8 +2,8 @@ title: "Run details" description: "Monitoring a Nextflow pipeline executed through Seqera Platform." date created: "2023-04-21" -last updated: "2025-08-01" -tags: [logging, monitoring, runs, details, workflow, execution] +last updated: "2026-05-11" +tags: [logging, monitoring, runs, details, workflow, execution, lineage] --- import Tabs from '@theme/Tabs'; @@ -11,23 +11,32 @@ import TabItem from '@theme/TabItem'; Select a workflow run from the **Runs** list to open a run details page. -![Run details and progress overview](./_images/progress-bar.png) - The top of the page contains basic run details and a progress overview for an at-a-glance view of the run's status: - View and copy the run ID, pipeline name and repository, pipeline work directory, compute environment, and launch date. - Select the star icon to favorite the run and find it more easily via a filter view in the runs list later. -- Use the options menu to apply labels, relaunch, resume, or delete the run, save the run as a new pipeline, or publish a new pipeline [version](../pipelines/versioning) (if the run was launched from a draft version). +- Use the options menu to apply labels, relaunch, resume, or delete the run, save the run as a new pipeline, or publish a new pipeline [version](../pipelines/versioning) (if the run was launched from an unnamed draft). Select the tabs below the workflow run progress bar to view further run details: - **Tasks**: View the status and progress of pipeline tasks and [processes](#processes), including extensive [task details](#tasks). - **Logs**: View and download the pipeline run's execution logs. - **Metrics**: View resource [metrics](#wall-time) for the run. - **Configuration**: View Nextflow configuration files and the resolved [configuration](#configuration) used for the run. -- **Datasets**: View [datasets](../data/datasets) used as input for the run, if any. -- **Reports**: View [reports](../reports/overview) for the run, if any were configured. +- **Inputs**: View pipeline parameters used by the run, including their lineage records. +- **Outputs**: View files produced by the run, including reports and lineage records for every published file. - **Containers**: View the details of containers used in the run, if any. - **Run Info**: View details about the [run](#run-details), [infrastructure](#infrastructure-details), and [executor](#executors-details). +:::tip +Data lineage is made available on request. Please contact your Seqera account manager. + +Lineage-aware fields and tabs only display data when the run was executed with data lineage enabled. There are three ways to enable lineage: + +- [**Settings > Lineage**][workspace-lineage-settings]. A workspace maintainer configures the cloud credentials, region, and (optionally) bucket name where lineage records are stored. Select **Enable lineage by default** to make the launch form lineage toggle default to on for every run launched in the workspace. +- **Launch form toggle**. When launching a pipeline, toggle lineage on or off for the individual run. + +See [Getting started with data lineage][nextflow-lineage-tutorial] for the underlying lineage data model. +::: + @@ -77,6 +86,7 @@ The **Tasks** panel shows all the tasks that were executed in the run, including | **wchar** | Number of characters written to storage. | | **vol_ctxt** | Number of voluntary context switches. | | **inv_ctxt** | Number of involuntary context switches. | +| **lineage_id** | Lineage ID (LID) of the task's `TaskRun` record. Populated only when lineage tracking is active for the run. Select the LID to navigate to the lineage record. | Use the search bar to filter tasks with substrings in the table columns such as **process**, **tag**, **hash**, and **status**. For example, if you enter `succeeded` in the **Search task** field, the table displays only tasks that succeeded. @@ -84,42 +94,62 @@ Use the search bar to filter tasks with substrings in the table columns such as ![Task details](./_images/task-details.png) -Select a task in the task table to open the **Task details** dialog. The dialog has four tabs: **About**, **Execution log**, **Data Explorer**, and **Container**. +Select a task in the task table to open the **Task details** dialog. The dialog has the following tabs: + +- **About** +- **Metrics** +- **Execution log** +- **Data Explorer** +- **Container** + +:::note +If lineage is enabled for the run, the **About** tab content includes **Inputs** and **Outputs** tabs. + +The **Inputs** and **Outputs** tabs show every input or output consumed by the task, including the name, its lineage type (`Collection` or `Path`), the source path, the lineage labels assigned to it, and the lineage ID of the corresponding lineage record. Select a name to open the file in [Data Explorer](../data/data-explorer). Select a lineage ID or label to navigate to that lineage record. +::: #### About - **Name**: Process name and tag. -- **Command**: Task script, defined in the pipeline process. - **Status**: Exit code, task status, attempts. -- **Work directory**: Directory where the task was executed. +- **Native ID**: Unique identifier assigned by the underlying execution executor to a specific job. +- **Command**: Task script, defined in the pipeline process. - **Environment**: Environment variables supplied to the task. +- **Work directory**: Directory where the task was executed. +- **Inputs**: File inputs to the task and associated lineage data. +- **Outputs**: File outputs from the task and associated lineage data. +- **Upstream**: Links to related upstream tasks. +- **Downstream**: Links to related downstream tasks. + +#### Metrics + - **Execution time**: Metrics for task submission, start, and completion time: | Label | Description | |-------|-------------| - | **submit** | Task submission timestamp. | - | **start** | Task execution timestamp. | - | **complete** | Task completion timestamp. | - | **duration** | Time elapsed from task submission to completion, including scheduling time. | - | **realtime** | Task script execution time. | + | **submitted** | Task submission timestamp. | + | **started** | Task execution timestamp. | + | **completed** | Task completion timestamp. | + | **total duration** | Time elapsed from task submission to completion, including scheduling time. | + | **script execution time** | Task script execution time. | -- **Resources requested**: Metrics for the resources requested by the task: +- **Requested resources**: Metrics for the resources requested by the task: | Label | Description | |-------|-------------| - | **container** | Container image name used to execute the task. | + | **container image** | Container image name used to execute the task. | | **queue** | The queue that the executor used to run the process. | | **cpus** | Number of CPUs requested for task execution. | | **memory** | Memory requested for task execution. | - | **disk** | Disk space requested for task executiuon. | - | **time** | Time requested for task execution. | + | **disk space** | Disk space requested for task execution. | + | **time limit** | Time requested for task execution. | | **executor** | The Nextflow executor used for this task. | - | **machineType** | The virtual machine type used for this task. | | **cloudZone** | The cloud zone (region) where the task was executed. | + | **machineType** | The virtual machine type used for this task. | | **priceModel** | The price model used to calculate the task computation cost. | - | **cost** | The estimated cost to compute this task. | + | **estimated cost** | The estimated cost to compute this task. | -- **Resources used**: Metrics for the actual resources used by the task: +- **Used resources**: Metrics for the actual resources used by the task: | Label | Description | |-------|-------------| @@ -251,13 +281,8 @@ Hover the cursor over each box plot to show more details. -The **Configuration** tab contains information about the pipeline parameters, Nextflow configuration, and Nextflow command used for the run. +The **Configuration** tab contains information about the Nextflow configuration files and the Nextflow command used for the run. -#### Parameters - -![Parameters](./_images/parameters.png) - -The **Parameters** window displays the pipeline parameters configured for the run, with options to view, copy, or download the parameters in Groovy, JSON, or YAML format. #### Configuration @@ -272,18 +297,43 @@ The **Configuration** window displays the locations of the Nextflow configuratio The **Command** window displays the Nextflow command used for the run. - + -![Datasets](./_images/datasets.png) +The **Inputs** tab consolidates the pipeline parameters and the input files used by the run. -The **Datasets** tab displays information about the [datasets](../data/datasets) used as input for the run. This view is empty if no datasets were selected. +#### Parameters + +![Parameters](./_images/parameters.png) + +The **Parameters** window displays the pipeline parameters configured for the run, with options to view, copy, or download the parameters in Groovy, JSON, or YAML format. + +#### Input files + +The **Input files** table displays every dataset, file, and collection that was used as input for the run: + +| Column | Description | +|--------|-------------| +| **Input Name** | Display name of the input. Select the name to open the file in [Data Explorer](../data/data-explorer) or in the corresponding [Dataset](../data/datasets). | +| **Type** | Lineage type, such as `Collection` or `Path`. | +| **File Path** | Full path to the input. The path is truncated; hover for the complete path. Select the path to open it in [Data Explorer](../data/data-explorer). | +| **Lineage ID** | Lineage ID (LID) of the input's lineage record. Only populated when [lineage tracking is enabled][nextflow-lineage-tutorial]. | +| **Lineage Labels** | Lineage labels assigned to the input. Each label is a clickable link that navigates to the lineage record for that label. | + +If the run was not launched with any input files or datasets, the table is empty. - + + +The **Outputs** tab links to every file the run published to its output directory: + +- **Reports** — The named report files configured for the run, such as the MultiQC report or any reports declared in `tower.yml`. + +#### Reports ![Reports](./_images/reports.png) -The **Reports** tab contains a table with the names, details, and paths to all [reports](../reports/overview) generated by the run, if any were configured. Select a report to open a Data Explorer file preview of the report, with options to open the report in a new tab or download it. +The **Reports** sub-tab contains a table with the names, details, and paths to all [reports](../reports/overview) generated by the run, if any were configured. Select a report to open a Data Explorer file preview of the report, with options to open the report in a new tab or download it. + @@ -309,7 +359,7 @@ The **Containers** tab displays the details of containers used in the run, if an -The **Run Info** tab contains at-a-glance details about the run, infrastructure, and executor. Hover over the information icon next to a card's name to view a value description. Select the icons next to any run detail values to copy them. +The **Run Info** tab contains at-a-glance details about the run, infrastructure, and executor. When lineage tracking is enabled, it also displays lineage information. Hover over the information icon next to a card's name to view a value description. Select the icons next to any run detail values to copy them. #### Run details @@ -349,3 +399,7 @@ View run executor details: + +[nextflow-lineage-tutorial]: https://docs.seqera.io/nextflow/tutorials/data-lineage +[nextflow-label-directive]: https://docs.seqera.io/nextflow/reference/process#label +[workspace-lineage-settings]: ../orgs-and-teams/workspace-management#lineage diff --git a/platform-enterprise_docs/orgs-and-teams/workspace-management.md b/platform-enterprise_docs/orgs-and-teams/workspace-management.md index 2583aaa8a..2c77c8e40 100644 --- a/platform-enterprise_docs/orgs-and-teams/workspace-management.md +++ b/platform-enterprise_docs/orgs-and-teams/workspace-management.md @@ -52,6 +52,23 @@ Studios sessions created in shared workspaces are not shared across all the work Select **Edit labels** to manage the workspace [labels and resource labels](../labels/overview). +### Lineage + +The **Lineage** card lets workspace maintainers configure where Nextflow records are stored and whether lineage tracking is on by default for every run launched in the workspace. + +:::note +Data lineage is not enabled by default. See [Configuration options](../enterprise/configuration/overview#data-features). +::: + +Once enabled, select **Settings > Lineage** to open the **Edit lineage settings** form: + +| Field | Required | Description | +|-------|----------|-------------| +| **Credentials** | Yes | The workspace credentials Platform uses to create and access the lineage storage bucket. The credentials must include permission to create buckets in the chosen region (or to access an existing bucket if **Bucket name** is specified), activate object notifications on the bucket, and manage auto provisioned SQS . See [Credentials](#credentials). | +| **Region** | Yes | Cloud region where the lineage storage bucket is created (for example, `us-east-1`, `eu-west-1`). | +| **Bucket name** | No | Bucket where lineage records are stored. If left empty, Platform generates a default bucket name in the form `seqera-lineage-`. | +| **Enable lineage by default** | No (toggle) | When enabled, the launch form lineage toggle defaults to on for every run launched in this workspace. Users can still override per run. | + ### Edit or delete a workspace :::note