diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md new file mode 100644 index 000000000..7a0c48339 --- /dev/null +++ b/docs/side_quests/groovy_essentials.md @@ -0,0 +1,2207 @@ +# Groovy Essentials for Nextflow Developers + +Nextflow is built on Groovy, a powerful dynamic language that runs on the Java Virtual Machine. You can write a lot of Nextflow without ever feeling like you've learned Groovy - many workflows use only basic syntax for variables, maps, and lists. Most Nextflow tutorials focus on workflow orchestration (channels, processes, and data flow), and you can go surprisingly far with just that. + +However, when you need to manipulate data, parse complex filenames, implement conditional logic, or build robust production workflows, you're writing Groovy code - and knowing a few key Groovy concepts can dramatically improve your ability to solve real-world problems efficiently. Understanding where Nextflow ends and Groovy begins helps you write clearer, more maintainable workflows. + +This side quest takes you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline, evolving it step-by-step through realistic challenges: + +- **Understanding boundaries:** Distinguish between Nextflow operators and Groovy methods, and master when to use each +- **Data manipulation:** Extract, transform, and subset maps and collections using Groovy's powerful operators +- **String processing:** Parse complex file naming schemes with regex patterns and master variable interpolation +- **Reusable functions:** Extract complex logic into named functions for cleaner, more maintainable workflows +- **Dynamic logic:** Build processes that adapt to different input types and use closures for dynamic resource allocation +- **Conditional routing:** Intelligently route samples through different processes based on their metadata characteristics +- **Safe operations:** Handle missing data gracefully with null-safe operators and validate inputs with clear error messages +- **Configuration-based handlers:** Use workflow event handlers for logging, notifications, and lifecycle management + +--- + +## 0. Warmup + +### 0.1. Prerequisites + +Before taking on this side quest you should: + +- Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial or have equivalent experience +- Understand basic Nextflow concepts (processes, channels, workflows) +- Have basic familiarity with common programming constructs used in Groovy syntax (variables, maps, lists) + +This tutorial will explain Groovy concepts as we encounter them, so you don't need extensive prior Groovy knowledge. We'll start with fundamental concepts and build up to advanced patterns. + +### 0.2. Starting Point + +Navigate to the project directory: + +```bash title="Navigate to project directory" +cd side-quests/groovy_essentials +``` + +The `data` directory contains sample files and a main workflow file we'll evolve throughout. + +```console title="Directory contents" +> tree +. +├── collect.nf +├── data +│ ├── samples.csv +│ └── sequences +│ ├── SAMPLE_001_S1_L001_R1_001.fastq +│ ├── SAMPLE_002_S2_L001_R1_001.fastq +│ └── SAMPLE_003_S3_L001_R1_001.fastq +├── main.nf +├── modules +│ ├── fastp.nf +│ ├── generate_report.nf +│ └── trimgalore.nf +└── nextflow.config + +4 directories, 10 files +``` + +Our sample CSV contains information about biological samples that need different processing based on their characteristics: + +```console title="samples.csv" +sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score +SAMPLE_001,human,liver,30000000,data/sequences/SAMPLE_001_S1_L001_R1_001.fastq,38.5 +SAMPLE_002,mouse,brain,25000000,data/sequences/SAMPLE_002_S2_L001_R1_001.fastq,35.2 +SAMPLE_003,human,kidney,45000000,data/sequences/SAMPLE_003_S3_L001_R1_001.fastq,42.1 +``` + +We'll use this realistic dataset to explore practical Groovy techniques that you'll encounter in real bioinformatics workflows. + +--- + +## 1. Nextflow vs Groovy: Understanding the Boundaries + +### 1.1. Identifying What's What + +Nextflow developers often confuse Nextflow constructs with Groovy language features. Let's build a workflow demonstrating how they work together. + +#### Step 1: Basic Nextflow Workflow + +Start with a simple workflow that just reads the CSV file (we've already done this for you in `main.nf`): + +```groovy title="main.nf" linenums="1" +workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .view() +} +``` + +The `workflow` block defines our pipeline structure, while `Channel.fromPath()` creates a channel from a file path. The `.splitCsv()` operator processes the CSV file and converts each row into a map data structure. + +Run this workflow to see the raw CSV data: + +```bash title="Test basic workflow" +nextflow run main.nf +``` + +You should see output like: + +```console title="Raw CSV data" +Launching `main.nf` [marvelous_tuckerman] DSL2 - revision: 6113e05c17 + +[sample_id:SAMPLE_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/SAMPLE_001_S1_L001_R1_001.fastq, quality_score:38.5] +[sample_id:SAMPLE_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/SAMPLE_002_S2_L001_R1_001.fastq, quality_score:35.2] +[sample_id:SAMPLE_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/SAMPLE_003_S3_L001_R1_001.fastq, quality_score:42.1] +``` + +#### Step 2: Adding the Map Operator + +Now we're going to use some Groovy code to transform the data, using the `.map()` operator you will probably already be familiar with. This operator takes a 'closure' where we can write Groovy code to transform each item. + +!!! note + + A **closure** is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces `{ }` and can take parameters. They're fundamental to how Nextflow operators work and if you've been writing Nextflow for a while, you may already have been using them without realizing it! + +Here's what that map operation looks like: + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="3-6" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + return row + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="3" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .view() + ``` + +This is our first **Groovy closure**—an anonymous function you can pass as an argument. Closures are a core Groovy concept (similar to lambdas in Python or arrow functions in JavaScript) and are essential for working with Nextflow operators. + +The closure `{ row -> return row }` takes a parameter `row` (could be any name: `item`, `sample`, etc.). You can also use the implicit variable `it` instead: `.map { return it }`, though naming parameters improves clarity. + +When Nextflow processes each channel item, it passes that item to your closure. Here, `row` holds one CSV row at a time. + +Apply this change and run the workflow: + +```bash title="Test map operator" +nextflow run main.nf +``` + +You'll see the same output as before, because we're simply returning the input unchanged. This confirms that the map operator is working correctly. Now let's start transforming the data. + +#### Step 3: Creating a Map Data Structure + +Now we're going to write **pure Groovy code** inside our closure. Everything from this point forward in this section is Groovy syntax and methods, not Nextflow operators. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="4-12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + return sample_meta + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="4" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + return row + } + .view() + ``` + +This is pure Groovy code. The `sample_meta` map is a key-value data structure (like dictionaries in Python, objects in JavaScript, or hashes in Ruby) storing related information: sample ID, organism, tissue type, sequencing depth, and quality score. + +We use Groovy's string manipulation methods like `.toLowerCase()` and `.replaceAll()` to clean up our data, and type conversion methods like `.toInteger()` and `.toDouble()` to convert string data from the CSV into the appropriate numeric types. + +Apply this change and run the workflow: + +```bash title="Test map data structure" +nextflow run main.nf +``` + +You should see the refined map output like: + +```console title="Transformed metadata" +[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5] +[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2] +[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1] +``` + +#### Step 4: Adding Conditional Logic + +Now let's add more Groovy logic - this time using a ternary operator to make decisions based on data values. + +Make the following change: + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="11-12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="11" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + return sample_meta + } + .view() + ``` + +The ternary operator is a shorthand for an if/else statement that follows the pattern `condition ? value_if_true : value_if_false`. This line means: "If the quality is greater than 40, use 'high', otherwise use 'normal'". Its cousin, the **Elvis operator** (`?:`), provides default values when something is null or empty - we'll explore that pattern later in this tutorial. + +The map addition operator `+` creates a **new map** rather than modifying the existing one. This line creates a new map that contains all the key-value pairs from `sample_meta` plus the new `priority` key. + +!!! Note + + Never modify maps passed into closures - always create new ones using `+` (for example). In Nextflow, the same data often flows through multiple operations simultaneously. Modifying a map in-place can cause unpredictable side effects when other operations reference that same object. Creating new maps ensures each operation has its own clean copy. + +Run the modified workflow: + +```bash title="Test conditional logic" +nextflow run main.nf +``` + +You should see output like: + +```console title="Metadata with priority" +[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] +[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal] +[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high] +``` + +We've successfully added conditional logic to enrich our metadata with a priority level based on quality scores. + +#### Step 4.5: Subsetting Maps with `.subMap()` + +While the `+` operator adds keys to a map, sometimes you need to do the opposite - extract only specific keys. Groovy's `.subMap()` method is perfect for this. + +Let's add a line to create a simplified version of our metadata that only contains identification fields: + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="12-15" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def id_only = sample_meta.subMap(['id', 'organism', 'tissue']) + println "ID fields only: ${id_only}" + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` + +Run the modified workflow: + +```bash title="Test subMap" +nextflow run main.nf +``` + +You should see output showing both the full metadata displayed by the `view()` operation and the extracted subset we printed with `println`: + +```console title="SubMap results" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [peaceful_cori] DSL2 - revision: 4cc4a8340f + +ID fields only: [id:sample_001, organism:human, tissue:liver] +ID fields only: [id:sample_002, organism:mouse, tissue:brain] +ID fields only: [id:sample_003, organism:human, tissue:kidney] +[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] +[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal] +[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high] +``` + +The `.subMap()` method takes a list of keys and returns a new map containing only those keys. If a key doesn't exist in the original map, it's simply not included in the result. + +This is particularly useful when you need to create different metadata versions for different processes - some might need full metadata while others need only minimal identification fields. + +Now remove those println statements to restore your workflow to its previous state, as we don't need them going forward. + +!!! tip "Map Operations Summary" + + - **Add keys**: `map1 + [new_key: value]` - Creates new map with additional keys + - **Extract keys**: `map1.subMap(['key1', 'key2'])` - Creates new map with only specified keys + - **Both operations create new maps** - Original maps remain unchanged + +#### Step 5: Combining Maps and Returning Results + +So far, we've only been returning what Nextflow community calls the 'meta map', and we've been ignoring the files those metadata relate to. But if you're writing Nextflow workflows, you probably want to do something with those files. + +Let's output a channel structure comprising a tuple of 2 elements: the enriched metadata map and the corresponding file path. This is a common pattern in Nextflow for passing data to processes. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + [priority: priority], file(row.file_path) ] + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` + +Apply this change and run the workflow: + +```bash title="Test complete workflow" +nextflow run main.nf +``` + +You should see output like: + +```console title="Complete workflow output" +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +``` + +This `[meta, file]` tuple structure is a common pattern in Nextflow for passing both metadata and associated files to processes. + +!!! note + + **Maps and Metadata**: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the [Working with metadata](./metadata.md) side quest. + +Our workflow demonstrates the core pattern: **Nextflow constructs** (`workflow`, `Channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate data flow, while **basic Groovy constructs** (maps `[key: value]`, string methods, type conversions, ternary operators) handle the data processing logic inside the `.map()` closure. + +### 1.2. Distinguishing Nextflow operators from Groovy functions + +So far, so good, we can distinguish between Nextflow constructs and basic Groovy constructs. But what about when the syntax overlaps? + +A perfect example of this confusion is the `collect` operation, which exists in both contexts but does completely different things. Groovy's `collect` transforms each element, while Nextflow's `collect` gathers all channel elements into a single-item channel. + +Let's demonstrate this with some sample data, starting by refreshing ourselves on what the Nextflow `collect()` operator does. Check out `collect.nf`: + +```groovy title="collect.nf" linenums="1" +def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + +// Nextflow collect() - groups multiple channel emissions into one +ch_input = Channel.fromList(sample_ids) +ch_input.view { "Individual channel item: ${it}" } +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } +``` + +Steps: + +- Define a Groovy list +- Create a channel with `fromList()` that emits each sample ID separately +- Print each item with `view()` as it flows through +- Gather all items into a single list with Nextflow's `collect()` operator +- Print the collected result (single item containing all sample IDs) with a second `view()` + +We've changed the structure of the channel, but we haven't changed the data itself. + +Run the workflow to confirm this: + +```bash title="Test collect operations" +nextflow run collect.nf +``` + +```console title="Different collect behaviors" + N E X T F L O W ~ version 25.04.6 + +Launching `collect.nf` [loving_mendel] DSL2 - revision: e8d054a46e + +Individual channel item: sample_001 +Individual channel item: sample_002 +Individual channel item: sample_003 +Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +``` + +`view()` returns an output for every channel emission, so we know that this single output contains all 3 original items grouped into one list. + +Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply Groovy's `collect` method to the original list of sample IDs: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="9-13" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + ``` + +In this new snippet we: + +- Define a new variable `formatted_ids` that uses Groovy's `collect` method to transform each sample ID in the original list +- Print the result using `println` + +Run the modified workflow: + +```bash title="Test Groovy collect" +nextflow run collect.nf +``` + +```console title="Groovy collect results" hl_lines="5" + N E X T F L O W ~ version 25.04.6 + +Launching `collect.nf` [cheeky_stonebraker] DSL2 - revision: 2d5039fb47 + +Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +Individual channel item: sample_001 +Individual channel item: sample_002 +Individual channel item: sample_003 +Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +``` + +This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using Groovy's `collect` method to produce a new list with modified values. This is sort of like using the `map` operator in Nextflow, but it's pure Groovy code operating on a standard Groovy list. + +`collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. + +### 1.3. The Spread Operator (`*.`) - Shorthand for Property Extraction + +Related to Groovy's `collect` is the spread operator (`*.`), which provides a concise way to extract properties from collections. It's essentially syntactic sugar for a common `collect` pattern. + +Let's add a demonstration to our `collect.nf` file: + +=== "After" + + ```groovy title="collect.nf" linenums="1" hl_lines="15-18" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + + // Spread operator - concise property access + def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] + def all_ids = sample_data*.id + println "Spread operator result: ${all_ids}" + ``` + +=== "Before" + + ```groovy title="collect.nf" linenums="1" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + ``` + +Run the updated workflow: + +```bash title="Test spread operator" +nextflow run collect.nf +``` + +You should see output like: + +```console title="Spread operator output" hl_lines="6" + N E X T F L O W ~ version 25.04.6 + +Launching `collect.nf` [cranky_galileo] DSL2 - revision: 5f3c8b2a91 + +Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +Spread operator result: [s1, s2, s3] +Individual channel item: sample_001 +Individual channel item: sample_002 +Individual channel item: sample_003 +Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +``` + +The spread operator `*.` is a shorthand for a common collect pattern: + +```groovy +// These are equivalent: +def ids = samples*.id +def ids = samples.collect { it.id } + +// Also works with method calls: +def names = files*.getName() +def names = files.collect { it.getName() } +``` + +The spread operator is particularly useful when you need to extract a single property from a list of objects - it's more readable than writing out the full `collect` closure. + +!!! tip "When to Use Groovy's Spread vs Collect" + + - **Use spread (`*.`)** for simple property access: `samples*.id`, `files*.name` + - **Use collect** for transformations or complex logic: `samples.collect { it.id.toUpperCase() }`, `samples.collect { [it.id, it.quality > 40] }` + +### Takeaway + +In this section, you've learned: + +- **It takes both Nextflow and Groovy**: Nextflow provides the workflow structure and data flow, while Groovy provides the data manipulation and logic +- **Distinguishing Nextflow from Groovy**: How to identify which language construct you're using given the context +- **Context matters**: The same operation name can have completely different behaviors + +Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows. + +Next we'll dive deeper into Groovy's powerful string processing capabilities, which are essential for handling real-world data. + +--- + +## 2. String Processing and Dynamic Script Generation + +Mastering Groovy's string processing separates brittle workflows from robust pipelines. This section covers parsing complex file names, dynamic script generation, and variable interpolation. + +### 2.1. Pattern Matching and Regular Expressions + +Bioinformatics files often have complex naming conventions encoding metadata. Let's extract this automatically with Groovy's pattern matching. + +We're going to return to our `main.nf` workflow and add some pattern matching logic to extract additional sample information from file names. The FASTQ files in our dataset follow Illumina-style naming conventions with names like `SAMPLE_001_S1_L001_R1_001.fastq.gz`. These might look cryptic, but they actually encode useful metadata like sample ID, lane number, and read direction. We're going to use Groovy's regex capabilities to parse these names. + +Make the following change to your existing `main.nf` workflow: + +=== "After" + + ```groovy title="main.nf" linenums="4" hl_lines="10-21" + .map { row -> + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="4" hl_lines="11" + .map { row -> + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + [priority: priority], file(row.file_path)] + } + ``` + +This demonstrates key **Groovy string processing concepts**: + +1. **Regular expression literals** using `~/pattern/` syntax - this creates a regex pattern without needing to escape backslashes +2. **Pattern matching** with the `=~` operator - this attempts to match a string against a regex pattern +3. **Matcher objects** that capture groups with `[0][1]`, `[0][2]`, etc. - `[0]` refers to the entire match, `[1]`, `[2]`, etc. refer to captured groups in parentheses + +Let's break down the regex pattern `^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$`: + +| Pattern | Matches | Captures | +| ------------------- | -------------------------------------- | ---------------------------------- | +| `^(.+)` | Sample name from start | Group 1: sample name | +| `_S(\d+)` | Sample number `_S1`, `_S2`, etc. | Group 2: sample number | +| `_L(\d{3})` | Lane number `_L001` | Group 3: lane (3 digits) | +| `_(R[12])` | Read direction `_R1` or `_R2` | Group 4: read direction | +| `_(\d{3})` | Chunk number `_001` | Group 5: chunk (3 digits) | +| `\.fastq(?:\.gz)?$` | File extension `.fastq` or `.fastq.gz` | Not captured (?: is non-capturing) | + +This parses Illumina-style naming conventions to extract metadata automatically. + +Run the modified workflow: + +```bash title="Test pattern matching" +nextflow run main.nf +``` + +You should see output with metadata enriched from the file names, like + +```console title="Metadata with file parsing" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 + +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +``` + +### 2.2. Dynamic Script Generation in Processes + +Process script blocks are essentially multi-line strings that get passed to the shell. You can use **Groovy conditional logic** (if/else, ternary operators) to dynamically generate different script strings based on input characteristics. This is essential for handling diverse input types—like single-end vs paired-end sequencing reads—without duplicating process definitions. + +Let's add a process to our workflow that demonstrates this pattern. Open `modules/fastp.nf` and take a look: + +```groovy title="modules/fastp.nf" linenums="1" +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + + output: + tuple val(sample_id), path("*_trimmed*.fastq.gz"), emit: reads + + script: + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ +} +``` + +The process takes FASTQ files as input and runs the `fastp` tool to trim adapters and filter low-quality reads. Unfortunately, the person who wrote this process didn't allow for the single-end reads we have in our example dataset. Let's add it to our workflow and see what happens: + +First, include the module at the very first line of your `main.nf` workflow: + +```groovy title="main.nf" linenums="1" +include { FASTP } from './modules/fastp.nf' +``` + +Then modify the `workflow` block to connect the `ch_samples` channel to the `FASTP` process: + +=== "After" + + ```groovy title="main.nf" linenums="25" hl_lines="7" + workflow { + + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="25" hl_lines="6" + workflow { + + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + .view() + } + ``` + +Run this modified workflow: + +```bash title="Test fastp process" +nextflow run main.nf +``` + +You'll see a long error trace with some content like: + +```console title="Process error" +ERROR ~ Error executing process > 'FASTP (3)' + +Caused by: + Process `FASTP (3)` terminated with an error exit status (255) + + +Command executed: + + fastp \ + --in1 SAMPLE_003_S3_L001_R1_001.fastq \ + --in2 null \ + --out1 sample_003_trimmed_R1.fastq.gz \ + --out2 sample_003_trimmed_R2.fastq.gz \ + --json sample_003.fastp.json \ + --html sample_003.fastp.html \ + --thread 2 + +Command exit status: + 255 + +Command output: + (empty) +``` + +You can see that the process is trying to run `fastp` with a `null` value for the second input file, which is causing it to fail. This is because our dataset contains single-end reads, but the process is hardcoded to expect paired-end reads (two input files at a time). + +Fix this by adding Groovy logic to the `FASTP` process `script:` block. An if/else statement checks read file count and adjusts the command accordingly. + +=== "After" + + ```groovy title="main.nf" linenums="10" hl_lines="3 5 15" + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + fastp \\ + --in1 ${input_file} \\ + --out1 ${meta.id}_trimmed.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } else { + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="10" + script: + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } + ``` + +Now the workflow can handle both single-end and paired-end reads gracefully. The Groovy logic checks the number of input files and constructs the appropriate command for `fastp`. Let's see if it works: + +```bash title="Test dynamic fastp" +nextflow run main.nf +``` + +```console title="Successful run" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [adoring_rosalind] DSL2 - revision: 04b1cd93e9 + +executor > local (3) +[31/a8ad4d] process > FASTP (3) [100%] 3 of 3 ✔ +``` + +Looks good! If we check the actual commands that were run (customise for your task hash): + +```console title="Check commands executed" +cat work/31/a8ad4d95749e685a6d842d3007957f/.command.sh +``` + +We can see that Nextflow correctly picked the right command for single-end reads: + +```bash title=".command.sh" +#!/bin/bash -ue +fastp \ + --in1 SAMPLE_003_S3_L001_R1_001.fastq \ + --out1 sample_003_trimmed.fastq.gz \ + --json sample_003.fastp.json \ + --html sample_003.fastp.html \ + --thread 2 +``` + +Another common usage of dynamic script logic can be seen in [the Nextflow for Science Genomics module](../../nf4science/genomics/02_joint_calling). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: + +```groovy title="command line manipulation for GATK" linenums="1" hl_lines="2 5" + script: + def gvcfs_line = all_gvcfs.collect { gvcf -> "-V ${gvcf}" }.join(' ') + """ + gatk GenomicsDBImport \ + ${gvcfs_line} \ + -L ${interval_list} \ + --genomicsdb-workspace-path ${cohort_name}_gdb + """ +``` + +These patterns of using Groovy logic in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. + +### 2.3. Variable Interpolation: Groovy, Bash, and Shell Variables + +Process scripts mix Nextflow variables, shell variables, and command substitutions, each with different interpolation syntax. Using the wrong syntax causes errors. Let's explore these with a process that creates a processing report. + +Take a look a the module file `modules/generate_report.nf`: + +```groovy title="modules/generate_report.nf" linenums="1" +process GENERATE_REPORT { + + publishDir 'results/reports', mode: 'copy' + + input: + tuple val(meta), path(reads) + + output: + path "${meta.id}_report.txt" + + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ +} +``` + +This process writes a simple report with the sample ID and filename. Now let's run it to see what happens when we need to mix different types of variables. + +Include the process in your `main.nf` and add it to the workflow: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="2 12" + include { FASTP } from './modules/fastp.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + // ... separateMetadata function ... + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" hl_lines="1 10" + include { FASTP } from './modules/fastp.nf' + + // ... separateMetadata function ... + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + } + ``` + +Now run the workflow and check the generated reports in `results/reports/`. They should contain basic information about each sample. + +But what if we want to add information about when and where the processing occurred? Let's modify the process to use **shell** variables and a bit of command substitution to include the current user, hostname, and date in the report: + +=== "After" + + ```groovy title="modules/generate_report.nf" linenums="10" hl_lines="5-7" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: ${USER}" >> ${meta.id}_report.txt + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + echo "Date: $(date)" >> ${meta.id}_report.txt + """ + ``` + +=== "Before" + + ```groovy title="modules/generate_report.nf" linenums="10" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ + ``` + +If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `$(hostname)` as a Groovy variable that doesn't exist: + +```console title="Error with shell variables" +unknown recognition error type: groovyjarjarantlr4.v4.runtime.LexerNoViableAltException +ERROR ~ Module compilation error +- file : /workspaces/training/side-quests/groovy_essentials/modules/generate_report.nf +- cause: token recognition error at: '(' @ line 16, column 22. + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + ^ + +1 error +``` + +We need to escape it so Bash can handle it instead. + +Fix this by escaping the shell variables and command substitutions with a backslash (`\`): + +=== "After - Fixed" + + ```groovy title="modules/generate_report.nf" linenums="10" hl_lines="5-7" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: \${USER}" >> ${meta.id}_report.txt + echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt + echo "Date: \$(date)" >> ${meta.id}_report.txt + """ + ``` + +=== "Before - Broken" + + ```groovy title="modules/generate_report.nf" linenums="10" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: ${USER}" >> ${meta.id}_report.txt + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + echo "Date: $(date)" >> ${meta.id}_report.txt + """ + ``` + +Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it through to Bash." + +### Takeaway + +In this section, you've learned **Groovy string processing** techniques: + +- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns (`~/pattern/`) to extract metadata from complex file naming conventions +- **Dynamic script generation**: Using Groovy conditional logic (if/else, ternary operators) to generate different script strings based on input characteristics +- **Variable interpolation**: Understanding when Nextflow interprets strings vs when the shell does + - `${var}` - Groovy/Nextflow variables (interpolated by Nextflow at workflow compile time) + - `\${var}` - Shell environment variables (escaped, passed to bash at runtime) + - `\$(cmd)` - Shell command substitution (escaped, executed by bash at runtime) + +These string processing and generation patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. + +--- + +## 3. Creating Reusable Functions + +Complex workflow logic inline in channel operators or process definitions reduces readability and maintainability. **Groovy functions** let you extract this logic into named, reusable components—this is core Groovy programming, not Nextflow-specific syntax. + +Our map operation has grown long and complex. Let's extract it into a reusable Groovy function using the `def` keyword. + +To illustrate what that looks like with our existing workflow, make the modification below, using `def` to define a reusable function called `separateMetadata`: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="4-24 29" + include { FASTP } from './modules/fastp.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" hl_lines="7-27" + include { FASTP } from './modules/fastp.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) + } + ``` + +By extracting this logic into a function, we've reduced the actual workflow logic down to something much cleaner: + +```groovy title="minimal workflow" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) +``` + +This makes the workflow logic much easier to read and understand at a glance. The function `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. + +Run the workflow to make sure it still works: + +```bash title="Test reusable function" +nextflow run main.nf +``` + +```console title="Function results" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [admiring_panini] DSL2 - revision: 8cc832e32f + +executor > local (6) +[8c/2e3f91] process > FASTP (3) [100%] 3 of 3 ✔ +[7a/1b4c92] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ +``` + +The output should show both processes completing successfully. The workflow is now much cleaner and easier to maintain, with all the complex metadata processing logic encapsulated in the `separateMetadata` function. + +### Takeaway + +In this section, you've learned core **Groovy programming concepts**: + +- **Defining functions with `def`**: Groovy's keyword for creating named functions (like `def` in Python or `function` in JavaScript) +- **Function scope**: Functions defined at the script level are accessible throughout your Nextflow workflow +- **Return values**: Functions automatically return the last expression, or use explicit `return` +- **Cleaner code**: Extracting complex logic into functions is a fundamental software engineering practice in any language, including Groovy + +Next, we'll explore how to use Groovy closures in process directives for dynamic resource allocation. + +--- + +## 4. Dynamic Resource Directives with Closures + +So far we've used Groovy in the `script` block of processes. But **Groovy closures** (introduced in Section 1.1) are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. + +Currently, our FASTP process uses default resources. Let's make it smarter by allocating more CPUs for high-depth samples. Edit `modules/fastp.nf` to include a dynamic `cpus` directive and a static `memory` directive: + +=== "After" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="4-5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '2 GB' + + input: + tuple val(meta), path(reads) + ``` + +=== "Before" + + ```groovy title="modules/fastp.nf" linenums="1" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + ``` + +The closure `{ meta.depth > 40000000 ? 4 : 2 }` uses the **Groovy ternary operator** (covered in Section 1.1) and is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 4 CPUs, while others get 2 CPUs. + +!!! note "Accessing Input Variables in Directives" + + The closure can access any input variables (like `meta` here) because Nextflow evaluates these closures in the context of each task execution. + +Run the workflow again: + +```bash title="Test resource allocation" +nextflow run main.nf -no-ansi-log +``` + +We're using the `-no-ansi-log` option to make it easier to see the task hashes. + +```console title="Resource allocation output" +N E X T F L O W ~ version 25.04.6 +Launching `main.nf` [fervent_albattani] DSL2 - revision: fa8f249759 +[bd/ff3d41] Submitted process > FASTP (2) +[a4/a3aab2] Submitted process > FASTP (1) +[48/6db0c9] Submitted process > FASTP (3) +[ec/83439d] Submitted process > GENERATE_REPORT (3) +[bd/15d7cc] Submitted process > GENERATE_REPORT (2) +[42/699357] Submitted process > GENERATE_REPORT (1) +``` + +You can check the exact `docker` command that was run to see the CPU allocation for any given task: + +```console title="Check docker command" +cat work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.run | grep "docker run" +``` + +You should see something like: + +```bash title="docker command" + docker run -i --cpu-shares 4096 --memory 2048m -e "NXF_TASK_WORKDIR" -v /workspaces/training/side-quests/groovy_essentials:/workspaces/training/side-quests/groovy_essentials -w "$NXF_TASK_WORKDIR" --name $NXF_BOXID community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690 /bin/bash -ue /workspaces/training/side-quests/groovy_essentials/work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.sh +``` + +In this example we've chosen an example that requested 4 CPUs (`--cpu-shares 4096`), because it was a high-depth sample, but you should see different CPU allocations depending on the sample depth. Try this for the other tasks as well. + +Another powerful pattern is using `task.attempt` for retry strategies. To show why this is useful, we're going to start by reducing the memory allocation to FASTP to less than it needs. Change the `memory` directive in `modules/fastp.nf` to `1.GB`: + +=== "After" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '1 GB' + + input: + tuple val(meta), path(reads) + ``` + +=== "Before" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '2 GB' + + input: + tuple val(meta), path(reads) + ``` + +... and run the workflow again: + +```bash title="Test insufficient memory" +nextflow run main.nf +``` + +You'll see an error indicating that the process was killed for exceeding memory limits: + +```console title="Memory error output" hl_lines="2 11" +Command exit status: + 137 + +Command output: + (empty) + +Command error: + Detecting adapter sequence for read1... + No adapter detected for read1 + + .command.sh: line 7: 101 Killed fastp --in1 SAMPLE_002_S2_L001_R1_001.fastq --out1 sample_002_trimmed.fastq.gz --json sample_002.fastp.json --html sample_002.fastp.html --thread 2 +``` + +This is a very common scenario in real-world workflows - sometimes you just don't know how much memory a task will need until you run it. To make our workflow more robust, we can implement a retry strategy that increases memory allocation on each attempt, once again using a Groovy closure. Modify the `memory` directive to multiply the base memory by `task.attempt`, and add `errorStrategy 'retry'` and `maxRetries 2` directives: + +=== "After" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5-7" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory { '1 GB' * task.attempt } + errorStrategy 'retry' + maxRetries 2 + + input: + tuple val(meta), path(reads) + ``` + +=== "Before" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '2 GB' + + input: + tuple val(meta), path(reads) + ``` + +Now if the process fails due to insufficient memory, Nextflow will retry with more memory: + +- First attempt: 1 GB (task.attempt = 1) +- Second attempt: 2 GB (task.attempt = 2) + +... and so on, up to the `maxRetries` limit. + +### Takeaway + +Dynamic directives with Groovy closures let you: + +- Allocate resources based on input characteristics +- Implement automatic retry strategies with increasing resources +- Combine multiple factors (metadata, attempt number, priorities) +- Use Groovy logic for complex resource calculations + +This makes your workflows both more efficient (not over-allocating) and more robust (automatic retry with more resources). + +--- + +## 5. Conditional Logic and Process Control + +Previously, we used `.map()` with Groovy to transform channel data. Now we'll use Groovy to control which processes execute based on data—essential for flexible workflows adapting to different sample types. + +Nextflow's [flow control operators](https://www.nextflow.io/docs/latest/reference/operator.html) take closures evaluated at runtime, enabling Groovy logic to drive workflow decisions based on channel content. + +### 5.1. Routing with `.branch()` + +For example, let's pretend that our sequencing samples need to be trimmed with FASTP only if they're human samples with a coverage above a certain threshold. Mouse samples or low-coverage samples should be run with Trimgalore instead (this is a contrived example, but it illustrates the point). + +We've provided a simple Trimgalore process in `modules/trimgalore.nf`, take a look if you like, but the details aren't important for this exercise. The key point is that we want to route samples based on their metadata. + +Include the new from in `modules/trimgalore.nf`: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="2" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" + include { FASTP } from './modules/fastp.nf' + ``` + +... and then modify your `main.nf` workflow to branch samples based on their metadata and route them through the appropriate trimming process, like this: + +=== "After" + + ```groovy title="main.nf" linenums="28" hl_lines="5-12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + trim_branches = ch_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="28" hl_lines="5" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) + ``` + +Run this modified workflow: + +```bash title="Test conditional trimming" +nextflow run main.nf +``` + +```console title="Conditional trimming results" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [adoring_galileo] DSL2 - revision: c9e83aaef1 + +executor > local (6) +[1d/0747ac] process > FASTP (2) [100%] 2 of 2 ✔ +[cc/c44caf] process > TRIMGALORE (1) [100%] 1 of 1 ✔ +[34/bd5a9f] process > GENERATE_REPORT (1) [100%] 3 of 3 ✔ +``` + +Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. + +### 5.2. Using `.filter()` with Groovy Truth + +Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Inside the filter closure, you'll write **Groovy boolean expressions** that decide which items pass through. + +Groovy has a concept called **"Groovy Truth"** that determines what values evaluate to `true` or `false` in boolean contexts: + +- **Truthy**: Non-null values, non-empty strings, non-zero numbers, non-empty collections +- **Falsy**: `null`, empty strings `""`, zero `0`, empty collections `[]` or `[:]`, `false` + +This means `meta.id` alone (without explicit `!= null`) checks if the ID exists and isn't empty. Let's use this to filter out samples that don't meet our quality requirements. + +Add the following before the branch operation: + +=== "After" + + ```groovy title="main.nf" linenums="28" hl_lines="11" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + // Filter out invalid or low-quality samples + ch_valid_samples = ch_samples + .filter { meta, reads -> + meta.id && meta.organism && meta.depth >= 25000000 + } + + trim_branches = ch_valid_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="28" hl_lines="5" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + trim_branches = ch_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + ``` + +Run the workflow again: + +```bash title="Test filtering samples" +nextflow run main.nf +``` + +Because we've chosen a filter that excludes some samples, you should see fewer tasks executed: + +```console title="Filtered samples results" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [deadly_woese] DSL2 - revision: 9a6044a969 + +executor > local (5) +[01/7b1483] process > FASTP (2) [100%] 2 of 2 ✔ +[- ] process > TRIMGALORE - +[07/ef53af] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ +``` + +The filter expression `meta.id && meta.organism && meta.depth >= 25000000` combines Groovy Truth with explicit comparisons: + +- `meta.id && meta.organism` checks that both fields exist and are non-empty (using Groovy Truth) +- `meta.depth >= 25000000` ensures sufficient sequencing depth with an explicit comparison + +!!! note "Groovy Truth in Practice" + + The expression `meta.id && meta.organism` is more concise than writing: + ```groovy + meta.id != null && meta.id != '' && meta.organism != null && meta.organism != '' + ``` + + This makes filtering logic much cleaner and easier to read. + +### Takeaway + +In this section, you've learned to use Groovy logic to control workflow execution using the closure interfaces of Nextflow operators like `.branch{}` and `.filter{}`, leveraging Groovy Truth to write concise conditional expressions. + +Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's make our workflow robust against missing or null values. + +--- + +## 6. Safe Navigation and Elvis Operators + +Our `separateMetadata` function currently assumes all CSV fields are present and valid. But what happens with incomplete data? Let's find out. + +### 6.1. The Problem: Accessing Properties That Don't Exist + +Let's say we want to add support for optional sequencing run information. In some labs, samples might have an additional field for the sequencing run ID or batch number, but our current CSV doesn't have this column. Let's try to access it anyway. + +Modify the `separateMetadata` function to include a run_id field: + +=== "After" + + ```groovy title="main.nf" linenums="5" hl_lines="9" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def run_id = row.run_id.toUpperCase() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="5" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + ``` + +Now run the workflow: + +```bash +nextflow run main.nf +``` + +It crashes with a NullPointerException: + +```console title="Null pointer error" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [trusting_torvalds] DSL2 - revision: b56fbfbce2 + +ERROR ~ Cannot invoke method toUpperCase() on null object + + -- Check script 'main.nf' at line: 13 or see '.nextflow.log' file for more details +``` + +The problem is that `row.run_id` returns `null` because the `run_id` column doesn't exist in our CSV. When we try to call `.toUpperCase()` on `null`, it crashes. This is where Groovy's safe navigation operator saves the day. + +### 6.2. Safe Navigation Operator (`?.`) + +The safe navigation operator (`?.`) returns `null` instead of throwing an exception when called on a `null` value. If the object before `?.` is `null`, the entire expression evaluates to `null` without executing the method. + +Update the function to use safe navigation: + +=== "After" + + ```groovy title="main.nf" linenums="4" hl_lines="9" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def run_id = row.run_id?.toUpperCase() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="4" hl_lines="9" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def run_id = row.run_id.toUpperCase() + ``` + +Run again: + +```bash +nextflow run main.nf +``` + +No crash! The workflow now handles the missing field gracefully. When `row.run_id` is `null`, the `?.` operator prevents the `.toUpperCase()` call, and `run_id` becomes `null` instead of causing an exception. + +### 6.3. Elvis Operator (`?:`) for Defaults + +The Elvis operator (`?:`) provides default values when the left side is `null` (or empty, in Groovy's "truth" evaluation). It's named after Elvis Presley because `?:` looks like his famous hair and eyes when viewed sideways! + +Now that we're using safe navigation, `run_id` will be `null` for samples without that field. Let's use the Elvis operator to provide a default value and add it to our `sample_meta` map: + +=== "After" + + ```groovy title="main.nf" linenums="5" hl_lines="9-10" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def run_id = row.run_id?.toUpperCase() ?: 'UNSPECIFIED' + sample_meta.run = run_id + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="5" hl_lines="9" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def run_id = row.run_id?.toUpperCase() + ``` + +Also add a `view()` operator in the workflow to see the results: + +=== "After" + + ```groovy title="main.nf" linenums="30" hl_lines="4" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="30" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + ``` + +and run the workflow: + +```bash +nextflow run main.nf +``` + +You'll see output like this: + +```console title="View output with run field" +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, run:UNSPECIFIED, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, run:UNSPECIFIED, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, run:UNSPECIFIED, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +``` + +Perfect! Now all samples have a `run` field with either their actual run ID (in uppercase) or the default value 'UNSPECIFIED'. The combination of `?.` and `?:` provides both safety (no crashes) and sensible defaults. + +Take out the `.view()` operator now that we've confirmed it works. + +!!! tip "Combining Safe Navigation and Elvis" + + The pattern `value?.method() ?: 'default'` is common in production Nextflow: + + - `value?.method()` - Safely calls method, returns `null` if `value` is `null` + - `?: 'default'` - Provides fallback if result is `null` + + This pattern handles missing/incomplete data gracefully. + +Use these operators consistently in functions, operator closures (`.map{}`, `.filter{}`), process scripts, and config files. They prevent crashes when handling real-world data. + +### Takeaway + +- **Safe navigation (`?.`)**: Prevents crashes on null values - returns null instead of throwing exception +- **Elvis operator (`?:`)**: Provides defaults - `value ?: 'default'` +- **Combining**: `value?.method() ?: 'default'` is the common pattern + +These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. + +--- + +## 7. Validation with `error()` and `log.warn` + +Sometimes you need to stop the workflow immediately if input parameters are invalid. While `error()` and `log.warn` are Nextflow-provided functions, the **validation logic itself is pure Groovy**—using conditionals (`if`, `!`), boolean logic, and methods like `.exists()`. Let's add validation to our workflow. + +Create a validation function before your workflow block, call it from the workflow, and change the channel creation to use a parameter for the CSV file path. If the parameter is missing or the file doesn't exist, call `error()` to stop execution with a clear message. + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="5-20 23-24" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + def validateInputs() { + // Check input parameter is provided + if (!params.input) { + error("Input CSV file path not provided. Please specify --input ") + } + + // Check CSV file exists + if (!file(params.input).exists()) { + error("Input CSV file not found: ${params.input}") + } + } + ... + workflow { + validateInputs() + ch_samples = Channel.fromPath(params.input) + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + ... + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + ``` + +Now try running without the CSV file: + +```bash +nextflow run main.nf +``` + +The workflow stops immediately with a clear error message instead of failing mysteriously later! + +```console title="Validation error output" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [confident_coulomb] DSL2 - revision: 07059399ed + +WARN: Access to undefined parameter `input` -- Initialise it to a default value eg. `params.input = some_value` +Input CSV file path not provided. Please specify --input +``` + +You can also add validation within the `separateMetadata` function. Let's use the non-fatal `log.warn` to issue warnings for samples with low sequencing depth, but still allow the workflow to continue: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="3-6" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + + // Validate data makes sense + if (sample_meta.depth < 30000000) { + log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" + } + + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + ``` + +Run the workflow again with the original CSV: + +```bash +nextflow run main.nf --input ./data/samples.csv +``` + +... and you'll see a warning about low sequencing depth for one of the samples: + +```console title="Warning output" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [awesome_goldwasser] DSL2 - revision: a31662a7c1 + +executor > local (5) +[ce/df5eeb] process > FASTP (2) [100%] 2 of 2 ✔ +[- ] process > TRIMGALORE - +[d1/7d2b4b] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ +WARN: Low sequencing depth for sample_002: 25000000 +``` + +### Takeaway + +- **`error()`**: Stops workflow immediately with clear message +- **`log.warn`**: Issues warnings without stopping workflow +- **Early validation**: Check inputs before processing to fail fast with helpful errors +- **Validation functions**: Create reusable validation logic that can be called at workflow start + +Proper validation makes workflows more robust and user-friendly by catching problems early with clear error messages. + +--- + +## 8. Groovy in Configuration: Workflow Event Handlers + +Up until now, we've been writing Groovy code in our workflow scripts and process definitions. But there's one more important place where Groovy is essential: workflow event handlers in your `nextflow.config` file (or other places you write configuration). + +Event handlers are Groovy closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. + +### 8.1. The `onComplete` Handler + +The most commonly used event handler is `onComplete`, which runs when your workflow finishes (whether it succeeded or failed). Let's add one to summarize our pipeline results. + +Your `nextflow.config` file already has Docker enabled. Add an event handler after the existing configuration: + +=== "After" + + ```groovy title="nextflow.config" linenums="1" hl_lines="5-15" + // Nextflow configuration for Groovy Essentials side quest + + docker.enabled = true + + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + } + ``` + +=== "Before" + + ```groovy title="nextflow.config" linenums="1" + // Nextflow configuration for Groovy Essentials side quest + + docker.enabled = true + ``` + +This is a Groovy closure being assigned to `workflow.onComplete`. Inside, you have access to the `workflow` object which provides useful properties about the execution. + +Run your workflow and you'll see this summary appear at the end! + +```bash title="Run with onComplete handler" +nextflow run main.nf --input ./data/samples.csv -no-ansi-log +``` + +```console title="onComplete output" +N E X T F L O W ~ version 25.04.6 +Launching `main.nf` [marvelous_boltzmann] DSL2 - revision: a31662a7c1 +WARN: Low sequencing depth for sample_002: 25000000 +[9b/d48e40] Submitted process > FASTP (2) +[6a/73867a] Submitted process > GENERATE_REPORT (2) +[79/ad0ac5] Submitted process > GENERATE_REPORT (1) +[f3/bda6cb] Submitted process > FASTP (1) +[34/d5b52f] Submitted process > GENERATE_REPORT (3) + +Pipeline execution summary: +========================== +Completed at: 2025-10-10T12:14:24.885384+01:00 +Duration : 2.9s +Success : true +workDir : /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/work +exit status : 0 +``` + +Let's make it more useful by adding conditional logic: + +=== "After" + + ```groovy title="nextflow.config" linenums="5" hl_lines="11-18" + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + + if (workflow.success) { + println "✅ Pipeline completed successfully!" + println "Results are in: ${params.outdir ?: 'results'}" + } else { + println "❌ Pipeline failed!" + println "Error: ${workflow.errorMessage}" + } + } + ``` + +=== "Before" + + ```groovy title="nextflow.config" linenums="5" + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + } + ``` + +Now we get an even more informative summary, including a success/failure message and the output directory if specified: + +```console title="Enhanced onComplete output" +N E X T F L O W ~ version 25.04.6 +Launching `main.nf` [boring_linnaeus] DSL2 - revision: a31662a7c1 +WARN: Low sequencing depth for sample_002: 25000000 +[e5/242efc] Submitted process > FASTP (2) +[3b/74047c] Submitted process > GENERATE_REPORT (3) +[8a/7a57e6] Submitted process > GENERATE_REPORT (1) +[a8/b1a31f] Submitted process > GENERATE_REPORT (2) +[40/648429] Submitted process > FASTP (1) + +Pipeline execution summary: +========================== +Completed at: 2025-10-10T12:16:00.522569+01:00 +Duration : 3.6s +Success : true +workDir : /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/work +exit status : 0 + +✅ Pipeline completed successfully! +``` + +You can also write the summary to a file using Groovy file operations: + +```groovy title="nextflow.config - Writing summary to file" +workflow.onComplete = { + def summary = """ + Pipeline Execution Summary + =========================== + Completed: ${workflow.complete} + Duration : ${workflow.duration} + Success : ${workflow.success} + Command : ${workflow.commandLine} + """ + + println summary + + // Write to a log file + def log_file = file("${workflow.launchDir}/pipeline_summary.txt") + log_file.text = summary +} +``` + +### 8.2. Other Useful Event Handlers + +Besides `onComplete`, there are other event handlers you can use: + +**`onStart`** - Runs when the workflow begins: + +```groovy title="nextflow.config - onStart handler" +workflow.onStart = { + println "="* 50 + println "Starting pipeline: ${workflow.runName}" + println "Project directory: ${workflow.projectDir}" + println "Launch directory: ${workflow.launchDir}" + println "Work directory: ${workflow.workDir}" + println "="* 50 +} +``` + +**`onError`** - Runs only if the workflow fails: + +```groovy title="nextflow.config - onError handler" +workflow.onError = { + println "="* 50 + println "Pipeline execution failed!" + println "Error message: ${workflow.errorMessage}" + println "="* 50 + + // Write detailed error log + def error_file = file("${workflow.launchDir}/error.log") + error_file.text = """ + Workflow Error Report + ===================== + Time: ${new Date()} + Error: ${workflow.errorMessage} + Error report: ${workflow.errorReport ?: 'No detailed report available'} + """ + + println "Error details written to: ${error_file}" +} +``` + +You can use multiple handlers together: + +```groovy title="nextflow.config - Combined handlers" +workflow.onStart = { + println "Starting ${workflow.runName} at ${workflow.start}" +} + +workflow.onError = { + println "Workflow failed: ${workflow.errorMessage}" +} + +workflow.onComplete = { + def duration_mins = workflow.duration.toMinutes().round(2) + def status = workflow.success ? "SUCCESS ✅" : "FAILED ❌" + + println """ + Pipeline finished: ${status} + Duration: ${duration_mins} minutes + """ +} +``` + +### Takeaway + +In this section, you've learned: + +- **Event handler closures**: Groovy closures in `nextflow.config` that run at different lifecycle points +- **`onComplete` handler**: For execution summaries and result reporting +- **`onStart` handler**: For logging pipeline initialization +- **`onError` handler**: For error handling and logging failures +- **Workflow object properties**: Accessing `workflow.success`, `workflow.duration`, `workflow.errorMessage`, etc. + +Event handlers are pure Groovy code running in your config file, demonstrating that Nextflow configuration is actually a Groovy script with access to the full language. + +--- + +## Summary + +Throughout this side quest, you've built a comprehensive sample processing pipeline that evolved from basic metadata handling to a sophisticated, production-ready workflow. Each section built upon the previous, demonstrating how Groovy transforms simple Nextflow workflows into powerful data processing systems. + +Here's how we progressively enhanced our pipeline: + +1. **Nextflow vs Groovy Boundaries**: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like `collect`. + +2. **Advanced String Processing**: You mastered regular expressions for parsing file names, dynamic script generation in processes, and variable interpolation (Groovy vs Bash vs Shell). + +3. **Creating Reusable Functions**: You learned to extract complex logic into named functions that can be called from channel operators, making workflows more readable and maintainable. + +4. **Dynamic Resource Directives with Closures**: You explored using Groovy closures in process directives for adaptive resource allocation based on input characteristics. + +5. **Conditional Logic and Process Control**: You added intelligent routing using `.branch()` and `.filter()` operators, leveraging Groovy Truth for concise conditional expressions. + +6. **Safe Navigation and Elvis Operators**: You made the pipeline robust against missing data using `?.` for null-safe property access and `?:` for providing default values. + +7. **Validation with error() and log.warn**: You learned to validate inputs early and fail fast with clear error messages. + +8. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete`, `onStart`, `onError`) for logging, notifications, and lifecycle management. + +### Key Benefits + +- **Clearer code**: Understanding when to use Nextflow and Groovy helps you write more organized workflows +- **Robust handling**: Safe navigation and Elvis operators make workflows resilient to missing data +- **Flexible processing**: Conditional logic lets your workflows process different sample types appropriately +- **Adaptive resources**: Dynamic directives optimize resource usage based on input characteristics + +### From Simple to Sophisticated + +This pipeline evolved from basic data processing to production-ready workflows: + +1. **Simple**: CSV processing and metadata extraction (Nextflow vs Groovy boundaries) +2. **Intelligent**: Regex parsing, variable interpolation, dynamic script generation +3. **Maintainable**: Reusable functions for cleaner, testable code +4. **Efficient**: Dynamic resource allocation and retry strategies +5. **Adaptive**: Conditional routing based on sample characteristics +6. **Robust**: Safe navigation, Elvis operators, early validation +7. **Observable**: Event handlers for logging and lifecycle management + +This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows. + +### Next Steps + +With these Groovy fundamentals mastered, you're ready to: + +- Write cleaner workflows with proper separation between Nextflow and Groovy logic +- Master variable interpolation to avoid common pitfalls with Groovy, Bash, and shell variables +- Use dynamic resource directives for efficient, adaptive workflows +- Transform file collections into properly formatted command-line arguments +- Handle different file naming conventions and input formats gracefully using regex and string processing +- Build reusable, maintainable code using advanced closure patterns and functional programming +- Process and organize complex datasets using collection operations +- Add validation, error handling, and logging to make your workflows production-ready +- Implement workflow lifecycle management with event handlers + +Continue practicing these patterns in your own workflows, and refer to the [Groovy documentation](http://groovy-lang.org/documentation.html) when you need to explore more advanced features. + +### Key Concepts Reference + +- **Language Boundaries** + + ```groovy title="Nextflow vs Groovy examples" + // Nextflow: workflow orchestration + Channel.fromPath('*.fastq').splitCsv(header: true) + + // Groovy: data processing + sample_data.collect { it.toUpperCase() } + ``` + +- **String Processing** + + ```groovy title="String processing examples" + // Pattern matching + filename =~ ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + + // Function with conditional return + def parseSample(filename) { + def matcher = filename =~ pattern + return matcher ? [valid: true, data: matcher[0]] : [valid: false] + } + + // File collection to command arguments (in process script block) + script: + def file_args = input_files.collect { file -> "--input ${file}" }.join(' ') + """ + analysis_tool ${file_args} --output results.txt + """ + ``` + +- **Error Handling** + + ```groovy title="Error handling patterns" + try { + def errors = validateSample(sample) + if (errors) throw new RuntimeException("Invalid: ${errors.join(', ')}") + } catch (Exception e) { + println "Error: ${e.message}" + } + ``` + +- **Essential Groovy Operators** + + ```groovy title="Essential operators examples" + // Safe navigation and Elvis operators + def id = data?.sample?.id ?: 'unknown' + if (sample.files) println "Has files" // Groovy Truth + + // Slashy strings for regex + def pattern = /^\w+_R[12]\.fastq$/ + def script = """ + echo "Processing ${sample.id}" + analysis --depth ${depth ?: 1_000_000} + """ + ``` + +- **Advanced Closures** + + ```groovy title="Advanced closure patterns" + // Named closures and composition + def enrichData = normalizeId >> addQualityCategory >> addFlags + def processor = generalFunction.curry(fixedParam) + + // Closures with scope access + def collectStats = { data -> stats.count++; return data } + ``` + +- **Collection Operations** + ```groovy title="Collection operations examples" + // Filter, group, and organize data + def high_quality = samples.findAll { it.quality > 40 } + def by_organism = samples.groupBy { it.organism } + def file_names = files*.getName() // Spread operator + def all_files = nested_lists.flatten() + ``` + +## Resources + +- [Groovy Documentation](http://groovy-lang.org/documentation.html) +- [Nextflow Operators](https://www.nextflow.io/docs/latest/operator.html) +- [Regular Expressions in Groovy](https://groovy-lang.org/syntax.html#_regular_expression_operators) +- [JSON Processing](https://groovy-lang.org/json.html) +- [XML Processing](https://groovy-lang.org/processing-xml.html) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 2eb905de8..f51e28bd6 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -30,6 +30,7 @@ Otherwise, select a side quest from the table below. | Side Quest | Time Estimate for Teaching | | -------------------------------------------------------------------------- | -------------------------- | +| [Workflow Management Fundamentals](./workflow_management_fundamentals.md) | 1.5 hours | | [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | | [Essential Nextflow Scripting Patterns](./essential_scripting_patterns.md) | 90 mins | | [Introduction to nf-core](./nf-core.md) | - | diff --git a/docs/side_quests/workflow_management_fundamentals.md b/docs/side_quests/workflow_management_fundamentals.md new file mode 100644 index 000000000..74224fd29 --- /dev/null +++ b/docs/side_quests/workflow_management_fundamentals.md @@ -0,0 +1,1329 @@ +# Workflow Management Fundamentals + +If you're coming from a background of writing shell scripts, Python scripts, or other sequential data processing scripts, Nextflow might seem like unnecessary complexity. Why learn a whole new framework when your bash script does the job? + +This side quest takes a different approach: **you'll build a bash script yourself**, experience its limitations firsthand, and then discover how Nextflow solves each problem as it emerges. By the end, you'll understand not just *what* workflow management does, but *why* it exists. + +This tutorial uses a single exemplar throughout: a quality control and assembly pipeline for bacterial genome sequencing data. + +**What you'll learn:** + +- **Why parallelization matters:** Experience sequential processing delays, then see automatic parallelization +- **Why software isolation matters:** Hit version conflicts, then see container-based solutions +- **Why resume capability matters:** Lose hours of work to a failure, then see instant recovery +- **Why portability matters:** See environment-dependent code, then see write-once-run-anywhere +- **Why resource management matters:** Experience resource bottlenecks, then see declarative solutions + +--- + +## 0. Warmup + +### 0.1. Prerequisites + +Before taking on this side quest you should: + +- Be comfortable with command-line basics (file navigation, running commands, writing simple bash scripts) +- Have basic familiarity with bioinformatics concepts (FASTQ files, quality control, assembly) +- Understand what containers are at a high level (isolated software environments) + +You do NOT need: + +- Prior Nextflow experience (we'll teach you as we go) +- Deep knowledge of Groovy or any programming language +- Experience with workflow management systems + +### 0.2. The Scenario + +You're analyzing bacterial genome sequences. Your workflow involves: + +1. Quality control of raw sequencing reads (FastQC) +2. Adapter trimming and filtering (fastp) +3. Genome assembly (SPAdes) +4. Assembly quality assessment (QUAST) + +You have 3 samples to process initially, but your PI just said there are 20 more coming next week... + +### 0.3. Starting Point + +Navigate to the project directory: + +```bash title="Navigate to project directory" +cd side-quests/workflow_management_fundamentals +``` + +The directory contains sample data: + +```console title="Directory contents" +> tree +. +├── data +│ ├── samples.csv +│ └── reads +│ ├── sample_01_R1.fastq.gz +│ ├── sample_01_R2.fastq.gz +│ ├── sample_02_R1.fastq.gz +│ ├── sample_02_R2.fastq.gz +│ ├── sample_03_R1.fastq.gz +│ └── sample_03_R2.fastq.gz +└── README.md + +3 directories, 8 files +``` + +Our sample CSV contains metadata: + +```csv title="data/samples.csv" +sample_id,organism,read1,read2 +sample_01,E.coli,data/reads/sample_01_R1.fastq.gz,data/reads/sample_01_R2.fastq.gz +sample_02,S.aureus,data/reads/sample_02_R1.fastq.gz,data/reads/sample_02_R2.fastq.gz +sample_03,P.aeruginosa,data/reads/sample_03_R1.fastq.gz,data/reads/sample_03_R2.fastq.gz +``` + +!!! note "About running the commands" + + This tutorial is conceptual - we won't actually execute these computationally intensive commands. Focus on understanding the patterns and problems. The code is realistic and representative of what you'd write in practice. + +--- + +## 1. Building the Bash Script + +### 1.1. Start Simple: Process One Sample + +Let's begin by processing just one sample. Create a file called `process_one.sh`: + +=== "Your Task" + + Write a bash script that: + + 1. Creates an output directory called `results/` + 2. Runs FastQC on `data/reads/sample_01_R1.fastq.gz` and `data/reads/sample_01_R2.fastq.gz` + 3. Outputs results to `results/fastqc/` + +=== "Solution" + + ```bash title="process_one.sh" linenums="1" + #!/bin/bash + set -e # Exit on error + + echo "Processing sample_01..." + + # Create output directory + mkdir -p results/fastqc + + # Run FastQC + fastqc -q -o results/fastqc \ + data/reads/sample_01_R1.fastq.gz \ + data/reads/sample_01_R2.fastq.gz + + echo "Done!" + ``` + +This works! For a single sample, this is perfectly fine. But you have 3 samples... + +### 1.2. Scale to Multiple Samples: Add a Loop + +Now modify your script to process all samples. Read from `data/samples.csv` and loop through them. + +=== "Your Task" + + Modify `process_one.sh` to: + + 1. Read sample information from `data/samples.csv` + 2. Loop through each sample + 3. Run FastQC on each sample's reads + + Hint: Use `tail -n +2` to skip the CSV header and `while IFS=',' read` to parse CSV lines + +=== "Solution" + + ```bash title="process_all.sh" linenums="1" + #!/bin/bash + set -e # Exit on error + + echo "Processing all samples..." + echo "========================" + + # Create output directory + mkdir -p results/fastqc + + # Read CSV and process each sample + tail -n +2 data/samples.csv | while IFS=',' read -r sample_id organism read1 read2; do + echo "" + echo "Processing $sample_id ($organism)..." + + fastqc -q -o results/fastqc $read1 $read2 + + echo "Completed $sample_id" + done + + echo "" + echo "All samples processed!" + ``` + +Run this script: + +```bash title="Run the script" +chmod +x process_all.sh +./process_all.sh +``` + +**Expected output:** + +```console title="Sequential processing" +Processing all samples... +======================== + +Processing sample_01 (E.coli)... +Completed sample_01 + +Processing sample_02 (S.aureus)... +Completed sample_02 + +Processing sample_03 (P.aeruginosa)... +Completed sample_03 + +All samples processed! +``` + +### 1.3. 🛑 Problem #1: It's Slow + +Time your script: + +```bash title="Time the script" +time ./process_all.sh +``` + +Notice something? The samples process **one at a time**. If each FastQC takes 5 minutes: + +- Sample 1: 5 minutes +- Sample 2: 5 minutes (waits for sample 1) +- Sample 3: 5 minutes (waits for sample 2) +- **Total: 15 minutes** + +But these samples are completely independent! They could run **simultaneously**. + +**Think about it:** With 20 samples, you're looking at 100 minutes of sequential waiting. On a machine with 16 CPUs, you're using only 2 CPUs at a time. The other 14 sit idle. + +You *could* parallelize this with `&` backgrounding or GNU parallel, but that adds complexity. Let's see this problem in action first, then we'll return to it. + +### 1.4. Add More Steps: The Full Pipeline + +Now let's add the remaining steps. Modify your script to: + +1. Run FastQC +2. Trim adapters with fastp +3. Assemble with SPAdes +4. Assess quality with QUAST + +=== "Your Task" + + Extend your script to include all four pipeline steps for each sample. + +=== "Solution" + + ```bash title="process_complete.sh" linenums="1" + #!/bin/bash + set -e # Exit on error + + echo "Starting bacterial genome analysis pipeline" + echo "===========================================" + + # Create output directories + mkdir -p results/fastqc + mkdir -p results/trimmed + mkdir -p results/assemblies + mkdir -p results/quast + + # Read CSV and process each sample + tail -n +2 data/samples.csv | while IFS=',' read -r sample_id organism read1 read2; do + + echo "" + echo "Processing $sample_id ($organism)..." + echo "-----------------------------------" + + # Step 1: Quality control + echo "Running FastQC..." + fastqc -q -o results/fastqc $read1 $read2 + + # Step 2: Trim adapters + echo "Running fastp..." + fastp \ + -i $read1 \ + -I $read2 \ + -o results/trimmed/${sample_id}_R1.fastq.gz \ + -O results/trimmed/${sample_id}_R2.fastq.gz \ + --json results/trimmed/${sample_id}.json \ + --html results/trimmed/${sample_id}.html \ + --thread 4 + + # Step 3: Genome assembly + echo "Running SPAdes assembly..." + spades.py \ + -1 results/trimmed/${sample_id}_R1.fastq.gz \ + -2 results/trimmed/${sample_id}_R2.fastq.gz \ + -o results/assemblies/${sample_id} \ + --threads 8 \ + --memory 16 + + # Step 4: Assembly quality assessment + echo "Running QUAST..." + quast.py \ + results/assemblies/${sample_id}/contigs.fasta \ + -o results/quast/${sample_id} \ + --threads 4 + + echo "Completed $sample_id" + done + + echo "" + echo "Pipeline complete!" + ``` + +### 1.5. 🛑 Problem #2: Resource Conflicts + +Look at the thread allocations: + +- `fastp --thread 4` +- `spades.py --threads 8` +- `quast.py --threads 4` + +These are **hardcoded**. What if: + +- You run this on your laptop with only 4 cores? SPAdes will fail or thrash +- You run this on an HPC node with 64 cores? You're wasting 56 cores +- You run this on two different machines simultaneously? They'll compete for resources + +**The problem:** Resource requirements are baked into your code, not adapted to the environment. + +### 1.6. 🛑 Problem #3: The Crash + +You start processing your samples. Everything is going well... + +```console +Processing sample_01 (E.coli)... +----------------------------------- +Running FastQC... +Running fastp... +Running SPAdes assembly... +Running QUAST... +Completed sample_01 + +Processing sample_02 (S.aureus)... +----------------------------------- +Running FastQC... +Running fastp... +Running SPAdes assembly... +``` + +Then disaster strikes at 2 AM: + +```console +ERROR: SPAdes terminated with an error +Out of memory: Killed process 12234 (spades.py) +``` + +**You've lost all the work for sample_02** and any samples that would have come after it. Your options: + +1. **Re-run everything** - Waste all completed work (sample_01 runs again unnecessarily) +2. **Manually edit the CSV** - Remove completed samples, re-run script (error-prone) +3. **Build checkpoint logic** - Track what completed, skip finished work (hours of additional coding) + +None of these are good options. + +### 1.7. 🛑 Problem #4: Environment Dependencies + +Your script assumes that `fastqc`, `fastp`, `spades.py`, and `quast.py` are: + +- Installed on the system +- In your PATH +- The correct versions +- Compatible with each other + +Try to run your script on a colleague's machine: + +```console +$ ./process_complete.sh +Processing sample_01 (E.coli)... +Running FastQC... +bash: fastqc: command not found +``` + +Or worse - it runs with different versions: + +```console +# Your machine: fastp version 0.23.4 +# Colleague's machine: fastp version 0.20.0 +# Different results, different outputs, not reproducible! +``` + +**The problem:** Your script is **environment-dependent**. It won't run the same way everywhere. + +### 1.8. 🛑 Problem #5: No Visibility + +Your script is running. Questions you can't easily answer: + +- How much memory is SPAdes actually using? +- Which step is the bottleneck? +- How long did sample_02 take compared to sample_01? +- What were the exact parameters used for sample_03? +- If I need to optimize, where should I focus? + +You'd need to manually add: + +- Logging at every step +- Timestamp tracking +- Resource monitoring +- Parameter recording + +That's a lot of boilerplate code that has nothing to do with your science. + +### 1.9. Taking Stock + +Let's summarize what we've experienced: + +| Problem | Impact | Your Options | +|---------|--------|--------------| +| **Sequential processing** | 3× slower than necessary | Add complex parallelization logic | +| **Hardcoded resources** | Fails on some machines, wastes resources on others | Add environment detection, make configurable | +| **No resume capability** | Re-run everything after failures | Build checkpoint tracking system | +| **Environment dependencies** | Not reproducible, won't run elsewhere | Create conda environments, document meticulously | +| **No provenance** | Can't optimize, hard to debug | Add extensive logging code | + +You *could* solve all of these. And after weeks of work, you'd have built... a workflow management system. + +**This is why Nextflow exists.** + +--- + +## 2. Enter Nextflow: Solving the Problems + +### 2.1. The Nextflow Equivalent + +Let's see how the same pipeline looks in Nextflow. We'll start with the main workflow file: + +```groovy title="main.nf" linenums="1" +#!/usr/bin/env nextflow + +// Include process definitions +include { FASTQC } from './modules/fastqc' +include { FASTP } from './modules/fastp' +include { SPADES } from './modules/spades' +include { QUAST } from './modules/quast' + +// Pipeline parameters +params.samples = 'data/samples.csv' +params.outdir = 'results' + +// Main workflow +workflow { + + // Read sample sheet and create channel + ch_samples = Channel + .fromPath(params.samples) + .splitCsv(header: true) + .map { row -> + def meta = [id: row.sample_id, organism: row.organism] + def reads = [file(row.read1), file(row.read2)] + return [meta, reads] + } + + // Run the pipeline steps + FASTQC(ch_samples) + FASTP(ch_samples) + SPADES(FASTP.out.reads) + QUAST(SPADES.out.assembly) +} +``` + +And one of the process definitions: + +```groovy title="modules/fastp.nf" linenums="1" +process FASTP { + tag "$meta.id" + container 'biocontainers/fastp:0.23.4' + publishDir "${params.outdir}/trimmed", mode: 'copy' + + cpus 4 + memory '8.GB' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_R{1,2}.fastq.gz"), emit: reads + path "*.{json,html}", emit: reports + + script: + def prefix = meta.id + """ + fastp \\ + -i ${reads[0]} \\ + -I ${reads[1]} \\ + -o ${prefix}_R1.fastq.gz \\ + -O ${prefix}_R2.fastq.gz \\ + --json ${prefix}.json \\ + --html ${prefix}.html \\ + --thread $task.cpus + """ +} +``` + +Let's run it: + +```bash title="Run the Nextflow pipeline" +nextflow run main.nf +``` + +### 2.2. ✅ Solution #1: Automatic Parallelization + +Watch the execution: + +```console title="Nextflow automatically parallelizes" +N E X T F L O W ~ version 24.10.0 +Launching `main.nf` [friendly_darwin] DSL2 - revision: a1b2c3d4e5 + +executor > local (12) +[3a/4b5c6d] FASTQC (sample_01) [100%] 3 of 3 ✔ +[8e/9f0a1b] FASTP (sample_01) [100%] 3 of 3 ✔ +[2c/3d4e5f] SPADES (sample_01) [100%] 3 of 3 ✔ +[7g/8h9i0j] QUAST (sample_01) [100%] 3 of 3 ✔ + +Completed at: 11-Oct-2025 14:23:45 +Duration : 28m 15s +CPU hours : 2.5 +Succeeded : 12 +``` + +**Notice the difference:** + +- **Bash script**: 90 minutes (sequential: 30 min × 3 samples) +- **Nextflow**: 28 minutes (parallel where possible) + +**What happened?** Nextflow analyzed the dependencies: + +1. All 3 FASTQC tasks are independent → **Run simultaneously** +2. All 3 FASTP tasks depend only on input data → **Run simultaneously** +3. Each SPADES depends on its FASTP output → **Run when ready** +4. Each QUAST depends on its SPADES output → **Run when ready** + +You wrote **zero parallelization code**. Nextflow figured it out from the data dependencies. + +### 2.3. ✅ Solution #2: Declarative Resource Requirements + +Look at the process definition again: + +```groovy title="Process with resource declarations" +process FASTP { + cpus 4 + memory '8.GB' + + // ... +} +``` + +These aren't hardcoded values - they're **requirements**. Nextflow uses them differently based on where you run: + +**On your laptop (4 cores, 16 GB RAM):** + +- Schedules tasks so they don't exceed available resources +- Might run 1 FASTP task at a time (needs 4 CPUs) +- Queues remaining tasks until resources free up + +**On HPC cluster (64 cores, 256 GB RAM):** + +- Generates appropriate SLURM job submission script +- Requests 4 CPUs and 8 GB per task +- Can run 16 FASTP tasks in parallel + +**On cloud (elastic scaling):** + +- Provisions appropriately-sized instances +- Scales up when needed, down when done +- You pay only for what you use + +**Same code. Different environments. Automatic adaptation.** + +### 2.4. ✅ Solution #3: Built-in Resume Capability + +Remember the crash scenario? Let's see it with Nextflow: + +```console title="Pipeline fails on sample_02" +N E X T F L O W ~ version 24.10.0 + +executor > local (9) +[3a/4b5c6d] FASTQC (sample_01) [100%] 3 of 3 ✔ +[8e/9f0a1b] FASTP (sample_01) [100%] 3 of 3 ✔ +[2c/3d4e5f] SPADES (sample_01) [100%] 2 of 3 ✔ +[2c/a1b2c3] SPADES (sample_02) [ 0%] 0 of 1, failed: 1 ✘ +[7g/8h9i0j] QUAST (sample_01) [100%] 2 of 2 ✔ + +Error executing process > 'SPADES (sample_02)' +``` + +**Key observations:** + +1. Samples 01 and 03 completed successfully +2. Only sample 02 failed +3. Downstream work (QUAST) continued for successful samples + +**Fix the problem** (increase memory): + +```groovy title="modules/spades.nf" hl_lines="3" +process SPADES { + tag "$meta.id" + memory '32.GB' // Increased from 16.GB + // ... rest of process +} +``` + +**Resume from where it failed:** + +```bash title="Resume with one flag" +nextflow run main.nf -resume +``` + +```console title="Only failed work re-runs" +N E X T F L O W ~ version 24.10.0 + +executor > local (2) +[3a/4b5c6d] FASTQC (sample_01) [100%] 3 of 3, cached: 3 ✔ +[8e/9f0a1b] FASTP (sample_01) [100%] 3 of 3, cached: 3 ✔ +[2c/3d4e5f] SPADES (sample_01) [100%] 3 of 3, cached: 2 ✔ +[7g/8h9i0j] QUAST (sample_01) [100%] 3 of 3, cached: 2 ✔ + +Completed at: 11-Oct-2025 14:45:12 +Duration : 32m 5s +``` + +**Only SPADES(sample_02) and QUAST(sample_02) re-ran.** Everything else used cached results. + +**Zero checkpoint logic required.** Nextflow tracks this automatically using content-based hashing. + +### 2.5. ✅ Solution #4: Container-Based Software Isolation + +Look at the process definition: + +```groovy title="Process with container specification" +process FASTP { + container 'biocontainers/fastp:0.23.4' + + // ... rest of process +} +``` + +**What Nextflow does automatically:** + +1. Downloads the container image (first run only) +2. Executes the task inside the container +3. Ensures exact version 0.23.4 is used +4. Prevents conflicts with other tools + +**Running on a colleague's machine:** + +```bash title="Works identically" +nextflow run main.nf +``` + +```console +Pulling biocontainers/fastp:0.23.4 ... done +Pulling biocontainers/spades:3.15.5 ... done +Pulling biocontainers/quast:5.2.0 ... done + +[... pipeline runs identically ...] +``` + +**Zero installation steps.** No PATH configuration. No version conflicts. Perfect reproducibility. + +### 2.6. ✅ Solution #5: Comprehensive Provenance + +Generate execution reports: + +```bash title="Run with reporting flags" +nextflow run main.nf -with-report -with-timeline -with-trace +``` + +**Execution Report (`report.html`):** + +- Task-level resource usage (CPU, memory, time) +- Success/failure status for each task +- Resource efficiency metrics +- Total cost estimates (for cloud) + +**Timeline (`timeline.html`):** + +- Visual timeline of when each task ran +- Shows parallelization patterns +- Identifies bottlenecks +- Reveals idle time + +**Trace file (`trace.txt`):** + +```csv +task_id,hash,process,status,exit,duration,realtime,%cpu,%mem,peak_rss,peak_vmem +1,3a/4b5c6d,FASTQC (sample_01),COMPLETED,0,1m 23s,1m 18s,98.5%,12.3%,1.8 GB,2.1 GB +2,8e/9f0a1b,FASTP (sample_01),COMPLETED,0,3m 45s,3m 40s,390.2%,8.7%,2.4 GB,3.1 GB +3,2c/3d4e5f,SPADES (sample_01),COMPLETED,0,28m 12s,27m 51s,785.3%,45.2%,12.1 GB,15.8 GB +``` + +Perfect for: + +- Identifying which samples need more resources +- Optimizing resource allocations +- Cost analysis +- Publication methods sections + +**Zero logging code written.** It's built in. + +--- + +## 3. Side-by-Side Comparison + +### 3.1. The Same Task, Two Approaches + +Let's directly compare what we built: + +**Bash Script Approach:** + +```bash +./process_complete.sh +``` + +- ❌ Sequential execution (90 minutes for 3 samples) +- ❌ Hardcoded resource requirements +- ❌ No resume capability +- ❌ Environment-dependent (requires manual software installation) +- ❌ No execution tracking +- ✅ Simple to understand initially +- ✅ No new tools to learn + +**Nextflow Approach:** + +```bash +nextflow run main.nf -resume +``` + +- ✅ Automatic parallelization (28 minutes for 3 samples) +- ✅ Adaptive resource management +- ✅ Built-in resume capability +- ✅ Container-based software isolation +- ✅ Comprehensive provenance tracking +- ⚠️ Initial learning curve +- ✅ Same code works everywhere + +### 3.2. Real-World Impact: The Numbers + +Let's quantify the differences: + +**Time Investment:** + +| Task | Bash Script | Nextflow | +|------|-------------|----------| +| Initial implementation | 30 minutes | 60 minutes | +| Add parallelization | +2 hours | Already included | +| Add resume logic | +3 hours | Already included | +| Add logging/tracking | +2 hours | Already included | +| Fix environment issues | +1 hour per machine | Works everywhere | +| **Total development** | **8.5 hours** | **1 hour** | + +**Execution Time (3 samples):** + +| Approach | Time | Speedup | +|----------|------|---------| +| Bash (sequential) | 90 minutes | 1× | +| Nextflow (parallel) | 28 minutes | 3.2× | + +**Execution Time (20 samples):** + +| Approach | Time | Speedup | +|----------|------|---------| +| Bash (sequential) | 10 hours | 1× | +| Nextflow (parallel) | 2 hours | 5× | + +**After a failure at sample 15:** + +| Approach | Time to Resume | +|----------|----------------| +| Bash | 7.5 hours (re-run samples 1-15) | +| Nextflow | 30 minutes (only sample 15) | + +### 3.3. The Breaking Point + +When should you use workflow management instead of bash scripts? + +**Consider Nextflow when:** + +✅ You have more than 3-5 samples to process +✅ Your pipeline has multiple steps with dependencies +✅ You need to share your work with collaborators +✅ You run analyses on different computers/clusters +✅ Your analyses take more than 30 minutes +✅ You need to re-run analyses with different parameters +✅ You care about reproducibility + +**Bash scripts are fine when:** + +✅ One-off exploratory analysis you'll never repeat +✅ Single sample, single step +✅ Interactive work where you adjust based on output +✅ Extremely simple operations (< 5 minutes total) + +**The transition point:** When you find yourself thinking "I wish I could..." about your bash script - parallelization, resuming, tracking - that's when to switch to Nextflow. + +--- + +## 4. Hands-On: Converting Your Script + +Now it's your turn. Let's convert part of the bash script to Nextflow. + +### Exercise 1: Create the FastQC Process + +Take the FastQC step from the bash script and convert it to a Nextflow process. + +=== "Your Task" + + Create `modules/fastqc.nf` with a process that: + + 1. Takes sample metadata and read files as input + 2. Runs FastQC on the reads + 3. Outputs the HTML and ZIP reports + 4. Uses the `biocontainers/fastqc:0.12.1` container + 5. Uses 2 CPUs and 2 GB memory + + Use this template: + + ```groovy title="modules/fastqc.nf" + process FASTQC { + tag "$meta.id" + container '???' + publishDir "${params.outdir}/fastqc", mode: 'copy' + + cpus ??? + memory ??? + + input: + tuple val(meta), path(reads) + + output: + // What outputs does FastQC produce? + + script: + """ + fastqc -q -t $task.cpus ${reads} + """ + } + ``` + +=== "Solution" + + ```groovy title="modules/fastqc.nf" + process FASTQC { + tag "$meta.id" + container 'biocontainers/fastqc:0.12.1' + publishDir "${params.outdir}/fastqc", mode: 'copy' + + cpus 2 + memory '2.GB' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*.html"), emit: html + tuple val(meta), path("*.zip"), emit: zip + + script: + """ + fastqc -q -t $task.cpus ${reads} + """ + } + ``` + + **Key elements:** + + - `tag "$meta.id"`: Labels output with sample ID + - `container`: Specifies exact software version + - `publishDir`: Where to save final results + - `cpus`/`memory`: Resource requirements (not hardcoded thread counts!) + - `input`: Takes metadata + read files as a tuple + - `output`: Declares what files are produced, with named outputs (`emit`) + - `$task.cpus`: Uses the declared CPU count (adapts to environment) + +### Exercise 2: Create the Main Workflow + +Now create a workflow that reads the CSV and runs FastQC on all samples. + +=== "Your Task" + + Create `main.nf` that: + + 1. Includes the FASTQC process + 2. Reads `data/samples.csv` + 3. Creates a channel with [meta, reads] tuples + 4. Runs FASTQC on all samples + + Use this template: + + ```groovy title="main.nf" + #!/usr/bin/env nextflow + + include { FASTQC } from './modules/fastqc' + + params.samples = 'data/samples.csv' + params.outdir = 'results' + + workflow { + ch_samples = Channel + .fromPath(params.samples) + .splitCsv(header: true) + .map { row -> + // Create metadata map + // Create reads list + // Return tuple + } + + FASTQC(ch_samples) + } + ``` + +=== "Solution" + + ```groovy title="main.nf" + #!/usr/bin/env nextflow + + include { FASTQC } from './modules/fastqc' + + params.samples = 'data/samples.csv' + params.outdir = 'results' + + workflow { + ch_samples = Channel + .fromPath(params.samples) + .splitCsv(header: true) + .map { row -> + def meta = [id: row.sample_id, organism: row.organism] + def reads = [file(row.read1), file(row.read2)] + return [meta, reads] + } + + FASTQC(ch_samples) + } + ``` + + **How it works:** + + - `Channel.fromPath()`: Creates channel from file + - `.splitCsv(header: true)`: Parses CSV, skips header + - `.map { row -> ... }`: Transforms each row + - `meta`: Metadata map (sample ID, organism, etc.) + - `reads`: List of read file paths + - `[meta, reads]`: Tuple that matches process input + - `FASTQC(ch_samples)`: Runs process on all items in channel + +### Exercise 3: Add the FASTP Process + +Now add the trimming step. + +=== "Your Task" + + 1. Create `modules/fastp.nf` (similar to FASTQC but with different outputs) + 2. Add it to the workflow after FASTQC + 3. Make sure it receives the input reads (hint: use the original `ch_samples`) + +=== "Solution" + + ```groovy title="modules/fastp.nf" + process FASTP { + tag "$meta.id" + container 'biocontainers/fastp:0.23.4' + publishDir "${params.outdir}/trimmed", mode: 'copy' + + cpus 4 + memory '8.GB' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_R{1,2}.fastq.gz"), emit: reads + path "*.{json,html}", emit: reports + + script: + def prefix = meta.id + """ + fastp \\ + -i ${reads[0]} \\ + -I ${reads[1]} \\ + -o ${prefix}_R1.fastq.gz \\ + -O ${prefix}_R2.fastq.gz \\ + --json ${prefix}.json \\ + --html ${prefix}.html \\ + --thread $task.cpus + """ + } + ``` + + ```groovy title="main.nf" hl_lines="3 17" + #!/usr/bin/env nextflow + + include { FASTQC } from './modules/fastqc' + include { FASTP } from './modules/fastp' + + params.samples = 'data/samples.csv' + params.outdir = 'results' + + workflow { + ch_samples = Channel + .fromPath(params.samples) + .splitCsv(header: true) + .map { row -> + def meta = [id: row.sample_id, organism: row.organism] + def reads = [file(row.read1), file(row.read2)] + return [meta, reads] + } + + FASTQC(ch_samples) + FASTP(ch_samples) + } + ``` + + **Note:** Both FASTQC and FASTP receive `ch_samples`. They'll run in parallel since neither depends on the other. + +### Exercise 4: Chain Processes with Dependencies + +Now add SPADES, which depends on FASTP output. + +=== "Your Task" + + Create the SPADES process and connect it to receive trimmed reads from FASTP. + + Hint: FASTP outputs `emit: reads` - you can reference this as `FASTP.out.reads` + +=== "Solution" + + ```groovy title="modules/spades.nf" + process SPADES { + tag "$meta.id" + container 'biocontainers/spades:3.15.5' + publishDir "${params.outdir}/assemblies", mode: 'copy' + + cpus 8 + memory '16.GB' + time '6.h' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}/contigs.fasta"), emit: assembly + + script: + """ + spades.py \\ + -1 ${reads[0]} \\ + -2 ${reads[1]} \\ + -o ${meta.id} \\ + --threads $task.cpus \\ + --memory ${task.memory.toGiga()} + """ + } + ``` + + ```groovy title="main.nf" hl_lines="4 21" + #!/usr/bin/env nextflow + + include { FASTQC } from './modules/fastqc' + include { FASTP } from './modules/fastp' + include { SPADES } from './modules/spades' + + params.samples = 'data/samples.csv' + params.outdir = 'results' + + workflow { + ch_samples = Channel + .fromPath(params.samples) + .splitCsv(header: true) + .map { row -> + def meta = [id: row.sample_id, organism: row.organism] + def reads = [file(row.read1), file(row.read2)] + return [meta, reads] + } + + FASTQC(ch_samples) + FASTP(ch_samples) + SPADES(FASTP.out.reads) + } + ``` + + **The dependency chain:** + + - `ch_samples` flows into FASTQC and FASTP (parallel) + - `FASTP.out.reads` flows into SPADES (sequential - SPADES waits for FASTP) + - Nextflow automatically handles the scheduling + +--- + +## 5. Configuration: Flexibility Without Code Changes + +### 5.1. Separating Configuration from Logic + +One of Nextflow's powerful features: **configuration lives in separate files**. + +Your workflow (`main.nf`) describes **what** to do. +Config files describe **how** to do it. + +Create `nextflow.config`: + +```groovy title="nextflow.config" linenums="1" +// Project defaults +params { + samples = 'data/samples.csv' + outdir = 'results' +} + +// Process defaults +process { + cpus = 2 + memory = '4.GB' + time = '2.h' + + // Override for specific processes + withName: 'SPADES' { + cpus = 8 + memory = '16.GB' + time = '6.h' + } +} + +// Container settings +docker.enabled = true + +// Generate execution reports +timeline { + enabled = true + file = "${params.outdir}/timeline.html" +} + +report { + enabled = true + file = "${params.outdir}/report.html" +} + +trace { + enabled = true + file = "${params.outdir}/trace.txt" +} +``` + +Now you can run with just: + +```bash +nextflow run main.nf +``` + +All settings come from the config file. + +### 5.2. Environment-Specific Profiles + +Add execution profiles for different environments: + +```groovy title="nextflow.config (add to end)" linenums="35" +profiles { + + standard { + process.executor = 'local' + docker.enabled = true + } + + cluster { + process.executor = 'slurm' + process.queue = 'general' + process.clusterOptions = '--account=myproject' + singularity.enabled = true + singularity.cacheDir = '/shared/containers' + docker.enabled = false + } + + cloud { + process.executor = 'awsbatch' + process.queue = 'genomics-queue' + workDir = 's3://my-bucket/work' + aws.region = 'us-east-1' + } + + test { + params.samples = 'test/mini_samples.csv' + process.cpus = 1 + process.memory = '2.GB' + } +} +``` + +Switch environments with one flag: + +```bash title="Different execution environments" +# Local execution +nextflow run main.nf -profile standard + +# HPC cluster +nextflow run main.nf -profile cluster + +# AWS cloud +nextflow run main.nf -profile cloud + +# Quick test +nextflow run main.nf -profile test +``` + +**The workflow code never changes.** + +--- + +## 6. Conclusion: The Paradigm Shift + +### 6.1. What We've Learned + +We started by building a bash script and experienced its limitations: + +1. ⏱️ **Sequential processing** - Samples waited unnecessarily +2. 🖥️ **Hardcoded resources** - Failed on different machines +3. 💥 **No resume capability** - Lost hours of work to failures +4. 📦 **Environment dependencies** - "Works on my machine" problems +5. 📊 **No provenance** - Couldn't optimize or debug effectively + +Then we saw how Nextflow solves each problem: + +1. ✅ **Automatic parallelization** - 3× faster with zero code changes +2. ✅ **Adaptive resources** - Same code works everywhere +3. ✅ **Built-in caching** - Resume from failures instantly +4. ✅ **Container isolation** - Perfect reproducibility +5. ✅ **Comprehensive tracking** - Complete provenance built-in + +### 6.2. When to Make the Switch + +**You should use Nextflow when:** + +- Processing 3+ samples in a multi-step pipeline +- Sharing analyses with collaborators +- Running on different computers/clusters +- Analyses take > 30 minutes +- You need reproducibility + +**Bash scripts are fine for:** + +- Quick one-off explorations +- Single-sample, single-step operations +- Interactive work with immediate feedback +- Very simple tasks (< 5 minutes) + +### 6.3. The Investment + +**Learning Nextflow:** + +- 1-2 days to become productive +- 1 week to feel comfortable +- After 2-3 workflows, you'll prefer it by default + +**Return on investment:** + +- Hours saved per analysis (scaling with sample count) +- Near-perfect reproducibility +- Seamless collaboration +- Infinite code reuse + +### 6.4. Next Steps + +**Continue learning:** + +1. [Groovy Essentials](./groovy_essentials.md) - Data manipulation and advanced patterns +2. [nf-core](./nf-core.md) - Community best practices and ready-made pipelines +3. [nf-test](./nf-test.md) - Testing your workflows +4. [Nextflow Patterns](https://nextflow-io.github.io/patterns/) - Common workflow patterns + +**Join the community:** + +- [Nextflow Slack](https://www.nextflow.io/slack-invite.html) - Active community support +- [nf-core Slack](https://nf-co.re/join) - Pipeline-specific help +- [Nextflow Training](https://training.nextflow.io/) - Official workshops + +### 6.5. The Mindset Shift + +Moving from scripts to workflows isn't just about using a new tool - it's a new way of thinking: + +**Script thinking:** +> "How do I run these commands in sequence on my data?" + +**Workflow thinking:** +> "What are the data dependencies? How will this scale? How can others reproduce this?" + +Once you start thinking in terms of processes, channels, and data flow, you'll wonder how you ever managed with bash scripts. + +Welcome to workflow management. Your future self will thank you. + +--- + +## Quick Reference + +### Nextflow vs Bash: Command Comparison + +| Task | Bash | Nextflow | +|------|------|----------| +| Process one sample | `fastqc file.fq` | `FASTQC(ch_sample)` | +| Process multiple samples | `for f in *.fq; do fastqc $f; done` | `FASTQC(ch_samples)` (automatic) | +| Chain commands | `cmd1 | cmd2 | cmd3` | `CMD1(input); CMD2(CMD1.out); CMD3(CMD2.out)` | +| Parallel execution | `cmd &` or GNU parallel | Automatic based on dependencies | +| Resume after failure | Manual checkpoints | `-resume` flag | +| Specify resources | Hardcoded threads | `cpus = X, memory = 'Y.GB'` | +| Software environment | `module load` or PATH | `container = 'image:version'` | + +### Common Nextflow Commands + +```bash +# Run workflow +nextflow run main.nf + +# Resume from cached results +nextflow run main.nf -resume + +# Override parameters +nextflow run main.nf --samples my_data.csv --outdir my_results + +# Use specific profile +nextflow run main.nf -profile cluster + +# Generate reports +nextflow run main.nf -with-report -with-timeline -with-trace + +# Clean up work directory +nextflow clean -f + +# List previous runs +nextflow log + +# View run details +nextflow log -f script,workdir,hash,duration +``` + +### Process Template + +```groovy +process PROCESS_NAME { + tag "$meta.id" // Label for logs + container 'image:version' // Software environment + publishDir "${params.outdir}/dir" // Where to save outputs + + cpus 4 // CPU requirement + memory '8.GB' // Memory requirement + time '2.h' // Time limit + + input: + tuple val(meta), path(files) // Input declaration + + output: + tuple val(meta), path("*.out"), emit: results + + script: + """ + command --input $files --output output.out --threads $task.cpus + """ +} +``` diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/groovy_essentials/collect.nf new file mode 100644 index 000000000..aaa557393 --- /dev/null +++ b/side-quests/groovy_essentials/collect.nf @@ -0,0 +1,7 @@ +def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + +// Nextflow collect() - groups multiple channel emissions into one +ch_input = Channel.fromList(sample_ids) +ch_input.view { "Individual channel item: ${it}" } +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } diff --git a/side-quests/groovy_essentials/data/samples.csv b/side-quests/groovy_essentials/data/samples.csv new file mode 100644 index 000000000..1d12e1384 --- /dev/null +++ b/side-quests/groovy_essentials/data/samples.csv @@ -0,0 +1,4 @@ +sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score +SAMPLE_001,human,liver,30000000,data/sequences/SAMPLE_001_S1_L001_R1_001.fastq,38.5 +SAMPLE_002,mouse,brain,25000000,data/sequences/SAMPLE_002_S2_L001_R1_001.fastq,35.2 +SAMPLE_003,human,kidney,45000000,data/sequences/SAMPLE_003_S3_L001_R1_001.fastq,42.1 diff --git a/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq b/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq new file mode 100644 index 000000000..5dc7a08c8 --- /dev/null +++ b/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq @@ -0,0 +1,12 @@ +@sample_001_read_1 +ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +@sample_001_read_2 +GCATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT ++ +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +@sample_001_read_3 +TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA ++ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJH diff --git a/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq b/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq new file mode 100644 index 000000000..f04cbb4d4 --- /dev/null +++ b/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq @@ -0,0 +1,12 @@ +@sample_002_read_1 +CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +@sample_002_read_2 +ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG ++ +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +@sample_002_read_3 +GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC ++ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq b/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq new file mode 100644 index 000000000..425e92aba --- /dev/null +++ b/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq @@ -0,0 +1,12 @@ +@sample_003_read_1 +GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +@sample_003_read_2 +CGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG ++ +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +@sample_003_read_3 +ATATATATATATATATATATATATATATATATATATATAT ++ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/main.nf b/side-quests/groovy_essentials/main.nf new file mode 100644 index 000000000..31aa794ed --- /dev/null +++ b/side-quests/groovy_essentials/main.nf @@ -0,0 +1,5 @@ +workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .view() +} diff --git a/side-quests/groovy_essentials/modules/fastp.nf b/side-quests/groovy_essentials/modules/fastp.nf new file mode 100644 index 000000000..29e9ef8ca --- /dev/null +++ b/side-quests/groovy_essentials/modules/fastp.nf @@ -0,0 +1,21 @@ +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_trimmed*.fastq.gz"), emit: reads + + script: + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ +} diff --git a/side-quests/groovy_essentials/modules/generate_report.nf b/side-quests/groovy_essentials/modules/generate_report.nf new file mode 100644 index 000000000..4bf7a1d78 --- /dev/null +++ b/side-quests/groovy_essentials/modules/generate_report.nf @@ -0,0 +1,16 @@ +process GENERATE_REPORT { + + publishDir 'results/reports', mode: 'copy' + + input: + tuple val(meta), path(reads) + + output: + path "${meta.id}_report.txt" + + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ +} diff --git a/side-quests/groovy_essentials/modules/trimgalore.nf b/side-quests/groovy_essentials/modules/trimgalore.nf new file mode 100644 index 000000000..945fef640 --- /dev/null +++ b/side-quests/groovy_essentials/modules/trimgalore.nf @@ -0,0 +1,37 @@ +process TRIMGALORE { + container 'quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_0' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_trimmed*.fq"), emit: reads + path "*_trimming_report.txt" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + trim_galore \\ + --cores $task.cpus \\ + ${input_file} + + # Rename output to match expected pattern + mv *_trimmed.fq ${meta.id}_trimmed.fq + """ + } else { + """ + trim_galore \\ + --paired \\ + --cores $task.cpus \\ + ${reads[0]} ${reads[1]} + + # Rename outputs to match expected pattern + mv *_val_1.fq ${meta.id}_trimmed_R1.fq + mv *_val_2.fq ${meta.id}_trimmed_R2.fq + """ + } +} diff --git a/side-quests/groovy_essentials/nextflow.config b/side-quests/groovy_essentials/nextflow.config new file mode 100644 index 000000000..b57f70c8c --- /dev/null +++ b/side-quests/groovy_essentials/nextflow.config @@ -0,0 +1,3 @@ +// Nextflow configuration for Groovy Essentials side quest + +docker.enabled = true diff --git a/side-quests/solutions/groovy_essentials/collect.nf b/side-quests/solutions/groovy_essentials/collect.nf new file mode 100644 index 000000000..dbfd5ac79 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/collect.nf @@ -0,0 +1,18 @@ +def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + +// Nextflow collect() - groups multiple channel emissions into one +ch_input = Channel.fromList(sample_ids) +ch_input.view { "Individual channel item: ${it}" } +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + +// Groovy collect - transforms each element, preserves structure +def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') +} +println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + +// Spread operator - concise property access +def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] +def all_ids = sample_data*.id +println "Spread operator result: ${all_ids}" diff --git a/side-quests/solutions/groovy_essentials/main.nf b/side-quests/solutions/groovy_essentials/main.nf new file mode 100644 index 000000000..ab83315d7 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/main.nf @@ -0,0 +1,69 @@ +include { FASTP } from './modules/fastp.nf' +include { TRIMGALORE } from './modules/trimgalore.nf' +include { GENERATE_REPORT } from './modules/generate_report.nf' + +def validateInputs() { + // Check input parameter is provided + if (!params.input) { + error("Input CSV file path not provided. Please specify --input ") + } + + // Check CSV file exists + if (!file(params.input).exists()) { + error("Input CSV file not found: ${params.input}") + } +} + +def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score?.toDouble() + ] + def run_id = row.run_id?.toUpperCase() ?: 'UNSPECIFIED' + sample_meta.run = run_id + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + + // Validate data makes sense + if (sample_meta.depth < 30000000) { + log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" + } + + return [sample_meta + file_meta + [priority: priority], fastq_path] +} + +workflow { + validateInputs() + + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + // Filter out invalid or low-quality samples + ch_valid_samples = ch_samples + .filter { meta, reads -> + meta.id && meta.organism && meta.depth > 25000000 + } + + trim_branches = ch_valid_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) +} diff --git a/side-quests/solutions/groovy_essentials/modules/fastp.nf b/side-quests/solutions/groovy_essentials/modules/fastp.nf new file mode 100644 index 000000000..2aa70e6c8 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/modules/fastp.nf @@ -0,0 +1,37 @@ +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_trimmed*.fastq.gz"), emit: reads + path "*.{json,html}" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + fastp \\ + --in1 ${input_file} \\ + --out1 ${meta.id}_trimmed.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } else { + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } +} diff --git a/side-quests/solutions/groovy_essentials/modules/generate_report.nf b/side-quests/solutions/groovy_essentials/modules/generate_report.nf new file mode 100644 index 000000000..fca06dad2 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/modules/generate_report.nf @@ -0,0 +1,19 @@ +process GENERATE_REPORT { + + publishDir 'results/reports', mode: 'copy' + + input: + tuple val(meta), path(reads) + + output: + path "${meta.id}_report.txt" + + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: \${USER}" >> ${meta.id}_report.txt + echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt + echo "Date: \$(date)" >> ${meta.id}_report.txt + """ +} diff --git a/side-quests/solutions/groovy_essentials/modules/trimgalore.nf b/side-quests/solutions/groovy_essentials/modules/trimgalore.nf new file mode 100644 index 000000000..945fef640 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/modules/trimgalore.nf @@ -0,0 +1,37 @@ +process TRIMGALORE { + container 'quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_0' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_trimmed*.fq"), emit: reads + path "*_trimming_report.txt" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + trim_galore \\ + --cores $task.cpus \\ + ${input_file} + + # Rename output to match expected pattern + mv *_trimmed.fq ${meta.id}_trimmed.fq + """ + } else { + """ + trim_galore \\ + --paired \\ + --cores $task.cpus \\ + ${reads[0]} ${reads[1]} + + # Rename outputs to match expected pattern + mv *_val_1.fq ${meta.id}_trimmed_R1.fq + mv *_val_2.fq ${meta.id}_trimmed_R2.fq + """ + } +} diff --git a/side-quests/solutions/groovy_essentials/nextflow.config b/side-quests/solutions/groovy_essentials/nextflow.config new file mode 100644 index 000000000..80c74b85f --- /dev/null +++ b/side-quests/solutions/groovy_essentials/nextflow.config @@ -0,0 +1,23 @@ +// Nextflow configuration for Groovy Essentials side quest + +docker.enabled = true + +workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + + if (workflow.success) { + println "✅ Pipeline completed successfully!" + println "Results are in: ${params.outdir ?: 'results'}" + } else { + println "❌ Pipeline failed!" + println "Error: ${workflow.errorMessage}" + } +} diff --git a/side-quests/workflow_management_fundamentals/README.md b/side-quests/workflow_management_fundamentals/README.md new file mode 100644 index 000000000..6af86b348 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/README.md @@ -0,0 +1,30 @@ +# Workflow Management Fundamentals + +This side quest demonstrates the benefits of workflow management systems like Nextflow by comparing a traditional bash script approach with a Nextflow workflow. + +## Overview + +This tutorial uses a bacterial genome analysis pipeline as an exemplar to show: + +- Automatic parallelization +- Software environment isolation via containers +- Resume capability +- Portable execution across environments +- Resource management +- Data provenance tracking + +## Files + +- `process_samples.sh` - Traditional bash script approach +- `main.nf` - Nextflow workflow +- `modules/` - Modular process definitions +- `data/samples.csv` - Sample metadata +- `nextflow.config` - Configuration file + +## Note + +This is a pedagogical example. The actual sequence data files are not included as they would be too large. The focus is on understanding the workflow patterns and benefits, not executing the actual analysis. + +## Getting Started + +See the full tutorial in the documentation: [docs/side_quests/workflow_management_fundamentals.md](../../docs/side_quests/workflow_management_fundamentals.md) diff --git a/side-quests/workflow_management_fundamentals/data/samples.csv b/side-quests/workflow_management_fundamentals/data/samples.csv new file mode 100644 index 000000000..4927a1c69 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/data/samples.csv @@ -0,0 +1,4 @@ +sample_id,organism,read1,read2 +sample_01,E.coli,data/reads/sample_01_R1.fastq.gz,data/reads/sample_01_R2.fastq.gz +sample_02,S.aureus,data/reads/sample_02_R1.fastq.gz,data/reads/sample_02_R2.fastq.gz +sample_03,P.aeruginosa,data/reads/sample_03_R1.fastq.gz,data/reads/sample_03_R2.fastq.gz diff --git a/side-quests/workflow_management_fundamentals/main.nf b/side-quests/workflow_management_fundamentals/main.nf new file mode 100644 index 000000000..a16d6c390 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/main.nf @@ -0,0 +1,37 @@ +#!/usr/bin/env nextflow + +// Include process definitions +include { FASTQC } from './modules/fastqc' +include { FASTP } from './modules/fastp' +include { SPADES } from './modules/spades' +include { QUAST } from './modules/quast' + +// Pipeline parameters +params.samples = 'data/samples.csv' +params.outdir = 'results' + +// Main workflow +workflow { + + // Read sample sheet and create channel + ch_samples = Channel + .fromPath(params.samples) + .splitCsv(header: true) + .map { row -> + def meta = [id: row.sample_id, organism: row.organism] + def reads = [file(row.read1), file(row.read2)] + return [meta, reads] + } + + // Quality control + FASTQC(ch_samples) + + // Trim and filter + FASTP(ch_samples) + + // Assemble genomes + SPADES(FASTP.out.reads) + + // Quality assessment + QUAST(SPADES.out.assembly) +} diff --git a/side-quests/workflow_management_fundamentals/modules/fastp.nf b/side-quests/workflow_management_fundamentals/modules/fastp.nf new file mode 100644 index 000000000..ad3b4dba9 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/modules/fastp.nf @@ -0,0 +1,28 @@ +process FASTP { + tag "$meta.id" + container 'biocontainers/fastp:0.23.4' + publishDir "${params.outdir}/trimmed", mode: 'copy' + + cpus 4 + memory '8.GB' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_R{1,2}.fastq.gz"), emit: reads + path "*.{json,html}", emit: reports + + script: + def prefix = meta.id + """ + fastp \\ + -i ${reads[0]} \\ + -I ${reads[1]} \\ + -o ${prefix}_R1.fastq.gz \\ + -O ${prefix}_R2.fastq.gz \\ + --json ${prefix}.json \\ + --html ${prefix}.html \\ + --thread $task.cpus + """ +} diff --git a/side-quests/workflow_management_fundamentals/modules/fastqc.nf b/side-quests/workflow_management_fundamentals/modules/fastqc.nf new file mode 100644 index 000000000..04d84bf77 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/modules/fastqc.nf @@ -0,0 +1,20 @@ +process FASTQC { + tag "$meta.id" + container 'biocontainers/fastqc:0.12.1' + publishDir "${params.outdir}/fastqc", mode: 'copy' + + cpus 2 + memory '2.GB' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*.html"), emit: html + tuple val(meta), path("*.zip"), emit: zip + + script: + """ + fastqc -q -t $task.cpus ${reads} + """ +} diff --git a/side-quests/workflow_management_fundamentals/modules/quast.nf b/side-quests/workflow_management_fundamentals/modules/quast.nf new file mode 100644 index 000000000..5c84e9e6e --- /dev/null +++ b/side-quests/workflow_management_fundamentals/modules/quast.nf @@ -0,0 +1,22 @@ +process QUAST { + tag "$meta.id" + container 'biocontainers/quast:5.2.0' + publishDir "${params.outdir}/quast", mode: 'copy' + + cpus 2 + memory '4.GB' + + input: + tuple val(meta), path(assembly) + + output: + tuple val(meta), path("${meta.id}/*"), emit: report + + script: + """ + quast.py \\ + $assembly \\ + -o ${meta.id} \\ + --threads $task.cpus + """ +} diff --git a/side-quests/workflow_management_fundamentals/modules/spades.nf b/side-quests/workflow_management_fundamentals/modules/spades.nf new file mode 100644 index 000000000..21a24bd43 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/modules/spades.nf @@ -0,0 +1,26 @@ +process SPADES { + tag "$meta.id" + container 'biocontainers/spades:3.15.5' + publishDir "${params.outdir}/assemblies", mode: 'copy' + + cpus 8 + memory '16.GB' + time '6.h' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}/contigs.fasta"), emit: assembly + tuple val(meta), path("${meta.id}/assembly_graph.fastg"), emit: graph + + script: + """ + spades.py \\ + -1 ${reads[0]} \\ + -2 ${reads[1]} \\ + -o ${meta.id} \\ + --threads $task.cpus \\ + --memory ${task.memory.toGiga()} + """ +} diff --git a/side-quests/workflow_management_fundamentals/nextflow.config b/side-quests/workflow_management_fundamentals/nextflow.config new file mode 100644 index 000000000..a603b66e3 --- /dev/null +++ b/side-quests/workflow_management_fundamentals/nextflow.config @@ -0,0 +1,70 @@ +// Project defaults +params { + samples = 'data/samples.csv' + outdir = 'results' +} + +// Process defaults +process { + cpus = 2 + memory = '4.GB' + time = '2.h' + + // Override for specific processes + withName: 'SPADES' { + cpus = 8 + memory = '16.GB' + time = '6.h' + } +} + +// Container settings +docker.enabled = true +singularity.enabled = false + +// Generate execution reports +timeline { + enabled = true + file = "${params.outdir}/timeline.html" +} + +report { + enabled = true + file = "${params.outdir}/report.html" +} + +trace { + enabled = true + file = "${params.outdir}/trace.txt" +} + +// Profiles for different execution environments +profiles { + + standard { + process.executor = 'local' + docker.enabled = true + } + + cluster { + process.executor = 'slurm' + process.queue = 'general' + process.clusterOptions = '--account=myproject' + singularity.enabled = true + singularity.cacheDir = '/shared/containers' + docker.enabled = false + } + + cloud { + process.executor = 'awsbatch' + process.queue = 'genomics-queue' + workDir = 's3://my-bucket/work' + aws.region = 'us-east-1' + } + + test { + params.samples = 'test/mini_samples.csv' + process.cpus = 1 + process.memory = '2.GB' + } +} diff --git a/side-quests/workflow_management_fundamentals/process_samples.sh b/side-quests/workflow_management_fundamentals/process_samples.sh new file mode 100644 index 000000000..7db48803f --- /dev/null +++ b/side-quests/workflow_management_fundamentals/process_samples.sh @@ -0,0 +1,56 @@ +#!/bin/bash +set -e # Exit on error + +echo "Starting bacterial genome analysis pipeline" +echo "===========================================" + +# Create output directories +mkdir -p results/fastqc +mkdir -p results/trimmed +mkdir -p results/assemblies +mkdir -p results/quast + +# Read the sample CSV and process each sample +tail -n +2 data/samples.csv | while IFS=',' read -r sample_id organism read1 read2; do + + echo "" + echo "Processing $sample_id ($organism)..." + echo "-----------------------------------" + + # Step 1: Quality control with FastQC + echo "Running FastQC..." + fastqc -q -o results/fastqc $read1 $read2 + + # Step 2: Trim adapters and filter with fastp + echo "Running fastp..." + fastp \ + -i $read1 \ + -I $read2 \ + -o results/trimmed/${sample_id}_R1.fastq.gz \ + -O results/trimmed/${sample_id}_R2.fastq.gz \ + --json results/trimmed/${sample_id}.json \ + --html results/trimmed/${sample_id}.html \ + --thread 4 + + # Step 3: Genome assembly with SPAdes + echo "Running SPAdes assembly..." + spades.py \ + -1 results/trimmed/${sample_id}_R1.fastq.gz \ + -2 results/trimmed/${sample_id}_R2.fastq.gz \ + -o results/assemblies/${sample_id} \ + --threads 8 \ + --memory 16 + + # Step 4: Assembly quality assessment with QUAST + echo "Running QUAST..." + quast.py \ + results/assemblies/${sample_id}/contigs.fasta \ + -o results/quast/${sample_id} \ + --threads 4 + + echo "Completed $sample_id" +done + +echo "" +echo "Pipeline complete!" +echo "Results available in results/"