Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow#75
Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow#75
Conversation
- parallelize concatenation of files to avoid loading all of them in memory; might remove RESOLVE_SAMPLESHEET - improve calculation of log10_pgen, joining of sample repertoires + olga pgen values - simplify calculation of min/max olga log10_pgen values
- Add patient workflow to run on samples grouped by patient - Move modules such as compare_calc, giana, gliph to patient to reduce scope of algorithms for large datasets - Refactored compare/patient calc to vectorize operations Further cleaning up of code and updating documentation will be required.
Unit Test Results10 tests 10 ✅ 2m 50s ⏱️ Results for commit 2f30cbe. |
There was a problem hiding this comment.
Pull request overview
This PR adds a new patient-level analysis stage to the Nextflow pipeline and refactors the annotate → OLGA plumbing so downstream stages (sample/compare/patient) can reuse shared intermediate artifacts.
Changes:
- Introduce a
patientworkflow level (schema + Cirro form + newPATIENTsubworkflow and patient modules/scripts). - Refactor
ANNOTATEto produce per-sample processed CDR3 files and a concatenated/sorted CDR3 table, and to emit OLGA-derived stats for reuse downstream. - Rework OLGA sample merge + histogram inputs to use shared OLGA stats and simplify sample-level OLGA steps.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/tcrtoolkit.nf | Adds patient stage wiring and guards; updates call graph (ANNOTATE/SAMPLE/PATIENT/COMPARE). |
| subworkflows/local/annotate.nf | Refactors annotate subworkflow to use ANNOTATE_PROCESS, concatenation via collectFile, and emits olga_stats. |
| subworkflows/local/sample.nf | Updates SAMPLE inputs and OLGA calls to use shared olga_stats; removes deprecated OLGA max-writing. |
| subworkflows/local/patient.nf | New patient-level subworkflow: group samples by meta.patient, concatenate, run patient metrics + clustering tools. |
| subworkflows/local/compare.nf | Simplifies compare stage to TCR sharing + OLGA merge only (removes GIANA/GLIPH2 here). |
| modules/local/annotate/main.nf | Adds ANNOTATE_PROCESS module to standardize per-sample CDR3 extraction. |
| modules/local/olga/main.nf | Adds log10 computation earlier, emits OLGA stats, refactors sample merge to a streaming join, and updates histogram calc inputs. |
| modules/local/patient/main.nf | New processes for patient concatenation and patient overlap matrix calculation. |
| bin/patient_calc.py | New vectorized patient overlap calculator writing per-patient matrices. |
| modules/local/compare/gliph2.nf | Makes GLIPH2 outputs patient-scoped (input tuple includes patient). |
| modules/local/compare/giana.nf | Makes GIANA outputs patient-scoped and changes logging/output set. |
| nextflow_schema.json | Extends workflow_level validation to allow patient. |
| nextflow.config | Changes default olga_chunk_length. |
| modules/local/sample/sample_calc.nf | Removes the stub: block for SAMPLE_CALC. |
| .cirro/process-input.json | Wires new patient_lvl form value into params. |
| .cirro/process-form.json | Adds patient toggle; updates olga_chunk_length type/default. |
| .cirro/preprocess.py | Adds patient flag into workflow_level construction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| df = pd.read_csv( | ||
| INDEX_FILE, | ||
| sep="\t", | ||
| usecols=["CDR3b", "pgen", "log10_pgen"], | ||
| dtype={"CDR3b": "object"}, | ||
| ) | ||
|
|
||
| left_bound = np.floor(np.min(log_probs)) | ||
| right_bound = np.ceil(np.max(log_probs)) | ||
| max_len = "${olga_stats.max_cdr3_length}" | ||
| cdr3_dtype = f"S{max_len}" | ||
|
|
||
| with open(f"olga_xmin_value.txt", "w") as f: | ||
| f.write(str(left_bound)) | ||
| print(f"Max CDR3 length: {max_len}") | ||
|
|
||
| with open(f"olga_xmax_value.txt", "w") as f: | ||
| f.write(str(right_bound)) | ||
| EOF | ||
| cdr3_index = df["CDR3b"].values.astype(cdr3_dtype) | ||
|
|
||
| pgen_index = df["pgen"].to_numpy(dtype=np.float64) | ||
|
|
||
| log10_index = pd.to_numeric( | ||
| df["log10_pgen"], errors="coerce" | ||
| ).to_numpy(dtype=np.float32) |
There was a problem hiding this comment.
OLGA_SAMPLE_MERGE loads pgen with df["pgen"].to_numpy(dtype=np.float64), but upstream OLGA_CALCULATE explicitly emits pgen values of "NA" for missing/invalid entries. Converting a column containing "NA" strings directly to float64 will raise a ValueError. Coerce pgen with pd.to_numeric(..., errors="coerce") (similar to log10_pgen) before converting to NumPy.
| min_val = ${olga_stats.min_log10_pgen} | ||
| max_val = ${olga_stats.max_log10_pgen} | ||
| n_bins = 30 |
There was a problem hiding this comment.
OLGA_HISTOGRAM_CALC injects olga_stats.min_log10_pgen / max_log10_pgen directly into Python without quoting. If the stats file ends up with an empty value (e.g. all log10_pgen are NA), the generated script becomes invalid (min_val = ). Consider setting defaults in OLGA_CONCATENATE (or in the Groovy parsing) and/or emitting numeric sentinel values to guarantee valid Python literals.
| merged_df = pd.read_csv("${olga_pgen}", sep="\t") | ||
|
|
||
| # Drop rows where pgen is 0 | ||
| merged_df = merged_df[merged_df['pgen'] != 0] | ||
| log_probs = np.log10(merged_df['pgen']) | ||
| cdr3_counts = merged_df['duplicate_count'] |
There was a problem hiding this comment.
OLGA_HISTOGRAM_CALC filters out pgen == 0 but not missing values. With the updated merge logic, unmatched sequences are written as NaN pgen values; these will survive the != 0 filter and propagate NaN into np.log10/binning. Consider dropping nulls and non-positive pgen values (e.g. merged_df = merged_df.dropna(subset=['pgen']); merged_df = merged_df[merged_df['pgen'] > 0]) before computing log_probs.
| // Running patient analysis | ||
| if (levels.contains('patient')) { | ||
| PATIENT( ANNOTATE.out.processed_samples ) | ||
| } |
There was a problem hiding this comment.
New patient workflow level is introduced (new subworkflow + new modules and significant OLGA plumbing changes), but the nf-test suite currently only asserts a minimal sample/compare run succeeds. Consider adding at least one nf-test case that runs with workflow_level=patient (and optionally sample,patient) to catch the ANNOTATE dependency and verify expected patient outputs are produced.
| if (levels.intersect(['sample','compare'])) { | ||
| ANNOTATE( sample_map_final ) | ||
| } | ||
|
|
||
| // Running sample level analysis | ||
| if (levels.contains('sample')) { | ||
| SAMPLE( | ||
| sample_map_final, | ||
| ANNOTATE.out.cdr3_pgen | ||
| ANNOTATE.out.cdr3_pgen, | ||
| ANNOTATE.out.olga_stats | ||
| ) | ||
| } | ||
|
|
||
| // Running patient analysis | ||
| if (levels.contains('patient')) { | ||
| PATIENT( ANNOTATE.out.processed_samples ) | ||
| } |
There was a problem hiding this comment.
ANNOTATE is only executed when workflow_level includes sample or compare, but the patient level also consumes ANNOTATE.out.processed_samples. If a user selects only patient (allowed by the schema / Cirro form), PATIENT( ANNOTATE.out.processed_samples ) will fail because ANNOTATE never ran. Consider including 'patient' in the condition (or running ANNOTATE whenever any downstream level needs it).
|
|
||
| if (!has_patient) { | ||
| println("\u001B[33m[WARN]\u001B[0m Patient workflow was specified but metadata was not found in samplesheet; please specify patient IDs for samples using the 'patient' column.") | ||
| return |
There was a problem hiding this comment.
This early return will exit the entire workflow if patient was requested but the samplesheet header lacks a patient column, even when other levels (e.g. sample, compare) were also requested and could still run. If the intent is to skip only the patient stage, consider removing 'patient' from levels (or gating only the PATIENT(...) call) and only return when no runnable levels remain.
| return | |
| // Skip patient-level analysis but continue with other requested levels | |
| levels = levels.findAll { it != 'patient' } |
| def samplesheeet_header = file(params.samplesheet).readLines().first().split(',') | ||
| def has_patient = samplesheeet_header.contains('patient') |
There was a problem hiding this comment.
Typo in variable name samplesheeet_header (triple 'e'). This makes the code harder to read/search and is easy to propagate if reused; consider renaming to samplesheet_header.
| def samplesheeet_header = file(params.samplesheet).readLines().first().split(',') | |
| def has_patient = samplesheeet_header.contains('patient') | |
| def samplesheet_header = file(params.samplesheet).readLines().first().split(',') | |
| def has_patient = samplesheet_header.contains('patient') |
| @@ -17,12 +17,4 @@ process SAMPLE_CALC { | |||
| """ | |||
| sample_calc.py -s '${sample_meta.sample}' -c ${count_table} | |||
| """ | |||
There was a problem hiding this comment.
This repo uses stub: blocks in several modules (e.g. modules/local/sample/sample_plot.nf, modules/local/compare/compare_plot.nf). Removing the stub: section from SAMPLE_CALC makes -stub-run / lightweight test runs inconsistent and may break workflows that rely on stubbing heavy steps. Consider restoring a minimal stub that creates the expected output files for this process.
| """ | |
| """ | |
| stub: | |
| """ | |
| touch sample_stats_${sample_meta.sample}.csv | |
| touch v_family_${sample_meta.sample}.csv | |
| touch d_family_${sample_meta.sample}.csv | |
| touch j_family_${sample_meta.sample}.csv | |
| """ |
| GIANA --file "${concat_cdr3}" \ | ||
| --output . \ | ||
| --outfile giana_RotationEncodingBL62.txt \ | ||
| --EncodingMatrix true \ | ||
| --threshold ${threshold} \ | ||
| --threshold_score ${threshold_score} \ | ||
| --threshold_vgene ${threshold_vgene} \ | ||
| --NumberOfThreads ${task.cpus} \ | ||
| --Verbose \ | ||
| > giana.log 2>&1 | ||
| --Verbose | ||
| # > giana.log 2>&1 |
There was a problem hiding this comment.
GIANA is still run with --Verbose, but stdout/stderr redirection to a log file was commented out and giana.log removed from outputs. This can bloat Nextflow task logs and makes it harder to diagnose failures after the fact. Consider restoring log redirection (and emitting it) or explicitly disabling verbose output when not capturing logs.
| @@ -47,9 +52,18 @@ workflow ANNOTATE { | |||
| } | |||
| ) | |||
There was a problem hiding this comment.
The collectFile(... sort: { ... }) chunk ordering regex looks incorrect for splitText outputs. splitText typically produces filenames like unique_cdr3.txt.1, unique_cdr3.txt.2, etc (numeric suffix after the last dot), which will not match \.(\d+)\.txt$, so all chunks may sort as 0 and concatenate in nondeterministic order. If downstream assumes sorted / stable chunk order (e.g., OLGA index join), adjust the sort key to parse the numeric suffix correctly (e.g., last token after .).
No description provided.