Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow by dltamayo · Pull Request #75 · KarchinLab/TCRtoolkit

dltamayo · 2026-03-23T17:46:21Z

No description provided.

- parallelize concatenation of files to avoid loading all of them in memory; might remove RESOLVE_SAMPLESHEET - improve calculation of log10_pgen, joining of sample repertoires + olga pgen values - simplify calculation of min/max olga log10_pgen values

- Add patient workflow to run on samples grouped by patient - Move modules such as compare_calc, giana, gliph to patient to reduce scope of algorithms for large datasets - Refactored compare/patient calc to vectorize operations Further cleaning up of code and updating documentation will be required.

github-actions · 2026-03-23T17:49:51Z

Unit Test Results

10 tests 10 ✅ 2m 50s ⏱️
2 suites 0 💤
1 files 0 ❌

Results for commit 2f30cbe.

Copilot

Pull request overview

This PR adds a new patient-level analysis stage to the Nextflow pipeline and refactors the annotate → OLGA plumbing so downstream stages (sample/compare/patient) can reuse shared intermediate artifacts.

Changes:

Introduce a patient workflow level (schema + Cirro form + new PATIENT subworkflow and patient modules/scripts).
Refactor ANNOTATE to produce per-sample processed CDR3 files and a concatenated/sorted CDR3 table, and to emit OLGA-derived stats for reuse downstream.
Rework OLGA sample merge + histogram inputs to use shared OLGA stats and simplify sample-level OLGA steps.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
workflows/tcrtoolkit.nf	Adds `patient` stage wiring and guards; updates call graph (ANNOTATE/SAMPLE/PATIENT/COMPARE).
subworkflows/local/annotate.nf	Refactors annotate subworkflow to use `ANNOTATE_PROCESS`, concatenation via `collectFile`, and emits `olga_stats`.
subworkflows/local/sample.nf	Updates SAMPLE inputs and OLGA calls to use shared `olga_stats`; removes deprecated OLGA max-writing.
subworkflows/local/patient.nf	New patient-level subworkflow: group samples by `meta.patient`, concatenate, run patient metrics + clustering tools.
subworkflows/local/compare.nf	Simplifies compare stage to TCR sharing + OLGA merge only (removes GIANA/GLIPH2 here).
modules/local/annotate/main.nf	Adds `ANNOTATE_PROCESS` module to standardize per-sample CDR3 extraction.
modules/local/olga/main.nf	Adds log10 computation earlier, emits OLGA stats, refactors sample merge to a streaming join, and updates histogram calc inputs.
modules/local/patient/main.nf	New processes for patient concatenation and patient overlap matrix calculation.
bin/patient_calc.py	New vectorized patient overlap calculator writing per-patient matrices.
modules/local/compare/gliph2.nf	Makes GLIPH2 outputs patient-scoped (input tuple includes patient).
modules/local/compare/giana.nf	Makes GIANA outputs patient-scoped and changes logging/output set.
nextflow_schema.json	Extends `workflow_level` validation to allow `patient`.
nextflow.config	Changes default `olga_chunk_length`.
modules/local/sample/sample_calc.nf	Removes the `stub:` block for `SAMPLE_CALC`.
.cirro/process-input.json	Wires new `patient_lvl` form value into params.
.cirro/process-form.json	Adds patient toggle; updates `olga_chunk_length` type/default.
.cirro/preprocess.py	Adds patient flag into workflow_level construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T17:56:01Z

modules/local/olga/main.nf

+    df = pd.read_csv(
+        INDEX_FILE,
+        sep="\t",
+        usecols=["CDR3b", "pgen", "log10_pgen"],
+        dtype={"CDR3b": "object"},
+    )

-    left_bound = np.floor(np.min(log_probs))
-    right_bound = np.ceil(np.max(log_probs))
+    max_len = "${olga_stats.max_cdr3_length}"
+    cdr3_dtype = f"S{max_len}"

-    with open(f"olga_xmin_value.txt", "w") as f:
-        f.write(str(left_bound))
+    print(f"Max CDR3 length: {max_len}")

-    with open(f"olga_xmax_value.txt", "w") as f:
-        f.write(str(right_bound))
-    EOF
+    cdr3_index = df["CDR3b"].values.astype(cdr3_dtype)
+
+    pgen_index = df["pgen"].to_numpy(dtype=np.float64)
+
+    log10_index = pd.to_numeric(
+        df["log10_pgen"], errors="coerce"
+    ).to_numpy(dtype=np.float32)


OLGA_SAMPLE_MERGE loads pgen with df["pgen"].to_numpy(dtype=np.float64), but upstream OLGA_CALCULATE explicitly emits pgen values of "NA" for missing/invalid entries. Converting a column containing "NA" strings directly to float64 will raise a ValueError. Coerce pgen with pd.to_numeric(..., errors="coerce") (similar to log10_pgen) before converting to NumPy.

Copilot · 2026-03-23T17:56:02Z

modules/local/olga/main.nf

+    min_val = ${olga_stats.min_log10_pgen}
+    max_val = ${olga_stats.max_log10_pgen}
    n_bins = 30


OLGA_HISTOGRAM_CALC injects olga_stats.min_log10_pgen / max_log10_pgen directly into Python without quoting. If the stats file ends up with an empty value (e.g. all log10_pgen are NA), the generated script becomes invalid (min_val = ). Consider setting defaults in OLGA_CONCATENATE (or in the Groovy parsing) and/or emitting numeric sentinel values to guarantee valid Python literals.

Copilot · 2026-03-23T17:56:02Z

modules/local/olga/main.nf

    merged_df = pd.read_csv("${olga_pgen}", sep="\t")

    # Drop rows where pgen is 0
    merged_df = merged_df[merged_df['pgen'] != 0]
    log_probs = np.log10(merged_df['pgen'])
    cdr3_counts = merged_df['duplicate_count']


OLGA_HISTOGRAM_CALC filters out pgen == 0 but not missing values. With the updated merge logic, unmatched sequences are written as NaN pgen values; these will survive the != 0 filter and propagate NaN into np.log10/binning. Consider dropping nulls and non-positive pgen values (e.g. merged_df = merged_df.dropna(subset=['pgen']); merged_df = merged_df[merged_df['pgen'] > 0]) before computing log_probs.

Copilot · 2026-03-23T17:56:02Z

workflows/tcrtoolkit.nf

+    // Running patient analysis
+    if (levels.contains('patient')) {
+        PATIENT( ANNOTATE.out.processed_samples )
+    }


New patient workflow level is introduced (new subworkflow + new modules and significant OLGA plumbing changes), but the nf-test suite currently only asserts a minimal sample/compare run succeeds. Consider adding at least one nf-test case that runs with workflow_level=patient (and optionally sample,patient) to catch the ANNOTATE dependency and verify expected patient outputs are produced.

Copilot · 2026-03-23T17:56:02Z

workflows/tcrtoolkit.nf

+    if (levels.intersect(['sample','compare'])) {
+        ANNOTATE( sample_map_final )
+    }

    // Running sample level analysis
    if (levels.contains('sample')) {
        SAMPLE(
            sample_map_final,
-            ANNOTATE.out.cdr3_pgen
+            ANNOTATE.out.cdr3_pgen,
+            ANNOTATE.out.olga_stats
        )
    }

+    // Running patient analysis
+    if (levels.contains('patient')) {
+        PATIENT( ANNOTATE.out.processed_samples )
+    }


ANNOTATE is only executed when workflow_level includes sample or compare, but the patient level also consumes ANNOTATE.out.processed_samples. If a user selects only patient (allowed by the schema / Cirro form), PATIENT( ANNOTATE.out.processed_samples ) will fail because ANNOTATE never ran. Consider including 'patient' in the condition (or running ANNOTATE whenever any downstream level needs it).

Copilot · 2026-03-23T17:56:03Z

workflows/tcrtoolkit.nf

+
+        if (!has_patient) {
+            println("\u001B[33m[WARN]\u001B[0m Patient workflow was specified but metadata was not found in samplesheet; please specify patient IDs for samples using the 'patient' column.")
+            return


This early return will exit the entire workflow if patient was requested but the samplesheet header lacks a patient column, even when other levels (e.g. sample, compare) were also requested and could still run. If the intent is to skip only the patient stage, consider removing 'patient' from levels (or gating only the PATIENT(...) call) and only return when no runnable levels remain.

Suggested change

return

// Skip patient-level analysis but continue with other requested levels

levels = levels.findAll { it != 'patient' }

Copilot · 2026-03-23T17:56:03Z

workflows/tcrtoolkit.nf

+        def samplesheeet_header = file(params.samplesheet).readLines().first().split(',')
+        def has_patient = samplesheeet_header.contains('patient')


Typo in variable name samplesheeet_header (triple 'e'). This makes the code harder to read/search and is easy to propagate if reused; consider renaming to samplesheet_header.

Suggested change

def samplesheeet_header = file(params.samplesheet).readLines().first().split(',')

def has_patient = samplesheeet_header.contains('patient')

def samplesheet_header = file(params.samplesheet).readLines().first().split(',')

def has_patient = samplesheet_header.contains('patient')

Copilot · 2026-03-23T17:56:03Z

modules/local/sample/sample_calc.nf

@@ -17,12 +17,4 @@ process SAMPLE_CALC {
    """
    sample_calc.py -s '${sample_meta.sample}' -c ${count_table}
    """


This repo uses stub: blocks in several modules (e.g. modules/local/sample/sample_plot.nf, modules/local/compare/compare_plot.nf). Removing the stub: section from SAMPLE_CALC makes -stub-run / lightweight test runs inconsistent and may break workflows that rely on stubbing heavy steps. Consider restoring a minimal stub that creates the expected output files for this process.

Suggested change

"""

"""

stub:

"""

touch sample_stats_${sample_meta.sample}.csv

touch v_family_${sample_meta.sample}.csv

touch d_family_${sample_meta.sample}.csv

touch j_family_${sample_meta.sample}.csv

"""

Copilot · 2026-03-23T17:56:04Z

modules/local/compare/giana.nf

    GIANA --file "${concat_cdr3}" \
        --output . \
        --outfile giana_RotationEncodingBL62.txt \
        --EncodingMatrix true \
        --threshold ${threshold} \
        --threshold_score ${threshold_score} \
        --threshold_vgene ${threshold_vgene} \
        --NumberOfThreads ${task.cpus} \
-        --Verbose \
-        > giana.log 2>&1
+        --Verbose
+        # > giana.log 2>&1


GIANA is still run with --Verbose, but stdout/stderr redirection to a log file was commented out and giana.log removed from outputs. This can bloat Nextflow task logs and makes it harder to diagnose failures after the fact. Consider restoring log redirection (and emitting it) or explicitly disabling verbose output when not capturing logs.

Copilot · 2026-03-23T17:56:04Z

subworkflows/local/annotate.nf

@@ -47,9 +52,18 @@ workflow ANNOTATE {
                }
            )


The collectFile(... sort: { ... }) chunk ordering regex looks incorrect for splitText outputs. splitText typically produces filenames like unique_cdr3.txt.1, unique_cdr3.txt.2, etc (numeric suffix after the last dot), which will not match \.(\d+)\.txt$, so all chunks may sort as 0 and concatenate in nondeterministic order. If downstream assumes sorted / stable chunk order (e.g., OLGA index join), adjust the sort key to parse the numeric suffix correctly (e.g., last token after .).

dltamayo added 8 commits February 6, 2026 11:00

Remove stub

ca76cf3

Remove nf-schema plugin verision

513cdd7

Make Annotate run as part of sample/compare

1a6284d

Debug Cirro form

5faba54

Decrease olga chunk size

e8c69fa

Re-add nf-schema version

2f30cbe

dimalvovs requested a review from Copilot March 23, 2026 17:49

Copilot started reviewing on behalf of dimalvovs March 23, 2026 17:50 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow#75

Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow#75
dltamayo wants to merge 8 commits intomainfrom
dltamayo-dev

dltamayo commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return
	// Skip patient-level analysis but continue with other requested levels
	levels = levels.findAll { it != 'patient' }

		def samplesheeet_header = file(params.samplesheet).readLines().first().split(',')
		def has_patient = samplesheeet_header.contains('patient')

-    """
+    """
+    stub:
+    """
+    touch sample_stats_${sample_meta.sample}.csv
+    touch v_family_${sample_meta.sample}.csv
+    touch d_family_${sample_meta.sample}.csv
+    touch j_family_${sample_meta.sample}.csv
+    """

Conversation

dltamayo commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Unit Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants