Skip to content

Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow#75

Open
dltamayo wants to merge 8 commits intomainfrom
dltamayo-dev
Open

Reformatting: Annotate subworkflow, OLGA module, Patient subworkflow#75
dltamayo wants to merge 8 commits intomainfrom
dltamayo-dev

Conversation

@dltamayo
Copy link
Collaborator

No description provided.

- parallelize concatenation of files to avoid loading all of them in memory; might remove RESOLVE_SAMPLESHEET
- improve calculation of log10_pgen, joining of sample repertoires + olga pgen values
- simplify calculation of min/max olga log10_pgen values
- Add patient workflow to run on samples grouped by patient
- Move modules such as compare_calc, giana, gliph to patient to reduce scope of algorithms for large datasets
- Refactored compare/patient calc to vectorize operations

Further cleaning up of code and updating documentation will be required.
@dimalvovs dimalvovs requested a review from Copilot March 23, 2026 17:49
@github-actions
Copy link

Unit Test Results

10 tests   10 ✅  2m 50s ⏱️
 2 suites   0 💤
 1 files     0 ❌

Results for commit 2f30cbe.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new patient-level analysis stage to the Nextflow pipeline and refactors the annotate → OLGA plumbing so downstream stages (sample/compare/patient) can reuse shared intermediate artifacts.

Changes:

  • Introduce a patient workflow level (schema + Cirro form + new PATIENT subworkflow and patient modules/scripts).
  • Refactor ANNOTATE to produce per-sample processed CDR3 files and a concatenated/sorted CDR3 table, and to emit OLGA-derived stats for reuse downstream.
  • Rework OLGA sample merge + histogram inputs to use shared OLGA stats and simplify sample-level OLGA steps.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
workflows/tcrtoolkit.nf Adds patient stage wiring and guards; updates call graph (ANNOTATE/SAMPLE/PATIENT/COMPARE).
subworkflows/local/annotate.nf Refactors annotate subworkflow to use ANNOTATE_PROCESS, concatenation via collectFile, and emits olga_stats.
subworkflows/local/sample.nf Updates SAMPLE inputs and OLGA calls to use shared olga_stats; removes deprecated OLGA max-writing.
subworkflows/local/patient.nf New patient-level subworkflow: group samples by meta.patient, concatenate, run patient metrics + clustering tools.
subworkflows/local/compare.nf Simplifies compare stage to TCR sharing + OLGA merge only (removes GIANA/GLIPH2 here).
modules/local/annotate/main.nf Adds ANNOTATE_PROCESS module to standardize per-sample CDR3 extraction.
modules/local/olga/main.nf Adds log10 computation earlier, emits OLGA stats, refactors sample merge to a streaming join, and updates histogram calc inputs.
modules/local/patient/main.nf New processes for patient concatenation and patient overlap matrix calculation.
bin/patient_calc.py New vectorized patient overlap calculator writing per-patient matrices.
modules/local/compare/gliph2.nf Makes GLIPH2 outputs patient-scoped (input tuple includes patient).
modules/local/compare/giana.nf Makes GIANA outputs patient-scoped and changes logging/output set.
nextflow_schema.json Extends workflow_level validation to allow patient.
nextflow.config Changes default olga_chunk_length.
modules/local/sample/sample_calc.nf Removes the stub: block for SAMPLE_CALC.
.cirro/process-input.json Wires new patient_lvl form value into params.
.cirro/process-form.json Adds patient toggle; updates olga_chunk_length type/default.
.cirro/preprocess.py Adds patient flag into workflow_level construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +132 to +150
df = pd.read_csv(
INDEX_FILE,
sep="\t",
usecols=["CDR3b", "pgen", "log10_pgen"],
dtype={"CDR3b": "object"},
)

left_bound = np.floor(np.min(log_probs))
right_bound = np.ceil(np.max(log_probs))
max_len = "${olga_stats.max_cdr3_length}"
cdr3_dtype = f"S{max_len}"

with open(f"olga_xmin_value.txt", "w") as f:
f.write(str(left_bound))
print(f"Max CDR3 length: {max_len}")

with open(f"olga_xmax_value.txt", "w") as f:
f.write(str(right_bound))
EOF
cdr3_index = df["CDR3b"].values.astype(cdr3_dtype)

pgen_index = df["pgen"].to_numpy(dtype=np.float64)

log10_index = pd.to_numeric(
df["log10_pgen"], errors="coerce"
).to_numpy(dtype=np.float32)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLGA_SAMPLE_MERGE loads pgen with df["pgen"].to_numpy(dtype=np.float64), but upstream OLGA_CALCULATE explicitly emits pgen values of "NA" for missing/invalid entries. Converting a column containing "NA" strings directly to float64 will raise a ValueError. Coerce pgen with pd.to_numeric(..., errors="coerce") (similar to log10_pgen) before converting to NumPy.

Copilot uses AI. Check for mistakes.
Comment on lines +259 to 261
min_val = ${olga_stats.min_log10_pgen}
max_val = ${olga_stats.max_log10_pgen}
n_bins = 30
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLGA_HISTOGRAM_CALC injects olga_stats.min_log10_pgen / max_log10_pgen directly into Python without quoting. If the stats file ends up with an empty value (e.g. all log10_pgen are NA), the generated script becomes invalid (min_val = ). Consider setting defaults in OLGA_CONCATENATE (or in the Groovy parsing) and/or emitting numeric sentinel values to guarantee valid Python literals.

Copilot uses AI. Check for mistakes.
Comment on lines 252 to 257
merged_df = pd.read_csv("${olga_pgen}", sep="\t")

# Drop rows where pgen is 0
merged_df = merged_df[merged_df['pgen'] != 0]
log_probs = np.log10(merged_df['pgen'])
cdr3_counts = merged_df['duplicate_count']
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLGA_HISTOGRAM_CALC filters out pgen == 0 but not missing values. With the updated merge logic, unmatched sequences are written as NaN pgen values; these will survive the != 0 filter and propagate NaN into np.log10/binning. Consider dropping nulls and non-positive pgen values (e.g. merged_df = merged_df.dropna(subset=['pgen']); merged_df = merged_df[merged_df['pgen'] > 0]) before computing log_probs.

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +113
// Running patient analysis
if (levels.contains('patient')) {
PATIENT( ANNOTATE.out.processed_samples )
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New patient workflow level is introduced (new subworkflow + new modules and significant OLGA plumbing changes), but the nf-test suite currently only asserts a minimal sample/compare run succeeds. Consider adding at least one nf-test case that runs with workflow_level=patient (and optionally sample,patient) to catch the ANNOTATE dependency and verify expected patient outputs are produced.

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +113
if (levels.intersect(['sample','compare'])) {
ANNOTATE( sample_map_final )
}

// Running sample level analysis
if (levels.contains('sample')) {
SAMPLE(
sample_map_final,
ANNOTATE.out.cdr3_pgen
ANNOTATE.out.cdr3_pgen,
ANNOTATE.out.olga_stats
)
}

// Running patient analysis
if (levels.contains('patient')) {
PATIENT( ANNOTATE.out.processed_samples )
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANNOTATE is only executed when workflow_level includes sample or compare, but the patient level also consumes ANNOTATE.out.processed_samples. If a user selects only patient (allowed by the schema / Cirro form), PATIENT( ANNOTATE.out.processed_samples ) will fail because ANNOTATE never ran. Consider including 'patient' in the condition (or running ANNOTATE whenever any downstream level needs it).

Copilot uses AI. Check for mistakes.

if (!has_patient) {
println("\u001B[33m[WARN]\u001B[0m Patient workflow was specified but metadata was not found in samplesheet; please specify patient IDs for samples using the 'patient' column.")
return
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This early return will exit the entire workflow if patient was requested but the samplesheet header lacks a patient column, even when other levels (e.g. sample, compare) were also requested and could still run. If the intent is to skip only the patient stage, consider removing 'patient' from levels (or gating only the PATIENT(...) call) and only return when no runnable levels remain.

Suggested change
return
// Skip patient-level analysis but continue with other requested levels
levels = levels.findAll { it != 'patient' }

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +49
def samplesheeet_header = file(params.samplesheet).readLines().first().split(',')
def has_patient = samplesheeet_header.contains('patient')
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in variable name samplesheeet_header (triple 'e'). This makes the code harder to read/search and is easy to propagate if reused; consider renaming to samplesheet_header.

Suggested change
def samplesheeet_header = file(params.samplesheet).readLines().first().split(',')
def has_patient = samplesheeet_header.contains('patient')
def samplesheet_header = file(params.samplesheet).readLines().first().split(',')
def has_patient = samplesheet_header.contains('patient')

Copilot uses AI. Check for mistakes.
@@ -17,12 +17,4 @@ process SAMPLE_CALC {
"""
sample_calc.py -s '${sample_meta.sample}' -c ${count_table}
"""
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repo uses stub: blocks in several modules (e.g. modules/local/sample/sample_plot.nf, modules/local/compare/compare_plot.nf). Removing the stub: section from SAMPLE_CALC makes -stub-run / lightweight test runs inconsistent and may break workflows that rely on stubbing heavy steps. Consider restoring a minimal stub that creates the expected output files for this process.

Suggested change
"""
"""
stub:
"""
touch sample_stats_${sample_meta.sample}.csv
touch v_family_${sample_meta.sample}.csv
touch d_family_${sample_meta.sample}.csv
touch j_family_${sample_meta.sample}.csv
"""

Copilot uses AI. Check for mistakes.
Comment on lines 19 to +28
GIANA --file "${concat_cdr3}" \
--output . \
--outfile giana_RotationEncodingBL62.txt \
--EncodingMatrix true \
--threshold ${threshold} \
--threshold_score ${threshold_score} \
--threshold_vgene ${threshold_vgene} \
--NumberOfThreads ${task.cpus} \
--Verbose \
> giana.log 2>&1
--Verbose
# > giana.log 2>&1
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GIANA is still run with --Verbose, but stdout/stderr redirection to a log file was commented out and giana.log removed from outputs. This can bloat Nextflow task logs and makes it harder to diagnose failures after the fact. Consider restoring log redirection (and emitting it) or explicitly disabling verbose output when not capturing logs.

Copilot uses AI. Check for mistakes.
Comment on lines 47 to 53
@@ -47,9 +52,18 @@ workflow ANNOTATE {
}
)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collectFile(... sort: { ... }) chunk ordering regex looks incorrect for splitText outputs. splitText typically produces filenames like unique_cdr3.txt.1, unique_cdr3.txt.2, etc (numeric suffix after the last dot), which will not match \.(\d+)\.txt$, so all chunks may sort as 0 and concatenate in nondeterministic order. If downstream assumes sorted / stable chunk order (e.g., OLGA index join), adjust the sort key to parse the numeric suffix correctly (e.g., last token after .).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants