Remove hardcoded metadata, other improvements by dltamayo · Pull Request #67 · KarchinLab/TCRtoolkit

dltamayo · 2025-12-17T21:25:21Z

No description provided.

Use awk instead of python to avoid reading files into memory

- `sample` is the unique identifier within samplesheet, no need to include other metadata columns for the purposes of clustering.

- reformatted sample_calc.py to only add sample as a metadata column; the rest of the metadata columns are now merged in the quarto notebook based on sample - added adjustable parameters allowing user to specify which metadata columns to use for notebook charts - cleaned up a lot of code within the notebook for aesthetics, reducing code redundancy

Merge from main

github-actions · 2025-12-17T21:29:02Z

Unit Test Results

10 tests 10 ✅ 3m 6s ⏱️
2 suites 0 💤
1 files 0 ❌

Results for commit 8760a31.

♻️ This comment has been updated with latest results.

Copilot

Pull request overview

This PR removes hardcoded metadata assumptions from the TCRtoolkit pipeline, making it more flexible and configurable. The changes transition from a rigid metadata structure (subject_id, timepoint, origin) to a user-configurable approach where metadata columns are read from the sample table and plotting parameters can be specified via configuration.

Key Changes:

Replaced hardcoded metadata column names with configurable parameters throughout the pipeline
Refactored the sample statistics notebook to use dynamic metadata and plotting functions
Simplified data processing scripts to remove metadata handling and use only sample names

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
notebooks/sample_stats_template.qmd	Converted from hardcoded metadata fields to configurable parameters; refactored plotting to use a reusable function; updated V gene family plotting to support dynamic grouping columns
nextflow.config	Added new configuration parameters for sample chart x/color columns and V gene grouping
modules/local/sample/sample_plot.nf	Updated to pass new configuration parameters to the Quarto rendering process
modules/local/sample/sample_calc.nf	Simplified to pass only sample name instead of full metadata JSON
bin/sample_calc.py	Removed metadata handling logic; now only uses sample name for output files
bin/compare_concatenate.py	Removed hardcoded metadata column handling and patient ID construction
modules/local/compare/tcrsharing.nf	Improved performance using vectorized pandas operations for sample mapping
modules/local/compare/gliph2.nf	Updated to use 'sample' column instead of constructed 'patient' identifier
modules/local/compare/giana.nf	Replaced Python script with more efficient awk-based header insertion

Comments suppressed due to low confidence (1)

modules/local/compare/gliph2.nf:36

The gliph2 R script is constructed with raw Nextflow parameters params.local_min_pvalue, params.simulation_depth, params.kmer_min_depth, and params.local_min_OVE inserted directly into the R code. If a caller can override these params, they can supply values like 0.001); system("malicious_command"); # so that the generated R script executes arbitrary OS commands via system() or otherwise alters control flow. These parameters should be strictly validated as numeric/structured values or safely quoted and parsed within R so that user-controlled input cannot change the R code structure.

    result <- turboGliph::gliph2(
        cdr3_sequences = df,
        result_folder = "./",
        lcminp = ${params.local_min_pvalue},
        sim_depth = ${params.simulation_depth},
        kmer_mindepth = ${params.kmer_min_depth},
        lcminove = ${params.local_min_OVE},

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

modules/local/compare/gliph2.nf

notebooks/sample_stats_template.qmd

Copilot · 2025-12-17T21:40:48Z

notebooks/sample_stats_template.qmd

 # 4. Loading data
+## reading sample metadata
+meta = pd.read_csv(sample_table, sep=',')
+meta_cols = meta.columns.tolist()[0:-1]


The meta_cols calculation uses meta.columns.tolist()[0:-1] which excludes the last column. This assumes the last column is always the 'file' column that should be excluded. However, this assumption is fragile - if the column order in the sample_table CSV changes, this will break. Consider explicitly selecting metadata columns or excluding the 'file' column by name instead.

Suggested change

meta_cols = meta.columns.tolist()[0:-1]

meta_cols = [c for c in meta.columns if c != 'file']

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- add logic to parse when submitting one grouping metadata column versus multiple for vgene chart

dltamayo and others added 5 commits December 15, 2025 14:15

Vectorize tcrsharing

542f21f

Update GIANA post-processing

2adef3b

Use awk instead of python to avoid reading files into memory

Remove hardcoded metadata from compare

626f5fa

- `sample` is the unique identifier within samplesheet, no need to include other metadata columns for the purposes of clustering.

Merge pull request #66 from KarchinLab/main

5021d53

Merge from main

dimalvovs requested a review from Copilot December 17, 2025 21:32

Copilot started reviewing on behalf of dimalvovs December 17, 2025 21:32 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

dltamayo added 3 commits December 19, 2025 12:21

Address Copilot suggestions

f63cfeb

Update notebook

ef32765

- add logic to parse when submitting one grouping metadata column versus multiple for vgene chart

Add Cirro config

8760a31

dltamayo merged commit ce634b8 into main Dec 23, 2025
3 checks passed

dltamayo mentioned this pull request Dec 29, 2025

Reformat column name handling #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove hardcoded metadata, other improvements#67

Remove hardcoded metadata, other improvements#67
dltamayo merged 8 commits intomainfrom
dltamayo-dev

dltamayo commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	meta_cols = meta.columns.tolist()[0:-1]
	meta_cols = [c for c in meta.columns if c != 'file']

Conversation

dltamayo commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Dec 17, 2025 •

edited

Loading