Conversation
Use awk instead of python to avoid reading files into memory
- `sample` is the unique identifier within samplesheet, no need to include other metadata columns for the purposes of clustering.
- reformatted sample_calc.py to only add sample as a metadata column; the rest of the metadata columns are now merged in the quarto notebook based on sample - added adjustable parameters allowing user to specify which metadata columns to use for notebook charts - cleaned up a lot of code within the notebook for aesthetics, reducing code redundancy
Merge from main
Unit Test Results10 tests 10 ✅ 3m 6s ⏱️ Results for commit 8760a31. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Pull request overview
This PR removes hardcoded metadata assumptions from the TCRtoolkit pipeline, making it more flexible and configurable. The changes transition from a rigid metadata structure (subject_id, timepoint, origin) to a user-configurable approach where metadata columns are read from the sample table and plotting parameters can be specified via configuration.
Key Changes:
- Replaced hardcoded metadata column names with configurable parameters throughout the pipeline
- Refactored the sample statistics notebook to use dynamic metadata and plotting functions
- Simplified data processing scripts to remove metadata handling and use only sample names
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| notebooks/sample_stats_template.qmd | Converted from hardcoded metadata fields to configurable parameters; refactored plotting to use a reusable function; updated V gene family plotting to support dynamic grouping columns |
| nextflow.config | Added new configuration parameters for sample chart x/color columns and V gene grouping |
| modules/local/sample/sample_plot.nf | Updated to pass new configuration parameters to the Quarto rendering process |
| modules/local/sample/sample_calc.nf | Simplified to pass only sample name instead of full metadata JSON |
| bin/sample_calc.py | Removed metadata handling logic; now only uses sample name for output files |
| bin/compare_concatenate.py | Removed hardcoded metadata column handling and patient ID construction |
| modules/local/compare/tcrsharing.nf | Improved performance using vectorized pandas operations for sample mapping |
| modules/local/compare/gliph2.nf | Updated to use 'sample' column instead of constructed 'patient' identifier |
| modules/local/compare/giana.nf | Replaced Python script with more efficient awk-based header insertion |
Comments suppressed due to low confidence (1)
modules/local/compare/gliph2.nf:36
- The
gliph2R script is constructed with raw Nextflow parametersparams.local_min_pvalue,params.simulation_depth,params.kmer_min_depth, andparams.local_min_OVEinserted directly into the R code. If a caller can override these params, they can supply values like0.001); system("malicious_command"); #so that the generated R script executes arbitrary OS commands viasystem()or otherwise alters control flow. These parameters should be strictly validated as numeric/structured values or safely quoted and parsed within R so that user-controlled input cannot change the R code structure.
result <- turboGliph::gliph2(
cdr3_sequences = df,
result_folder = "./",
lcminp = ${params.local_min_pvalue},
sim_depth = ${params.simulation_depth},
kmer_mindepth = ${params.kmer_min_depth},
lcminove = ${params.local_min_OVE},
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
notebooks/sample_stats_template.qmd
Outdated
| # 4. Loading data | ||
| ## reading sample metadata | ||
| meta = pd.read_csv(sample_table, sep=',') | ||
| meta_cols = meta.columns.tolist()[0:-1] |
There was a problem hiding this comment.
The meta_cols calculation uses meta.columns.tolist()[0:-1] which excludes the last column. This assumes the last column is always the 'file' column that should be excluded. However, this assumption is fragile - if the column order in the sample_table CSV changes, this will break. Consider explicitly selecting metadata columns or excluding the 'file' column by name instead.
| meta_cols = meta.columns.tolist()[0:-1] | |
| meta_cols = [c for c in meta.columns if c != 'file'] |
- add logic to parse when submitting one grouping metadata column versus multiple for vgene chart
No description provided.