docs: update CHANGELOG, rules, and output docs for v0.1.1

jayhesselberth · claude · jayhesselberth · commit 7d651c9a9c5f · 2026-02-11T05:18:22.000-07:00
Add v0.1.1 CHANGELOG entry covering 34 commits since v0.1.0 including
Quarto QC report, per-tRNA odds ratios, reference similarity QC, and
classify_charging CPU migration. Update rules-reference, overview,
outputs, scripts-reference, and README to document new rules, updated
commands, and new scripts.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,33 @@
 
 All notable changes to the aa-tRNA-seq pipeline are documented in this file.
 
+## [v0.1.1] - 2026-02-11
+
+### Added
+- Quarto QC report with per-sample tabs (#81)
+- Per-tRNA pairwise modification odds ratios (#85)
+- Reference sequence similarity QC (#84)
+- Squiggy session JSON export for Positron IDE
+- Utility to collapse redundant GtRNAdb FASTA sequences
+- Multiple 3' adapter support for PT tag detection
+- Skip mode for reference validation
+- Pre-download dorado mod base models rule (avoids race conditions)
+- nvitop GPU monitoring dependency
+
+### Changed
+- `classify_charging` switched from GPU to CPU with parallel workers (8 threads)
+- WarpDemuX workflow simplified: eliminated `merge_pods_for_demux`, passes raw POD5 dirs directly
+- `bwa_align` filtering changed from `-F 4` to `-F 20` (also excludes reverse-strand reads)
+- Removed redundant awk position filter from `bwa_align`
+- Removed `protected()` directive from `rebasecall` output
+
+### Fixed
+- Race condition when parallel GPU jobs download dorado modification models simultaneously
+- Reverse-strand reads not filtered at alignment step
+- Redundant awk position filter in `bwa_align` superseded by adapter-based filtering
+- Graceful fallback for `get_pipeline_commit` when git unavailable
+- Various snakefmt formatting and test corrections
+
 ## [v0.1.0] - 2025-01-16
 
 ### Added
diff --git a/README.md b/README.md
@@ -123,6 +123,10 @@ flowchart TD
         G --> J[bcerror<br/>basecalling errors]
         G --> K[align_stats]
         G --> L[modkit pileups]
+        L -.-> M[odds_ratios<br/>pairwise mod ORs]
+        H -.-> M
+        K -.-> N[qc_report<br/>Quarto HTML]
+        H -.-> N
     end
 
     POD5 -.-> W
@@ -141,7 +145,7 @@ Given a directory of POD5 files, this pipeline:
 
 The classification generates ML tag values (0-255) indicating the likelihood of aminoacylation. By default, ML values of 200-255 are treated as charged, and values <200 as uncharged. This threshold can be adjusted via the `ml-threshold` parameter in the `get_cca_trna_cpm` rule.
 
-The final steps of the pipeline calculate a number of outputs that may be useful for analysis and visualization, including normalized counts for charged and uncharged tRNA (`get_cca_trna_cpm`), basecalling error values (`bcerror`), alignment statistics (`align_stats`) and information on raw nanopore signal from Remora (`remora_signal_stats`).
+The final steps of the pipeline calculate a number of outputs that may be useful for analysis and visualization, including normalized counts for charged and uncharged tRNA (`get_cca_trna_cpm`), basecalling error values (`bcerror`), alignment statistics (`align_stats`), information on raw nanopore signal from Remora (`remora_signal_stats`), per-tRNA pairwise modification odds ratios (`compute_odds_ratios`), reference sequence similarity QC (`compute_reference_similarity`), and a combined Quarto QC report (`render_combined_qc_report`).
 
 ### Remora classification
 
diff --git a/docs/user-guide/outputs.md b/docs/user-guide/outputs.md
@@ -11,7 +11,9 @@ This guide documents all output files produced by the pipeline.
 ├── fq/                      # Extracted FASTQ files
 ├── summary/                 # Analysis outputs
 │   ├── tables/             # Tabular summaries
-│   └── modkit/             # Modification calling
+│   ├── modkit/             # Modification calling
+│   └── qc/                 # Reference QC metrics
+├── reports/                 # Rendered QC reports
 ├── demux/                   # Demultiplexing outputs (if enabled)
 ├── logs/                    # Rule execution logs
 └── squiggy-session.json     # Squiggy session file for Positron
@@ -37,11 +39,18 @@ flowchart TB
     subgraph Outputs
         H[summary/tables/<br/>Charging & Stats]
         I[summary/modkit/<br/>Modifications]
+        J[summary/qc/<br/>Reference similarity]
+        K[summary/tables/<br/>Odds ratios]
+        L[reports/<br/>QC report]
     end
 
     A --> B --> C --> D --> E --> F --> G
     G --> H
     G --> I
+    G --> J
+    I --> K
+    H --> K
+    H --> L
 ```
 
 ## Core Outputs
@@ -209,6 +218,59 @@ Individual modification calls per read.
 
 Comprehensive modification information including all modkit fields.
 
+## Reference Similarity Matrix
+
+`summary/qc/reference_similarity.tsv`
+
+Pairwise sequence similarity matrix for the reference FASTA, useful for identifying potential cross-mapping issues.
+
+!!! info "Separate invocation"
+    This rule is not part of the default pipeline outputs. Run it explicitly:
+    ```bash
+    pixi run snakemake compute_reference_similarity --configfile=config/config.yml
+    ```
+
+**Format:** Square TSV matrix with sequence names as row and column headers, values are percent identity (0-100).
+
+## Modification Odds Ratios
+
+`summary/tables/{sample}/{sample}.odds_ratios.tsv.gz`
+
+Per-tRNA pairwise modification odds ratios testing whether modification at one position is correlated with modification at another position (or with charging status).
+
+!!! info "Separate invocation"
+    This rule is not part of the default pipeline outputs. Run it explicitly:
+    ```bash
+    pixi run snakemake compute_odds_ratios --configfile=config/config.yml
+    ```
+
+| Column | Description |
+|--------|-------------|
+| `tRNA` | Reference tRNA name |
+| `pos1` | First position |
+| `pos2` | Second position (999 = charging) |
+| `n00`, `n01`, `n10`, `n11` | 2x2 contingency table counts |
+| `total_obs` | Total observations |
+| `odds_ratio` | Odds ratio |
+| `log_odds_ratio` | Log odds ratio |
+| `se_log_or` | Standard error of log OR |
+| `ci_lower`, `ci_upper` | 95% confidence interval |
+| `fisher_or` | Fisher's exact test OR |
+| `p_value` | Fisher's exact test p-value |
+| `p_adjusted` | BH-adjusted p-value |
+
+## QC Report
+
+`reports/qc_report.html`
+
+A combined Quarto HTML report with per-sample QC tabs, including alignment statistics, charging distributions, and basecalling error metrics.
+
+!!! info "Separate invocation"
+    This report requires the `report` pixi environment:
+    ```bash
+    pixi run -e report snakemake render_combined_qc_report --configfile=config/config.yml
+    ```
+
 ## Squiggy Session File
 
 `squiggy-session.json`
@@ -258,7 +320,7 @@ Merged POD5 file containing all raw signal data for the sample.
 
 `bam/rebasecall/{sample}/{sample}.rbc.bam`
 
-Dorado output with basecalls and move tables. Protected output (not deleted).
+Dorado output with basecalls and move tables.
 
 ### Aligned BAM
 
@@ -325,6 +387,9 @@ Approximate file sizes for a typical sample:
 | Charging CPM | 10-50 KB |
 | Charging Prob | 1-10 MB |
 | Modkit pileup | 1-5 MB |
+| Odds ratios | 100 KB-1 MB |
+| Reference similarity | 10-500 KB |
+| QC report (HTML) | 1-5 MB |
 
 ## Cleanup
 
diff --git a/docs/workflow/overview.md b/docs/workflow/overview.md
@@ -13,6 +13,8 @@ flowchart TB
         B[aatrnaseq-charging.smk<br/>Charging analysis]
         C[aatrnaseq-qc.smk<br/>Quality control]
         D[aatrnaseq-modifications.smk<br/>Modification calling]
+        OR[aatrnaseq-odds-ratios.smk<br/>Odds ratio analysis]
+        R[aatrnaseq-report.smk<br/>QC report]
         E[warpdemux.smk<br/>Demultiplexing<br/><i>conditional</i>]
     end
 
@@ -64,6 +66,14 @@ flowchart TB
         P[modkit_extract_full<br/>Full export]
     end
 
+    subgraph OddsRatios[aatrnaseq-odds-ratios.smk]
+        Q[compute_odds_ratios<br/>Pairwise OR]
+    end
+
+    subgraph Report[aatrnaseq-report.smk]
+        R[render_combined_qc_report<br/>QC report]
+    end
+
     A --> B --> C --> D --> E --> F --> G --> G2
 
     G2 --> H --> I
@@ -74,6 +84,12 @@ flowchart TB
     G2 --> N
     G2 --> O
     G2 --> P
+    O --> Q
+    H --> Q
+    J --> R
+    H --> R
+    I --> R
+    K --> R
 ```
 
 ### With Demultiplexing (WarpDemuX)
@@ -111,7 +127,7 @@ Core data processing from raw signal to classified reads:
 | `ubam_to_fastq` | Extract reads for alignment | No |
 | `bwa_idx` | Build BWA index | No |
 | `bwa_align` | Align reads to reference | No |
-| `classify_charging` | ML charging classification | Yes |
+| `classify_charging` | ML charging classification | No |
 | `transfer_bam_tags` | Rename ML→CL tags | No |
 | `add_adapter_tags` | Add PT tags for adapter positions | No |
 
@@ -130,6 +146,7 @@ Generate QC metrics and statistics:
 
 | Rule | Purpose |
 |------|---------|
+| `compute_reference_similarity` | Pairwise reference sequence similarity matrix |
 | `base_calling_error` | Per-position error frequencies |
 | `align_stats` | Read counts through pipeline |
 | `remora_signal_stats` | Raw signal metrics |
@@ -145,6 +162,22 @@ RNA modification calling with Modkit:
 | `modkit_extract_calls` | Per-read modification calls |
 | `modkit_extract_full` | Comprehensive modification export |
 
+### Odds Ratio Rules
+
+Per-tRNA pairwise modification odds ratios:
+
+| Rule | Purpose |
+|------|---------|
+| `compute_odds_ratios` | Pairwise modification odds ratios per tRNA |
+
+### Report Rules
+
+QC report generation:
+
+| Rule | Purpose |
+|------|---------|
+| `render_combined_qc_report` | Combined Quarto QC report with per-sample tabs |
+
 ### Demultiplexing Rules
 
 Optional WarpDemuX barcode demultiplexing:
@@ -181,8 +214,7 @@ Dorado re-basecalls with:
 BWA MEM with RNA-optimized parameters:
 
 - `-x ont2d` preset for ONT reads
-- Position filtering (read start ≤ 25)
-- Unmapped read removal
+- `-F 20`: Unmapped and reverse-strand read removal
 
 ### 4. Charging Classification
 
@@ -224,14 +256,14 @@ These rules require GPU access:
 | Rule | Typical Runtime | Memory |
 |------|-----------------|--------|
 | `rebasecall` | 30-60 min/sample | 24 GB |
-| `classify_charging` | 10-30 min/sample | 24 GB |
 
 ### CPU-Intensive Rules
 
 | Rule | Threads | Memory |
 |------|---------|--------|
 | `merge_pods` | 12 | 16 GB |
 | `bwa_align` | 12 | 24 GB |
+| `classify_charging` | 8 | 24 GB |
 | `modkit_extract_full` | 12 | 48 GB |
 
 ### Memory-Intensive Rules
diff --git a/docs/workflow/rules-reference.md b/docs/workflow/rules-reference.md
diff --git a/docs/workflow/scripts-reference.md b/docs/workflow/scripts-reference.md