|
| 1 | +# AGENTS.md - Developer Guide for scraps |
| 2 | + |
| 3 | +This guide provides conventions and commands for AI coding agents working in the scraps repository. |
| 4 | + |
| 5 | +**Project**: scraps - Single Cell RNA PolyA Site Discovery |
| 6 | +**Type**: Snakemake bioinformatics pipeline for analyzing mRNA polyadenylation sites from single-cell RNA-seq data |
| 7 | +**Primary Languages**: Python 3, R, Snakemake, Shell (zsh) |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Quick Start Commands |
| 12 | + |
| 13 | +### Running the Pipeline |
| 14 | + |
| 15 | +```bash |
| 16 | +# Dry-run to validate pipeline (recommended before any changes) |
| 17 | +snakemake -npr --configfile config.yaml |
| 18 | + |
| 19 | +# Run pipeline with test data |
| 20 | +snakemake --snakefile Snakefile \ |
| 21 | + --configfile config.yaml \ |
| 22 | + --resources total_impact=5 \ |
| 23 | + --keep-going |
| 24 | + |
| 25 | +# Run with specific number of cores |
| 26 | +snakemake -j 8 --configfile config.yaml |
| 27 | + |
| 28 | +# Generate DAG visualization |
| 29 | +snakemake --dag | dot -Tpdf > dag.pdf |
| 30 | +``` |
| 31 | + |
| 32 | +### Testing Changes |
| 33 | + |
| 34 | +```bash |
| 35 | +# Always dry-run first to validate Snakemake syntax |
| 36 | +snakemake -npr --configfile config.yaml |
| 37 | + |
| 38 | +# Test specific rule |
| 39 | +snakemake -npr --configfile config.yaml results/counts/chromiumv2_test_R2_counts.tsv.gz |
| 40 | + |
| 41 | +# List all rules |
| 42 | +snakemake --list |
| 43 | + |
| 44 | +# Show reason for rule execution |
| 45 | +snakemake -npr --reason --configfile config.yaml |
| 46 | +``` |
| 47 | + |
| 48 | +### Environment Setup |
| 49 | + |
| 50 | +```bash |
| 51 | +# Create conda environment |
| 52 | +conda env create -f scraps_conda.yml |
| 53 | + |
| 54 | +# Activate environment |
| 55 | +conda activate scraps_conda |
| 56 | + |
| 57 | +# Update environment after changes |
| 58 | +conda env update -n scraps_conda -f scraps_conda.yml |
| 59 | +``` |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## Project Structure |
| 64 | + |
| 65 | +``` |
| 66 | +scraps/ |
| 67 | +├── Snakefile # Main workflow entry point |
| 68 | +├── config.yaml # Sample and pipeline configuration |
| 69 | +├── chemistry.yaml # Platform-specific chemistry configs |
| 70 | +├── scraps_conda.yml # Conda environment specification |
| 71 | +├── rules/ # Snakemake rule modules |
| 72 | +│ ├── cutadapt_star.snake # Read trimming and alignment |
| 73 | +│ ├── count.snake # Feature counting and quantification |
| 74 | +│ ├── qc.snake # Quality control reports |
| 75 | +│ └── check_versions.snake # Dependency version checks |
| 76 | +├── inst/scripts/ # Helper scripts |
| 77 | +│ ├── *.py # Python utilities (BAM filtering, etc.) |
| 78 | +│ └── R/ # R analysis functions |
| 79 | +├── ref/ # Reference files (polyA_DB, etc.) |
| 80 | +├── sample_data/ # Test data location |
| 81 | +└── results/ # Pipeline outputs (generated) |
| 82 | +``` |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## Code Style Guidelines |
| 87 | + |
| 88 | +### Python Scripts |
| 89 | + |
| 90 | +**Imports**: Standard library → Third party → Local, grouped and sorted |
| 91 | +```python |
| 92 | +import os |
| 93 | +import re |
| 94 | +import argparse |
| 95 | + |
| 96 | +import pysam |
| 97 | +import pandas as pd |
| 98 | +import numpy as np |
| 99 | +``` |
| 100 | + |
| 101 | +**Docstrings**: Triple-quoted strings describing script/function purpose |
| 102 | +```python |
| 103 | +""" Filter BAM files to only reads with soft-clipped A tail, |
| 104 | +suitable for cellranger and starsolo output |
| 105 | +""" |
| 106 | +``` |
| 107 | + |
| 108 | +**Command-line arguments**: Use `argparse` with descriptive help text |
| 109 | +```python |
| 110 | +parser.add_argument('-i', '--inbam', |
| 111 | + help="Bam file to correct", |
| 112 | + required=True) |
| 113 | +``` |
| 114 | + |
| 115 | +**Naming conventions**: |
| 116 | +- Functions: `snake_case` (e.g., `filter_bam_by_A`, `correct_bam_read1`) |
| 117 | +- Variables: `snake_case` (e.g., `target_len`, `filter_cut`, `single_end`) |
| 118 | +- Constants: `UPPER_CASE` if truly constant |
| 119 | + |
| 120 | +**File handling**: Use context managers for file operations |
| 121 | +```python |
| 122 | +with open(file_in) as file, gzip.open(file_out, 'wt') as file2: |
| 123 | + # process files |
| 124 | +``` |
| 125 | + |
| 126 | +### R Scripts |
| 127 | + |
| 128 | +**Documentation**: Roxygen2-style comments for functions |
| 129 | +```r |
| 130 | +#' Read scraps output from umi_tools to sparseMatrix |
| 131 | +#' |
| 132 | +#' @param file scraps output table |
| 133 | +#' @param n_min minimum number of observations |
| 134 | +#' @return count matrix |
| 135 | +#' @export |
| 136 | +``` |
| 137 | + |
| 138 | +**Style**: Follow tidyverse conventions |
| 139 | +- Use `%>%` pipe operator |
| 140 | +- Prefer `dplyr`, `readr`, `stringr`, `tidyr` functions |
| 141 | +- Function names: `snake_case` |
| 142 | + |
| 143 | +**Dependencies**: Import packages explicitly |
| 144 | +```r |
| 145 | +#' @import readr dplyr stringr tidyr |
| 146 | +``` |
| 147 | + |
| 148 | +### Snakemake Rules |
| 149 | + |
| 150 | +**Shell executable**: Pipeline uses `zsh` (defined in Snakefile line 1) |
| 151 | +```python |
| 152 | +shell.executable("zsh") |
| 153 | +``` |
| 154 | + |
| 155 | +**Rule structure**: Include all standard sections |
| 156 | +```python |
| 157 | +rule rulename: |
| 158 | + input: |
| 159 | + "path/to/input.bam" |
| 160 | + output: |
| 161 | + temp("path/to/output.bam") # Use temp() for intermediate files |
| 162 | + params: |
| 163 | + job_name = "rulename", |
| 164 | + # Additional parameters |
| 165 | + log: |
| 166 | + "{results}/logs/{sample}_rulename.txt" |
| 167 | + threads: |
| 168 | + 12 |
| 169 | + resources: |
| 170 | + mem_mb = 8000 |
| 171 | + shell: |
| 172 | + r""" |
| 173 | + command --arg {input} > {output} 2> {log} |
| 174 | + """ |
| 175 | +``` |
| 176 | + |
| 177 | +**Key conventions**: |
| 178 | +- Use raw strings `r"""..."""` for shell blocks |
| 179 | +- Redirect stderr to log files: `2> {log}` |
| 180 | +- Mark intermediate files with `temp()` |
| 181 | +- Use wildcards in paths: `{sample}`, `{results}`, `{read}` |
| 182 | +- Resource specifications: `threads`, `mem_mb` |
| 183 | +- Use `expand()` for generating multiple outputs |
| 184 | + |
| 185 | +**Accessing config**: Use helper functions like `_get_config(sample, item)` |
| 186 | +```python |
| 187 | +def _get_config(sample, item): |
| 188 | + # Hierarchical lookup: sample -> chemistry[platform] -> chemistry -> defaults |
| 189 | +``` |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## Configuration Files |
| 194 | + |
| 195 | +### config.yaml |
| 196 | +- `DATA`: Directory containing input FASTQs |
| 197 | +- `RESULTS`: Output directory path |
| 198 | +- `STAR_INDEX`: Path to STAR genome index |
| 199 | +- `POLYA_SITES`: PolyA database reference file (SAF format) |
| 200 | +- `DEFAULTS`: Default chemistry and platform settings |
| 201 | +- `SAMPLES`: Per-sample configuration (basename, chemistry, alignments) |
| 202 | + |
| 203 | +### chemistry.yaml |
| 204 | +Platform-specific configurations organized hierarchically: |
| 205 | +```yaml |
| 206 | +chemistry_name: |
| 207 | + bc_whitelist: path/to/whitelist |
| 208 | + platform_name: |
| 209 | + cutadapt_R1: "trimming parameters" |
| 210 | + STAR_R1: "alignment parameters" |
| 211 | + STAR_R2: "alignment parameters" |
| 212 | +``` |
| 213 | +
|
| 214 | +--- |
| 215 | +
|
| 216 | +## Common Development Tasks |
| 217 | +
|
| 218 | +### Adding a New Rule |
| 219 | +
|
| 220 | +1. Create rule in appropriate file under `rules/` |
| 221 | +2. Follow naming convention: `verb_target` (e.g., `assign_sites_R1`) |
| 222 | +3. Add to workflow by including outputs in `SAMPLE_OUTS` (Snakefile) |
| 223 | +4. Test with dry-run: `snakemake -npr` |
| 224 | + |
| 225 | +### Modifying Chemistry Configuration |
| 226 | + |
| 227 | +1. Edit `chemistry.yaml` |
| 228 | +2. Ensure all required fields present: `cutadapt_*`, `STAR_*` |
| 229 | +3. Optional fields: `bc_whitelist`, `bc_cut`, `bc_length1` |
| 230 | +4. Test with dry-run to validate YAML syntax |
| 231 | + |
| 232 | +### Adding Python Helper Script |
| 233 | + |
| 234 | +1. Place in `inst/scripts/` |
| 235 | +2. Use argparse for CLI interface |
| 236 | +3. Include docstring explaining purpose |
| 237 | +4. Make executable: `chmod +x script.py` |
| 238 | +5. Call from Snakemake rule with `python3 inst/scripts/script.py` |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +## Error Handling and Debugging |
| 243 | + |
| 244 | +**Log files**: All rules write logs to `{results}/logs/` |
| 245 | +- Check logs for detailed error messages |
| 246 | +- Logs include stderr from all commands |
| 247 | + |
| 248 | +**Common issues**: |
| 249 | +- Missing conda dependencies → check `scraps_conda.yml` |
| 250 | +- YAML syntax errors → validate with `snakemake -npr` |
| 251 | +- Missing input files → check `DATA` path in config.yaml |
| 252 | +- Resource exhaustion → adjust `mem_mb` or `threads` in rules |
| 253 | + |
| 254 | +**Debugging Snakemake**: |
| 255 | +```bash |
| 256 | +# Show detailed execution plan |
| 257 | +snakemake -npr --verbose |
| 258 | +
|
| 259 | +# Print shell commands without execution |
| 260 | +snakemake -np --printshellcmds |
| 261 | +
|
| 262 | +# Force re-run specific rule |
| 263 | +snakemake --forcerun rulename |
| 264 | +``` |
| 265 | + |
| 266 | +--- |
| 267 | + |
| 268 | +## Dependencies and Tools |
| 269 | + |
| 270 | +**Core requirements** (installed via conda): |
| 271 | +- Python >= 3.7 |
| 272 | +- Snakemake >= 5.3.0, < 8 |
| 273 | +- STAR >= 2.7.9a (RNA-seq aligner) |
| 274 | +- UMI-tools >= 1.1.2 (UMI handling) |
| 275 | +- cutadapt >= 3.4 (adapter trimming) |
| 276 | +- samtools >= 1.15 (BAM manipulation) |
| 277 | +- bedtools >= 2.30.0 (genomic intervals) |
| 278 | +- subread >= 2.0.1 (featureCounts) |
| 279 | +- MultiQC >= 1.6 (report generation) |
| 280 | +- pysam >= 0.16.0 (Python BAM interface) |
| 281 | + |
| 282 | +**Version checking**: Run `snakemake --configfile config.yaml` to trigger version checks |
| 283 | + |
| 284 | +--- |
| 285 | + |
| 286 | +## Notes for AI Agents |
| 287 | + |
| 288 | +- **Always dry-run first**: Use `snakemake -npr` before any pipeline changes |
| 289 | +- **Respect shell choice**: Pipeline explicitly uses `zsh`, not bash |
| 290 | +- **Preserve temp files**: Snakemake manages cleanup via `temp()` directive |
| 291 | +- **Follow hierarchical config**: Sample → Chemistry/Platform → Defaults |
| 292 | +- **Log everything**: Redirect stderr to log files for debugging |
| 293 | +- **Resource awareness**: Bioinformatics tools are memory/CPU intensive |
| 294 | +- **No traditional tests**: Validation is via successful Snakemake dry-run |
0 commit comments