Skip to content

Commit 2e9ee7a

Browse files
authored
Merge pull request #82 from rnabioco/config_update_zsh
Merge new zsh-dependent config revamp into master
2 parents 73b77d9 + 8107074 commit 2e9ee7a

File tree

15 files changed

+799
-356
lines changed

15 files changed

+799
-356
lines changed

.github/workflows/snakemake_run.yml

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,17 @@ jobs:
1717
mamba-version: "*"
1818
channels: bioconda,conda-forge,defaults
1919
channel-priority: true
20-
environment-file: scraps_conda.yml
20+
auto-activate-base: false
21+
use-mamba: true
2122

22-
- shell: bash -el {0}
23+
- name: Create conda environment
24+
shell: bash -el {0}
2325
run: |
24-
conda activate scraps_conda
25-
snakemake -npr --configfile config.yaml
26+
mamba env create -f scraps_conda.yml
2627
28+
- name: Run Snakemake
29+
shell: bash -el {0}
30+
run: |
31+
conda activate scraps_conda
32+
conda info
33+
snakemake -npr --configfile config.yaml

AGENTS.md

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# AGENTS.md - Developer Guide for scraps
2+
3+
This guide provides conventions and commands for AI coding agents working in the scraps repository.
4+
5+
**Project**: scraps - Single Cell RNA PolyA Site Discovery
6+
**Type**: Snakemake bioinformatics pipeline for analyzing mRNA polyadenylation sites from single-cell RNA-seq data
7+
**Primary Languages**: Python 3, R, Snakemake, Shell (zsh)
8+
9+
---
10+
11+
## Quick Start Commands
12+
13+
### Running the Pipeline
14+
15+
```bash
16+
# Dry-run to validate pipeline (recommended before any changes)
17+
snakemake -npr --configfile config.yaml
18+
19+
# Run pipeline with test data
20+
snakemake --snakefile Snakefile \
21+
--configfile config.yaml \
22+
--resources total_impact=5 \
23+
--keep-going
24+
25+
# Run with specific number of cores
26+
snakemake -j 8 --configfile config.yaml
27+
28+
# Generate DAG visualization
29+
snakemake --dag | dot -Tpdf > dag.pdf
30+
```
31+
32+
### Testing Changes
33+
34+
```bash
35+
# Always dry-run first to validate Snakemake syntax
36+
snakemake -npr --configfile config.yaml
37+
38+
# Test specific rule
39+
snakemake -npr --configfile config.yaml results/counts/chromiumv2_test_R2_counts.tsv.gz
40+
41+
# List all rules
42+
snakemake --list
43+
44+
# Show reason for rule execution
45+
snakemake -npr --reason --configfile config.yaml
46+
```
47+
48+
### Environment Setup
49+
50+
```bash
51+
# Create conda environment
52+
conda env create -f scraps_conda.yml
53+
54+
# Activate environment
55+
conda activate scraps_conda
56+
57+
# Update environment after changes
58+
conda env update -n scraps_conda -f scraps_conda.yml
59+
```
60+
61+
---
62+
63+
## Project Structure
64+
65+
```
66+
scraps/
67+
├── Snakefile # Main workflow entry point
68+
├── config.yaml # Sample and pipeline configuration
69+
├── chemistry.yaml # Platform-specific chemistry configs
70+
├── scraps_conda.yml # Conda environment specification
71+
├── rules/ # Snakemake rule modules
72+
│ ├── cutadapt_star.snake # Read trimming and alignment
73+
│ ├── count.snake # Feature counting and quantification
74+
│ ├── qc.snake # Quality control reports
75+
│ └── check_versions.snake # Dependency version checks
76+
├── inst/scripts/ # Helper scripts
77+
│ ├── *.py # Python utilities (BAM filtering, etc.)
78+
│ └── R/ # R analysis functions
79+
├── ref/ # Reference files (polyA_DB, etc.)
80+
├── sample_data/ # Test data location
81+
└── results/ # Pipeline outputs (generated)
82+
```
83+
84+
---
85+
86+
## Code Style Guidelines
87+
88+
### Python Scripts
89+
90+
**Imports**: Standard library → Third party → Local, grouped and sorted
91+
```python
92+
import os
93+
import re
94+
import argparse
95+
96+
import pysam
97+
import pandas as pd
98+
import numpy as np
99+
```
100+
101+
**Docstrings**: Triple-quoted strings describing script/function purpose
102+
```python
103+
""" Filter BAM files to only reads with soft-clipped A tail,
104+
suitable for cellranger and starsolo output
105+
"""
106+
```
107+
108+
**Command-line arguments**: Use `argparse` with descriptive help text
109+
```python
110+
parser.add_argument('-i', '--inbam',
111+
help="Bam file to correct",
112+
required=True)
113+
```
114+
115+
**Naming conventions**:
116+
- Functions: `snake_case` (e.g., `filter_bam_by_A`, `correct_bam_read1`)
117+
- Variables: `snake_case` (e.g., `target_len`, `filter_cut`, `single_end`)
118+
- Constants: `UPPER_CASE` if truly constant
119+
120+
**File handling**: Use context managers for file operations
121+
```python
122+
with open(file_in) as file, gzip.open(file_out, 'wt') as file2:
123+
# process files
124+
```
125+
126+
### R Scripts
127+
128+
**Documentation**: Roxygen2-style comments for functions
129+
```r
130+
#' Read scraps output from umi_tools to sparseMatrix
131+
#'
132+
#' @param file scraps output table
133+
#' @param n_min minimum number of observations
134+
#' @return count matrix
135+
#' @export
136+
```
137+
138+
**Style**: Follow tidyverse conventions
139+
- Use `%>%` pipe operator
140+
- Prefer `dplyr`, `readr`, `stringr`, `tidyr` functions
141+
- Function names: `snake_case`
142+
143+
**Dependencies**: Import packages explicitly
144+
```r
145+
#' @import readr dplyr stringr tidyr
146+
```
147+
148+
### Snakemake Rules
149+
150+
**Shell executable**: Pipeline uses `zsh` (defined in Snakefile line 1)
151+
```python
152+
shell.executable("zsh")
153+
```
154+
155+
**Rule structure**: Include all standard sections
156+
```python
157+
rule rulename:
158+
input:
159+
"path/to/input.bam"
160+
output:
161+
temp("path/to/output.bam") # Use temp() for intermediate files
162+
params:
163+
job_name = "rulename",
164+
# Additional parameters
165+
log:
166+
"{results}/logs/{sample}_rulename.txt"
167+
threads:
168+
12
169+
resources:
170+
mem_mb = 8000
171+
shell:
172+
r"""
173+
command --arg {input} > {output} 2> {log}
174+
"""
175+
```
176+
177+
**Key conventions**:
178+
- Use raw strings `r"""..."""` for shell blocks
179+
- Redirect stderr to log files: `2> {log}`
180+
- Mark intermediate files with `temp()`
181+
- Use wildcards in paths: `{sample}`, `{results}`, `{read}`
182+
- Resource specifications: `threads`, `mem_mb`
183+
- Use `expand()` for generating multiple outputs
184+
185+
**Accessing config**: Use helper functions like `_get_config(sample, item)`
186+
```python
187+
def _get_config(sample, item):
188+
# Hierarchical lookup: sample -> chemistry[platform] -> chemistry -> defaults
189+
```
190+
191+
---
192+
193+
## Configuration Files
194+
195+
### config.yaml
196+
- `DATA`: Directory containing input FASTQs
197+
- `RESULTS`: Output directory path
198+
- `STAR_INDEX`: Path to STAR genome index
199+
- `POLYA_SITES`: PolyA database reference file (SAF format)
200+
- `DEFAULTS`: Default chemistry and platform settings
201+
- `SAMPLES`: Per-sample configuration (basename, chemistry, alignments)
202+
203+
### chemistry.yaml
204+
Platform-specific configurations organized hierarchically:
205+
```yaml
206+
chemistry_name:
207+
bc_whitelist: path/to/whitelist
208+
platform_name:
209+
cutadapt_R1: "trimming parameters"
210+
STAR_R1: "alignment parameters"
211+
STAR_R2: "alignment parameters"
212+
```
213+
214+
---
215+
216+
## Common Development Tasks
217+
218+
### Adding a New Rule
219+
220+
1. Create rule in appropriate file under `rules/`
221+
2. Follow naming convention: `verb_target` (e.g., `assign_sites_R1`)
222+
3. Add to workflow by including outputs in `SAMPLE_OUTS` (Snakefile)
223+
4. Test with dry-run: `snakemake -npr`
224+
225+
### Modifying Chemistry Configuration
226+
227+
1. Edit `chemistry.yaml`
228+
2. Ensure all required fields present: `cutadapt_*`, `STAR_*`
229+
3. Optional fields: `bc_whitelist`, `bc_cut`, `bc_length1`
230+
4. Test with dry-run to validate YAML syntax
231+
232+
### Adding Python Helper Script
233+
234+
1. Place in `inst/scripts/`
235+
2. Use argparse for CLI interface
236+
3. Include docstring explaining purpose
237+
4. Make executable: `chmod +x script.py`
238+
5. Call from Snakemake rule with `python3 inst/scripts/script.py`
239+
240+
---
241+
242+
## Error Handling and Debugging
243+
244+
**Log files**: All rules write logs to `{results}/logs/`
245+
- Check logs for detailed error messages
246+
- Logs include stderr from all commands
247+
248+
**Common issues**:
249+
- Missing conda dependencies → check `scraps_conda.yml`
250+
- YAML syntax errors → validate with `snakemake -npr`
251+
- Missing input files → check `DATA` path in config.yaml
252+
- Resource exhaustion → adjust `mem_mb` or `threads` in rules
253+
254+
**Debugging Snakemake**:
255+
```bash
256+
# Show detailed execution plan
257+
snakemake -npr --verbose
258+
259+
# Print shell commands without execution
260+
snakemake -np --printshellcmds
261+
262+
# Force re-run specific rule
263+
snakemake --forcerun rulename
264+
```
265+
266+
---
267+
268+
## Dependencies and Tools
269+
270+
**Core requirements** (installed via conda):
271+
- Python >= 3.7
272+
- Snakemake >= 5.3.0, < 8
273+
- STAR >= 2.7.9a (RNA-seq aligner)
274+
- UMI-tools >= 1.1.2 (UMI handling)
275+
- cutadapt >= 3.4 (adapter trimming)
276+
- samtools >= 1.15 (BAM manipulation)
277+
- bedtools >= 2.30.0 (genomic intervals)
278+
- subread >= 2.0.1 (featureCounts)
279+
- MultiQC >= 1.6 (report generation)
280+
- pysam >= 0.16.0 (Python BAM interface)
281+
282+
**Version checking**: Run `snakemake --configfile config.yaml` to trigger version checks
283+
284+
---
285+
286+
## Notes for AI Agents
287+
288+
- **Always dry-run first**: Use `snakemake -npr` before any pipeline changes
289+
- **Respect shell choice**: Pipeline explicitly uses `zsh`, not bash
290+
- **Preserve temp files**: Snakemake manages cleanup via `temp()` directive
291+
- **Follow hierarchical config**: Sample → Chemistry/Platform → Defaults
292+
- **Log everything**: Redirect stderr to log files for debugging
293+
- **Resource awareness**: Bioinformatics tools are memory/CPU intensive
294+
- **No traditional tests**: Validation is via successful Snakemake dry-run

0 commit comments

Comments
 (0)