Skip to content

Commit 37dca67

Browse files
authored
Merge pull request #345 from jkmckenna/0.3.2
0.3.2
2 parents 3dacfed + 8f5aa05 commit 37dca67

File tree

64 files changed

+10924
-4242
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+10924
-4242
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ venv/
1919
venvs/
2020
/environment.yml
2121

22+
# Development
23+
/dev/
24+
2225
# Tests
2326
/tests/_test_inputs/dorado_models
2427
/tests/_test_outputs/

AGENTS.md

Lines changed: 110 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,60 @@
11
# AGENTS.md
22

3-
This file tells coding agents (including OpenAI Codex) how to work in this repo.
3+
This file tells coding agents (including OpenAI Codex and Claude Code) how to work in this repo.
4+
5+
Coding agents can only read from AGENTS.md or Claude.md files.
6+
Agents can not edit AGENTS.md or Claude.md files.
47

58
## Goals
69
- Make minimal, correct changes.
710
- Prefer small PRs / diffs.
811
- Keep behavior stable unless the task explicitly requests changes.
12+
- Generate production grade, scalable code.
13+
14+
## Prompt interface
15+
- When asked about a problem or task, first describe the plan to handle the task.
16+
- Keep taking prompts until the plan is validated.
17+
- Implement code after being told to proceed.
918

1019
## Repo orientation
1120
- Read existing patterns before inventing new ones.
1221
- Don’t refactor broadly unless asked.
1322
- If you’re unsure about intended behavior, look for tests/docs first.
23+
- If behavior is not clear after tests/docs, look at the Click commands section in this file.
1424
- Ignore all files in any directory named "archived".
25+
- User defined parameters exist within src/smftools/config.
26+
- Parameters are herited from default.yaml -> MODALITY.yaml -> user_defined_config.csv
27+
- Frequently used non user defined variables should exist within src/smftools/constants.py
28+
- Logging functionality is defined within src/smftools/logging_utils.py
29+
- Optional dependency handling is defined within src/smftools/optional_imports.py
30+
- Frequently used I/O functionality is defined within src/smftools/readwrite.py
31+
- CLI functionality is provided through click and is defined within:
32+
- src/smftools/cli_entry.py
33+
- Modules of the src/smtools/cli subpackage
34+
- RTD documentation organization through smftools/docs
35+
- Pytest testing within smftools/tests
1536

1637
## Project dependencies
1738
- A core set of dependencies is required for the project.
1839
- Various optional dependencies are provided for:
19-
- Optional functional modules of the package (ont, plotting, ml-base, ml-extended, scanpy, qc)
20-
- If a Python version of a CLI tool is preferred (Such as for Samtools, Bedtools, BedGraphToBigWig).
21-
- For potential performance boosts in computation (torch)
40+
- Optional functional modules of the package (ont, plotting, ml-base, ml-extended, umap, qc)
41+
- If available, a Python version of a CLI tool is preferred (Such as for Samtools, Bedtools, BedGraphToBigWig).
42+
- torch is listed as an extra dependency, but is currently required.
2243
- All dependencies can be installed with `pip install -e ".[all]"`
44+
- Certain command line tools are currently needed for certain functionalities within smftools load:
45+
- dorado: Used for nanopore basecalling from POD5/FAST5 files to BAM.
46+
- dorado/minimap2: Used for alignment of reads to reference.
47+
- dorado: Used for demultiplexing of nanopore derived BAMs.
48+
- modkit: Used for extracting modification probabilities from MM/ML BAM tags for native smf modality.
2349

2450
## Setup
25-
- Create env (pick one):
26-
- `python -m venv .venv && source .venv/bin/activate`
27-
- or `conda env create -f environment.yml && conda activate <env>`
28-
- Install:
29-
- `pip install -e ".[dev]"`
51+
- Use current environment if the core dependencies are installed.
52+
- If dependencies are not found, create a venv in smftools/venvs/ directory:
53+
- `python3 -m venv .temp-venv && source .temp-venv/bin/activate`
54+
- Install the core dependencies and development dependencies for testing/formatting/linting:
55+
- `pip install -e ".[dev,torch]"`
56+
- If code is raising dependencies errors and they are in the optional dependencies:
57+
- `pip install -e ".[EXTRA_DEPENDENCY_NAME]"`
3058

3159
## How to run checks
3260
- Smoke tests: `pytest -m smoke -q`
@@ -41,17 +69,22 @@ This file tells coding agents (including OpenAI Codex) how to work in this repo.
4169
## Coding conventions
4270
- Follow existing style and module layout.
4371
- Prefer clear, explicit code over cleverness.
72+
- Prefer modular functionality to facilitate testing and future development.
73+
- Do not over-parametize functions when possible.
74+
- For function parameters that a user may want to tune, use the config management strategy.
75+
- Use constants.py when appropriate.
76+
- Annotate code blocks to describe functionality.
4477
- Add/adjust tests for bug fixes and new behavior.
4578
- Keep public APIs backward compatible unless explicitly changing them.
4679
- Python:
4780
- Use type hints for new/modified functions where reasonable.
4881
- Use Google style docstring format.
4982
- Avoid heavy dependencies unless necessary.
5083
- Use typing.TYPE_CHECKING and annotations.
84+
- In docstring of new functions, define the purpose of the function and what it does.
5185

5286
## Testing expectations
5387
- New functionality must include tests.
54-
- Bug fix PRs should include a regression test.
5588
- If tests are flaky or slow, note it and scope the change.
5689

5790
## Logging & secrets
@@ -67,3 +100,70 @@ This file tells coding agents (including OpenAI Codex) how to work in this repo.
67100
## If something fails
68101
- If a command fails, paste the full error and summarize likely causes.
69102
- Don’t “fix” by deleting tests or weakening assertions unless explicitly instructed.
103+
104+
## Click commands and their primary intent. Look in docs first, and underneath if the task is still not clear.
105+
- smftools load:
106+
- Take a variety of raw sequencing input options (FASTQs, POD5s, BAMs) from a single molecule footprinting experiment.
107+
- Determine the smf modality specified by the user (conversion, deaminase, native).
108+
- Handle FASTA inputs
109+
- Basecall the files using dorado if needed.
110+
- Align the reads using dorado or minimap2.
111+
- Sort/Index/Demultiplex BAMs.
112+
- BAM QC.
113+
- Extract Base modification probabilities for native smf modality
114+
- Load an AnnData object containing:
115+
- adata.X with a read X position matrix of SMF data.
116+
- adata.layers with:
117+
- integer encoded DNA sequences of each read.
118+
- mismatch encodings of DNA sequence vs reference for each read.
119+
- Base Q-scores for each read.
120+
- Read span masks indicating where the read aligned.
121+
- adata.var with per Reference_strand FASTA bases across positions.
122+
- adata.var_names being positional indexes within each read.
123+
- adata.obs_names being read names.
124+
- adata.obs with read level metadata
125+
- adata.uns with various unstructured data metrics.
126+
- Run multiqc on the BAM qc files.
127+
- Directory temp file cleanup.
128+
- Write out the adata, it's backup accessory data, and csv files of obs, var, and keys.
129+
- smftools preprocess:
130+
- Requires the adata produced by smftools load.
131+
- Adds various QC metrics and performs data preprocessing and filtering.
132+
- Read length, quality, and mapping based QC.
133+
- Per reference position level QC.
134+
- Appending base context for each reference.
135+
- Binarization of SMF probabilities for the native smf modality
136+
- NaN filling strategies in adata.layers.
137+
- Read level modification QC and filtering.
138+
- Duplicate detection and complexity analysis for conversion/deaminase modalities.
139+
- Visualizing read spans and base quality clustermaps.
140+
- Optionally inverts the adata along the var-axis.
141+
- Optionally reindexes var.
142+
- smftools variant:
143+
- Requires at least a preprocessed adata object.
144+
- Calculates per position mismatch frequencies/types for each reference/sample.
145+
- Optional variant site labeling if comparing two references.
146+
- Visualized sequence encodings and mismatch encodings with clustermaps.
147+
- smftools chimeric:
148+
- Requires at least a preprocessed adata object.
149+
- Meant to detect putative PCR chimeras.
150+
- smftools spatial:
151+
- Requires at least a preprocessed adata object.
152+
- Basic spatial signal analyses.
153+
- Clustermaps to visualize smf signal per reference/sample.
154+
- Spatial autocorrelation.
155+
- Position x position correlation matrices (Pearson, Binary covariance, chi2, relative risk)
156+
- smftools hmm:
157+
- Requires at least a preprocessed adata object.
158+
- Fits/saves/applies HMM to adata to label putative molecular features.
159+
- Creates adata.layers that hold binary masks of each feature class/subclass.
160+
- Creates adata.layers that hold HMM emission probabilities.
161+
- Visualizes HMM layers with clustermaps.
162+
- Performs peak calling on HMM layers and labels reads with the features in obs.
163+
- smftools latent:
164+
- Requires at least a preprocessed adata object.
165+
- Generates latent representations of the smf data.
166+
- PCA/KNN/UMAP/NMF/CP decomposition strategies.
167+
- Represents full sequences.
168+
- Represents modified sites only.
169+
- Represents non-modified sites only.

Claude.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Claude Code Agent Instructions
2+
3+
You are the implementation agent defined in smftools/AGENTS.md

docs/source/basic_usage.md

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,34 +17,68 @@ This command takes a user passed config file handling:
1717

1818
## Preprocess Usage
1919

20-
This command performs preprocessing on the anndata object. It automatically runs the load command under the hood if starting from raw data.
20+
This command performs preprocessing on the anndata object.
2121

2222
```shell
2323
smftools preprocess "/Path_to_experiment_config.csv"
2424
```
2525

2626
![](_static/smftools_preprocessing_diagram.png)
2727

28+
29+
## Variant Usage
30+
31+
This command performs DNA sequence variation based analyses on the anndata object.
32+
33+
```shell
34+
smftools variant "/Path_to_experiment_config.csv"
35+
```
36+
37+
## Chimeric Usage
38+
39+
This command performs putative PCR chimera detection on the anndata object.
40+
41+
```shell
42+
smftools chimeric "/Path_to_experiment_config.csv"
43+
```
44+
2845
## Spatial Usage
2946

30-
This command performs spatial analysis on the anndata object. It automatically runs the load command and preprocessing under the hood if they have not been already run.
47+
This command performs spatial analysis on the anndata object.
3148

3249
```shell
3350
smftools spatial "/Path_to_experiment_config.csv"
3451
```
3552

36-
- Currently Includes: Position X Position correlation matrices, clustering, dimensionality reduction, spatial autocorrelation.
53+
- Currently Includes: Position X Position correlation matrices, read x position clustermaps, and spatial autocorrelation.
3754

3855
## HMM Usage
3956

40-
This command performs hmm based feature annotation on the anndata object. It automatically runs the load command and preprocessing under the hood if they have not been already run.
57+
This command performs hmm based feature annotation on the anndata object.
4158

4259
```shell
4360
smftools hmm "/Path_to_experiment_config.csv"
4461
```
4562

4663
- Main outputs wills be stored in adata.layers
4764

65+
66+
## Latent Usage
67+
68+
This command constructs various latent representations of the anndata object.
69+
70+
```shell
71+
smftools latent "/Path_to_experiment_config.csv"
72+
```
73+
74+
## Full Usage
75+
76+
This command is a wrapper that sequentially runs load, preprocess, variant, chimeric, spatial, hmm, latent workflows.
77+
78+
```shell
79+
smftools full "/Path_to_experiment_config.csv"
80+
```
81+
4882
## Batch Usage
4983

5084
This command performs batch processing of any of the above commands across multiple experiments. It takes in a tsv, txt, or csv of experiment specific config csvs.

docs/source/tutorials/cli_usage.md

Lines changed: 60 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33
## Quick start
44

55
Most CLI workflows start with an experiment configuration CSV that points to your data, FASTA, and
6-
output directory. Once the configuration is ready, you can run commands like:
6+
output directory. Once the configuration is ready, you can run commands such as:
77

88
```shell
99
smftools load /path/to/experiment_config.csv
1010
smftools preprocess /path/to/experiment_config.csv
11-
smftools spatial /path/to/experiment_config.csv
12-
smftools hmm /path/to/experiment_config.csv
11+
smftools full /path/to/experiment_config.csv
12+
smftools batch full /path/to/config_paths.csv
1313
```
1414

1515
Each command will create (or reuse) stage-specific AnnData files in the output directory. Later
@@ -26,48 +26,93 @@ The load command builds the raw AnnData object from your raw sequencing data. It
2626
- Performs basecalling, alignment, demultiplexing, and BAM QC.
2727
- Optionally generates BED/bigWig outputs for alignment summaries.
2828
- Constructs the raw AnnData object (Single molecules x Positional coordinates).
29-
- Adds basic read-level QC annotations.
29+
- adata.X contains binarized modification data (conversion/deaminase), or modification probabilitiesc (native).
30+
- Adds basic read-level QC annotations (Read start, end, length, mean quality).
31+
- Adds layers encoding read DNA sequences, base quality scores, base mismatches.
32+
- Maintains BAM Tags/Flags in adata.obs.
3033
- Writes the raw AnnData to the canonical output path and runs MultiQC.
3134
- Optionally deletes intermediate BAMs, H5ADs, and TSVs.
3235

3336
### `smftools preprocess`
3437

3538
The preprocess command performs QC, binarization, filtering, and duplicate detection. It:
3639

40+
- Requires an Anndata created by smftools load.
3741
- Loads sample sheet metadata (if provided).
3842
- Generates read length/quality QC plots and filters reads on these metrics.
3943
- Binarizes direct-modification calls based on thresholds (hard or fit thresholds).
40-
- Cleans NaNs in adata.layers.
41-
- Computes positional coverage and base-context annotations.
44+
- Cleans NaNs from adata.X and stores in adata.layers (nan0_0minus1, nan_half).
45+
- Computes positional coverage and base-context annotations (GpC, CpG, ambiguous, other C, any C).
4246
- Calculates read modification statistics and QC plots.
4347
- Filters reads based on modification thresholds.
44-
- Adds base-context binary layers.
45-
- Flags duplicate reads and performs complexity analyses (conversion/deamination workflows).
46-
- Writes preprocessed and deduplicated AnnData outputs.
48+
- Adds base-context binary modification layers.
49+
- Optionally inverts and reindexes the data along the var (positions) axis.
50+
- Flags duplicate reads based on nearest neighbor hamming distance of overlapping valid sites (Conversion/deamination).
51+
- Performs complexity analyses using duplicate read clusters and Lander/Waterman fits (conversion/deamination workflows).
52+
- Visualizes read span masks and base quality scores with clustermaps.
53+
- Writes preprocessed (duplicates flagged, but kept) and preprocessed/deduplicated AnnData outputs.
54+
55+
### `smftools variant`
56+
57+
The variant command focuses on DNA sequence variation analyses. It:
58+
59+
- Requires at least a preprocessed AnnData object.
60+
- Calculates position level variation frequencies per reference/sample.
61+
- Generates z-scores for variant occurance given read level Q-scores and assuming uniform Palt transitions.
62+
- Visualizes read DNA sequence encodings and mismatch encodings.
63+
64+
### `smftools chimeric`
65+
66+
The chimeric command is meant to find putative PCR chimeras. It:
67+
68+
- Requires at least a preprocessed AnnData object.
69+
- Performs sliding window nearest neighbor hamming distance analysis per read.
70+
- Visualizes the windowed nearest neighbor hamming distances per read.
71+
- Assembles maximum spanning intervals of 0-hamming distance neighbors per read within the reference/sample.
72+
- In progress.
4773

4874
### `smftools spatial`
4975

5076
The spatial command runs downstream spatial analyses on the preprocessed data. It:
5177

78+
- Requires at least a preprocessed AnnData object.
5279
- Optionally loads sample sheet metadata.
53-
- Optionally inverts and reindexes the data along the reference axis.
80+
- Optionally inverts and reindexes the data along the positions axis.
5481
- Generates clustermaps for preprocessed (and deduplicated) AnnData.
55-
- Runs PCA/UMAP/Leiden clustering.
5682
- Computes spatial autocorrelation, rolling metrics, and grid summaries.
57-
- Generates positionwise correlation matrices (non-direct modalities).
83+
- Generates positionwise correlation matrices.
5884
- Writes the spatial AnnData output.
5985

6086
### `smftools hmm`
6187

6288
The hmm command adds HMM-based feature annotation and summary plots. It:
6389

64-
- Ensures preprocessing and spatial analyses are up to date.
90+
- Requires at least a preprocessed AnnData object.
6591
- Fits or reuses HMM models for configured feature sets.
66-
- Annotates AnnData with HMM-derived layers and merged intervals.
92+
- Annotates AnnData with HMM-derived feature layers (State layers and probability layers)
6793
- Calls HMM feature peaks and writes peak-calling outputs.
68-
- Generates clustermaps, rolling traces, and fragment size plots for HMM layers.
94+
- Generates clustermaps, bulk feature traces, and fragment size distribution plots for HMM layers.
6995
- Writes the HMM AnnData output.
7096

97+
### `smftools latent`
98+
99+
The latent command constructs latent representations of the data. It:
100+
101+
- Requires at least a preprocessed AnnData object.
102+
- Runs various dimensionality reduction and graph construction modalities:
103+
- Principle component analysis (PCA)
104+
- K-nearest neighbor (KNN)
105+
- Uniform manifold approximation and projection (UMAP)
106+
- Non-negative matrix factorization (NMF)
107+
- Canonical polyadic decomposition (PARAFAC)
108+
109+
### `smftools full`
110+
111+
The full command is a workflow wrapper. It runs the following sequentially:
112+
113+
- Load / preprocess / variant / chimeric / spatial / hmm / latent.
114+
115+
71116
## Batch processing
72117

73118
Use the batch command to run a single task across multiple experiments.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ torch = [
147147
all = [
148148
# cluster
149149
"fastcluster",
150-
"igraph"
150+
"igraph",
151151
"leidenalg",
152152

153153
# informatics

0 commit comments

Comments
 (0)