11# AGENTS.md
22
3- This file tells coding agents (including OpenAI Codex) how to work in this repo.
3+ This file tells coding agents (including OpenAI Codex and Claude Code) how to work in this repo.
4+
5+ Coding agents can only read from AGENTS.md or Claude.md files.
6+ Agents can not edit AGENTS.md or Claude.md files.
47
58## Goals
69- Make minimal, correct changes.
710- Prefer small PRs / diffs.
811- Keep behavior stable unless the task explicitly requests changes.
12+ - Generate production grade, scalable code.
13+
14+ ## Prompt interface
15+ - When asked about a problem or task, first describe the plan to handle the task.
16+ - Keep taking prompts until the plan is validated.
17+ - Implement code after being told to proceed.
918
1019## Repo orientation
1120- Read existing patterns before inventing new ones.
1221- Don’t refactor broadly unless asked.
1322- If you’re unsure about intended behavior, look for tests/docs first.
23+ - If behavior is not clear after tests/docs, look at the Click commands section in this file.
1424- Ignore all files in any directory named "archived".
25+ - User defined parameters exist within src/smftools/config.
26+ - Parameters are herited from default.yaml -> MODALITY.yaml -> user_defined_config.csv
27+ - Frequently used non user defined variables should exist within src/smftools/constants.py
28+ - Logging functionality is defined within src/smftools/logging_utils.py
29+ - Optional dependency handling is defined within src/smftools/optional_imports.py
30+ - Frequently used I/O functionality is defined within src/smftools/readwrite.py
31+ - CLI functionality is provided through click and is defined within:
32+ - src/smftools/cli_entry.py
33+ - Modules of the src/smtools/cli subpackage
34+ - RTD documentation organization through smftools/docs
35+ - Pytest testing within smftools/tests
1536
1637## Project dependencies
1738- A core set of dependencies is required for the project.
1839- Various optional dependencies are provided for:
19- - Optional functional modules of the package (ont, plotting, ml-base, ml-extended, scanpy , qc)
20- - If a Python version of a CLI tool is preferred (Such as for Samtools, Bedtools, BedGraphToBigWig).
21- - For potential performance boosts in computation (torch)
40+ - Optional functional modules of the package (ont, plotting, ml-base, ml-extended, umap , qc)
41+ - If available, a Python version of a CLI tool is preferred (Such as for Samtools, Bedtools, BedGraphToBigWig).
42+ - torch is listed as an extra dependency, but is currently required.
2243 - All dependencies can be installed with ` pip install -e ".[all]" `
44+ - Certain command line tools are currently needed for certain functionalities within smftools load:
45+ - dorado: Used for nanopore basecalling from POD5/FAST5 files to BAM.
46+ - dorado/minimap2: Used for alignment of reads to reference.
47+ - dorado: Used for demultiplexing of nanopore derived BAMs.
48+ - modkit: Used for extracting modification probabilities from MM/ML BAM tags for native smf modality.
2349
2450## Setup
25- - Create env (pick one):
26- - ` python -m venv .venv && source .venv/bin/activate `
27- - or ` conda env create -f environment.yml && conda activate <env> `
28- - Install:
29- - ` pip install -e ".[dev]" `
51+ - Use current environment if the core dependencies are installed.
52+ - If dependencies are not found, create a venv in smftools/venvs/ directory:
53+ - ` python3 -m venv .temp-venv && source .temp-venv/bin/activate `
54+ - Install the core dependencies and development dependencies for testing/formatting/linting:
55+ - ` pip install -e ".[dev,torch]" `
56+ - If code is raising dependencies errors and they are in the optional dependencies:
57+ - ` pip install -e ".[EXTRA_DEPENDENCY_NAME]" `
3058
3159## How to run checks
3260- Smoke tests: ` pytest -m smoke -q `
@@ -41,17 +69,22 @@ This file tells coding agents (including OpenAI Codex) how to work in this repo.
4169## Coding conventions
4270- Follow existing style and module layout.
4371- Prefer clear, explicit code over cleverness.
72+ - Prefer modular functionality to facilitate testing and future development.
73+ - Do not over-parametize functions when possible.
74+ - For function parameters that a user may want to tune, use the config management strategy.
75+ - Use constants.py when appropriate.
76+ - Annotate code blocks to describe functionality.
4477- Add/adjust tests for bug fixes and new behavior.
4578- Keep public APIs backward compatible unless explicitly changing them.
4679- Python:
4780 - Use type hints for new/modified functions where reasonable.
4881 - Use Google style docstring format.
4982 - Avoid heavy dependencies unless necessary.
5083 - Use typing.TYPE_CHECKING and annotations.
84+ - In docstring of new functions, define the purpose of the function and what it does.
5185
5286## Testing expectations
5387- New functionality must include tests.
54- - Bug fix PRs should include a regression test.
5588- If tests are flaky or slow, note it and scope the change.
5689
5790## Logging & secrets
@@ -67,3 +100,70 @@ This file tells coding agents (including OpenAI Codex) how to work in this repo.
67100## If something fails
68101- If a command fails, paste the full error and summarize likely causes.
69102- Don’t “fix” by deleting tests or weakening assertions unless explicitly instructed.
103+
104+ ## Click commands and their primary intent. Look in docs first, and underneath if the task is still not clear.
105+ - smftools load:
106+ - Take a variety of raw sequencing input options (FASTQs, POD5s, BAMs) from a single molecule footprinting experiment.
107+ - Determine the smf modality specified by the user (conversion, deaminase, native).
108+ - Handle FASTA inputs
109+ - Basecall the files using dorado if needed.
110+ - Align the reads using dorado or minimap2.
111+ - Sort/Index/Demultiplex BAMs.
112+ - BAM QC.
113+ - Extract Base modification probabilities for native smf modality
114+ - Load an AnnData object containing:
115+ - adata.X with a read X position matrix of SMF data.
116+ - adata.layers with:
117+ - integer encoded DNA sequences of each read.
118+ - mismatch encodings of DNA sequence vs reference for each read.
119+ - Base Q-scores for each read.
120+ - Read span masks indicating where the read aligned.
121+ - adata.var with per Reference_strand FASTA bases across positions.
122+ - adata.var_names being positional indexes within each read.
123+ - adata.obs_names being read names.
124+ - adata.obs with read level metadata
125+ - adata.uns with various unstructured data metrics.
126+ - Run multiqc on the BAM qc files.
127+ - Directory temp file cleanup.
128+ - Write out the adata, it's backup accessory data, and csv files of obs, var, and keys.
129+ - smftools preprocess:
130+ - Requires the adata produced by smftools load.
131+ - Adds various QC metrics and performs data preprocessing and filtering.
132+ - Read length, quality, and mapping based QC.
133+ - Per reference position level QC.
134+ - Appending base context for each reference.
135+ - Binarization of SMF probabilities for the native smf modality
136+ - NaN filling strategies in adata.layers.
137+ - Read level modification QC and filtering.
138+ - Duplicate detection and complexity analysis for conversion/deaminase modalities.
139+ - Visualizing read spans and base quality clustermaps.
140+ - Optionally inverts the adata along the var-axis.
141+ - Optionally reindexes var.
142+ - smftools variant:
143+ - Requires at least a preprocessed adata object.
144+ - Calculates per position mismatch frequencies/types for each reference/sample.
145+ - Optional variant site labeling if comparing two references.
146+ - Visualized sequence encodings and mismatch encodings with clustermaps.
147+ - smftools chimeric:
148+ - Requires at least a preprocessed adata object.
149+ - Meant to detect putative PCR chimeras.
150+ - smftools spatial:
151+ - Requires at least a preprocessed adata object.
152+ - Basic spatial signal analyses.
153+ - Clustermaps to visualize smf signal per reference/sample.
154+ - Spatial autocorrelation.
155+ - Position x position correlation matrices (Pearson, Binary covariance, chi2, relative risk)
156+ - smftools hmm:
157+ - Requires at least a preprocessed adata object.
158+ - Fits/saves/applies HMM to adata to label putative molecular features.
159+ - Creates adata.layers that hold binary masks of each feature class/subclass.
160+ - Creates adata.layers that hold HMM emission probabilities.
161+ - Visualizes HMM layers with clustermaps.
162+ - Performs peak calling on HMM layers and labels reads with the features in obs.
163+ - smftools latent:
164+ - Requires at least a preprocessed adata object.
165+ - Generates latent representations of the smf data.
166+ - PCA/KNN/UMAP/NMF/CP decomposition strategies.
167+ - Represents full sequences.
168+ - Represents modified sites only.
169+ - Represents non-modified sites only.
0 commit comments