jkmckenna
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 110 additions & 10 deletions b/‎AGENTS.md‎
Lines changed: 110 additions & 10 deletions
diff --git a/‎Claude.md‎
Lines changed: 3 additions & 0 deletions b/‎Claude.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/basic_usage.md‎
Lines changed: 38 additions & 4 deletions b/‎docs/source/basic_usage.md‎
Lines changed: 38 additions & 4 deletions
diff --git a/‎docs/source/tutorials/cli_usage.md‎
Lines changed: 60 additions & 15 deletions b/‎docs/source/tutorials/cli_usage.md‎
Lines changed: 60 additions & 15 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion
@@ -19,6 +19,9 @@ venv/
 venvs/
 /environment.yml
 
+# Development
+/dev/
+
 # Tests
 /tests/_test_inputs/dorado_models
 /tests/_test_outputs/
 
@@ -1,32 +1,60 @@
 # AGENTS.md
 
-This file tells coding agents (including OpenAI Codex) how to work in this repo.
+This file tells coding agents (including OpenAI Codex and Claude Code) how to work in this repo.
+
+Coding agents can only read from AGENTS.md or Claude.md files.
+Agents can not edit AGENTS.md or Claude.md files.
 
 ## Goals
 - Make minimal, correct changes.
 - Prefer small PRs / diffs.
 - Keep behavior stable unless the task explicitly requests changes.
+- Generate production grade, scalable code.
+
+## Prompt interface
+- When asked about a problem or task, first describe the plan to handle the task.
+- Keep taking prompts until the plan is validated.
+- Implement code after being told to proceed.
 
 ## Repo orientation
 - Read existing patterns before inventing new ones.
 - Don’t refactor broadly unless asked.
 - If you’re unsure about intended behavior, look for tests/docs first.
+- If behavior is not clear after tests/docs, look at the Click commands section in this file.
 - Ignore all files in any directory named "archived".
+- User defined parameters exist within src/smftools/config.
+- Parameters are herited from default.yaml -> MODALITY.yaml -> user_defined_config.csv
+- Frequently used non user defined variables should exist within src/smftools/constants.py
+- Logging functionality is defined within src/smftools/logging_utils.py
+- Optional dependency handling is defined within src/smftools/optional_imports.py
+- Frequently used I/O functionality is defined within src/smftools/readwrite.py
+- CLI functionality is provided through click and is defined within:
+  - src/smftools/cli_entry.py
+  - Modules of the src/smtools/cli subpackage
+- RTD documentation organization through smftools/docs
+- Pytest testing within smftools/tests
 
 ## Project dependencies
 - A core set of dependencies is required for the project.
 - Various optional dependencies are provided for:
-    - Optional functional modules of the package (ont, plotting, ml-base, ml-extended, scanpy, qc)
-    - If a Python version of a CLI tool is preferred (Such as for Samtools, Bedtools, BedGraphToBigWig).
-    - For potential performance boosts in computation (torch)
+    - Optional functional modules of the package (ont, plotting, ml-base, ml-extended, umap, qc)
+    - If available, a Python version of a CLI tool is preferred (Such as for Samtools, Bedtools, BedGraphToBigWig).
+    - torch is listed as an extra dependency, but is currently required.
     - All dependencies can be installed with `pip install -e ".[all]"`
+- Certain command line tools are currently needed for certain functionalities within smftools load:
+  - dorado: Used for nanopore basecalling from POD5/FAST5 files to BAM.
+  - dorado/minimap2: Used for alignment of reads to reference.
+  - dorado: Used for demultiplexing of nanopore derived BAMs.
+  - modkit: Used for extracting modification probabilities from MM/ML BAM tags for native smf modality.
 
 ## Setup
-- Create env (pick one):
-  - `python -m venv .venv && source .venv/bin/activate`
-  - or `conda env create -f environment.yml && conda activate <env>`
-- Install:
-  - `pip install -e ".[dev]"`
+- Use current environment if the core dependencies are installed.
+- If dependencies are not found, create a venv in smftools/venvs/ directory:
+  - `python3 -m venv .temp-venv && source .temp-venv/bin/activate`
+- Install the core dependencies and development dependencies for testing/formatting/linting:
+  - `pip install -e ".[dev,torch]"`
+- If code is raising dependencies errors and they are in the optional dependencies:
+  - `pip install -e ".[EXTRA_DEPENDENCY_NAME]"`
 
 ## How to run checks
 - Smoke tests: `pytest -m smoke -q`
@@ -41,17 +69,22 @@ This file tells coding agents (including OpenAI Codex) how to work in this repo.
 ## Coding conventions
 - Follow existing style and module layout.
 - Prefer clear, explicit code over cleverness.
+- Prefer modular functionality to facilitate testing and future development.
+- Do not over-parametize functions when possible.
+- For function parameters that a user may want to tune, use the config management strategy.
+- Use constants.py when appropriate.
+- Annotate code blocks to describe functionality.
 - Add/adjust tests for bug fixes and new behavior.
 - Keep public APIs backward compatible unless explicitly changing them.
 - Python:
   - Use type hints for new/modified functions where reasonable.
   - Use Google style docstring format.
   - Avoid heavy dependencies unless necessary.
   - Use typing.TYPE_CHECKING and annotations.
+  - In docstring of new functions, define the purpose of the function and what it does.
 
 ## Testing expectations
 - New functionality must include tests.
-- Bug fix PRs should include a regression test.
 - If tests are flaky or slow, note it and scope the change.
 
 ## Logging & secrets
@@ -67,3 +100,70 @@ This file tells coding agents (including OpenAI Codex) how to work in this repo.
 ## If something fails
 - If a command fails, paste the full error and summarize likely causes.
 - Don’t “fix” by deleting tests or weakening assertions unless explicitly instructed.
+
+## Click commands and their primary intent. Look in docs first, and underneath if the task is still not clear.
+- smftools load:
+  - Take a variety of raw sequencing input options (FASTQs, POD5s, BAMs) from a single molecule footprinting experiment.
+  - Determine the smf modality specified by the user (conversion, deaminase, native).
+  - Handle FASTA inputs
+  - Basecall the files using dorado if needed.
+  - Align the reads using dorado or minimap2.
+  - Sort/Index/Demultiplex BAMs.
+  - BAM QC.
+  - Extract Base modification probabilities for native smf modality
+  - Load an AnnData object containing:
+    - adata.X with a read X position matrix of SMF data.
+    - adata.layers with:
+      - integer encoded DNA sequences of each read.
+      - mismatch encodings of DNA sequence vs reference for each read.
+      - Base Q-scores for each read.
+      - Read span masks indicating where the read aligned.
+    - adata.var with per Reference_strand FASTA bases across positions.
+    - adata.var_names being positional indexes within each read.
+    - adata.obs_names being read names.
+    - adata.obs with read level metadata
+    - adata.uns with various unstructured data metrics.
+  - Run multiqc on the BAM qc files.
+  - Directory temp file cleanup.
+  - Write out the adata, it's backup accessory data, and csv files of obs, var, and keys.
+- smftools preprocess:
+  - Requires the adata produced by smftools load.
+  - Adds various QC metrics and performs data preprocessing and filtering.
+    - Read length, quality, and mapping based QC.
+    - Per reference position level QC.
+    - Appending base context for each reference.
+    - Binarization of SMF probabilities for the native smf modality
+    - NaN filling strategies in adata.layers.
+    - Read level modification QC and filtering.
+    - Duplicate detection and complexity analysis for conversion/deaminase modalities.
+    - Visualizing read spans and base quality clustermaps.
+  - Optionally inverts the adata along the var-axis.
+  - Optionally reindexes var.
+- smftools variant:
+  - Requires at least a preprocessed adata object.
+  - Calculates per position mismatch frequencies/types for each reference/sample.
+  - Optional variant site labeling if comparing two references.
+  - Visualized sequence encodings and mismatch encodings with clustermaps.
+- smftools chimeric:
+  - Requires at least a preprocessed adata object.
+  - Meant to detect putative PCR chimeras.
+- smftools spatial:
+  - Requires at least a preprocessed adata object.
+  - Basic spatial signal analyses.
+  - Clustermaps to visualize smf signal per reference/sample.
+  - Spatial autocorrelation.
+  - Position x position correlation matrices (Pearson, Binary covariance, chi2, relative risk)
+- smftools hmm:
+  - Requires at least a preprocessed adata object.
+  - Fits/saves/applies HMM to adata to label putative molecular features.
+  - Creates adata.layers that hold binary masks of each feature class/subclass.
+  - Creates adata.layers that hold HMM emission probabilities.
+  - Visualizes HMM layers with clustermaps.
+  - Performs peak calling on HMM layers and labels reads with the features in obs.
+- smftools latent:
+  - Requires at least a preprocessed adata object.
+  - Generates latent representations of the smf data.
+  - PCA/KNN/UMAP/NMF/CP decomposition strategies.
+  - Represents full sequences.
+  - Represents modified sites only.
+  - Represents non-modified sites only.
@@ -0,0 +1,3 @@
+# Claude Code Agent Instructions
+
+You are the implementation agent defined in smftools/AGENTS.md
@@ -17,34 +17,68 @@ This command takes a user passed config file handling:
 
 ## Preprocess Usage
 
-This command performs preprocessing on the anndata object. It automatically runs the load command under the hood if starting from raw data.
+This command performs preprocessing on the anndata object.
 
 ```shell
 smftools preprocess "/Path_to_experiment_config.csv"
 ```
 
 ![](_static/smftools_preprocessing_diagram.png)
 
+
+## Variant Usage
+
+This command performs DNA sequence variation based analyses on the anndata object.
+
+```shell
+smftools variant "/Path_to_experiment_config.csv"
+```
+
+## Chimeric Usage
+
+This command performs putative PCR chimera detection on the anndata object.
+
+```shell
+smftools chimeric "/Path_to_experiment_config.csv"
+```
+
 ## Spatial Usage
 
-This command performs spatial analysis on the anndata object. It automatically runs the load command and preprocessing under the hood if they have not been already run.
+This command performs spatial analysis on the anndata object.
 
 ```shell
 smftools spatial "/Path_to_experiment_config.csv"
 ```
 
-- Currently Includes: Position X Position correlation matrices, clustering, dimensionality reduction, spatial autocorrelation. 
+- Currently Includes: Position X Position correlation matrices, read x position clustermaps, and spatial autocorrelation. 
 
 ## HMM Usage
 
-This command performs hmm based feature annotation on the anndata object. It automatically runs the load command and preprocessing under the hood if they have not been already run.
+This command performs hmm based feature annotation on the anndata object.
 
 ```shell
 smftools hmm "/Path_to_experiment_config.csv"
 ```
 
 - Main outputs wills be stored in adata.layers
 
+
+## Latent Usage
+
+This command constructs various latent representations of the anndata object.
+
+```shell
+smftools latent "/Path_to_experiment_config.csv"
+```
+
+## Full Usage
+
+This command is a wrapper that sequentially runs load, preprocess, variant, chimeric, spatial, hmm, latent workflows.
+
+```shell
+smftools full "/Path_to_experiment_config.csv"
+```
+
 ## Batch Usage
 
 This command performs batch processing of any of the above commands across multiple experiments. It takes in a tsv, txt, or csv of experiment specific config csvs.
 
@@ -3,13 +3,13 @@
 ## Quick start
 
 Most CLI workflows start with an experiment configuration CSV that points to your data, FASTA, and
-output directory. Once the configuration is ready, you can run commands like:
+output directory. Once the configuration is ready, you can run commands such as:
 
 ```shell
 smftools load /path/to/experiment_config.csv
 smftools preprocess /path/to/experiment_config.csv
-smftools spatial /path/to/experiment_config.csv
-smftools hmm /path/to/experiment_config.csv
+smftools full /path/to/experiment_config.csv
+smftools batch full /path/to/config_paths.csv
 ```
 
 Each command will create (or reuse) stage-specific AnnData files in the output directory. Later
@@ -26,48 +26,93 @@ The load command builds the raw AnnData object from your raw sequencing data. It
 - Performs basecalling, alignment, demultiplexing, and BAM QC.
 - Optionally generates BED/bigWig outputs for alignment summaries.
 - Constructs the raw AnnData object (Single molecules x Positional coordinates).
-- Adds basic read-level QC annotations.
+- adata.X contains binarized modification data (conversion/deaminase), or modification probabilitiesc (native).
+- Adds basic read-level QC annotations (Read start, end, length, mean quality).
+- Adds layers encoding read DNA sequences, base quality scores, base mismatches.
+- Maintains BAM Tags/Flags in adata.obs.
 - Writes the raw AnnData to the canonical output path and runs MultiQC.
 - Optionally deletes intermediate BAMs, H5ADs, and TSVs.
 
 ### `smftools preprocess`
 
 The preprocess command performs QC, binarization, filtering, and duplicate detection. It:
 
+- Requires an Anndata created by smftools load.
 - Loads sample sheet metadata (if provided).
 - Generates read length/quality QC plots and filters reads on these metrics.
 - Binarizes direct-modification calls based on thresholds (hard or fit thresholds).
-- Cleans NaNs in adata.layers.
-- Computes positional coverage and base-context annotations.
+- Cleans NaNs from adata.X and stores in adata.layers (nan0_0minus1, nan_half).
+- Computes positional coverage and base-context annotations (GpC, CpG, ambiguous, other C, any C).
 - Calculates read modification statistics and QC plots.
 - Filters reads based on modification thresholds.
-- Adds base-context binary layers.
-- Flags duplicate reads and performs complexity analyses (conversion/deamination workflows).
-- Writes preprocessed and deduplicated AnnData outputs.
+- Adds base-context binary modification layers.
+- Optionally inverts and reindexes the data along the var (positions) axis.
+- Flags duplicate reads based on nearest neighbor hamming distance of overlapping valid sites (Conversion/deamination).
+- Performs complexity analyses using duplicate read clusters and Lander/Waterman fits (conversion/deamination workflows).
+- Visualizes read span masks and base quality scores with clustermaps.
+- Writes preprocessed (duplicates flagged, but kept) and preprocessed/deduplicated AnnData outputs.
+
+### `smftools variant`
+
+The variant command focuses on DNA sequence variation analyses. It:
+
+- Requires at least a preprocessed AnnData object.
+- Calculates position level variation frequencies per reference/sample.
+- Generates z-scores for variant occurance given read level Q-scores and assuming uniform Palt transitions.
+- Visualizes read DNA sequence encodings and mismatch encodings.
+
+### `smftools chimeric`
+
+The chimeric command is meant to find putative PCR chimeras. It:
+
+- Requires at least a preprocessed AnnData object.
+- Performs sliding window nearest neighbor hamming distance analysis per read.
+- Visualizes the windowed nearest neighbor hamming distances per read.
+- Assembles maximum spanning intervals of 0-hamming distance neighbors per read within the reference/sample.
+- In progress.
 
 ### `smftools spatial`
 
 The spatial command runs downstream spatial analyses on the preprocessed data. It:
 
+- Requires at least a preprocessed AnnData object.
 - Optionally loads sample sheet metadata.
-- Optionally inverts and reindexes the data along the reference axis.
+- Optionally inverts and reindexes the data along the positions axis.
 - Generates clustermaps for preprocessed (and deduplicated) AnnData.
-- Runs PCA/UMAP/Leiden clustering.
 - Computes spatial autocorrelation, rolling metrics, and grid summaries.
-- Generates positionwise correlation matrices (non-direct modalities).
+- Generates positionwise correlation matrices.
 - Writes the spatial AnnData output.
 
 ### `smftools hmm`
 
 The hmm command adds HMM-based feature annotation and summary plots. It:
 
-- Ensures preprocessing and spatial analyses are up to date.
+- Requires at least a preprocessed AnnData object.
 - Fits or reuses HMM models for configured feature sets.
-- Annotates AnnData with HMM-derived layers and merged intervals.
+- Annotates AnnData with HMM-derived feature layers (State layers and probability layers)
 - Calls HMM feature peaks and writes peak-calling outputs.
-- Generates clustermaps, rolling traces, and fragment size plots for HMM layers.
+- Generates clustermaps, bulk feature traces, and fragment size distribution plots for HMM layers.
 - Writes the HMM AnnData output.
 
+### `smftools latent`
+
+The latent command constructs latent representations of the data. It:
+
+- Requires at least a preprocessed AnnData object.
+- Runs various dimensionality reduction and graph construction modalities:
+    - Principle component analysis (PCA)
+    - K-nearest neighbor (KNN)
+    - Uniform manifold approximation and projection (UMAP)
+    - Non-negative matrix factorization (NMF)
+    - Canonical polyadic decomposition (PARAFAC)
+
+### `smftools full`
+
+The full command is a workflow wrapper. It runs the following sequentially:
+
+- Load / preprocess / variant / chimeric / spatial / hmm / latent.
+
+
 ## Batch processing
 
 Use the batch command to run a single task across multiple experiments.
 
@@ -147,7 +147,7 @@ torch = [
 all = [
     # cluster
     "fastcluster",
-    "igraph"
+    "igraph",
     "leidenalg",
 
     # informatics
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Claude Code Agent Instructions`
	`2`	`+`
	`3`	`+You are the implementation agent defined in smftools/AGENTS.md`