Merge pull request #175 from jkmckenna/0.2.5

jkmckenna · web-flow · commit e3cb9f46141a · 2026-01-08T12:57:56.000-08:00
doc cli tutorials / begin adding logging functionality
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -60,7 +60,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
+        python-version: ["3.10"]
 
     steps:
       - name: Check out repository
@@ -85,7 +85,7 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          python -m pip install .[tests]
+          python -m pip install .[dev]
 
       - name: Run smoke tests
         run: pytest -m smoke -q
@@ -95,7 +95,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
+        python-version: ["3.10", "3.11", "3.12"]
 
     steps:
       - name: Check out repository
diff --git a/docs/source/tutorials/cli_usage.md b/docs/source/tutorials/cli_usage.md
@@ -0,0 +1,91 @@
+# Command line tutorials
+
+## Quick start
+
+Most CLI workflows start with an experiment configuration CSV that points to your data, FASTA, and
+output directory. Once the configuration is ready, you can run commands like:
+
+```shell
+smftools load /path/to/experiment_config.csv
+smftools preprocess /path/to/experiment_config.csv
+smftools spatial /path/to/experiment_config.csv
+smftools hmm /path/to/experiment_config.csv
+```
+
+Each command will create (or reuse) stage-specific AnnData files in the output directory. Later
+commands reuse results from earlier stages unless you explicitly force a redo via configuration
+flags.
+
+## What each command does
+
+### `smftools load`
+
+The load command builds the raw AnnData object from your raw sequencing data. It:
+
+- Handles input formats (fast5/pod5/fastq/bam).
+- Performs basecalling, alignment, demultiplexing, and BAM QC.
+- Optionally generates BED/bigWig outputs for alignment summaries.
+- Constructs the raw AnnData object (Single molecules x Positional coordinates).
+- Adds basic read-level QC annotations.
+- Writes the raw AnnData to the canonical output path and runs MultiQC.
+- Optionally deletes intermediate BAMs, H5ADs, and TSVs.
+
+### `smftools preprocess`
+
+The preprocess command performs QC, binarization, filtering, and duplicate detection. It:
+
+- Loads sample sheet metadata (if provided).
+- Generates read length/quality QC plots and filters reads on these metrics.
+- Binarizes direct-modification calls based on thresholds (hard or fit thresholds).
+- Cleans NaNs in adata.layers.
+- Computes positional coverage and base-context annotations.
+- Calculates read modification statistics and QC plots.
+- Filters reads based on modification thresholds.
+- Adds base-context binary layers.
+- Flags duplicate reads and performs complexity analyses (conversion/deamination workflows).
+- Writes preprocessed and deduplicated AnnData outputs.
+
+### `smftools spatial`
+
+The spatial command runs downstream spatial analyses on the preprocessed data. It:
+
+- Optionally loads sample sheet metadata.
+- Optionally inverts and reindexes the data along the reference axis.
+- Generates clustermaps for preprocessed (and deduplicated) AnnData.
+- Runs PCA/UMAP/Leiden clustering.
+- Computes spatial autocorrelation, rolling metrics, and grid summaries.
+- Generates positionwise correlation matrices (non-direct modalities).
+- Writes the spatial AnnData output.
+
+### `smftools hmm`
+
+The hmm command adds HMM-based feature annotation and summary plots. It:
+
+- Ensures preprocessing and spatial analyses are up to date.
+- Fits or reuses HMM models for configured feature sets.
+- Annotates AnnData with HMM-derived layers and merged intervals.
+- Calls HMM feature peaks and writes peak-calling outputs.
+- Generates clustermaps, rolling traces, and fragment size plots for HMM layers.
+- Writes the HMM AnnData output.
+
+## Batch processing
+
+Use the batch command to run a single task across multiple experiments.
+
+```shell
+smftools batch preprocess /path/to/config_paths.csv
+```
+
+The batch command accepts:
+
+- **CSV/TSV** tables with a column of config paths (default column name: `config_path`).
+- **TXT** files with one config path per line.
+
+You can override the column name or delimiter if needed:
+
+```shell
+smftools batch spatial /path/to/configs.tsv --column my_config --sep $'\t'
+```
+
+Each path is validated; missing configs are skipped with a message, while valid configs run the
+requested task in sequence.
diff --git a/docs/source/tutorials/experiment_config.md b/docs/source/tutorials/experiment_config.md
@@ -0,0 +1,51 @@
+# Experiment configuration CSV
+
+smftools uses an experiment configuration CSV to define paths, modality settings, and workflow
+options. You can start from the repository template (`experiment_config.csv`) and fill in your
+experiment-specific values.
+
+## CSV format
+
+The configuration CSV is a table with the following columns:
+
+| Column | Description |
+| --- | --- |
+| `variable` | Configuration key name (used by smftools). |
+| `value` | Your value for this key. |
+| `help` | Short description of the key. |
+| `options` | Expected values (when applicable). |
+| `type` | Expected value type (`str`, `int`, `float`, `list`). |
+
+A shortened example looks like:
+
+```csv
+variable,value,help,options,type
+smf_modality,conversion,Modality of SMF. Can either be conversion or direct.,"conversion, direct",str
+input_data_path,/path_to_POD5_directory,Path to directory/file containing input sequencing data,,str
+fasta,/path_to_fasta.fasta,Path to initial FASTA file,,str
+output_directory,/outputs,Directory to act as root for all analysis outputs,,str
+experiment_name,,An experiment name for the final h5ad file,,str
+```
+
+## Common fields
+
+Below are some of the most commonly edited fields and how they affect the CLI workflows:
+
+- `smf_modality`: Defines whether the data is `conversion`, `direct` or `deaminase`, which determines
+  preprocessing and HMM feature handling.
+- `input_data_path`: Location of raw input data (fast5/pod5/fastq/bam).
+- `fasta`: Reference FASTA for alignment and positional context.
+- `fasta_regions_of_interest`: Optional BED file to subset the FASTA.
+- `output_directory`: Root output folder for all generated AnnData files and plots.
+- `experiment_name`: Base name used for output AnnData files.
+- `model_dir` / `model`: Dorado basecalling model configuration (nanopore runs).
+- `barcode_kit`: Demultiplexing configuration for barcoded nanopore experiments.
+- `mapping_threshold`: Minimum mapping proportion per reference required for downstream steps.
+- `mod_list`: Modification calls to use for direct-modality workflows.
+- `conversion_types`: Target modification types for conversion workflows.
+
+## Tips
+
+- Keep paths absolute whenever possible to avoid ambiguity.
+- Lists are written in bracketed form, e.g. `[5mC]` or `[5mC_5hmC]`.
+- If you update the CSV, re-run the CLI command pointing at the updated file.
diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md
@@ -1,3 +1,13 @@
 # Tutorials
 
-## Basic workflows
+```{toctree}
+:maxdepth: 1
+
+cli_usage
+experiment_config
+```
+
+## Basic workflows
+
+- Command-line walkthroughs and batch processing examples.
+- Experiment configuration CSV layout and field descriptions.
diff --git a/pyproject.toml b/pyproject.toml
@@ -5,7 +5,7 @@ build-backend = "hatchling.build"
 [project]
 name = "smftools"
 description = "Single Molecule Footprinting Analysis in Python."
-requires-python = ">=3.10,<3.15"
+requires-python = ">=3.10"
 license = { file = "LICENSE" }
 authors = [
     {name = "Joseph McKenna"}
@@ -35,6 +35,8 @@ classifiers = [
     "Programming Language :: Python :: 3.10",
     "Programming Language :: Python :: 3.11",
     "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Programming Language :: Python :: 3.14",
     "Topic :: Scientific/Engineering :: Bio-Informatics",
     "Topic :: Scientific/Engineering :: Visualization"
 ]
@@ -78,12 +80,8 @@ Documentation = "https://smftools.readthedocs.io/"
 smftools = "smftools.cli_entry:cli"
 
 [project.optional-dependencies]
-tests = [
-    "pytest",
-    "pytest-cov"
-]
 
-dev = ["ruff", "pre-commit"]
+dev = ["ruff", "pre-commit", "pytest", "pytest-cov"]
 
 docs = [
     "sphinx>=7",
diff --git a/src/smftools/cli_entry.py b/src/smftools/cli_entry.py
@@ -1,3 +1,4 @@
+import logging
 from pathlib import Path
 from typing import Sequence
 
@@ -8,13 +9,28 @@
 from .cli.load_adata import load_adata
 from .cli.preprocess_adata import preprocess_adata
 from .cli.spatial_adata import spatial_adata
+from .logging_utils import setup_logging
 from .readwrite import concatenate_h5ads
 
 
 @click.group()
-def cli():
+@click.option(
+    "--log-file",
+    type=click.Path(dir_okay=False, writable=True, path_type=Path),
+    default=None,
+    help="Optional file path to write smftools logs.",
+)
+@click.option(
+    "--log-level",
+    type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"], case_sensitive=False),
+    default="INFO",
+    show_default=True,
+    help="Logging level for smftools output.",
+)
+def cli(log_file: Path | None, log_level: str):
     """Command-line interface for smftools."""
-    pass
+    level = getattr(logging, log_level.upper(), logging.INFO)
+    setup_logging(level=level, log_file=log_file)
 
 
 ####### Load anndata from raw data ###########
diff --git a/src/smftools/config/default.yaml b/src/smftools/config/default.yaml
@@ -9,9 +9,7 @@ device: "auto"
 
 ######## smftools load params #########
 # Generic i/o
-bam_suffix: ".bam"
 recursive_input_search: True
-split_dir: "demultiplexed_BAMs"
 strands:
   - bottom
   - top
diff --git a/src/smftools/config/discover_input_files.py b/src/smftools/config/discover_input_files.py
@@ -3,10 +3,12 @@
 from pathlib import Path
 from typing import Any, Dict, List, Union
 
+from smftools.constants import BAM_SUFFIX
+
 
 def discover_input_files(
     input_data_path: Union[str, Path],
-    bam_suffix: str = ".bam",
+    bam_suffix: str = BAM_SUFFIX,
     recursive: bool = False,
     follow_symlinks: bool = False,
 ) -> Dict[str, Any]:
diff --git a/src/smftools/config/experiment_config.py b/src/smftools/config/experiment_config.py
@@ -8,6 +8,8 @@
 from pathlib import Path
 from typing import IO, Any, Dict, List, Optional, Sequence, Tuple, Union
 
+from smftools.constants import BAM_SUFFIX, MOD_LIST, MOD_MAP, SPLIT_DIR
+
 from .discover_input_files import discover_input_files
 
 # Optional dependency for YAML handling
@@ -652,11 +654,11 @@ class ExperimentConfig:
     input_data_path: Optional[str] = None
     output_directory: Optional[str] = None
     fasta: Optional[str] = None
-    bam_suffix: str = ".bam"
+    bam_suffix: str = BAM_SUFFIX
     recursive_input_search: bool = True
     input_type: Optional[str] = None
     input_files: Optional[List[Path]] = None
-    split_dir: str = "demultiplexed_BAMs"
+    split_dir: str = SPLIT_DIR
     split_path: Optional[str] = None
     strands: List[str] = field(default_factory=lambda: ["bottom", "top"])
     conversions: List[str] = field(default_factory=lambda: ["unconverted"])
@@ -708,10 +710,10 @@ class ExperimentConfig:
     hm5C_threshold: float = 0.7
     thresholds: List[float] = field(default_factory=list)
     mod_list: List[str] = field(
-        default_factory=lambda: ["5mC_5hmC", "6mA"]
+        default_factory=lambda: list(MOD_LIST)
     )  # Dorado modified basecalling codes
     mod_map: Dict[str, str] = field(
-        default_factory=lambda: {"6mA": "6mA", "5mC_5hmC": "5mC"}
+        default_factory=lambda: dict(MOD_MAP)
     )  # Map from dorado modified basecalling codes to codes used in modkit_extract_to_adata function
 
     # Alignment params
@@ -1058,7 +1060,7 @@ def from_var_dict(
         elif input_data_path.is_dir():
             found = discover_input_files(
                 input_data_path,
-                bam_suffix=merged["bam_suffix"],
+                bam_suffix=merged.get("bam_suffix", BAM_SUFFIX),
                 recursive=merged["recursive_input_search"],
             )
 
@@ -1093,7 +1095,7 @@ def from_var_dict(
         summary_file = output_dir / summary_file_basename
 
         # Demultiplexing output path
-        split_dir = merged.get("split_dir", "demultiplexed_BAMs")
+        split_dir = merged.get("split_dir", SPLIT_DIR)
         split_path = output_dir / split_dir
 
         # final normalization
@@ -1228,7 +1230,7 @@ def from_var_dict(
             barcode_kit=merged.get("barcode_kit"),
             fastq_barcode_map=merged.get("fastq_barcode_map"),
             fastq_auto_pairing=merged.get("fastq_auto_pairing"),
-            bam_suffix=merged.get("bam_suffix", ".bam"),
+            bam_suffix=merged.get("bam_suffix", BAM_SUFFIX),
             split_dir=split_dir,
             split_path=split_path,
             strands=merged.get("strands", ["bottom", "top"]),
@@ -1261,7 +1263,8 @@ def from_var_dict(
             m5C_threshold=merged.get("m5C_threshold", 0.7),
             hm5C_threshold=merged.get("hm5C_threshold", 0.7),
             thresholds=merged.get("thresholds", []),
-            mod_list=merged.get("mod_list", ["5mC_5hmC", "6mA"]),
+            mod_list=merged.get("mod_list", list(MOD_LIST)),
+            mod_map=merged.get("mod_map", list(MOD_MAP)),
             batch_size=merged.get("batch_size", 4),
             skip_unclassified=merged.get("skip_unclassified", True),
             delete_batch_hdfs=merged.get("delete_batch_hdfs", True),
diff --git a/src/smftools/constants.py b/src/smftools/constants.py
@@ -20,5 +20,8 @@ def _deep_freeze(obj: Any) -> Any:
 BAM_SUFFIX: Final[str] = ".bam"
 SPLIT_DIR: Final[str] = "demultiplexed_BAMs"
 
+_private_mod_list = ("5mC_5hmC", "6mA")
+MOD_LIST: Final[tuple[str, ...]] = _deep_freeze(_private_mod_list)
+
 _private_mod_map: Dict[str, str] = {"6mA": "6mA", "5mC_5hmC": "5mC"}
 MOD_MAP: Final[Mapping[str, str]] = _deep_freeze(_private_mod_map)
diff --git a/src/smftools/logging_utils.py b/src/smftools/logging_utils.py