Skip to content

Commit e3cb9f4

Browse files
authored
Merge pull request #175 from jkmckenna/0.2.5
doc cli tutorials / begin adding logging functionality
2 parents 0f70dbc + 42b70c0 commit e3cb9f4

File tree

11 files changed

+246
-23
lines changed

11 files changed

+246
-23
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060
strategy:
6161
fail-fast: false
6262
matrix:
63-
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
63+
python-version: ["3.10"]
6464

6565
steps:
6666
- name: Check out repository
@@ -85,7 +85,7 @@ jobs:
8585
- name: Install dependencies
8686
run: |
8787
python -m pip install --upgrade pip
88-
python -m pip install .[tests]
88+
python -m pip install .[dev]
8989
9090
- name: Run smoke tests
9191
run: pytest -m smoke -q
@@ -95,7 +95,7 @@ jobs:
9595
strategy:
9696
fail-fast: false
9797
matrix:
98-
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
98+
python-version: ["3.10", "3.11", "3.12"]
9999

100100
steps:
101101
- name: Check out repository

docs/source/tutorials/cli_usage.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Command line tutorials
2+
3+
## Quick start
4+
5+
Most CLI workflows start with an experiment configuration CSV that points to your data, FASTA, and
6+
output directory. Once the configuration is ready, you can run commands like:
7+
8+
```shell
9+
smftools load /path/to/experiment_config.csv
10+
smftools preprocess /path/to/experiment_config.csv
11+
smftools spatial /path/to/experiment_config.csv
12+
smftools hmm /path/to/experiment_config.csv
13+
```
14+
15+
Each command will create (or reuse) stage-specific AnnData files in the output directory. Later
16+
commands reuse results from earlier stages unless you explicitly force a redo via configuration
17+
flags.
18+
19+
## What each command does
20+
21+
### `smftools load`
22+
23+
The load command builds the raw AnnData object from your raw sequencing data. It:
24+
25+
- Handles input formats (fast5/pod5/fastq/bam).
26+
- Performs basecalling, alignment, demultiplexing, and BAM QC.
27+
- Optionally generates BED/bigWig outputs for alignment summaries.
28+
- Constructs the raw AnnData object (Single molecules x Positional coordinates).
29+
- Adds basic read-level QC annotations.
30+
- Writes the raw AnnData to the canonical output path and runs MultiQC.
31+
- Optionally deletes intermediate BAMs, H5ADs, and TSVs.
32+
33+
### `smftools preprocess`
34+
35+
The preprocess command performs QC, binarization, filtering, and duplicate detection. It:
36+
37+
- Loads sample sheet metadata (if provided).
38+
- Generates read length/quality QC plots and filters reads on these metrics.
39+
- Binarizes direct-modification calls based on thresholds (hard or fit thresholds).
40+
- Cleans NaNs in adata.layers.
41+
- Computes positional coverage and base-context annotations.
42+
- Calculates read modification statistics and QC plots.
43+
- Filters reads based on modification thresholds.
44+
- Adds base-context binary layers.
45+
- Flags duplicate reads and performs complexity analyses (conversion/deamination workflows).
46+
- Writes preprocessed and deduplicated AnnData outputs.
47+
48+
### `smftools spatial`
49+
50+
The spatial command runs downstream spatial analyses on the preprocessed data. It:
51+
52+
- Optionally loads sample sheet metadata.
53+
- Optionally inverts and reindexes the data along the reference axis.
54+
- Generates clustermaps for preprocessed (and deduplicated) AnnData.
55+
- Runs PCA/UMAP/Leiden clustering.
56+
- Computes spatial autocorrelation, rolling metrics, and grid summaries.
57+
- Generates positionwise correlation matrices (non-direct modalities).
58+
- Writes the spatial AnnData output.
59+
60+
### `smftools hmm`
61+
62+
The hmm command adds HMM-based feature annotation and summary plots. It:
63+
64+
- Ensures preprocessing and spatial analyses are up to date.
65+
- Fits or reuses HMM models for configured feature sets.
66+
- Annotates AnnData with HMM-derived layers and merged intervals.
67+
- Calls HMM feature peaks and writes peak-calling outputs.
68+
- Generates clustermaps, rolling traces, and fragment size plots for HMM layers.
69+
- Writes the HMM AnnData output.
70+
71+
## Batch processing
72+
73+
Use the batch command to run a single task across multiple experiments.
74+
75+
```shell
76+
smftools batch preprocess /path/to/config_paths.csv
77+
```
78+
79+
The batch command accepts:
80+
81+
- **CSV/TSV** tables with a column of config paths (default column name: `config_path`).
82+
- **TXT** files with one config path per line.
83+
84+
You can override the column name or delimiter if needed:
85+
86+
```shell
87+
smftools batch spatial /path/to/configs.tsv --column my_config --sep $'\t'
88+
```
89+
90+
Each path is validated; missing configs are skipped with a message, while valid configs run the
91+
requested task in sequence.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Experiment configuration CSV
2+
3+
smftools uses an experiment configuration CSV to define paths, modality settings, and workflow
4+
options. You can start from the repository template (`experiment_config.csv`) and fill in your
5+
experiment-specific values.
6+
7+
## CSV format
8+
9+
The configuration CSV is a table with the following columns:
10+
11+
| Column | Description |
12+
| --- | --- |
13+
| `variable` | Configuration key name (used by smftools). |
14+
| `value` | Your value for this key. |
15+
| `help` | Short description of the key. |
16+
| `options` | Expected values (when applicable). |
17+
| `type` | Expected value type (`str`, `int`, `float`, `list`). |
18+
19+
A shortened example looks like:
20+
21+
```csv
22+
variable,value,help,options,type
23+
smf_modality,conversion,Modality of SMF. Can either be conversion or direct.,"conversion, direct",str
24+
input_data_path,/path_to_POD5_directory,Path to directory/file containing input sequencing data,,str
25+
fasta,/path_to_fasta.fasta,Path to initial FASTA file,,str
26+
output_directory,/outputs,Directory to act as root for all analysis outputs,,str
27+
experiment_name,,An experiment name for the final h5ad file,,str
28+
```
29+
30+
## Common fields
31+
32+
Below are some of the most commonly edited fields and how they affect the CLI workflows:
33+
34+
- `smf_modality`: Defines whether the data is `conversion`, `direct` or `deaminase`, which determines
35+
preprocessing and HMM feature handling.
36+
- `input_data_path`: Location of raw input data (fast5/pod5/fastq/bam).
37+
- `fasta`: Reference FASTA for alignment and positional context.
38+
- `fasta_regions_of_interest`: Optional BED file to subset the FASTA.
39+
- `output_directory`: Root output folder for all generated AnnData files and plots.
40+
- `experiment_name`: Base name used for output AnnData files.
41+
- `model_dir` / `model`: Dorado basecalling model configuration (nanopore runs).
42+
- `barcode_kit`: Demultiplexing configuration for barcoded nanopore experiments.
43+
- `mapping_threshold`: Minimum mapping proportion per reference required for downstream steps.
44+
- `mod_list`: Modification calls to use for direct-modality workflows.
45+
- `conversion_types`: Target modification types for conversion workflows.
46+
47+
## Tips
48+
49+
- Keep paths absolute whenever possible to avoid ambiguity.
50+
- Lists are written in bracketed form, e.g. `[5mC]` or `[5mC_5hmC]`.
51+
- If you update the CSV, re-run the CLI command pointing at the updated file.

docs/source/tutorials/index.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
11
# Tutorials
22

3-
## Basic workflows
3+
```{toctree}
4+
:maxdepth: 1
5+
6+
cli_usage
7+
experiment_config
8+
```
9+
10+
## Basic workflows
11+
12+
- Command-line walkthroughs and batch processing examples.
13+
- Experiment configuration CSV layout and field descriptions.

pyproject.toml

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build-backend = "hatchling.build"
55
[project]
66
name = "smftools"
77
description = "Single Molecule Footprinting Analysis in Python."
8-
requires-python = ">=3.10,<3.15"
8+
requires-python = ">=3.10"
99
license = { file = "LICENSE" }
1010
authors = [
1111
{name = "Joseph McKenna"}
@@ -35,6 +35,8 @@ classifiers = [
3535
"Programming Language :: Python :: 3.10",
3636
"Programming Language :: Python :: 3.11",
3737
"Programming Language :: Python :: 3.12",
38+
"Programming Language :: Python :: 3.13",
39+
"Programming Language :: Python :: 3.14",
3840
"Topic :: Scientific/Engineering :: Bio-Informatics",
3941
"Topic :: Scientific/Engineering :: Visualization"
4042
]
@@ -78,12 +80,8 @@ Documentation = "https://smftools.readthedocs.io/"
7880
smftools = "smftools.cli_entry:cli"
7981

8082
[project.optional-dependencies]
81-
tests = [
82-
"pytest",
83-
"pytest-cov"
84-
]
8583

86-
dev = ["ruff", "pre-commit"]
84+
dev = ["ruff", "pre-commit", "pytest", "pytest-cov"]
8785

8886
docs = [
8987
"sphinx>=7",

src/smftools/cli_entry.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import logging
12
from pathlib import Path
23
from typing import Sequence
34

@@ -8,13 +9,28 @@
89
from .cli.load_adata import load_adata
910
from .cli.preprocess_adata import preprocess_adata
1011
from .cli.spatial_adata import spatial_adata
12+
from .logging_utils import setup_logging
1113
from .readwrite import concatenate_h5ads
1214

1315

1416
@click.group()
15-
def cli():
17+
@click.option(
18+
"--log-file",
19+
type=click.Path(dir_okay=False, writable=True, path_type=Path),
20+
default=None,
21+
help="Optional file path to write smftools logs.",
22+
)
23+
@click.option(
24+
"--log-level",
25+
type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"], case_sensitive=False),
26+
default="INFO",
27+
show_default=True,
28+
help="Logging level for smftools output.",
29+
)
30+
def cli(log_file: Path | None, log_level: str):
1631
"""Command-line interface for smftools."""
17-
pass
32+
level = getattr(logging, log_level.upper(), logging.INFO)
33+
setup_logging(level=level, log_file=log_file)
1834

1935

2036
####### Load anndata from raw data ###########

src/smftools/config/default.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,7 @@ device: "auto"
99

1010
######## smftools load params #########
1111
# Generic i/o
12-
bam_suffix: ".bam"
1312
recursive_input_search: True
14-
split_dir: "demultiplexed_BAMs"
1513
strands:
1614
- bottom
1715
- top

src/smftools/config/discover_input_files.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,12 @@
33
from pathlib import Path
44
from typing import Any, Dict, List, Union
55

6+
from smftools.constants import BAM_SUFFIX
7+
68

79
def discover_input_files(
810
input_data_path: Union[str, Path],
9-
bam_suffix: str = ".bam",
11+
bam_suffix: str = BAM_SUFFIX,
1012
recursive: bool = False,
1113
follow_symlinks: bool = False,
1214
) -> Dict[str, Any]:

src/smftools/config/experiment_config.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
from pathlib import Path
99
from typing import IO, Any, Dict, List, Optional, Sequence, Tuple, Union
1010

11+
from smftools.constants import BAM_SUFFIX, MOD_LIST, MOD_MAP, SPLIT_DIR
12+
1113
from .discover_input_files import discover_input_files
1214

1315
# Optional dependency for YAML handling
@@ -652,11 +654,11 @@ class ExperimentConfig:
652654
input_data_path: Optional[str] = None
653655
output_directory: Optional[str] = None
654656
fasta: Optional[str] = None
655-
bam_suffix: str = ".bam"
657+
bam_suffix: str = BAM_SUFFIX
656658
recursive_input_search: bool = True
657659
input_type: Optional[str] = None
658660
input_files: Optional[List[Path]] = None
659-
split_dir: str = "demultiplexed_BAMs"
661+
split_dir: str = SPLIT_DIR
660662
split_path: Optional[str] = None
661663
strands: List[str] = field(default_factory=lambda: ["bottom", "top"])
662664
conversions: List[str] = field(default_factory=lambda: ["unconverted"])
@@ -708,10 +710,10 @@ class ExperimentConfig:
708710
hm5C_threshold: float = 0.7
709711
thresholds: List[float] = field(default_factory=list)
710712
mod_list: List[str] = field(
711-
default_factory=lambda: ["5mC_5hmC", "6mA"]
713+
default_factory=lambda: list(MOD_LIST)
712714
) # Dorado modified basecalling codes
713715
mod_map: Dict[str, str] = field(
714-
default_factory=lambda: {"6mA": "6mA", "5mC_5hmC": "5mC"}
716+
default_factory=lambda: dict(MOD_MAP)
715717
) # Map from dorado modified basecalling codes to codes used in modkit_extract_to_adata function
716718

717719
# Alignment params
@@ -1058,7 +1060,7 @@ def from_var_dict(
10581060
elif input_data_path.is_dir():
10591061
found = discover_input_files(
10601062
input_data_path,
1061-
bam_suffix=merged["bam_suffix"],
1063+
bam_suffix=merged.get("bam_suffix", BAM_SUFFIX),
10621064
recursive=merged["recursive_input_search"],
10631065
)
10641066

@@ -1093,7 +1095,7 @@ def from_var_dict(
10931095
summary_file = output_dir / summary_file_basename
10941096

10951097
# Demultiplexing output path
1096-
split_dir = merged.get("split_dir", "demultiplexed_BAMs")
1098+
split_dir = merged.get("split_dir", SPLIT_DIR)
10971099
split_path = output_dir / split_dir
10981100

10991101
# final normalization
@@ -1228,7 +1230,7 @@ def from_var_dict(
12281230
barcode_kit=merged.get("barcode_kit"),
12291231
fastq_barcode_map=merged.get("fastq_barcode_map"),
12301232
fastq_auto_pairing=merged.get("fastq_auto_pairing"),
1231-
bam_suffix=merged.get("bam_suffix", ".bam"),
1233+
bam_suffix=merged.get("bam_suffix", BAM_SUFFIX),
12321234
split_dir=split_dir,
12331235
split_path=split_path,
12341236
strands=merged.get("strands", ["bottom", "top"]),
@@ -1261,7 +1263,8 @@ def from_var_dict(
12611263
m5C_threshold=merged.get("m5C_threshold", 0.7),
12621264
hm5C_threshold=merged.get("hm5C_threshold", 0.7),
12631265
thresholds=merged.get("thresholds", []),
1264-
mod_list=merged.get("mod_list", ["5mC_5hmC", "6mA"]),
1266+
mod_list=merged.get("mod_list", list(MOD_LIST)),
1267+
mod_map=merged.get("mod_map", list(MOD_MAP)),
12651268
batch_size=merged.get("batch_size", 4),
12661269
skip_unclassified=merged.get("skip_unclassified", True),
12671270
delete_batch_hdfs=merged.get("delete_batch_hdfs", True),

src/smftools/constants.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,8 @@ def _deep_freeze(obj: Any) -> Any:
2020
BAM_SUFFIX: Final[str] = ".bam"
2121
SPLIT_DIR: Final[str] = "demultiplexed_BAMs"
2222

23+
_private_mod_list = ("5mC_5hmC", "6mA")
24+
MOD_LIST: Final[tuple[str, ...]] = _deep_freeze(_private_mod_list)
25+
2326
_private_mod_map: Dict[str, str] = {"6mA": "6mA", "5mC_5hmC": "5mC"}
2427
MOD_MAP: Final[Mapping[str, str]] = _deep_freeze(_private_mod_map)

0 commit comments

Comments
 (0)