Skip to content
This repository was archived by the owner on Dec 31, 2025. It is now read-only.

Commit dca153e

Browse files
authored
Merge pull request #26 from poldrack/text/workflows-Dec19
Text/workflows dec19
2 parents c0434ca + f7e942c commit dca153e

File tree

99 files changed

+14351
-217
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+14351
-217
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ __pycache__
55
.hypothesis
66
.env
77
._*
8+
.snakemake
9+
10+
# Workflow output directories
11+
**/simple_workflow/*/output/
812

913
data
1014
exports

CLAUDE.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is an open-source book on building better code for science using AI, authored by Russell Poldrack. The rendered book is published at https://poldrack.github.io/BetterCodeBetterScience/.
8+
9+
## Build Commands
10+
11+
```bash
12+
# Install dependencies (uses uv package manager)
13+
uv pip install -r pyproject.toml
14+
uv pip install -e .
15+
16+
# Build book as HTML and serve locally
17+
myst build --html
18+
npx serve _build/html
19+
20+
# Build PDF (requires LaTeX)
21+
jupyter-book build book/ --builder pdflatex
22+
23+
# Clean build artifacts
24+
rm -rf book/_build
25+
```
26+
27+
## Testing
28+
29+
```bash
30+
# Run all tests
31+
pytest
32+
33+
# Run tests with coverage
34+
pytest --cov=src/BetterCodeBetterScience --cov-report term-missing
35+
36+
# Run specific test modules
37+
pytest tests/textmining/
38+
pytest tests/property_based_testing/
39+
pytest tests/narps/
40+
41+
# Run tests with specific markers
42+
pytest -m unit
43+
pytest -m integration
44+
```
45+
46+
Test markers defined in pyproject.toml: `unit` and `integration`.
47+
48+
## Linting and Code Quality
49+
50+
```bash
51+
# Spell checking (configured in pyproject.toml)
52+
codespell
53+
54+
# Python linting and formatting
55+
ruff check .
56+
ruff format .
57+
58+
# Pre-commit hooks (runs codespell)
59+
pre-commit run --all-files
60+
```
61+
62+
## Project Structure
63+
64+
- `book/` - MyST markdown chapters (configured in myst.yml)
65+
- `src/BetterCodeBetterScience/` - Example Python code referenced in book chapters
66+
- `tests/` - Test examples demonstrating testing concepts from the book
67+
- `data/` - Data files for examples
68+
- `scripts/` - Utility scripts
69+
- `_build/` - Build output (gitignored)
70+
71+
## Key Configuration Files
72+
73+
- `myst.yml` - MyST book configuration (table of contents, exports, site settings)
74+
- `pyproject.toml` - Python dependencies, pytest config, codespell settings
75+
- `.pre-commit-config.yaml` - Pre-commit hooks (codespell)
76+
77+
## Contribution Guidelines
78+
79+
- New text should be authored by a human (AI may be used to check/improve text)
80+
- Code examples should follow PEP8
81+
- Avoid introducing new dependencies when possible
82+
- Custom words for codespell are in `project-words.txt`
83+
84+
## Coding guidelines
85+
86+
## Notes for Development
87+
88+
- Think about the problem before generating code.
89+
- Write code that is clean and modular. Prefer shorter functions/methods over longer ones.
90+
- Prefer reliance on widely used packages (such as numpy, pandas, and scikit-learn); avoid unknown packages from Github.
91+
- Do not include *any* code in `__init__.py` files.
92+
- Use pytest for testing.
93+
- Use functions rather than classes for tests. Use pytest fixtures to share resources between tests.

book/images/simple-DAG.png

89.9 KB
Loading

book/images/snakemake-DAG.png

37.4 KB
Loading

book/workflows.md

Lines changed: 666 additions & 96 deletions
Large diffs are not rendered by default.

myst.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ project:
2323
- file: book/project_organization.md
2424
- file: book/data_management.md
2525
# - file: workflows
26-
# - file: validation_robustness.md
26+
# - file: validation.md
2727
# - file: performance
2828
# - file: HPC
2929
# - file: sharing

problems_to_solve.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
## Problems to be fixed
2+
3+
Open problems marked with [ ]
4+
Fixed problems marked with [x]
5+
6+
[x] I would like to generate a new example of a very simple pandas-based data analysis workflow for demonstrating the features of Prefect and snakemake. Put the new code into src/BetterCodeBetterScience/simple_workflow. The example should include separate modules that implement each of the following functions:
7+
- load these two files (using the first column as the index for each):
8+
- https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv
9+
- https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv
10+
- Filter out any non-numerical variables from each
11+
- join the two data frames based on the index
12+
- compute the correlation matrix across all measures using Spearman correlation
13+
- generate a clustered heatmap from the correlation matrix using Seaborn
14+
- Created `simple_workflow/` directory with modular functions:
15+
- `load_data.py`: Functions to load CSV data from URLs with optional caching
16+
- `filter_data.py`: Functions to filter dataframes to numerical columns only
17+
- `join_data.py`: Functions to join dataframes based on index
18+
- `correlation.py`: Functions to compute Spearman correlation matrices
19+
- `visualization.py`: Functions to generate clustered heatmaps with Seaborn
20+
- Created `prefect_workflow/` subdirectory:
21+
- `tasks.py`: Prefect task definitions wrapping each workflow function
22+
- `flows.py`: Main workflow flow orchestrating all steps
23+
- `run_workflow.py`: CLI entry point
24+
- Usage: `python run_workflow.py --output-dir ./output`
25+
- Created `snakemake_workflow/` subdirectory:
26+
- `Snakefile`: Workflow rules with dependencies
27+
- `config/config.yaml`: Configuration for URLs and heatmap settings
28+
- `scripts/*.py`: Scripts for each workflow step
29+
- `report/`: RST files for Snakemake report generation
30+
- Usage: `snakemake --cores 1 --config output_dir=/path/to/output`
31+
- Created `make_workflow/` subdirectory:
32+
- `Makefile`: GNU Make-based workflow with proper dependencies
33+
- `scripts/*.py`: Standalone CLI scripts for each step
34+
- Usage: `make OUTPUT_DIR=/path/to/output all`
35+
36+
37+
[x] For the Snakemake workflow I would like to use the Snakemake report generating functions to create a report showing the results from each of the analyses.
38+
- Added `report: "report/workflow.rst"` global declaration to Snakefile
39+
- Created `report/` directory with RST caption files for each figure type
40+
- Updated preprocessing.smk rules (filtering, qc, dimred, clustering) to declare figures as outputs with `report()` wrapper
41+
- Updated pseudobulk.smk checkpoint to include pseudobulk figure with `report()` wrapper
42+
- Updated per_cell_type.smk rules (GSEA, Enrichr, prediction) to include figures with `report()` wrapper and cell_type subcategory
43+
- Updated common.smk `aggregate_per_cell_type_outputs()` to include figure files
44+
- Added `report` and `report-zip` targets to Makefile
45+
- Updated WORKFLOW_OVERVIEW.md with report generation documentation
46+
- Usage: `snakemake --report report.html --config datadir=/path/to/data` or `make report`
47+
48+
[x] For the Prefect workflow, the default parameters for each workflow module are embedded in the python code for the workflow. I would rather that they be defined using a configuration file. Please extract all of the parameters into a configuration file (using whatever format you think is most appropriate) and read those in during workflow execution rather than hard-coding.
49+
- Created `prefect_workflow/config/config.yaml` with all workflow parameters
50+
- Parameters organized by step: filtering, qc, preprocessing, dimred, clustering, pseudobulk, differential_expression, pathway_analysis, overrepresentation, predictive_modeling
51+
- Added `load_config()` function to flows.py that loads from YAML file
52+
- Updated `run_workflow()` and `analyze_single_cell_type()` to accept `config_path` parameter
53+
- Added `--config` CLI argument to run_workflow.py
54+
- Default config bundled with package; custom configs can be specified via CLI
55+
[x] For the Prefect workflow, please save the output to a folder called "wf_prefect" (rather than "workflow")
56+
- Updated all output directories in flows.py and run_workflow.py to use `wf_prefect/` instead of `workflow/`
57+
[x] For the Snakemake workflow, please save the output to a folder called "wf_snakemake" (rather than "workflow")
58+
- Updated Snakefile to use `wf_snakemake/` for CHECKPOINT_DIR, RESULTS_DIR, FIGURE_DIR, LOG_DIR
59+
- Updated WORKFLOW_OVERVIEW.md to reflect new output structure
60+
61+
[x] I would now like to add another workflow, with code saved to src/BetterCodeBetterScience/rnaseq/snakemake_workflow. This workflow will use the Snakemake workflow manager (https://snakemake.readthedocs.io/en/stable/index.html); otherwise it should be functionally equivalent to the other workflows already developed.
62+
- Created `snakemake_workflow/` directory with:
63+
- `Snakefile`: Main workflow entry point
64+
- `config/config.yaml`: All workflow parameters with defaults
65+
- `rules/common.smk`: Helper functions (sanitize_cell_type, aggregate functions)
66+
- `rules/preprocessing.smk`: Steps 1-6 rules
67+
- `rules/pseudobulk.smk`: Step 7 as Snakemake checkpoint (enables dynamic rules)
68+
- `rules/per_cell_type.smk`: Steps 8-11 with {cell_type} wildcard
69+
- `scripts/*.py`: 12 Python scripts wrapping modular workflow functions
70+
- Uses Snakemake checkpoint for step 7 to discover cell types dynamically
71+
- Per-cell-type steps (8-11) triggered automatically for all valid cell types
72+
- Reuses existing modular workflow functions and checkpoint utilities
73+
- Added `snakemake>=8.0` dependency to pyproject.toml
74+
- Usage: `snakemake --cores 8 --config datadir=/path/to/data`
75+
76+
[x] I would like to add a new workflow, with code saved to src/BetterCodeBetterScience/rnaseq/prefect_workflow. This workflow will use the Prefect workflow manager (https://github.com/PrefectHQ/prefect) to manage the workflow that was previously developed in src/BetterCodeBetterScience/rnaseq/stateless_workflow. The one new feature that I would like to add here is to perform steps 8-11 separately on each different cell type that survives the initial filtering.
77+
- Created `prefect_workflow/` directory with:
78+
- `tasks.py`: Prefect task definitions wrapping modular workflow functions
79+
- `flows.py`: Main workflow flow with parallel per-cell-type analysis
80+
- `run_workflow.py`: CLI entry point with argument parsing
81+
- Steps 1-7 run sequentially with checkpoint caching (reuses existing system)
82+
- Steps 8-11 run in parallel for each cell type:
83+
- DE tasks submitted in parallel across all cell types
84+
- GSEA, Enrichr, and predictive modeling run in parallel within each cell type
85+
- Added `prefect>=3.0` dependency to pyproject.toml
86+
- Results organized by cell type in `workflow/results/per_cell_type/`
87+
- CLI supports: `--force-from`, `--cell-type`, `--list-cell-types`, `--min-samples`
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Prompt: please read CLAUDE.md for guidelines, and then read refactor_monolithic_to_modular.md for a description of your task.
2+
3+
# Goal
4+
5+
src/BetterCodeBetterScience/rnaseq/immune_scrnaseq_monolithic.py is currently a single monolithic script for a data analysis workflow. I would like to refactor it into a modular script based on the following decomposition of the workflow:
6+
7+
- Data (down)loading
8+
- Data filtering (removing subjects or cell types with insufficient observations)
9+
- Quality control
10+
- identifying bad cells on the basis of mitochondrial, ribosomal, or hemoglobin genes or hemoglobin contamination
11+
- identifying "doublets" (multiple cells identified as one)
12+
- Preprocessing
13+
- Count normalization
14+
- Log transformation
15+
- Identification of high-variance features
16+
- Filtering of nuisance genes
17+
- Dimensionality reduction
18+
- UMAP generation
19+
- Clustering
20+
- Pseudobulking
21+
- Differential expression analysis
22+
- Pathway enrichment analysis (GSEA)
23+
- Overrepresentation analysis (Enrichr)
24+
- Predictive modeling
25+
26+
Please generate a new set of scripts within a new directory called `src/BetterCodeBetterScience/rnaseq/modular_workflow` that implements the same workflow in a modular way.
27+

pyproject.toml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,12 @@ dependencies = [
2929
"icecream>=2.1.4",
3030
"python-dotenv>=1.0.1",
3131
"pyyaml>=6.0.2",
32-
"numba>=0.61.0",
32+
"numba>=0.61,<0.63",
3333
"codespell>=2.4.1",
3434
"tomli>=2.2.1",
3535
"pre-commit>=4.2.0",
3636
"mdnewline>=0.1.3",
3737
"anthropic>=0.61.0",
38-
"rpy2>=3.6.4",
3938
"nibabel>=5.3.2",
4039
"fastparquet>=2024.11.0",
4140
"templateflow>=25.1.1",
@@ -47,7 +46,6 @@ dependencies = [
4746
"datalad-osf>=0.3.0",
4847
"pymongo[srv]>=4.15.4",
4948
"mysql-connector-python>=9.5.0",
50-
"mariadb>=1.1.14",
5149
"biopython>=1.86",
5250
"neo4j>=6.0.3",
5351
"tqdm>=4.66.5",
@@ -76,6 +74,13 @@ dependencies = [
7674
"fastcluster>=1.3.0",
7775
"scikit-misc>=0.5.2",
7876
"harmony-pytorch>=0.1.8",
77+
"pydeseq2>=0.5.3",
78+
"gseapy>=1.1.11",
79+
"ipython>=9.8.0",
80+
"harmonypy>=0.0.10",
81+
"rpy2>=3.6.4",
82+
"prefect>=3.0",
83+
"snakemake>=8.0",
7984
]
8085

8186
[build-system]

0 commit comments

Comments
 (0)