poldrack
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 93 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎book/images/simple-DAG.png‎
89.9 KB b/‎book/images/simple-DAG.png‎
89.9 KB
diff --git a/‎book/images/snakemake-DAG.png‎
37.4 KB b/‎book/images/snakemake-DAG.png‎
37.4 KB
diff --git a/‎book/workflows.md‎
Lines changed: 666 additions & 96 deletions b/‎book/workflows.md‎
Lines changed: 666 additions & 96 deletions
diff --git a/‎myst.yml‎
Lines changed: 1 addition & 1 deletion b/‎myst.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎problems_to_solve.md‎
Lines changed: 87 additions & 0 deletions b/‎problems_to_solve.md‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎prompts/refactor_monolithic_to_modular.md‎
Lines changed: 27 additions & 0 deletions b/‎prompts/refactor_monolithic_to_modular.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 8 additions & 3 deletions b/‎pyproject.toml‎
Lines changed: 8 additions & 3 deletions
@@ -5,6 +5,10 @@ __pycache__
 .hypothesis
 .env
 ._*
+.snakemake
+
+# Workflow output directories
+**/simple_workflow/*/output/
 
 data
 exports
 
@@ -0,0 +1,93 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is an open-source book on building better code for science using AI, authored by Russell Poldrack. The rendered book is published at https://poldrack.github.io/BetterCodeBetterScience/.
+
+## Build Commands
+
+```bash
+# Install dependencies (uses uv package manager)
+uv pip install -r pyproject.toml
+uv pip install -e .
+
+# Build book as HTML and serve locally
+myst build --html
+npx serve _build/html
+
+# Build PDF (requires LaTeX)
+jupyter-book build book/ --builder pdflatex
+
+# Clean build artifacts
+rm -rf book/_build
+```
+
+## Testing
+
+```bash
+# Run all tests
+pytest
+
+# Run tests with coverage
+pytest --cov=src/BetterCodeBetterScience --cov-report term-missing
+
+# Run specific test modules
+pytest tests/textmining/
+pytest tests/property_based_testing/
+pytest tests/narps/
+
+# Run tests with specific markers
+pytest -m unit
+pytest -m integration
+```
+
+Test markers defined in pyproject.toml: `unit` and `integration`.
+
+## Linting and Code Quality
+
+```bash
+# Spell checking (configured in pyproject.toml)
+codespell
+
+# Python linting and formatting
+ruff check .
+ruff format .
+
+# Pre-commit hooks (runs codespell)
+pre-commit run --all-files
+```
+
+## Project Structure
+
+- `book/` - MyST markdown chapters (configured in myst.yml)
+- `src/BetterCodeBetterScience/` - Example Python code referenced in book chapters
+- `tests/` - Test examples demonstrating testing concepts from the book
+- `data/` - Data files for examples
+- `scripts/` - Utility scripts
+- `_build/` - Build output (gitignored)
+
+## Key Configuration Files
+
+- `myst.yml` - MyST book configuration (table of contents, exports, site settings)
+- `pyproject.toml` - Python dependencies, pytest config, codespell settings
+- `.pre-commit-config.yaml` - Pre-commit hooks (codespell)
+
+## Contribution Guidelines
+
+- New text should be authored by a human (AI may be used to check/improve text)
+- Code examples should follow PEP8
+- Avoid introducing new dependencies when possible
+- Custom words for codespell are in `project-words.txt`
+
+## Coding guidelines
+
+## Notes for Development
+
+- Think about the problem before generating code.
+- Write code that is clean and modular. Prefer shorter functions/methods over longer ones.
+- Prefer reliance on widely used packages (such as numpy, pandas, and scikit-learn); avoid unknown packages from Github.
+- Do not include *any* code in `__init__.py` files.
+- Use pytest for testing.
+- Use functions rather than classes for tests. Use pytest fixtures to share resources between tests.
@@ -23,7 +23,7 @@ project:
     - file: book/project_organization.md
     - file: book/data_management.md
 # - file: workflows
-# - file: validation_robustness.md
+# - file: validation.md
 # - file: performance
 # - file: HPC
 # - file: sharing
 
@@ -0,0 +1,87 @@
+## Problems to be fixed
+
+Open problems marked with [ ]
+Fixed problems marked with [x]
+
+[x] I would like to generate a new example of a very simple pandas-based data analysis workflow for demonstrating the features of Prefect and snakemake. Put the new code into src/BetterCodeBetterScience/simple_workflow.  The example should include separate modules that implement each of the following functions:
+- load these two files (using the first column as the index for each):
+  - https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv
+  - https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv
+- Filter out any non-numerical variables from each
+- join the two data frames based on the index
+- compute the correlation matrix across all measures using Spearman correlation
+- generate a clustered heatmap from the correlation matrix using Seaborn
+    - Created `simple_workflow/` directory with modular functions:
+      - `load_data.py`: Functions to load CSV data from URLs with optional caching
+      - `filter_data.py`: Functions to filter dataframes to numerical columns only
+      - `join_data.py`: Functions to join dataframes based on index
+      - `correlation.py`: Functions to compute Spearman correlation matrices
+      - `visualization.py`: Functions to generate clustered heatmaps with Seaborn
+    - Created `prefect_workflow/` subdirectory:
+      - `tasks.py`: Prefect task definitions wrapping each workflow function
+      - `flows.py`: Main workflow flow orchestrating all steps
+      - `run_workflow.py`: CLI entry point
+      - Usage: `python run_workflow.py --output-dir ./output`
+    - Created `snakemake_workflow/` subdirectory:
+      - `Snakefile`: Workflow rules with dependencies
+      - `config/config.yaml`: Configuration for URLs and heatmap settings
+      - `scripts/*.py`: Scripts for each workflow step
+      - `report/`: RST files for Snakemake report generation
+      - Usage: `snakemake --cores 1 --config output_dir=/path/to/output`
+    - Created `make_workflow/` subdirectory:
+      - `Makefile`: GNU Make-based workflow with proper dependencies
+      - `scripts/*.py`: Standalone CLI scripts for each step
+      - Usage: `make OUTPUT_DIR=/path/to/output all`
+
+
+[x] For the Snakemake workflow I would like to use the Snakemake report generating functions to create a report showing the results from each of the analyses.
+    - Added `report: "report/workflow.rst"` global declaration to Snakefile
+    - Created `report/` directory with RST caption files for each figure type
+    - Updated preprocessing.smk rules (filtering, qc, dimred, clustering) to declare figures as outputs with `report()` wrapper
+    - Updated pseudobulk.smk checkpoint to include pseudobulk figure with `report()` wrapper
+    - Updated per_cell_type.smk rules (GSEA, Enrichr, prediction) to include figures with `report()` wrapper and cell_type subcategory
+    - Updated common.smk `aggregate_per_cell_type_outputs()` to include figure files
+    - Added `report` and `report-zip` targets to Makefile
+    - Updated WORKFLOW_OVERVIEW.md with report generation documentation
+    - Usage: `snakemake --report report.html --config datadir=/path/to/data` or `make report`
+
+[x] For the Prefect workflow, the default parameters for each workflow module are embedded in the python code for the workflow. I would rather that they be defined using a configuration file.  Please extract all of the parameters into a configuration file (using whatever format you think is most appropriate) and read those in during workflow execution rather than hard-coding.
+    - Created `prefect_workflow/config/config.yaml` with all workflow parameters
+    - Parameters organized by step: filtering, qc, preprocessing, dimred, clustering, pseudobulk, differential_expression, pathway_analysis, overrepresentation, predictive_modeling
+    - Added `load_config()` function to flows.py that loads from YAML file
+    - Updated `run_workflow()` and `analyze_single_cell_type()` to accept `config_path` parameter
+    - Added `--config` CLI argument to run_workflow.py
+    - Default config bundled with package; custom configs can be specified via CLI
+[x] For the Prefect workflow, please save the output to a folder called "wf_prefect" (rather than "workflow")
+    - Updated all output directories in flows.py and run_workflow.py to use `wf_prefect/` instead of `workflow/`
+[x] For the Snakemake workflow, please save the output to a folder called "wf_snakemake" (rather than "workflow")
+    - Updated Snakefile to use `wf_snakemake/` for CHECKPOINT_DIR, RESULTS_DIR, FIGURE_DIR, LOG_DIR
+    - Updated WORKFLOW_OVERVIEW.md to reflect new output structure
+
+[x] I would now like to add another workflow, with code saved to src/BetterCodeBetterScience/rnaseq/snakemake_workflow. This workflow will use the Snakemake workflow manager (https://snakemake.readthedocs.io/en/stable/index.html); otherwise it should be functionally equivalent to the other workflows already developed.
+    - Created `snakemake_workflow/` directory with:
+      - `Snakefile`: Main workflow entry point
+      - `config/config.yaml`: All workflow parameters with defaults
+      - `rules/common.smk`: Helper functions (sanitize_cell_type, aggregate functions)
+      - `rules/preprocessing.smk`: Steps 1-6 rules
+      - `rules/pseudobulk.smk`: Step 7 as Snakemake checkpoint (enables dynamic rules)
+      - `rules/per_cell_type.smk`: Steps 8-11 with {cell_type} wildcard
+      - `scripts/*.py`: 12 Python scripts wrapping modular workflow functions
+    - Uses Snakemake checkpoint for step 7 to discover cell types dynamically
+    - Per-cell-type steps (8-11) triggered automatically for all valid cell types
+    - Reuses existing modular workflow functions and checkpoint utilities
+    - Added `snakemake>=8.0` dependency to pyproject.toml
+    - Usage: `snakemake --cores 8 --config datadir=/path/to/data`  
+
+[x] I would like to add a new workflow, with code saved to src/BetterCodeBetterScience/rnaseq/prefect_workflow. This workflow will use the Prefect workflow manager (https://github.com/PrefectHQ/prefect) to manage the workflow that was previously developed in src/BetterCodeBetterScience/rnaseq/stateless_workflow. The one new feature that I would like to add here is to perform steps 8-11 separately on each different cell type that survives the initial filtering.
+    - Created `prefect_workflow/` directory with:
+      - `tasks.py`: Prefect task definitions wrapping modular workflow functions
+      - `flows.py`: Main workflow flow with parallel per-cell-type analysis
+      - `run_workflow.py`: CLI entry point with argument parsing
+    - Steps 1-7 run sequentially with checkpoint caching (reuses existing system)
+    - Steps 8-11 run in parallel for each cell type:
+      - DE tasks submitted in parallel across all cell types
+      - GSEA, Enrichr, and predictive modeling run in parallel within each cell type
+    - Added `prefect>=3.0` dependency to pyproject.toml
+    - Results organized by cell type in `workflow/results/per_cell_type/`
+    - CLI supports: `--force-from`, `--cell-type`, `--list-cell-types`, `--min-samples`  
@@ -0,0 +1,27 @@
+Prompt: please read CLAUDE.md for guidelines, and then read refactor_monolithic_to_modular.md for a description of your task.
+
+# Goal
+
+src/BetterCodeBetterScience/rnaseq/immune_scrnaseq_monolithic.py is currently a single monolithic script for a data analysis workflow.  I would like to refactor it into a modular script based on the following decomposition of the workflow:
+
+- Data (down)loading
+- Data filtering (removing subjects or cell types with insufficient observations)
+- Quality control
+    - identifying bad cells on the basis of mitochondrial, ribosomal, or hemoglobin genes or hemoglobin contamination
+    - identifying "doublets" (multiple cells identified as one)
+- Preprocessing
+    - Count normalization
+    - Log transformation
+    - Identification of high-variance features
+    - Filtering of nuisance genes
+- Dimensionality reduction
+- UMAP generation
+- Clustering
+- Pseudobulking
+- Differential expression analysis
+- Pathway enrichment analysis (GSEA)
+- Overrepresentation analysis (Enrichr)
+- Predictive modeling
+
+Please generate a new set of scripts within a new directory called `src/BetterCodeBetterScience/rnaseq/modular_workflow` that implements the same workflow in a modular way.
+
@@ -29,13 +29,12 @@ dependencies = [
     "icecream>=2.1.4",
     "python-dotenv>=1.0.1",
     "pyyaml>=6.0.2",
-    "numba>=0.61.0",
+    "numba>=0.61,<0.63",
     "codespell>=2.4.1",
     "tomli>=2.2.1",
     "pre-commit>=4.2.0",
     "mdnewline>=0.1.3",
     "anthropic>=0.61.0",
-    "rpy2>=3.6.4",
     "nibabel>=5.3.2",
     "fastparquet>=2024.11.0",
     "templateflow>=25.1.1",
@@ -47,7 +46,6 @@ dependencies = [
     "datalad-osf>=0.3.0",
     "pymongo[srv]>=4.15.4",
     "mysql-connector-python>=9.5.0",
-    "mariadb>=1.1.14",
     "biopython>=1.86",
     "neo4j>=6.0.3",
     "tqdm>=4.66.5",
@@ -76,6 +74,13 @@ dependencies = [
     "fastcluster>=1.3.0",
     "scikit-misc>=0.5.2",
     "harmony-pytorch>=0.1.8",
+    "pydeseq2>=0.5.3",
+    "gseapy>=1.1.11",
+    "ipython>=9.8.0",
+    "harmonypy>=0.0.10",
+    "rpy2>=3.6.4",
+    "prefect>=3.0",
+    "snakemake>=8.0",
 ]
 
 [build-system]