|
| 1 | +## Problems to be fixed |
| 2 | + |
| 3 | +Open problems marked with [ ] |
| 4 | +Fixed problems marked with [x] |
| 5 | + |
| 6 | +[x] I would like to generate a new example of a very simple pandas-based data analysis workflow for demonstrating the features of Prefect and snakemake. Put the new code into src/BetterCodeBetterScience/simple_workflow. The example should include separate modules that implement each of the following functions: |
| 7 | +- load these two files (using the first column as the index for each): |
| 8 | + - https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv |
| 9 | + - https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv |
| 10 | +- Filter out any non-numerical variables from each |
| 11 | +- join the two data frames based on the index |
| 12 | +- compute the correlation matrix across all measures using Spearman correlation |
| 13 | +- generate a clustered heatmap from the correlation matrix using Seaborn |
| 14 | + - Created `simple_workflow/` directory with modular functions: |
| 15 | + - `load_data.py`: Functions to load CSV data from URLs with optional caching |
| 16 | + - `filter_data.py`: Functions to filter dataframes to numerical columns only |
| 17 | + - `join_data.py`: Functions to join dataframes based on index |
| 18 | + - `correlation.py`: Functions to compute Spearman correlation matrices |
| 19 | + - `visualization.py`: Functions to generate clustered heatmaps with Seaborn |
| 20 | + - Created `prefect_workflow/` subdirectory: |
| 21 | + - `tasks.py`: Prefect task definitions wrapping each workflow function |
| 22 | + - `flows.py`: Main workflow flow orchestrating all steps |
| 23 | + - `run_workflow.py`: CLI entry point |
| 24 | + - Usage: `python run_workflow.py --output-dir ./output` |
| 25 | + - Created `snakemake_workflow/` subdirectory: |
| 26 | + - `Snakefile`: Workflow rules with dependencies |
| 27 | + - `config/config.yaml`: Configuration for URLs and heatmap settings |
| 28 | + - `scripts/*.py`: Scripts for each workflow step |
| 29 | + - `report/`: RST files for Snakemake report generation |
| 30 | + - Usage: `snakemake --cores 1 --config output_dir=/path/to/output` |
| 31 | + - Created `make_workflow/` subdirectory: |
| 32 | + - `Makefile`: GNU Make-based workflow with proper dependencies |
| 33 | + - `scripts/*.py`: Standalone CLI scripts for each step |
| 34 | + - Usage: `make OUTPUT_DIR=/path/to/output all` |
| 35 | + |
| 36 | + |
| 37 | +[x] For the Snakemake workflow I would like to use the Snakemake report generating functions to create a report showing the results from each of the analyses. |
| 38 | + - Added `report: "report/workflow.rst"` global declaration to Snakefile |
| 39 | + - Created `report/` directory with RST caption files for each figure type |
| 40 | + - Updated preprocessing.smk rules (filtering, qc, dimred, clustering) to declare figures as outputs with `report()` wrapper |
| 41 | + - Updated pseudobulk.smk checkpoint to include pseudobulk figure with `report()` wrapper |
| 42 | + - Updated per_cell_type.smk rules (GSEA, Enrichr, prediction) to include figures with `report()` wrapper and cell_type subcategory |
| 43 | + - Updated common.smk `aggregate_per_cell_type_outputs()` to include figure files |
| 44 | + - Added `report` and `report-zip` targets to Makefile |
| 45 | + - Updated WORKFLOW_OVERVIEW.md with report generation documentation |
| 46 | + - Usage: `snakemake --report report.html --config datadir=/path/to/data` or `make report` |
| 47 | + |
| 48 | +[x] For the Prefect workflow, the default parameters for each workflow module are embedded in the python code for the workflow. I would rather that they be defined using a configuration file. Please extract all of the parameters into a configuration file (using whatever format you think is most appropriate) and read those in during workflow execution rather than hard-coding. |
| 49 | + - Created `prefect_workflow/config/config.yaml` with all workflow parameters |
| 50 | + - Parameters organized by step: filtering, qc, preprocessing, dimred, clustering, pseudobulk, differential_expression, pathway_analysis, overrepresentation, predictive_modeling |
| 51 | + - Added `load_config()` function to flows.py that loads from YAML file |
| 52 | + - Updated `run_workflow()` and `analyze_single_cell_type()` to accept `config_path` parameter |
| 53 | + - Added `--config` CLI argument to run_workflow.py |
| 54 | + - Default config bundled with package; custom configs can be specified via CLI |
| 55 | +[x] For the Prefect workflow, please save the output to a folder called "wf_prefect" (rather than "workflow") |
| 56 | + - Updated all output directories in flows.py and run_workflow.py to use `wf_prefect/` instead of `workflow/` |
| 57 | +[x] For the Snakemake workflow, please save the output to a folder called "wf_snakemake" (rather than "workflow") |
| 58 | + - Updated Snakefile to use `wf_snakemake/` for CHECKPOINT_DIR, RESULTS_DIR, FIGURE_DIR, LOG_DIR |
| 59 | + - Updated WORKFLOW_OVERVIEW.md to reflect new output structure |
| 60 | + |
| 61 | +[x] I would now like to add another workflow, with code saved to src/BetterCodeBetterScience/rnaseq/snakemake_workflow. This workflow will use the Snakemake workflow manager (https://snakemake.readthedocs.io/en/stable/index.html); otherwise it should be functionally equivalent to the other workflows already developed. |
| 62 | + - Created `snakemake_workflow/` directory with: |
| 63 | + - `Snakefile`: Main workflow entry point |
| 64 | + - `config/config.yaml`: All workflow parameters with defaults |
| 65 | + - `rules/common.smk`: Helper functions (sanitize_cell_type, aggregate functions) |
| 66 | + - `rules/preprocessing.smk`: Steps 1-6 rules |
| 67 | + - `rules/pseudobulk.smk`: Step 7 as Snakemake checkpoint (enables dynamic rules) |
| 68 | + - `rules/per_cell_type.smk`: Steps 8-11 with {cell_type} wildcard |
| 69 | + - `scripts/*.py`: 12 Python scripts wrapping modular workflow functions |
| 70 | + - Uses Snakemake checkpoint for step 7 to discover cell types dynamically |
| 71 | + - Per-cell-type steps (8-11) triggered automatically for all valid cell types |
| 72 | + - Reuses existing modular workflow functions and checkpoint utilities |
| 73 | + - Added `snakemake>=8.0` dependency to pyproject.toml |
| 74 | + - Usage: `snakemake --cores 8 --config datadir=/path/to/data` |
| 75 | + |
| 76 | +[x] I would like to add a new workflow, with code saved to src/BetterCodeBetterScience/rnaseq/prefect_workflow. This workflow will use the Prefect workflow manager (https://github.com/PrefectHQ/prefect) to manage the workflow that was previously developed in src/BetterCodeBetterScience/rnaseq/stateless_workflow. The one new feature that I would like to add here is to perform steps 8-11 separately on each different cell type that survives the initial filtering. |
| 77 | + - Created `prefect_workflow/` directory with: |
| 78 | + - `tasks.py`: Prefect task definitions wrapping modular workflow functions |
| 79 | + - `flows.py`: Main workflow flow with parallel per-cell-type analysis |
| 80 | + - `run_workflow.py`: CLI entry point with argument parsing |
| 81 | + - Steps 1-7 run sequentially with checkpoint caching (reuses existing system) |
| 82 | + - Steps 8-11 run in parallel for each cell type: |
| 83 | + - DE tasks submitted in parallel across all cell types |
| 84 | + - GSEA, Enrichr, and predictive modeling run in parallel within each cell type |
| 85 | + - Added `prefect>=3.0` dependency to pyproject.toml |
| 86 | + - Results organized by cell type in `workflow/results/per_cell_type/` |
| 87 | + - CLI supports: `--force-from`, `--cell-type`, `--list-cell-types`, `--min-samples` |
0 commit comments