A minimal pipeline: one input CSV, several processing steps, and a plot.
data/input.csv is a small table with columns category and value.
- clean (
scripts/clean.py) : Readdata/input.csv, drop any nulls, and ensurevalueentries are numeric. Producesdata/cleaned.csvas the clean dataset. - transform (
scripts/transform.py) : Add column to insert normalized value and another column to add the rank. Producesresults/transformed.csvas the result. - summarize (
scripts/summarize.py) : Write summary stats toresults/summary.txtfile. - plot (
scripts/plot.py) : Create bar chart of category vs value asresults/plot.pngandresults/plot_unsorted.pngimage files.
Each step is implemented in a Python script under scripts/; the Snakefile invokes them with input and output paths taken as input arguments.
init/setup.sh contains the steps to set up a conda environment that contains python, snakemake, and any dependencies for this pipeline. Please note that the setup is specifically configured for the Roar Collab cluster. To run on other clusters, some minor modifications must be made.
From this directory (workflowtools_intro/):
Local:
snakemake -j 1Local with a profile: The profiles/local/config.yaml file sets some defaults to limit the resources used the snakemake jobs.
snakemake --profile profiles/localSlurm: The profiles/slurm/config.yaml file sets slurm-specific settings that enable snakemake to submit jobs via sbatch.
snakemake --profile profiles/slurminit/reset.sh removes all outputs and puts the repo back to a clean state after a snakemake run. Run with the following:
./init/reset.shdata/cleaned.csvresults/transformed.csvresults/summary.txtresults/plot.png