Skip to content

Latest commit

 

History

History
121 lines (108 loc) · 7.63 KB

File metadata and controls

121 lines (108 loc) · 7.63 KB

✨ Miscellanous useful things ✨

🐍 Snakemake 🐍

📝 Log/progress parsing setup 📝

I am really happy with my log parsing setup for running snakemake jobs. You might have your own setup, but I wanted to share it anyways, maybe something useful for someone - maybe I get nice feedback how to make it even better . This works well if you

  • use conductor jobs (i.e., don't run snakemake interactively but via sbatch - not supposed to be like this but really useful imo)
  • always have the same structure, in my case, all conductor logs are named ~/projects/*/results/logs/snake_sbatch/*conductor*.out

Here, my useful aliases:

  • find and show the errors in an easily parseable manner (this is my favourite)

    alias finderr='python ~/projects/useful_scripts/src/snakemake/find_error_logs_in_conductor.py > ~/tmp/finderr.txt; less ~/tmp/finderr.txt'
    • optimized for Snakemake 9 (the log structures sometimes change between versions)
    • counts the types of errors & sorts the file paths by them in an easily copy-pasteable way → no more guessing if the 300 error files are all the same error or 20 different errors
    | Category                                                                   | Count |
    +----------------------------------------------------------------------------+-------+
    | Out Of Memory (OOM)                                                        |     0 |
    | Killed (not OOM)                                                           |     0 |
    | KeyError: 'cell_type'                                                      |     3 |
    | Error in .subset(x, j) : invalid subscript type 'list'                     |     5 |
    | AssertionError: This script only works for groups being cell types for now |     2 |
    +----------------------------------------------------------------------------+-------+
    
    Out Of Memory (OOM)
    -------------------
    
    Killed (not OOM)
    ----------------
    
    KeyError: 'cell_type'
    ---------------------
    fig1_marker_plot_selected_genes 2025-10-06 22:30:36 |||||
      /path/to/repo/.snakemake/slurm_logs/rule_fig1_marker_plot_selected_genes/ATAC_TSS_1000_500/10552392.log
    fig1_marker_plot_selected_genes 2025-10-06 22:30:36 |||||
      /path/to/repo/.snakemake/slurm_logs/rule_fig1_marker_plot_selected_genes/ATAC_TSS_500_100/10552394.log
    fig1_marker_plot_selected_genes 2025-10-06 22:30:36 |||||
      /path/to/repo/.snakemake/slurm_logs/rule_fig1_marker_plot_selected_genes/ATAC_TSS_100_100/10552396.log
    
    Error in .subset(x, j) : invalid subscript type 'list'
    ------------------------------------------------------
    confounding_factor_quantification_stat_test 2025-10-06 22:29:46 |||||
      /path/to/repo/.snakemake/slurm_logs/rule_confounding_factor_quantification_stat_test/a
    ll_L4_RNA/10552408.log
    confounding_factor_quantification_stat_test 2025-10-06 22:29:45 |||||
      /path/to/repo/.snakemake/slurm_logs/rule_confounding_factor_quantification_stat_test/s
    ample_type__not_all_metadata__False_L4_RNA/10552409.log
  • print my queue including job names

    alias myq="squeue -u rbednarsky -o '%.12i %.4P %.5j %.80k %.8M %.4C %.9m %.6D %R'"
  • give me the most recent log across projects

    alias log_cond='LASTLOG=$(ls -Atd ~/projects/*/results/logs/snake_sbatch/*conductor*.{out,log} | head -1); echo $LASTLOG; tail -100 $LASTLOG; echo $LASTLOG; echo "----------------------------------------"'
  • continuously print how many errors there are in your pipeline (this is useful to notice early if something doesn't work)

    alias ccounterr='while true; do LASTLOG=$(ls -Atd ~/projects/*/results/logs/snake_sbatch/*conductor*.{out,log} | head -1); echo ........................................................; ls -l "$LASTLOG" | awk '\''{print $6, $7, $8, $9}'\''; grep "Error" "$LASTLOG" | sort | uniq -c; sleep 5; done'

🧑‍💻 🤝 🐍 Interactive coding with Snakemake 🧑‍💻 🤝 🐍

This relates to code in src/snakemake/interactive_snakemake_object.py

My aim here is to work interactively, while developing a workflow, in two ways:

  1. When writing a script for the first time, I want to already write it in a way that makes it easy to adapt the script to be run in the snakemake workflow.
  2. Once I think the script is working, and it is run by the workflow already once, but I find out I want to change something, I want to be able to start a session that looks as if the script is just now being run by snakemake, i.e., there is a object in memory that is called snakemake that contains the input, output, wildcards, etc.

Here is how I do it:

  • During development, I use the SnakelikeObject class to work interactively with the snakemake object. It takes a nested dictionary, where first keys are input, output, wildcards, etc., and second keys are the names of the input/output files/directories with values being the paths to the files/directories.
  • Note: I actually write smk instead of snakemake in my scripts for code brevity, but some people don't like that.
snakemake = SnakelikeObject({
  "input": {
    "adata_superset": "/path/to/adata_superset.h5ad",
    "marker_genes": "/path/to/marker_genes.csv"
  },
  "output": {
    "fig1_marker_plot_selected_genes": "/path/to/fig1_marker_plot_selected_genes.png"
  },
  "wildcards": {
    "cell_type": "L4_RNA",
    "tss_distance": "1000_500"
  },
})
  • You can then use this object just as snakemake would use it, accessing attributes like this snakemake.input['adata_superset'] etc.
  • Once you are ready to run via snakemake, this structure is easy to transfer into a rule.
  • To make it easy to switch between running via snakemake and interactive work, I use a function that does two things:
    • Run via snakemake: Save the object that snakemake injects into your environment as a json file, so you can load it for interactive coding later.
    • Once snakemake was run, if you want to code interactively: Recognize there is no snakemake object in memory, thus load the snakemake object from the json file.
smk = misc.get_smk(
    snakemake=globals().get("snakemake"),
    path_to_json=PROJECT_ROOT / 'results/smk_objects/{rule_name}/{wildcards}.json'
)

📊 Figure making 📊

  • My plotting setup is in src/python/plotting_setup.py
  • Below I show how I import RC etc for consistent plots across scripts
  • The figure dimensions really depend on the plot, and at first I try to just make it as small as possible.
  • If I have to scale one of the axis to make space for more labels or so, I try to keep the vertical axis stable so things fit well into one row.
  • Sometimes I add a little space for the axis labels, because in the end what looks tidy is if the squares are the same size.
  • I don't ever resize figures in illustrator.
  • I try to go as close to the final figure as is reasonable, and that's quite far - particularly now that AI is extremely helpful in doing spacing adjustments etc. Tell the AI to write constants on top of the script for whereever the spacing is off, then tune those constants manually until things are as you want them.
  • However, a lot of work to make things look tidy, particularly for legends, happens manually in illustrator. This literally takes hours for me, though I am quite fast in illustrator.
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.use('cairo')  # pycairo library to export pdfs for illustrator, default creates text as shapes
sns.set_theme(style='whitegrid', palette=CEMM_PALETTE, rc=CUSTOM_RC_SMALL)
fig, ax = plt.subplots(1, 1, figsize=(PANEL_WIDTH_3_PANELS, PANEL_WIDTH_3_PANELS))
# ...
fig.savefig(path_to_figure, dpi=300, bbox_inches="tight")