Skip to content

Commit 2bc0064

Browse files
committed
Create workflow tips outline
1 parent 877d5a4 commit 2bc0064

File tree

1 file changed

+177
-0
lines changed

1 file changed

+177
-0
lines changed
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
layout: post
3+
title: "Hard-won Research Workflow Tips"
4+
date: 2022-12-04
5+
no_toc: true
6+
---
7+
8+
## Number Seeds Starting at 1,000
9+
10+
Ensures alphabetical ordering of string representation matches numerical ordering.
11+
This tip credit [@voidptr](https://github.com/voidptr)
12+
13+
## Have low-numbered seeds run quick, pro-forma versions of your workflow
14+
15+
Good for debugging.
16+
Your workflow won't work the first or second time.
17+
The first time you run your workflow is NEVER to generate actual data.
18+
It is to debug the workflow.
19+
20+
Spot check all data files manually.
21+
22+
## Divide seeds into completely-independent "endeavors"
23+
24+
No data collation between endeavors.
25+
Allows completely independent instantiations of the same workflow.
26+
27+
## Do you have all the data columns you'll want to join on?
28+
29+
if you gzip, redundant or always-same columns shouldn't take up much extra space
30+
31+
## How will you manually re-start or intervene in the middle of your workflow when things go wrong?
32+
33+
## How will you weed out bad or corrupted data from good data?
34+
35+
Ideally, different runs will be completely isolated so this is not necessary
36+
37+
## How will you play detective when you find something weird?
38+
39+
## Use JSON, YAML, etc.
40+
41+
don't invent your own serialization format
42+
43+
## Save SLURM job ID and hard-pin job ID with *all* data
44+
45+
## Save SLURM logs to one folder, named with the JOB ID
46+
47+
## Also save a copy of SLURM scripts to a separate folder, named with the JOB ID
48+
49+
This makes it easy to debug, rerun, or make small manual changes and rerun.
50+
51+
## A 1% or 0.1% failure rate == major headache at scale
52+
53+
Request *very* generous extra time on jobs or, better yet, use time-aware experimental conditions (i.e., run for 3 hours instead of n updates)
54+
55+
## Use Jinja to instantiate batches of job scripts
56+
57+
## Hard-pin your source as a Jinja variable
58+
59+
Subsequent jobs should inherit parent's hard pin
60+
61+
## Embed large numbers of jobs into a script as a text variable representation of `.tar.gz` data
62+
63+
<https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/slurm_stoker_kickoff.sh>
64+
65+
<https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/slurm_stoker.slurm.sh.jinja>
66+
67+
## Re-clone and re-compile source for every job
68+
69+
Use automatic caching instead of manual caching
70+
* `ccache`
71+
* <https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/gitget.sh>
72+
73+
## Get push notifications of job failures
74+
75+
Automatically upload to logs to `transfer.sh` to be able to view as embedded link
76+
77+
Running experiments equals pager duty.
78+
79+
## Get push notifications of your running job count
80+
81+
## Have every job check file quota on startup and send push warning if quota is nearing capacity
82+
83+
<https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/check_quota.sh>
84+
85+
## Have all failed jobs try rerunning themselves once
86+
87+
## If it happens once (i.e., not in a loop), log it
88+
89+
## Log full configuration and save an `asconfigured.cfg` file
90+
91+
## Log ALL Bash variables
92+
93+
## Log RNG state
94+
95+
## Timestamp log entries
96+
97+
## Put everything that might fail (i.e., downloads) in an automatic retry loop
98+
99+
## Log digests and row count of all data read in from file
100+
101+
in notebooks and in simulation
102+
103+
## gzip log files and save them, named by SLURM job ID
104+
105+
Your data has SLURM job id, so it is easy to find relevant log when questions come up
106+
107+
## You can open URLs directly inside `pandas.read_csv()`
108+
109+
i.e., OSF urls
110+
111+
## You can open `.csv.gz`, `.csv.xz` directly inside `pandas.read_csv()`
112+
113+
Consider using a library to save to `gz` filehandle directly.
114+
115+
## Notebooks
116+
117+
* commit bare notebooks
118+
* consider adding a git commit hook or CI script to enforce this
119+
* run notebooks inside GitHub actions
120+
* consistent, known environment
121+
* enforces EVERY asset MUST be committed or downloaded from URL
122+
* keeps generated cruft off of main branch history
123+
* use `teeplot` to generate and save copies of all plots with descriptive, systematic filenames
124+
* github actions pushes new executed notebooks, generated assets to a `binder` or `notebooks` branch
125+
* be sure to push so that history on that branch is preserved
126+
* include graphics in LaTeX source as git submodule
127+
* get new versions of graphics as `git pull`
128+
* be sure to use `--recursive --depth=1 --jobs=n` while cloning
129+
130+
## Overleaf doesn't like git submodules
131+
132+
AWS Cloud9 is a passable alternative, but it's hard to get collaborator buy-in and annoying to manage logins and permissions
133+
134+
## Use symlinks to access your in-repository Python library
135+
136+
## Create supplementary materials as a LaTeX appendix to your paper
137+
138+
Makes moving content in/out of supplement trivial.
139+
You can use `pdftk dump_data_utf8` to automatically split the supplement off into a separate document as the last step of the build.
140+
141+
## Everything MUST print a stack trace WHEN it crashes
142+
143+
<https://github.com/bombela/backward-cpp>
144+
145+
Consider adding similar features to your job script with a bash error trap
146+
<https://unix.stackexchange.com/a/504829>
147+
148+
## Be sure to log node name on crash
149+
150+
Some nodes are bad.
151+
You can exclude them from letting your jobs run on them.
152+
153+
## Use `bump2version` to release software
154+
155+
<https://github.com/c4urself/bump2version>
156+
157+
## Use `pip-tools` to ensure a consistent Python environment
158+
159+
<https://pypi.org/project/pip-tools/>
160+
161+
## Don't upload to the OSF directly from HPCC jobs
162+
163+
Their service isn't robust enough and you'll cause them issues.
164+
AWS S3 is an easy alternative that integrates well.
165+
166+
## Make it clear what can be deleted or just delete immediately
167+
168+
Otherwise you will end up with a bunch of stuff and not remember what's important
169+
170+
Put things in `/tmp` so they're deleted automatically.
171+
172+
## ondemand
173+
174+
<https://ondemand.hpcc.msu.edu/>
175+
176+
Great for browsing, downloading files.
177+
Also for viewing queue and canceling individual jobs.

0 commit comments

Comments
 (0)