|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Hard-won Research Workflow Tips" |
| 4 | +date: 2022-12-04 |
| 5 | +no_toc: true |
| 6 | +--- |
| 7 | + |
| 8 | +## Number Seeds Starting at 1,000 |
| 9 | + |
| 10 | +Ensures alphabetical ordering of string representation matches numerical ordering. |
| 11 | +This tip credit [@voidptr](https://github.com/voidptr) |
| 12 | + |
| 13 | +## Have low-numbered seeds run quick, pro-forma versions of your workflow |
| 14 | + |
| 15 | +Good for debugging. |
| 16 | +Your workflow won't work the first or second time. |
| 17 | +The first time you run your workflow is NEVER to generate actual data. |
| 18 | +It is to debug the workflow. |
| 19 | + |
| 20 | +Spot check all data files manually. |
| 21 | + |
| 22 | +## Divide seeds into completely-independent "endeavors" |
| 23 | + |
| 24 | +No data collation between endeavors. |
| 25 | +Allows completely independent instantiations of the same workflow. |
| 26 | + |
| 27 | +## Do you have all the data columns you'll want to join on? |
| 28 | + |
| 29 | +if you gzip, redundant or always-same columns shouldn't take up much extra space |
| 30 | + |
| 31 | +## How will you manually re-start or intervene in the middle of your workflow when things go wrong? |
| 32 | + |
| 33 | +## How will you weed out bad or corrupted data from good data? |
| 34 | + |
| 35 | +Ideally, different runs will be completely isolated so this is not necessary |
| 36 | + |
| 37 | +## How will you play detective when you find something weird? |
| 38 | + |
| 39 | +## Use JSON, YAML, etc. |
| 40 | + |
| 41 | +don't invent your own serialization format |
| 42 | + |
| 43 | +## Save SLURM job ID and hard-pin job ID with *all* data |
| 44 | + |
| 45 | +## Save SLURM logs to one folder, named with the JOB ID |
| 46 | + |
| 47 | +## Also save a copy of SLURM scripts to a separate folder, named with the JOB ID |
| 48 | + |
| 49 | +This makes it easy to debug, rerun, or make small manual changes and rerun. |
| 50 | + |
| 51 | +## A 1% or 0.1% failure rate == major headache at scale |
| 52 | + |
| 53 | +Request *very* generous extra time on jobs or, better yet, use time-aware experimental conditions (i.e., run for 3 hours instead of n updates) |
| 54 | + |
| 55 | +## Use Jinja to instantiate batches of job scripts |
| 56 | + |
| 57 | +## Hard-pin your source as a Jinja variable |
| 58 | + |
| 59 | +Subsequent jobs should inherit parent's hard pin |
| 60 | + |
| 61 | +## Embed large numbers of jobs into a script as a text variable representation of `.tar.gz` data |
| 62 | + |
| 63 | +<https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/slurm_stoker_kickoff.sh> |
| 64 | + |
| 65 | +<https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/slurm_stoker.slurm.sh.jinja> |
| 66 | + |
| 67 | +## Re-clone and re-compile source for every job |
| 68 | + |
| 69 | +Use automatic caching instead of manual caching |
| 70 | +* `ccache` |
| 71 | +* <https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/gitget.sh> |
| 72 | + |
| 73 | +## Get push notifications of job failures |
| 74 | + |
| 75 | +Automatically upload to logs to `transfer.sh` to be able to view as embedded link |
| 76 | + |
| 77 | +Running experiments equals pager duty. |
| 78 | + |
| 79 | +## Get push notifications of your running job count |
| 80 | + |
| 81 | +## Have every job check file quota on startup and send push warning if quota is nearing capacity |
| 82 | + |
| 83 | +<https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/check_quota.sh> |
| 84 | + |
| 85 | +## Have all failed jobs try rerunning themselves once |
| 86 | + |
| 87 | +## If it happens once (i.e., not in a loop), log it |
| 88 | + |
| 89 | +## Log full configuration and save an `asconfigured.cfg` file |
| 90 | + |
| 91 | +## Log ALL Bash variables |
| 92 | + |
| 93 | +## Log RNG state |
| 94 | + |
| 95 | +## Timestamp log entries |
| 96 | + |
| 97 | +## Put everything that might fail (i.e., downloads) in an automatic retry loop |
| 98 | + |
| 99 | +## Log digests and row count of all data read in from file |
| 100 | + |
| 101 | +in notebooks and in simulation |
| 102 | + |
| 103 | +## gzip log files and save them, named by SLURM job ID |
| 104 | + |
| 105 | +Your data has SLURM job id, so it is easy to find relevant log when questions come up |
| 106 | + |
| 107 | +## You can open URLs directly inside `pandas.read_csv()` |
| 108 | + |
| 109 | +i.e., OSF urls |
| 110 | + |
| 111 | +## You can open `.csv.gz`, `.csv.xz` directly inside `pandas.read_csv()` |
| 112 | + |
| 113 | +Consider using a library to save to `gz` filehandle directly. |
| 114 | + |
| 115 | +## Notebooks |
| 116 | + |
| 117 | +* commit bare notebooks |
| 118 | + * consider adding a git commit hook or CI script to enforce this |
| 119 | +* run notebooks inside GitHub actions |
| 120 | + * consistent, known environment |
| 121 | + * enforces EVERY asset MUST be committed or downloaded from URL |
| 122 | + * keeps generated cruft off of main branch history |
| 123 | + * use `teeplot` to generate and save copies of all plots with descriptive, systematic filenames |
| 124 | + * github actions pushes new executed notebooks, generated assets to a `binder` or `notebooks` branch |
| 125 | + * be sure to push so that history on that branch is preserved |
| 126 | +* include graphics in LaTeX source as git submodule |
| 127 | + * get new versions of graphics as `git pull` |
| 128 | + * be sure to use `--recursive --depth=1 --jobs=n` while cloning |
| 129 | + |
| 130 | +## Overleaf doesn't like git submodules |
| 131 | + |
| 132 | +AWS Cloud9 is a passable alternative, but it's hard to get collaborator buy-in and annoying to manage logins and permissions |
| 133 | + |
| 134 | +## Use symlinks to access your in-repository Python library |
| 135 | + |
| 136 | +## Create supplementary materials as a LaTeX appendix to your paper |
| 137 | + |
| 138 | +Makes moving content in/out of supplement trivial. |
| 139 | +You can use `pdftk dump_data_utf8` to automatically split the supplement off into a separate document as the last step of the build. |
| 140 | + |
| 141 | +## Everything MUST print a stack trace WHEN it crashes |
| 142 | + |
| 143 | +<https://github.com/bombela/backward-cpp> |
| 144 | + |
| 145 | +Consider adding similar features to your job script with a bash error trap |
| 146 | +<https://unix.stackexchange.com/a/504829> |
| 147 | + |
| 148 | +## Be sure to log node name on crash |
| 149 | + |
| 150 | +Some nodes are bad. |
| 151 | +You can exclude them from letting your jobs run on them. |
| 152 | + |
| 153 | +## Use `bump2version` to release software |
| 154 | + |
| 155 | +<https://github.com/c4urself/bump2version> |
| 156 | + |
| 157 | +## Use `pip-tools` to ensure a consistent Python environment |
| 158 | + |
| 159 | +<https://pypi.org/project/pip-tools/> |
| 160 | + |
| 161 | +## Don't upload to the OSF directly from HPCC jobs |
| 162 | + |
| 163 | +Their service isn't robust enough and you'll cause them issues. |
| 164 | +AWS S3 is an easy alternative that integrates well. |
| 165 | + |
| 166 | +## Make it clear what can be deleted or just delete immediately |
| 167 | + |
| 168 | +Otherwise you will end up with a bunch of stuff and not remember what's important |
| 169 | + |
| 170 | +Put things in `/tmp` so they're deleted automatically. |
| 171 | + |
| 172 | +## ondemand |
| 173 | + |
| 174 | +<https://ondemand.hpcc.msu.edu/> |
| 175 | + |
| 176 | +Great for browsing, downloading files. |
| 177 | +Also for viewing queue and canceling individual jobs. |
0 commit comments