|
| 1 | +# Design: `datalad containers-run` support in BABS |
| 2 | + |
| 3 | +Issue: https://github.com/PennLINC/babs/issues/328 |
| 4 | + |
| 5 | +## Goal |
| 6 | + |
| 7 | +Use `datalad containers-run` instead of direct `singularity run` for BIDS app |
| 8 | +execution. This provides: |
| 9 | + |
| 10 | +- **Dynamic container path discovery** — `datalad containers-list` reads the |
| 11 | + path from the container dataset's `.datalad/config`, so BABS works with any |
| 12 | + container dataset layout (not just the hardcoded |
| 13 | + `containers/.datalad/environments/{name}/image`). |
| 14 | +- **Provenance tracking** — datalad records exactly which container ran and how. |
| 15 | +- **Compatibility with repronim/containers** — which uses |
| 16 | + `images/bids/{name}.sif` instead of the default datalad-containers layout. |
| 17 | + |
| 18 | +## How it works |
| 19 | + |
| 20 | +### During `babs init` |
| 21 | + |
| 22 | +1. Clone container dataset as subdataset of analysis (unchanged) |
| 23 | +2. `datalad containers-list` discovers the image path in the container dataset |
| 24 | +3. `datalad containers-add` registers the container at the analysis level with: |
| 25 | + - The discovered image path (relative to analysis) |
| 26 | + - A `call_fmt` built from the user's `singularity_args` config |
| 27 | +4. `datalad get` fetches the container image content so it's available for job |
| 28 | + symlinks |
| 29 | + |
| 30 | +### During job execution (`participant_job.sh`) |
| 31 | + |
| 32 | +Three-step flow replaces the single `datalad run` that called a combined |
| 33 | +singularity+zip script: |
| 34 | + |
| 35 | +1. **`datalad containers-run`** — runs the BIDS app. Container is accessed via |
| 36 | + a PROJECT_ROOT symlink to the analysis dataset's fetched image. |
| 37 | +2. **`datalad run`** — calls a zip-only script (no longer contains singularity |
| 38 | + invocation or output removal). |
| 39 | +3. **`git rm --sparse`** — removes raw outputs. Needed because `datalad run |
| 40 | + --explicit` doesn't track file deletions. The `--sparse` flag is required |
| 41 | + because outputs are outside the sparse-checkout cone. |
| 42 | + |
| 43 | +## Prior work |
| 44 | + |
| 45 | +An initial proof-of-concept was developed before upstream's sparse-checkout PR |
| 46 | +(#337) landed: https://github.com/asmacdo/babs/pull/6 |
| 47 | + |
| 48 | +That POC was validated end-to-end on a SLURM cluster (Discovery) with both |
| 49 | +handmade container datasets and repronim/containers. Key findings from that |
| 50 | +work: |
| 51 | + |
| 52 | +- **Concurrent `datalad get` failures** — when multiple jobs run simultaneously, |
| 53 | + they all try to fetch the same container image. Git-annex uses transfer locks, |
| 54 | + causing all but one job to fail. This is a known git-annex bug: |
| 55 | + https://git-annex.branchable.com/bugs/concurrent_get_from_separate_clones_fails/ |
| 56 | + The POC solved this with ephemeral clones for the containers subdataset. |
| 57 | + |
| 58 | +- **Ephemeral clones work for containers but not inputs** — container images are |
| 59 | + resolved on the host before singularity launches, so annex symlinks outside |
| 60 | + `$PWD` are fine. Input data needs to be accessible inside the container with |
| 61 | + `--containall -B $PWD`, so symlinks outside `$PWD` break. |
| 62 | + |
| 63 | +- **`datalad run --explicit` doesn't track deletions** — discovered and |
| 64 | + documented as an upstream issue. The workaround (separate `git rm` step) is |
| 65 | + carried forward into this implementation. |
| 66 | + |
| 67 | +The current implementation was recreated from scratch on top of the |
| 68 | +sparse-checkout changes, using the POC as reference. The PROJECT_ROOT symlink |
| 69 | +approach (from upstream) combined with `datalad get` of the image during init |
| 70 | +avoids the concurrent get problem entirely — jobs never call `datalad get` for |
| 71 | +the container image, they just follow the symlink to the pre-fetched content. |
| 72 | +Ephemeral clones remain a potential future alternative. |
| 73 | + |
| 74 | +## Problems encountered |
| 75 | + |
| 76 | +### `git sparse-checkout` interaction |
| 77 | + |
| 78 | +The upstream sparse-checkout PR (#337) changed job clones to use |
| 79 | +`--no-checkout` + `git sparse-checkout`. This caused two issues: |
| 80 | + |
| 81 | +- **`.datalad/config` not checked out** — cone mode doesn't include |
| 82 | + dot-directories by default. Fixed by adding `.datalad` to the sparse-checkout |
| 83 | + set. Without it, `containers-run` fails with "No known containers." |
| 84 | + |
| 85 | +- **`git rm` blocked by sparse-checkout** — outputs created by `containers-run` |
| 86 | + are outside the cone, so `git rm` refuses to touch them. Fixed with |
| 87 | + `git rm --sparse`. |
| 88 | + |
| 89 | +### Container image access in job clones |
| 90 | + |
| 91 | +Job clones don't have the container image content — they clone from input RIA |
| 92 | +with `--no-checkout`. The existing PROJECT_ROOT symlink approach works: symlink |
| 93 | +from the job clone's container path to the analysis dataset's copy. But this |
| 94 | +requires the analysis dataset to have the actual content (not just an annex |
| 95 | +pointer), so `datalad get` during init is necessary. |
| 96 | + |
| 97 | +### `--containall` requires `--pwd $PWD` and `-B $PWD` |
| 98 | + |
| 99 | +When users specify `--containall` in `singularity_args`, singularity disables |
| 100 | +auto-binding of `$PWD` and sets the working directory to `$HOME` inside the |
| 101 | +container. This breaks relative paths to inputs/outputs. The `call_fmt` built |
| 102 | +during init always includes `-B $PWD --pwd $PWD` to handle this. |
| 103 | + |
| 104 | +### datalad `{placeholder}` conflicts with shell `${variables}` |
| 105 | + |
| 106 | +`datalad run` interprets `{name}` as a substitution placeholder, which |
| 107 | +conflicts with shell variable expansion like `${subid}`. The POC initially |
| 108 | +tried inlining zip commands in `datalad run` but this caused placeholder |
| 109 | +conflicts. The solution is a separate zip script called by `datalad run`. |
| 110 | + |
| 111 | +In the jinja2 templates, shell variables use `{%raw%}/${subid}{%endraw%}` |
| 112 | +blocks to prevent jinja2 from interpreting them, and datalad sees the literal |
| 113 | +`${subid}` which it passes through to bash. |
| 114 | + |
| 115 | +### `datalad run --explicit` doesn't track deletions |
| 116 | + |
| 117 | +When outputs are deleted inside a `datalad run --explicit` command, datalad |
| 118 | +doesn't record the deletion — only additions/modifications are tracked. This is |
| 119 | +why raw output removal is a separate `git rm` step outside of datalad run. |
| 120 | + |
| 121 | +This was fixed upstream in https://github.com/datalad/datalad/pull/7823 but |
| 122 | +we keep the `git rm` workaround to avoid requiring the latest datalad version. |
| 123 | + |
| 124 | +## Open questions |
| 125 | + |
| 126 | +### Pipeline support |
| 127 | + |
| 128 | +Pipeline currently uses a separate code path (`_bootstrap_pipeline_scripts`) |
| 129 | +that hardcodes container paths and generates a combined script |
| 130 | +(`pipeline_zip.sh`) with direct `singularity run` calls per step. |
| 131 | + |
| 132 | +Questions: |
| 133 | +- Should pipeline be "a pipeline of 1"? i.e., unify single-app and pipeline |
| 134 | + into one code path where single-app is just a pipeline with one step. |
| 135 | +- If not unified, pipeline needs the same `containers-list` / `containers-add` |
| 136 | + treatment per container step. |
| 137 | +- Pipeline currently puts all singularity invocations in one script called by |
| 138 | + one `datalad run`. With `containers-run`, each step would be a separate |
| 139 | + `datalad containers-run` call — better provenance but different structure. |
| 140 | + |
| 141 | +### Reusing `call_fmt` / `cmdexec` from container datasets |
| 142 | + |
| 143 | +Currently we build `call_fmt` from the user's `singularity_args` config: |
| 144 | +``` |
| 145 | +singularity run -B $PWD --pwd $PWD {user_args} {img} {cmd} |
| 146 | +``` |
| 147 | + |
| 148 | +Some container datasets (like repronim/containers) already define `cmdexec` in |
| 149 | +their `.datalad/config`. Should we reuse that instead of building our own? |
| 150 | + |
| 151 | +Related: datalad-container's freeze script approach |
| 152 | +(https://github.com/datalad/datalad-container/issues/287) could provide a |
| 153 | +standardized way to handle this. |
| 154 | + |
| 155 | +### Container image fetch during init |
| 156 | + |
| 157 | +`datalad get` of the container image during `babs init` ensures the symlink |
| 158 | +approach works at job time. But this makes init slow if the image isn't already |
| 159 | +local (e.g., pulling a multi-GB .sif from a remote). |
| 160 | + |
| 161 | +Options: |
| 162 | +- Accept the cost during init (current approach) |
| 163 | +- Advise users to `datalad get` the image in their source container dataset |
| 164 | + before running `babs init`, so the fetch is a fast local copy |
| 165 | +- Add a separate `babs prepare` step for data orchestration that runs after |
| 166 | + init but before submit |
| 167 | + |
| 168 | +### Container access mechanism |
| 169 | + |
| 170 | +Currently using PROJECT_ROOT symlink (inherited from upstream). This depends on: |
| 171 | +- `PROJECT_ROOT` environment variable propagating through SLURM |
| 172 | +- Shared filesystem visible from compute nodes |
| 173 | + |
| 174 | +An alternative is ephemeral clones (`datalad get -n --reckless ephemeral |
| 175 | +containers`), which removes these dependencies but requires changing the clone |
| 176 | +source from input RIA to the analysis dataset path. Deferred for now. |
| 177 | + |
| 178 | +### Backwards-incompatible: features from `bidsapp_run.sh.jinja2` not yet ported |
| 179 | + |
| 180 | +The old `bidsapp_run.sh.jinja2` template had several BABS-managed features that |
| 181 | +are not yet in the `containers-run` path. These are **backwards-incompatible** |
| 182 | +for users who depend on them and need to be restored or replaced. |
| 183 | + |
| 184 | +- **TemplateFlow bind mount** — BABS read `TEMPLATEFLOW_HOME` from the |
| 185 | + environment and added `-B path:path_in_container` + `--env` to the singularity |
| 186 | + command. Could be restored in BABS, or users could handle this via |
| 187 | + `singularity_args` in their config. |
| 188 | + |
| 189 | +- **FreeSurfer license bind mount** — BABS intercepted `--fs-license-file` from |
| 190 | + `bids_app_args` and converted it to a `-B` bind mount. Same options: restore |
| 191 | + in BABS or let users specify via `singularity_args`. |
| 192 | + |
| 193 | +- **BIDS filter file** — dynamically generated per-job for session-level |
| 194 | + processing, restricting BIDS layout indexing to the current session. This is |
| 195 | + BIDS-app-specific (fmriprep gets `bold`, qsiprep gets `dwi`, etc.) and can't |
| 196 | + be handled by user config alone. Could potentially be supported via a hook or |
| 197 | + user-provided script mechanism (nothing like this exists in BABS today). |
| 198 | + |
| 199 | +- **`${PWD}/` prefix on paths** — the old script passed absolute paths |
| 200 | + (`${PWD}/inputs/data/BIDS`). The `containers-run` path passes relative paths, |
| 201 | + which should work because `call_fmt` includes `--pwd $PWD`. Needs verification |
| 202 | + with BIDS apps that resolve paths differently. |
| 203 | + |
| 204 | +### `--explicit` flag |
| 205 | + |
| 206 | +We use `--explicit` on both `containers-run` and the zip `datalad run`. The |
| 207 | +POC required `--explicit` because the old subject pruning (`rm -rf` of other |
| 208 | +subjects' directories) dirtied the dataset, and datalad refuses to run on a |
| 209 | +dirty dataset without `--explicit`. Upstream's sparse-checkout PR (#337) |
| 210 | +eliminated subject pruning, so the dataset should stay clean. `--explicit` may |
| 211 | +no longer be strictly necessary — kept for safety, could test removing it. |
0 commit comments