Skip to content

ENH: Centralize fMRIPrep's and MRIQC's guidelines for Docker & DataLad #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 18 additions & 19 deletions docs/apps/datalad.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,4 @@
Apps may be able to identify if the input dataset is handled with
*DataLad* or *Git-Annex*, and pull down linked data that has not
been fetched yet.
One example of one such application is *MRIQC*, and all the examples
on this documentation page will refer to it.

!!! important "Summary"
!!! abstract "Summary"

Executing *BIDS-Apps* leveraging *DataLad*-controlled datasets
within containers can be tricky.
Expand All @@ -18,6 +12,12 @@ on this documentation page will refer to it.

## *DataLad* and *Docker*

Apps may be able to identify if the input dataset is handled with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could/should I propose (separate PR on top?) in this section and/or https://github.com/nipreps/nipreps.github.io/blob/HEAD/docs/apps/singularity.md file to mention our https://github.com/ReproNim/containers which contains (automatically updates) pre-created singularity images for all bids-apps (thus including fmriprep, mriqc), and providing some helpers to streamline their use and more guaranteed reproducibility (isolated environment execution etc).

Note that wrapper also tries to support non-Linux systems (OSX) where we could run singularity under docker. Or could also be used on Linux if there is no singularity installation.

https://github.com/ReproNim/containers?tab=readme-ov-file#runnable-script provides a "typical" use example based on mriqc.

https://github.com/OpenNeuroDerivatives/ by @jbwexler (and @effigies ?) use that ReproNim/containers as a subdatset archive of the images with reproman run for execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also that should include YODA aspects whenever talking about containers... with them it becomes possible to encapsulate all digital objects nicely and reproducibly (there is no guarantee that docker:// would later give you the images used etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Mentioning ReproNim on a separate PR would be fantastic.

[*DataLad*](https://www.datalad.org/) or [*git-annex*](https://git-annex.branchable.com), and pull down linked data that has not
been fetched yet.
One example of one such application is *MRIQC*, and all the examples
on this documentation page will refer to it.

When executing *MRIQC* within *Docker* on a *DataLad* dataset
(for instance, installed from [*OpenNeuro*](https://openneuro.org)),
we will need to ensure the following settings are observed:
Expand All @@ -27,9 +27,12 @@ we will need to ensure the following settings are observed:
* the uid who is *executing MRIQC* within the container must
have sufficient permissions to write in the tree.

### Setting execution uid
### Setting a regular user's execution uid

If the uid is not correct, we will likely encounter the following error:
If the execution uid does not match the uid of the user who installed
the *DataLad* dataset, we will likely encounter the following error
with relatively recent
[*Git* versions (+2.35.2)](https://github.blog/open-source/git/git-security-vulnerability-announced/#):

```
datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false -c annex.merge-annex-branches=false annex find --not --in . --json --json-error-messages -c annex.dotfiles=true -- sub-0001/func/sub-0001_task-restingstate_acq-mb3_bold.nii.gz sub-0002/func/sub-0002_task-emomatching_acq-seq_bold.nii.gz sub-0002/func/sub-0002_task-restingstate_acq-mb3_bold.nii.gz sub-0001/func/sub-0001_task-emomatching_acq-seq_bold.nii.gz sub-0001/func/sub-0001_task-faces_acq-mb3_bold.nii.gz sub-0001/dwi/sub-0001_dwi.nii.gz sub-0002/func/sub-0002_task-workingmemory_acq-seq_bold.nii.gz sub-0001/anat/sub-0001_T1w.nii.gz sub-0002/anat/sub-0002_T1w.nii.gz sub-0001/func/sub-0001_task-gstroop_acq-seq_bold.nii.gz sub-0002/func/sub-0002_task-faces_acq-mb3_bold.nii.gz sub-0002/func/sub-0002_task-anticipation_acq-seq_bold.nii.gz sub-0002/dwi/sub-0002_dwi.nii.gz sub-0001/func/sub-0001_task-anticipation_acq-seq_bold.nii.gz sub-0001/func/sub-0001_task-workingmemory_acq-seq_bold.nii.gz sub-0002/func/sub-0002_task-gstroop_acq-seq_bold.nii.gz' failed with exitcode 1 under /data [info keys: stdout_json] [err: 'git-annex: Git refuses to operate in this repository, probably because it is owned by someone else.
Expand All @@ -40,20 +43,16 @@ git config --global --add safe.directory /data
git-annex: automatic initialization failed due to above problems']
```

Confusingly, following the suggestion from *DataLad* directly on the host
(`git config --global --add safe.directory /data`) will not work in this
Confusingly, following the suggestion from *DataLad*
(just propagated from *Git*) of executing
`git config --global --add safe.directory /data` will not work in this
case, because this line must be executed within the container.
However, containers are *transient* and the setting this configuration
on *Git* will not be propagated between executions unless advanced
actions are taken (such as mounting a *HOME* folder with the necessary settings).

Instead, we can override the default user executing within the container
(which is `root`, or uid = 0).
This can be achieved with
[*Docker*'s `-u`/`--user` option](https://docs.docker.com/engine/containers/run/#user):

```
--user=[ user | user:group | uid | uid:gid | user:gid | uid:group ]
```

We can combine this option with *Bash*'s `id` command to ensure the current user's uid and group id (gid) are being set.
Let's update the last example in the previous
[*Docker* execution section](docker.md#running-a-niprep-directly-interacting-with-the-docker-engine):

Expand Down
242 changes: 194 additions & 48 deletions docs/apps/docker.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
!!! important "Summary"
!!! abstract "Summary"

Here, we describe how to run *NiPreps* with Docker containers.
To illustrate the process, we will show the execution of *fMRIPrep*, but these guidelines extend to any other end-user *NiPrep*.
Expand Down Expand Up @@ -41,6 +41,15 @@ For more examples and ideas, visit:

After checking your Docker Engine is capable of running Docker images, you are ready to pull your first *NiPreps* container image.

!!! tip "Troubleshooting"

If you encounter issues while executing a containerized application,
it is critical to identify where the fault is sourced.
For issues emerging from the *Docker Engine*, please read the
[corresponding troubleshooting guidelines](https://docs.docker.com/desktop/troubleshoot-and-support/troubleshoot/#volumes).
Once verified the problem is not related to the container system,
then follow the specific application debugging guidelines.

## Docker images

For every new version of the particular *NiPrep* app that is released, a corresponding Docker image is generated.
Expand Down Expand Up @@ -82,73 +91,210 @@ This tutorial also provides valuable troubleshooting insights and advice on what

If you need a finer control over the container execution, or you feel comfortable with the Docker Engine, avoiding the extra software layer of the wrapper might be a good decision.

**Accessing filesystems in the host within the container**:
Containers are confined in a sandbox, so they can't access the host in any ways
unless you explicitly prescribe acceptable accesses to the host. The
Docker Engine provides mounting filesystems into the container with the
`-v` argument and the following syntax:
`-v some/path/in/host:/absolute/path/within/container:ro`, where the
trailing `:ro` specifies that the mount is read-only. The mount
permissions modifiers can be omitted, which means the mount will have
read-write permissions. In general, you'll want to at least provide two
mount-points: one set in read-only mode for the input data and one
read/write to store the outputs. Potentially, you'll want to provide
one or two more mount-points: one for the working directory, in case you
need to debug some issue or reuse pre-cached results; and a
[TemplateFlow](https://www.templateflow.org) folder to preempt the
download of your favorite templates in every run.

**Running containers as a user**:
By default, Docker will run the
container as **root**. Some share systems my limit this feature and only
allow running containers as a user. When the container is run as
**root**, files written out to filesystems mounted from the host will
have the user id `1000` by default. In other words, you'll need to be
able to run as root in the host to change permissions or manage these
files. Alternatively, running as a user allows preempting these
permissions issues. It is possible to run as a user with the `-u`
argument. In general, we will want to use the same user ID as the
running user in the host to ensure the ownership of files written during
the container execution. Therefore, you will generally run the container
with `-u $( id -u )`.

You may also invoke `docker` directly:
### Accessing filesystems in the host within the container

Containers are confined in a sandbox, so they can't access the host
in any ways unless you explicitly prescribe acceptable accesses
to the host.
The Docker Engine provides mounting filesystems into the container with the `-v` argument and the following syntax:
`-v some/path/in/host:/absolute/path/within/container:ro`,
where the trailing `:ro` specifies that the mount is read-only.
The mount permissions modifiers can be omitted, which means the mount
will have read-write permissions.
In general, you'll want to at least provide two mount-points:
one set in read-only mode for the input data and one read/write
to store the outputs:

``` {.shell hl_lines="2 3"}
$ docker run -ti --rm \
-v path/to/data:/data:ro \ # read-only, for data
-v path/to/output:/out \ # read-write, for outputs
nipreps/fmriprep:<latest-version> \
/data /out/out \
participant
```

``` Shell
When **debugging** or **reusing pre-cached intermediate results**,
you'll also need to mount some working directory that otherwise
is not exposed by the application.
In the case of *NiPreps*, we typically inform the *BIDS Apps*
to override the work directory by setting the `-w`/`--work-dir`
argument (please note that this is not defined by the *BIDS Apps*
specifications and it may change across applications):

``` {.shell hl_lines="4 8"}
$ docker run -ti --rm \
-v path/to/data:/data:ro \
-v path/to/output:/out \
-v path/to/work:/work \ # mount from host
nipreps/fmriprep:<latest-version> \
/data /out/out \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw -- isn't there freesurfer license needed to be passed somehow?

Copy link
Contributor

@mckenziephagen mckenziephagen Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further up, on line 18, it says that all examples will refer to MRIQC (which doesn't require freesurfer license).
nvm, didn't realize this was the docker.md file, not the datalad.md file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had the freesurfer license issue in mind. At the moment it's defined in fMRIPrep's documentation but we probably want to bubble it up here.

participant
-w /work # override default directory
```

*BIDS Apps* relying on [TemplateFlow](https://www.templateflow.org)
for atlases and templates management may require
the *TemplateFlow Archive* be mounted from the host.
Mounting the *Archive* from the host is an effective way
to preempt the download of your favorite templates in every run:

``` {.shell hl_lines="5 6"}
$ docker run -ti --rm \
-v path/to/data:/data:ro \
-v path/to/output:/out \
-v path/to/work:/work \
-v path/to/tf-cache:/opt/templateflow \ # mount from host
-e TEMPLATEFLOW_HOME=/opt/templateflow \ # override TF home
nipreps/fmriprep:<latest-version> \
/data /out/out \
participant
-w /work
```

!!! warning "*Docker for Windows* requires enabling Shared Drives"

On *Windows* installations, the `-v` argument will not work
by default because it is necessary to enable shared drives.
Please check on this [Stackoverflow post](https://stackoverflow.com/a/51822083) how to enable them.

### Running containers as a user
By default, Docker will run the container with the
user id (uid) **0**, which is reserved for the default **root**
account in *Linux*.
In other words, by default *Docker* will use the superuser account
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the actual docker there is podman which by default would run rootless ;)

ha@hopa:~$ rm -f 123; docker run -it -v $PWD:$PWD -w $PWD --entrypoint bash nipreps/fmriprep:latest -c 'echo `whoami` am in; id; pwd; touch 123; ls -l 123' ; echo "`whoami` is out"; ls -ld 123
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
root am in
uid=0(root) gid=0(root) groups=0(root)
/home/ha
-rw-r--r-- 1 root root 0 Jan  7 14:19 123
ha is out
-rw-r--r-- 1 ha ha 0 Jan  7 09:19 123

I am yet to try podman on hpc/for hpc, using primarily for services, but googled into https://github.com/NERSC/podman-hpc ... worth researching and at least pointing users that there is another OCI solution podman which might give them easier means to run compute on their infrastructure.

to execute the container and will write files with the corresponding
uid=0 unless configured otherwise.
Executing as superuser may derive in permissions and security issues,
for example, [with *DataLad* (discussed later)](datalad.md#).
One paramount example of permissions issues where beginners typically
run into is deleting files after a containerized execution.
If the uid is not overridden, the outputs of a containerized execution
will be owned by **root** and group **root**.
Therefore, normal users will not be able to modify the output and
superuser permissions will be required to deleted data generated
by the containerized application.
Some shared systems only allow running containers as a normal user
because the user will not be able to action on the outputs otherwise.

Either way (whether the container is available with default settings
or the execution has been customized to normal users),
running as a normal user allows preempting these permissions issues.
This can be achieved with
[*Docker*'s `-u`/`--user` option](https://docs.docker.com/engine/containers/run/#user):

```
--user=[ user | user:group | uid | uid:gid | user:gid | uid:group ]
```

For example: :
We can combine this option with *Bash*'s `id` command to ensure the current user's uid and group id (gid) are being set:

``` {.shell hl_lines="4"}
$ docker run -ti --rm \
-v path/to/data:/data:ro \
-v path/to/output:/out \
-u $(id -u):$(id -g) \ # set execution uid:gid
-v path/to/tf-cache:/opt/templateflow \ # mount from host
-e TEMPLATEFLOW_HOME=/opt/templateflow \ # override TF home
nipreps/fmriprep:<latest-version> \
/data /out/out \
participant
```

For example:

``` Shell
$ docker run -ti --rm \
-v $HOME/ds005:/data:ro \
-v $HOME/ds005/derivatives:/out \
-v $HOME/tmp/ds005-workdir:/work \
-u $(id -u):$(id -g) \
-v $HOME/.cache/templateflow:/opt/templateflow \
-e TEMPLATEFLOW_HOME=/opt/templateflow \
nipreps/fmriprep:<latest-version> \
/data /out/fmriprep-<latest-version> \
participant \
-w /work
```

### Application-specific options

Once the Docker Engine arguments are written, the remainder of the
command line follows the [usage](https://fmriprep.readthedocs.io/en/latest/usage.html).
In other words, the first section of the command line is all equivalent to the
`fmriprep` executable in a *bare-metal* installation: :
command line follows the interface defined by the specific
*BIDS App* (for instance,
[*fMRIPrep*](https://fmriprep.readthedocs.io/en/latest/usage.html)
or [*MRIQC*](https://mriqc.readthedocs.io/en/latest/running.html#command-line-interface)).

``` Shell
$ docker run -ti --rm \ # These lines
-v $HOME/ds005:/data:ro \ # are equivalent to
-v $HOME/ds005/derivatives:/out \ # a call to the App's
-v $HOME/tmp/ds005-workdir:/work \ # entry-point.
nipreps/fmriprep:<latest-version> \ #
\
/data /out/fmriprep-<latest-version> \ # These lines correspond
participant \ # to the particular BIDS
-w /work # App arguments.
```
The first section of a call comprehends arguments specific to *Docker*,
and configure the execution of the container:

``` {.shell hl_lines="1-7"}
$ docker run -ti --rm \
-v $HOME/ds005:/data:ro \
-v $HOME/ds005/derivatives:/out \
-v $HOME/tmp/ds005-workdir:/work \
-u $(id -u):$(id -g) \
-v $HOME/.cache/templateflow:/opt/templateflow \
-e TEMPLATEFLOW_HOME=/opt/templateflow \
nipreps/fmriprep:<latest-version> \
/data /out/fmriprep-<latest-version> \
participant \
-w /work
```

Then, we specify the container image that we execute:

``` {.shell hl_lines="8"}
$ docker run -ti --rm \
-v $HOME/ds005:/data:ro \
-v $HOME/ds005/derivatives:/out \
-v $HOME/tmp/ds005-workdir:/work \
-u $(id -u):$(id -g) \
-v $HOME/.cache/templateflow:/opt/templateflow \
-e TEMPLATEFLOW_HOME=/opt/templateflow \
nipreps/fmriprep:<latest-version> \
/data /out/fmriprep-<latest-version> \
participant \
-w /work
```

Finally, the application-specific options can be added.
We already described the work directory setting before, in the case
of *NiPreps* such as *MRIQC* and *fMRIPrep*.
Some options are *BIDS Apps* standard, such as
the *analysis level* (`participant` or `group`)
and specific participant identifier(s) (`--participant-label`):

``` {.shell hl_lines="9-12"}
$ docker run -ti --rm \
-v $HOME/ds005:/data:ro \
-v $HOME/ds005/derivatives:/out \
-v $HOME/tmp/ds005-workdir:/work \
-u $(id -u):$(id -g) \
-v $HOME/.cache/templateflow:/opt/templateflow \
-e TEMPLATEFLOW_HOME=/opt/templateflow \
nipreps/fmriprep:<latest-version> \
/data /out/fmriprep-<latest-version> \
participant \
--participant-label 001 002 \
-w /work
```

### Resource constraints

*Docker* may be executed with limited resources.
Please [read the documentation](https://docs.docker.com/engine/containers/resource_constraints/)
to limit resources such as memory, memory policies, number of CPUs, etc.

**Memory will be a common culprit** when working with large datasets
(+10GB).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another shameless plug which might be of interest/help.

Inspired by our reproman, and BrainLife's helper to monitor execution of compute, we created a simple helper https://github.com/con/duct which could be of help to monitor/identify memory and cpu requirements for e.g. future informed specification for job parameters or plotting resource consumption during compute. It also takes care about storing stdout/stderr outputs produced, thus making it possible (if used along with datalad *run) to capture those for possible future troubleshooting etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you finally send an additional PR, I'm happy to see this documented there :)

However, *Docker* engine is limited to 2GB of RAM by default
for some installations of *Docker* for *MacOSX* and *Windows*.
The general resource settings can be also modified through the *Docker Desktop*
graphical user interface.
On a shell, the memory limit can be overridden with:

```
$ service docker stop
$ dockerd --storage-opt dm.basesize=30G
```
Loading