Skip to content

Singularity support#18

Merged
t0mdavid-m merged 19 commits into
mainfrom
singularity_support
May 15, 2026
Merged

Singularity support#18
t0mdavid-m merged 19 commits into
mainfrom
singularity_support

Conversation

@t0mdavid-m
Copy link
Copy Markdown
Member

@t0mdavid-m t0mdavid-m commented May 15, 2026

Summary by CodeRabbit

  • New Features

    • Added support for running the application on Apptainer/Singularity (HPC environments).
    • Enabled multi-instance Streamlit deployments with automatic load balancing.
    • Improved container startup with better Redis and worker process management.
  • Documentation

    • Added Apptainer/Singularity usage guide with bind mount configuration.
  • Chores

    • Enhanced CI/CD pipeline with Docker image caching and automated Apptainer image building and publishing.
    • Added automatic cleanup of old container images from the registry.

Review Change Stack

t0mdavid-m and others added 16 commits May 13, 2026 11:14
Apptainer mounts the container filesystem read-only by default, so the
existing entrypoint failed in two places:
  * cron could not write /var/run/crond.pid
  * redis-server could not access /var/lib/redis (and on Ubuntu it loads
    the system /etc/redis/redis.conf which points there)

Refactor the inline-heredoc entrypoint into entrypoint.sh and redirect
all runtime state (Redis data + pidfile, generated nginx config, nginx
pidfile + temp dirs) to RUNTIME_DIR (default /tmp/opendiakiosk), which
is writable as tmpfs under Apptainer and as overlay under Docker. cron
failure is now non-fatal so the rest of the app still starts when
scheduled cleanup is unavailable.

Docker behavior is unchanged - the queue is already configured with
appendonly=no so moving the Redis dir to a tmpfs path costs nothing.
Apptainer (the HPC container runtime) mounts the root filesystem read-only
by default and runs the container as the host user's UID. The existing
entrypoint failed under both conditions:

  cron: can't open or create /var/run/crond.pid: Read-only file system
  FATAL CONFIG FILE ERROR (Redis 7.0.15) ... 'dir "/var/lib/redis"'

Extract the inline heredoc from the Dockerfile into a standalone
docker/entrypoint.sh that auto-detects the read-only root (APPTAINER_NAME
env var or /var/run write probe) and falls back to /tmp/openms-runtime-\$\$
for Redis data dir, Redis/nginx PID files, and the generated nginx.conf.
Skip cron entirely when the root FS is read-only — workspace cleanup is a
nice-to-have, not a hard requirement.

The same script powers both Dockerfile and Dockerfile_simple: the Redis/RQ
section is gated on \`command -v redis-server\` so the simple image (no
redis installed) is a no-op for that block. Drop \`chown redis:redis\` on
/var/lib/redis — under apptainer the in-image redis UID is unreachable.

Add a test-apptainer CI job that reuses the build artifact, installs
apptainer, converts to SIF, starts an instance, and waits for
/_stcore/health. Reproduces the bug on the pre-fix entrypoint and gates
future regressions for both image variants.
CI exposed three follow-up issues after the initial apptainer port:

1. /root is mode 0700 in the stock ubuntu base image. Docker runs the
   entrypoint as root so this is invisible, but apptainer maps the host
   user UID into the container — that user can't traverse /root, so
   `source /root/miniforge3/bin/activate ...` (the first executable line
   of the entrypoint) fails with EACCES, set -e exits, and the apptainer
   instance dies before streamlit binds 8501. Add `chmod o+x /root` in
   both Dockerfiles so the path is traversable by anyone, keeping the
   directory listing private.

2. Bound the `until redis-cli ping` loop (CodeRabbit OpenMS#387 review). If
   redis-server fails to bind 6379 (e.g. apptainer's shared host net
   namespace has the port taken), the loop spun forever and the health
   check timed out with no actionable error. Now retries
   REDIS_STARTUP_RETRIES times (default 30s) and exits 1 with a clear
   message on timeout.

3. Drop `chmod 0777 /var/lib/redis` (CodeRabbit OpenMS#387 review). Docker mode
   writes here as root regardless of mode bits, and apptainer mode never
   uses this path (the entrypoint relocates to /tmp/openms-runtime-*),
   so 0755 root-owned is correct and matches least-privilege.
`apptainer instance start` does not consistently honor a Docker image's
WORKDIR — the container's CWD ends up being the host CWD at invocation
(e.g. the GH Actions checkout root), so `streamlit run app.py` resolves
against the wrong directory, exits with "Error: file not found", and the
apptainer instance dies before binding 8501. The health check then times
out at "Wait for streamlit /_stcore/health" with no obvious trace —
exactly the failure seen on test-apptainer (full) and (simple).

Anchor the entrypoint at /app explicitly. In docker mode WORKDIR /app is
already set so this is a no-op; in apptainer mode it's the actual fix.

Also stamp pwd+uid into the "Starting Streamlit app" log line so future
breakage shows the resolved CWD/user in the apptainer logs without
needing to re-instrument.
A user setting STREAMLIT_SERVER_COUNT > 1 on the simple image variant
(no nginx installed) currently gets a single Streamlit instance with no
log indication, making the misconfiguration silent and hard to diagnose.
Emit a clear WARN before the load-balancer branch falls through.

Addresses CodeRabbit review on OpenMS#387.
The test-apptainer job's "Dump entrypoint logs on failure" step is
post-mortem and easy to miss in the GH Actions UI; combined with the
auth-walled API on the logs endpoint, every prior failure left us
guessing at what the entrypoint actually printed.

Two changes, both no-op when the test passes:

1. The wait loop now tails the apptainer instance .out / .err every
   five attempts (and dumps them in full on timeout), so failures
   surface inline. The "Start apptainer instance" step exports the
   discovered log dir into $GITHUB_ENV so the next step can read it
   without re-deriving from hostname + whoami.

2. The entrypoint logs uid/gid/cwd, the relevant APPTAINER_* env vars,
   and whether `streamlit` resolves after conda activation. Two echo
   lines — harmless in docker mode where the logs aren't read, and the
   missing data on the apptainer side has been the whole bottleneck.
The test-apptainer job's instance came up cleanly (apptainer instance
list reported PID 2586) but the entrypoint's first echo never landed in
.out/.err — both files were empty when dumped on timeout. The Docker
ENTRYPOINT was translated into the SIF's %runscript only; %startscript
on a docker-archive build defaults to a no-op `exec "$@"`. So
`apptainer instance start` was launching an empty daemon and streamlit
never bound 8501.

`apptainer instance run` (added in apptainer 1.1) starts a persistent
named instance AND executes %runscript inside it — the verb actually
intended for OCI-derived SIFs. With this change the entrypoint runs,
the breadcrumbs added in 189a94b will appear in the instance log, and
the health endpoint should come up for both image variants.
…tainer-Sdrl8

Support Apptainer/Singularity with read-only root filesystem
…ds attach

A user binding host storage onto /workspaces-streamlit-template via
singularity hit "[Errno 30] Read-only file system" the moment the app
tried to mkdir a workspace, even though they passed `:rw` on the bind.

Root cause: neither Dockerfile creates /workspaces-streamlit-template
or /mounted-data. Docker auto-creates missing `-v` mount targets, but
singularity uses a read-only underlay when the destination isn't a real
directory in the SIF and silently degrades the bind — writes then go to
the read-only squashfs and fail with EROFS regardless of the `:rw` flag.

`mkdir -p` both paths in Dockerfile and Dockerfile_simple. Cost: one
inode each. Behavior in docker mode is unchanged (a `-v` mount, k8s
volumeMount, or compose volume shadows the empty dir). Behavior in
singularity-without-bind is unchanged — writes still fail with EROFS,
just one frame later in the path (parent in squashfs vs. parent missing
entirely); persistent storage still requires a bind.

CI guard: the test-apptainer job now starts the instance with explicit
`--bind /tmp/host-workspaces:/workspaces-streamlit-template:rw` and
`--bind /tmp/host-mounted-data:/mounted-data:ro`, then exec's into the
running instance to write a probe file and asserts it appears on the
host with the expected contents (plus a read-side check on the :ro
mount). Without the mkdir, the probe write would fail with EROFS and
the test would fail closed.
…ence

The previous detection in StreamlitUI._mounted_data_root() returned the
path as soon as it resolved to an existing directory, treating
"directory exists" as proof that an operator bound something there.
That worked only because docker auto-creates `-v` targets and the image
never pre-created /mounted-data — so existence was a reliable proxy
for "mount happened."

The companion fix (pre-creating /workspaces-streamlit-template and
/mounted-data in the Dockerfile so singularity binds attach read-write)
breaks that assumption: the path now always exists. Without this change,
the upload widget would render an empty mounted-drive browser to every
docker user without a `-v` flag and every apptainer user without a
`--bind`.

Switch to os.path.ismount(): a true mount point (docker -v, k8s
volumeMount, singularity --bind) crosses filesystems and trips the
kernel's mount detection; an empty image-baked dir doesn't. The
detection now asks the question we actually meant to ask.

CI guard: the test-apptainer bind step now asserts os.path.ismount()
returns True for both /mounted-data and /workspaces-streamlit-template
under `apptainer instance run --bind`, so the gating logic stays
consistent with the kernel's view if either side drifts.
…ntpoints

fix(singularity): pre-create /workspaces and /mounted-data so :rw bin…
…tainer-dHZXJ

Extract entrypoint script to file for Apptainer compatibility
Diagnostic from a user reproducing the workflow EROFS revealed the real
chain of failure under singularity:

  Starting Redis server (data=/tmp/openms-runtime-452993/redis)...
  Redis is ready
  Starting 1 RQ worker(s)...
  Starting Streamlit app (cwd=/app, uid=1000)...
  ERROR:root:There exists an active worker named 'worker-1' already

Apptainer/singularity share the host's network namespace by default.
When the host has anything listening on 6379 — a system redis-server,
a docker container, a previous singularity instance that didn't clean
up — our `redis-server --daemonize yes` silently fails to bind with
EADDRINUSE, but because daemonize forks before the listen-error
surfaces, the entrypoint's parent shell returns 0 and the subsequent
`redis-cli ping` happily connects to the *host's* redis instead.

From there:
- RQ tries to register `worker-1` against the host's redis → conflicts
  with stale state from a previous run, the worker dies.
- Streamlit enqueues to the host's redis; the workflow job is consumed
  by whatever stale worker is still alive on the host, which runs the
  mkdir outside our mount namespace (no /workspaces-streamlit-template
  bind there) and hits EROFS at the squashfs root.

Unix-socket sidesteps the entire problem class: when the entrypoint
detects read-only-root (apptainer mode), it now starts redis with
`--unixsocket $RUNTIME_DIR/redis.sock --port 0` (no TCP listener at
all) and exports `REDIS_URL=unix://<socket>` so streamlit's
QueueManager and the RQ worker can only connect to *our* redis.
docker mode is unchanged (TCP 6379 on localhost as before, no socket).

Also: write the resolved URL to /tmp/openms-redis-url so `apptainer
exec` can discover it for diagnostics (env doesn't propagate across
exec invocations). The test-apptainer CI step now reads that marker
and pings with `redis-cli -s <sock>` accordingly.
…ntpoints

fix(apptainer): use unix socket for Redis so host:6379 can't shadow us
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dff8475a-142c-4366-a99e-db197783f2ba

📥 Commits

Reviewing files that changed from the base of the PR and between d58a94d and f01c16b.

📒 Files selected for processing (7)
  • .github/workflows/build-and-test.yml
  • .github/workflows/ghcr-cleanup.yml
  • Dockerfile
  • README.md
  • docker/entrypoint.sh
  • entrypoint.sh
  • src/workflow/StreamlitUI.py

📝 Walkthrough

Walkthrough

This PR introduces Apptainer/Singularity support for HPC deployments by adding a complete container build and test pipeline, modifying container startup to handle read-only filesystems with Unix sockets, publishing validated SIF images to GHCR with retention, and updating integration tests to consume image artifacts.

Changes

Apptainer Container Support

Layer / File(s) Summary
Container Runtime & Entrypoint
Dockerfile, docker/entrypoint.sh, entrypoint.sh
Dockerfile enables non-root execution (chmod o+x /root), installs Redis/nginx, and copies entrypoint script; entrypoint detects read-only mode (Apptainer/Singularity), selects runtime directories, manages Redis startup with unix socket support, launches RQ workers, and orchestrates single or multi-instance Streamlit behind nginx with cookie-based routing.
Apptainer SIF Build & Publish Pipeline
.github/workflows/build-and-test.yml, .github/workflows/ghcr-cleanup.yml
Build job saves Docker image as tarball artifact; test-apptainer job loads tarball, builds SIF, verifies Streamlit health and Redis/mount-point behavior, and uploads validated artifact; publish-apptainer job pushes SIF to GHCR with tag variants using ORAS; cleanup job enforces retention policy for old commits and untagged manifests.
Integration Test Updates
.github/workflows/build-and-test.yml
NGINX and Traefik test jobs download and load image artifact from build job, dynamically compute overlay SLUG from production kustomization, verify Redis readiness with discovered selector, wait for deployments, and validate health endpoints with extended readiness retry loops.
Documentation & Mount Validation
README.md, src/workflow/StreamlitUI.py
README adds Apptainer/HPC section with pull/run commands, supported tags, and conversion fallback; StreamlitUI restricts data directory rendering to actual mount points using os.path.ismount() instead of directory existence alone.

A rabbit hops through clouds of containers bright,
From Docker's warmth to HPC's read-only night,
With Apptainer's speed and SLUG-derived grace,
Mount points are blessed, a validated place! 🐰📦

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch singularity_support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

claude and others added 3 commits May 15, 2026 21:12
Reuse the SIF that test-apptainer already builds and validates: upload it
as a workflow artifact when validation passes, then push it to
ghcr.io/<owner>/<repo>/sif:<tag> from a new publish-apptainer job. Tag
scheme mirrors the docker image (branch/sha/version-<variant> plus bare
`latest` for full+main). Sibling /sif package keeps tag lists clean and
cleanup policies independent. README now points HPC users at the prebuilt
ORAS path instead of the slow on-the-fly OCI->SIF conversion.

https://claude.ai/code/session_01NumLyfkQ3w3JF3TU8jM1iX
…osting-eH5Gh

Publish prebuilt Apptainer SIFs to GHCR via ORAS
@t0mdavid-m t0mdavid-m merged commit 0e2df38 into main May 15, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants