Singularity support#18
Conversation
Apptainer mounts the container filesystem read-only by default, so the
existing entrypoint failed in two places:
* cron could not write /var/run/crond.pid
* redis-server could not access /var/lib/redis (and on Ubuntu it loads
the system /etc/redis/redis.conf which points there)
Refactor the inline-heredoc entrypoint into entrypoint.sh and redirect
all runtime state (Redis data + pidfile, generated nginx config, nginx
pidfile + temp dirs) to RUNTIME_DIR (default /tmp/opendiakiosk), which
is writable as tmpfs under Apptainer and as overlay under Docker. cron
failure is now non-fatal so the rest of the app still starts when
scheduled cleanup is unavailable.
Docker behavior is unchanged - the queue is already configured with
appendonly=no so moving the Redis dir to a tmpfs path costs nothing.
Apptainer (the HPC container runtime) mounts the root filesystem read-only by default and runs the container as the host user's UID. The existing entrypoint failed under both conditions: cron: can't open or create /var/run/crond.pid: Read-only file system FATAL CONFIG FILE ERROR (Redis 7.0.15) ... 'dir "/var/lib/redis"' Extract the inline heredoc from the Dockerfile into a standalone docker/entrypoint.sh that auto-detects the read-only root (APPTAINER_NAME env var or /var/run write probe) and falls back to /tmp/openms-runtime-\$\$ for Redis data dir, Redis/nginx PID files, and the generated nginx.conf. Skip cron entirely when the root FS is read-only — workspace cleanup is a nice-to-have, not a hard requirement. The same script powers both Dockerfile and Dockerfile_simple: the Redis/RQ section is gated on \`command -v redis-server\` so the simple image (no redis installed) is a no-op for that block. Drop \`chown redis:redis\` on /var/lib/redis — under apptainer the in-image redis UID is unreachable. Add a test-apptainer CI job that reuses the build artifact, installs apptainer, converts to SIF, starts an instance, and waits for /_stcore/health. Reproduces the bug on the pre-fix entrypoint and gates future regressions for both image variants.
CI exposed three follow-up issues after the initial apptainer port: 1. /root is mode 0700 in the stock ubuntu base image. Docker runs the entrypoint as root so this is invisible, but apptainer maps the host user UID into the container — that user can't traverse /root, so `source /root/miniforge3/bin/activate ...` (the first executable line of the entrypoint) fails with EACCES, set -e exits, and the apptainer instance dies before streamlit binds 8501. Add `chmod o+x /root` in both Dockerfiles so the path is traversable by anyone, keeping the directory listing private. 2. Bound the `until redis-cli ping` loop (CodeRabbit OpenMS#387 review). If redis-server fails to bind 6379 (e.g. apptainer's shared host net namespace has the port taken), the loop spun forever and the health check timed out with no actionable error. Now retries REDIS_STARTUP_RETRIES times (default 30s) and exits 1 with a clear message on timeout. 3. Drop `chmod 0777 /var/lib/redis` (CodeRabbit OpenMS#387 review). Docker mode writes here as root regardless of mode bits, and apptainer mode never uses this path (the entrypoint relocates to /tmp/openms-runtime-*), so 0755 root-owned is correct and matches least-privilege.
`apptainer instance start` does not consistently honor a Docker image's WORKDIR — the container's CWD ends up being the host CWD at invocation (e.g. the GH Actions checkout root), so `streamlit run app.py` resolves against the wrong directory, exits with "Error: file not found", and the apptainer instance dies before binding 8501. The health check then times out at "Wait for streamlit /_stcore/health" with no obvious trace — exactly the failure seen on test-apptainer (full) and (simple). Anchor the entrypoint at /app explicitly. In docker mode WORKDIR /app is already set so this is a no-op; in apptainer mode it's the actual fix. Also stamp pwd+uid into the "Starting Streamlit app" log line so future breakage shows the resolved CWD/user in the apptainer logs without needing to re-instrument.
A user setting STREAMLIT_SERVER_COUNT > 1 on the simple image variant (no nginx installed) currently gets a single Streamlit instance with no log indication, making the misconfiguration silent and hard to diagnose. Emit a clear WARN before the load-balancer branch falls through. Addresses CodeRabbit review on OpenMS#387.
The test-apptainer job's "Dump entrypoint logs on failure" step is post-mortem and easy to miss in the GH Actions UI; combined with the auth-walled API on the logs endpoint, every prior failure left us guessing at what the entrypoint actually printed. Two changes, both no-op when the test passes: 1. The wait loop now tails the apptainer instance .out / .err every five attempts (and dumps them in full on timeout), so failures surface inline. The "Start apptainer instance" step exports the discovered log dir into $GITHUB_ENV so the next step can read it without re-deriving from hostname + whoami. 2. The entrypoint logs uid/gid/cwd, the relevant APPTAINER_* env vars, and whether `streamlit` resolves after conda activation. Two echo lines — harmless in docker mode where the logs aren't read, and the missing data on the apptainer side has been the whole bottleneck.
The test-apptainer job's instance came up cleanly (apptainer instance list reported PID 2586) but the entrypoint's first echo never landed in .out/.err — both files were empty when dumped on timeout. The Docker ENTRYPOINT was translated into the SIF's %runscript only; %startscript on a docker-archive build defaults to a no-op `exec "$@"`. So `apptainer instance start` was launching an empty daemon and streamlit never bound 8501. `apptainer instance run` (added in apptainer 1.1) starts a persistent named instance AND executes %runscript inside it — the verb actually intended for OCI-derived SIFs. With this change the entrypoint runs, the breadcrumbs added in 189a94b will appear in the instance log, and the health endpoint should come up for both image variants.
…tainer-Sdrl8 Support Apptainer/Singularity with read-only root filesystem
…ds attach A user binding host storage onto /workspaces-streamlit-template via singularity hit "[Errno 30] Read-only file system" the moment the app tried to mkdir a workspace, even though they passed `:rw` on the bind. Root cause: neither Dockerfile creates /workspaces-streamlit-template or /mounted-data. Docker auto-creates missing `-v` mount targets, but singularity uses a read-only underlay when the destination isn't a real directory in the SIF and silently degrades the bind — writes then go to the read-only squashfs and fail with EROFS regardless of the `:rw` flag. `mkdir -p` both paths in Dockerfile and Dockerfile_simple. Cost: one inode each. Behavior in docker mode is unchanged (a `-v` mount, k8s volumeMount, or compose volume shadows the empty dir). Behavior in singularity-without-bind is unchanged — writes still fail with EROFS, just one frame later in the path (parent in squashfs vs. parent missing entirely); persistent storage still requires a bind. CI guard: the test-apptainer job now starts the instance with explicit `--bind /tmp/host-workspaces:/workspaces-streamlit-template:rw` and `--bind /tmp/host-mounted-data:/mounted-data:ro`, then exec's into the running instance to write a probe file and asserts it appears on the host with the expected contents (plus a read-side check on the :ro mount). Without the mkdir, the probe write would fail with EROFS and the test would fail closed.
…ence The previous detection in StreamlitUI._mounted_data_root() returned the path as soon as it resolved to an existing directory, treating "directory exists" as proof that an operator bound something there. That worked only because docker auto-creates `-v` targets and the image never pre-created /mounted-data — so existence was a reliable proxy for "mount happened." The companion fix (pre-creating /workspaces-streamlit-template and /mounted-data in the Dockerfile so singularity binds attach read-write) breaks that assumption: the path now always exists. Without this change, the upload widget would render an empty mounted-drive browser to every docker user without a `-v` flag and every apptainer user without a `--bind`. Switch to os.path.ismount(): a true mount point (docker -v, k8s volumeMount, singularity --bind) crosses filesystems and trips the kernel's mount detection; an empty image-baked dir doesn't. The detection now asks the question we actually meant to ask. CI guard: the test-apptainer bind step now asserts os.path.ismount() returns True for both /mounted-data and /workspaces-streamlit-template under `apptainer instance run --bind`, so the gating logic stays consistent with the kernel's view if either side drifts.
…ntpoints fix(singularity): pre-create /workspaces and /mounted-data so :rw bin…
…tainer-dHZXJ Extract entrypoint script to file for Apptainer compatibility
Diagnostic from a user reproducing the workflow EROFS revealed the real chain of failure under singularity: Starting Redis server (data=/tmp/openms-runtime-452993/redis)... Redis is ready Starting 1 RQ worker(s)... Starting Streamlit app (cwd=/app, uid=1000)... ERROR:root:There exists an active worker named 'worker-1' already Apptainer/singularity share the host's network namespace by default. When the host has anything listening on 6379 — a system redis-server, a docker container, a previous singularity instance that didn't clean up — our `redis-server --daemonize yes` silently fails to bind with EADDRINUSE, but because daemonize forks before the listen-error surfaces, the entrypoint's parent shell returns 0 and the subsequent `redis-cli ping` happily connects to the *host's* redis instead. From there: - RQ tries to register `worker-1` against the host's redis → conflicts with stale state from a previous run, the worker dies. - Streamlit enqueues to the host's redis; the workflow job is consumed by whatever stale worker is still alive on the host, which runs the mkdir outside our mount namespace (no /workspaces-streamlit-template bind there) and hits EROFS at the squashfs root. Unix-socket sidesteps the entire problem class: when the entrypoint detects read-only-root (apptainer mode), it now starts redis with `--unixsocket $RUNTIME_DIR/redis.sock --port 0` (no TCP listener at all) and exports `REDIS_URL=unix://<socket>` so streamlit's QueueManager and the RQ worker can only connect to *our* redis. docker mode is unchanged (TCP 6379 on localhost as before, no socket). Also: write the resolved URL to /tmp/openms-redis-url so `apptainer exec` can discover it for diagnostics (env doesn't propagate across exec invocations). The test-apptainer CI step now reads that marker and pings with `redis-cli -s <sock>` accordingly.
…ntpoints fix(apptainer): use unix socket for Redis so host:6379 can't shadow us
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR introduces Apptainer/Singularity support for HPC deployments by adding a complete container build and test pipeline, modifying container startup to handle read-only filesystems with Unix sockets, publishing validated SIF images to GHCR with retention, and updating integration tests to consume image artifacts. ChangesApptainer Container Support
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reuse the SIF that test-apptainer already builds and validates: upload it as a workflow artifact when validation passes, then push it to ghcr.io/<owner>/<repo>/sif:<tag> from a new publish-apptainer job. Tag scheme mirrors the docker image (branch/sha/version-<variant> plus bare `latest` for full+main). Sibling /sif package keeps tag lists clean and cleanup policies independent. README now points HPC users at the prebuilt ORAS path instead of the slow on-the-fly OCI->SIF conversion. https://claude.ai/code/session_01NumLyfkQ3w3JF3TU8jM1iX
…osting-eH5Gh Publish prebuilt Apptainer SIFs to GHCR via ORAS
Summary by CodeRabbit
New Features
Documentation
Chores