Skip to content

Conversation

@liavweiss
Copy link
Contributor

Move model storage to /mnt directory to prevent disk space issues

Summary

This PR addresses disk space issues in CI/CD workflows by moving model downloads from the root filesystem (~14GB available) to the /mnt directory (~75GB available). This prevents "no space left on device" errors when downloading large models during CI runs.

Problem

GitHub Actions runners were experiencing disk space exhaustion when downloading large models. The root filesystem (/) has only ~14GB available, which is insufficient for model downloads that can reach several GB. The previous workaround of deleting toolchains (~25GB) was:

  • Not sufficient for large model sets
  • Not stable (depended on toolchain versions)
  • Not applicable to Kind clusters (models stored inside the cluster)

Solution

This PR implements a comprehensive solution that works for both host-level workflows and Kind cluster-based tests:

  1. Host-level workflows (test-and-build.yml, integration-test-docker.yml):

    • Creates /mnt/models directory
    • Moves existing models/ directory to /mnt/models if present
    • Creates a symlink from models/ to /mnt/models/ for backward compatibility
    • All model downloads now use the larger /mnt disk (~75GB)
  2. Kind cluster workflows (integration-test-k8s.yml):

    • Mounts host /mnt into Kind nodes (control-plane and worker)
    • Patches local-path-provisioner ConfigMap to use /mnt/local-path-provisioner instead of /tmp
    • All PVCs (including model storage) now use the larger disk
    • Removed the temporary "Free up disk space" step (no longer needed)

Changes

Files Modified

  • .github/workflows/test-and-build.yml (+38 lines)

    • Added "Setup model storage on /mnt" step with symlink creation
  • .github/workflows/integration-test-docker.yml (+45 lines, -11 lines)

    • Replaced "Free up disk space" step with "Setup model storage on /mnt"
  • .github/workflows/integration-test-k8s.yml (-11 lines)

    • Removed "Free up disk space" step (no longer needed)
  • e2e/pkg/cluster/kind.go (+49 lines, -2 lines)

    • Added Kind config with /mnt mount for control-plane and worker nodes
    • Added logic to patch local-path-config ConfigMap to use /mnt/local-path-provisioner
    • Restarts local-path-provisioner deployment after patching

Technical Details

Host-Level Implementation

The symlink approach ensures backward compatibility - existing code continues to work without changes:

models/ -> /mnt/models/

Kind Cluster Implementation

  1. Mount Configuration: Kind nodes mount host /mnt at container path /mnt
  2. Storage Provisioner: local-path-provisioner is patched to use /mnt/local-path-provisioner as the base path
  3. PVC Storage: All PersistentVolumeClaims created in the cluster now use the larger disk

Testing

Validation Performed

  • ✅ Go modules tidy check passed
  • ✅ Pre-commit hooks passed (except local OpenSSL issue - CI will handle)
  • ✅ Code formatting verified (go fmt, YAML syntax)
  • ✅ DCO sign-off verified on all commits
  • ✅ Git status clean
  • ✅ integration-test-k8s.yml - passed on my fork
  • ✅ test-and-build.yml - passed on my fork
  • ✅ docker-publish.yml - passed on my fork

Benefits

  1. 5x more disk space: ~75GB available vs ~14GB on root
  2. Stable solution: No longer depends on toolchain versions
  3. Works for all scenarios: Both host-level and Kubernetes-based tests
  4. Backward compatible: Existing code works without changes (symlink)
  5. Cleaner workflows: Removed temporary workarounds

Related Issues

Addresses disk space issues mentioned in PR #623 (follow-up improvement requested by @rootfs)

…d cluster to prevent disk space issues

Signed-off-by: Liav Weiss <[email protected]>
@netlify
Copy link

netlify bot commented Dec 9, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 68238e1
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/69383019242cfe0007200989
😎 Deploy Preview https://deploy-preview-792--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • .github/workflows/integration-test-docker.yml
  • .github/workflows/integration-test-k8s.yml
  • .github/workflows/test-and-build.yml

📁 e2e

Owners: @Xunzhuo
Files changed:

  • e2e/pkg/cluster/kind.go

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Copy link
Member

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

Copy link
Collaborator

@yuluo-yx yuluo-yx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rootfs
Copy link
Collaborator

rootfs commented Dec 9, 2025

this error looks related to the change:

=== RUN   TestAutoInitializeUnifiedClassifier
    classifier_test.go:2097: AutoInitializeUnifiedClassifier() failed: model validation failed: no valid models found (neither LoRA nor legacy) (models directory should exist at ../../../../models)
--- FAIL: TestAutoInitializeUnifiedClassifier (0.00s)
=== RUN   TestUnifiedClassifier_Initialize
=== RUN   TestUnifiedClassifier_Initialize/Already_initialized
=== RUN   TestUnifiedClassifier_Initialize/Initialization_attempt
Error: ModernBERT model path does not exist: ./test_models/modernbert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants