Skip to content

Conversation

@TimPietruskyRunPod
Copy link
Contributor

@TimPietruskyRunPod TimPietruskyRunPod commented Jan 13, 2026

Summary

  • Updates Dockerfile base image from CUDA 12.1.0 to 12.4.1
  • Updates ldconfig path to cuda-12.4
  • removes flashinfer
  • Adds NVIDIA B200 (Blackwell) to supported gpuIds in hub.json

Problem

vLLM workers fail on Blackwell GPUs (RTX PRO 6000, B200) with:

imagePullAsync: failed to get self-hosted image registry auth

This was caused by a CUDA version mismatch:

  • Dockerfile used CUDA 12.1.0
  • hub.json allowedCudaVersions only allows 12.4-12.9 (12.1 was removed in commit f8bf824)

Test plan

  • Wait for CI to build the dev image
  • Create a test endpoint on RunPod with a Blackwell GPU (B200)
  • Verify worker starts successfully without the imagePullAsync error
  • Run a simple inference test

Fixes: DR-1118

TimPietruskyRunPod and others added 2 commits January 13, 2026 18:51
- Update Dockerfile base image from CUDA 12.1.0 to 12.4.1
- Update ldconfig path to cuda-12.4
- Update FlashInfer installation to use flashinfer-python package
- Add NVIDIA B200 (Blackwell) to supported gpuIds in hub.json

This fixes the "imagePullAsync: failed to get self-hosted image registry auth"
error when deploying on Blackwell GPUs (RTX PRO 6000, B200) by aligning
the Docker image CUDA version with the allowedCudaVersions in hub.json.

Fixes: DR-1118

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The gpuIds in hub.json controls default GPU selection for deployments,
not GPU compatibility. The CUDA 12.4 upgrade is sufficient to enable
Blackwell GPU support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FlashInfer requires nvcc to JIT-compile CUDA kernels at runtime for
new GPU architectures (like Blackwell SM 10.0). Since we use the CUDA
base image without the toolkit, nvcc is not available.

vLLM will use its built-in fallback sampling methods instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@TimPietruskyRunPod TimPietruskyRunPod merged commit 90c16b4 into main Jan 13, 2026
4 checks passed
@TimPietruskyRunPod TimPietruskyRunPod deleted the timpietrusky/dr-1118-vllm-image-fails-on-blackwell-gpus-imagepullasync-error branch January 13, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants