-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
This is my host info with Tesla T4 GPU
infrastrcture: x86_64
CPU : 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
CPU: 48
ID: GenuineIntel
Name: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.16.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.3.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0+cu128
[pip3] torchaudio==2.9.0+cu128
[pip3] torchvision==0.24.0+cu128
[pip3] transformers==4.57.3
[pip3] triton==3.5.0
[conda] flashinfer-python 0.5.2 pypi_0 pypi
[conda] numpy 2.2.6 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.8.4.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.8.90 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.8.93 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.8.90 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.10.2.21 pypi_0 pypi
[conda] nvidia-cudnn-frontend 1.16.0 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.3.3.83 pypi_0 pypi
[conda] nvidia-cufile-cu12 1.13.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.9.90 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.7.3.90 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.5.8.93 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi
[conda] nvidia-cutlass-dsl 4.3.1 pypi_0 pypi
[conda] nvidia-ml-py 13.580.82 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.27.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.8.93 pypi_0 pypi
[conda] nvidia-nvshmem-cu12 3.3.20 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.8.90 pypi_0 pypi
[conda] pyzmq 27.1.0 pypi_0 pypi
[conda] torch 2.9.0+cu128 pypi_0 pypi
[conda] torchaudio 2.9.0+cu128 pypi_0 pypi
[conda] torchvision 0.24.0+cu128 pypi_0 pypi
[conda] transformers 4.57.3 pypi_0 pypi
[conda] triton 3.5.0 pypi_0 pypi
==============================
Environment Variables
==============================
LD_LIBRARY_PATH=/root/.local/lib::/usr/local/cuda-12.4/lib64
CUDA_HOME=:/usr/local/cuda-12.4
CUDA_HOME=:/usr/local/cuda-12.4
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
the host machine is running ollama well with info
time=2025-11-30T07:24:49.679+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-11-30T07:24:50.126+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-cd68373a-b857-7671-860b-0d17f4c2d4cf library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="14.6 GiB" available="13.9 GiB"
installing with docker
docker run -d --gpus all -p 1312:1312 [ghcr.io/psalias2006/gpu-hot:latest](http://ghcr.io/psalias2006/gpu-hot:latest)
gives me an error
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
wondering how to deal with this error, is this error happening in this part of code during the nvml setup
https://github.com/psalias2006/gpu-hot/blob/main/core/monitor.py
try:
pynvml.nvmlInit()
self.initialized = True
version = pynvml.nvmlSystemGetDriverVersion()
if isinstance(version, bytes):
version = version.decode('utf-8')
logger.info(f"NVML initialized - Driver: {version}")
# Detect which GPUs need nvidia-smi (once at boot)
self._detect_smi_gpus()
except Exception as e:
logger.error(f"Failed to initialize NVML: {e}")
self.initialized = False
Metadata
Metadata
Assignees
Labels
No labels