-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error.
Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.
[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error)
[TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][WARNING] Step function failed, continuing.
These RPMs are installed in the host server.
cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64
nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64
nvidia-libXNVCtrl-550.90.07-2.el9.x86_64
nvidia-driver-NVML-550.90.07-1.el9.x86_64
nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64
nvidia-driver-libs-550.90.07-1.el9.x86_64
nvidia-persistenced-550.90.07-1.el9.x86_64
nvidia-driver-cuda-550.90.07-1.el9.x86_64
dnf-plugin-nvidia-2.2-1.el9.noarch
kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64
nvidia-kmod-common-550.90.07-1.el9.noarch
nvidia-driver-550.90.07-1.el9.x86_64
nvidia-modprobe-550.90.07-2.el9.x86_64
nvidia-settings-550.90.07-2.el9.x86_64
nvidia-xconfig-550.90.07-2.el9.x86_64
nvidia-driver-devel-550.90.07-1.el9.x86_64
nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64
nvidia-fabric-manager-550.90.07-1.x86_64
nvidia-gds-12-5-12.5.1-1.x86_64
nvidia-gds-12.5.1-1.x86_64
nvidia-fs-dkms-2.22.3-1.x86_64
nvidia-fs-2.22.3-1.x86_64
[root@hxxxx ~]# rpm -qa |grep -i cuda
cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64
nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64
nvidia-driver-cuda-550.90.07-1.el9.x86_64
cuda-toolkit-config-common-12.5.82-1.noarch
cuda-toolkit-12-config-common-12.5.82-1.noarch
cuda-toolkit-12-5-config-common-12.5.82-1.noarch
RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64