Skip to content

Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error #4080

@jaiswackhv

Description

@jaiswackhv

Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error.
Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.

[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error)
[TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][WARNING] Step function failed, continuing.

These RPMs are installed in the host server.
cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64
nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64
nvidia-libXNVCtrl-550.90.07-2.el9.x86_64
nvidia-driver-NVML-550.90.07-1.el9.x86_64
nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64
nvidia-driver-libs-550.90.07-1.el9.x86_64
nvidia-persistenced-550.90.07-1.el9.x86_64
nvidia-driver-cuda-550.90.07-1.el9.x86_64
dnf-plugin-nvidia-2.2-1.el9.noarch
kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64
nvidia-kmod-common-550.90.07-1.el9.noarch
nvidia-driver-550.90.07-1.el9.x86_64
nvidia-modprobe-550.90.07-2.el9.x86_64
nvidia-settings-550.90.07-2.el9.x86_64
nvidia-xconfig-550.90.07-2.el9.x86_64
nvidia-driver-devel-550.90.07-1.el9.x86_64
nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64
nvidia-fabric-manager-550.90.07-1.x86_64
nvidia-gds-12-5-12.5.1-1.x86_64
nvidia-gds-12.5.1-1.x86_64
nvidia-fs-dkms-2.22.3-1.x86_64
nvidia-fs-2.22.3-1.x86_64
[root@hxxxx ~]# rpm -qa |grep -i cuda
cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64
nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64
nvidia-driver-cuda-550.90.07-1.el9.x86_64
cuda-toolkit-config-common-12.5.82-1.noarch
cuda-toolkit-12-config-common-12.5.82-1.noarch
cuda-toolkit-12-5-config-common-12.5.82-1.noarch

RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagedIssue has been triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions