Replies: 1 comment
-
Hey! Just wondering if you figured this out? I'm experiencing something very similar and would be happy to compare notes if you're available. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Thank you very much for your open-source work. During the use of the GPU, the following error occurred: "An uncorrectable ECC error detected (possible firmware handling failure)". This error is detected within the gpuCheckEccCounts_TU102 function. By tracing the call process, it is found that the invocation of this function depends on the return value of kgspBootstrap_HAL. The ECC error check is only performed when the function returns a failure, which seems to differ from the mechanism described at https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. Could you please tell me what the relationship between ECC errors and gspBoot?
Additionally, when a UCE error occurs, the dmesg log continuously prints "RmInitAdapter failed!", and subsequently, nvidia-smi fails to recognize the GPU. Reboot is required for it to take effect. According to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping, NVIDIA has a comprehensive mechanism to deal with UCE errors. How can I operate to prevent this issue from occurring?
Beta Was this translation helpful? Give feedback.
All reactions