questions about ECC check #734

nju-zjx · 2024-11-12T06:43:10Z

nju-zjx
Nov 12, 2024

Thank you very much for your open-source work. During the use of the GPU, the following error occurred: "An uncorrectable ECC error detected (possible firmware handling failure)". This error is detected within the gpuCheckEccCounts_TU102 function. By tracing the call process, it is found that the invocation of this function depends on the return value of kgspBootstrap_HAL. The ECC error check is only performed when the function returns a failure, which seems to differ from the mechanism described at https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. Could you please tell me what the relationship between ECC errors and gspBoot?

Additionally, when a UCE error occurs, the dmesg log continuously prints "RmInitAdapter failed!", and subsequently, nvidia-smi fails to recognize the GPU. Reboot is required for it to take effect. According to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping, NVIDIA has a comprehensive mechanism to deal with UCE errors. How can I operate to prevent this issue from occurring?

yezhiyong30 · 2025-08-06T11:42:42Z

yezhiyong30
Aug 6, 2025

Hey! Just wondering if you figured this out? I'm experiencing something very similar and would be happy to compare notes if you're available.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

questions about ECC check #734

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

questions about ECC check #734

Uh oh!

nju-zjx Nov 12, 2024

Replies: 1 comment

Uh oh!

yezhiyong30 Aug 6, 2025

nju-zjx
Nov 12, 2024

yezhiyong30
Aug 6, 2025