Skip to content

NUC Box: eGPU is unstable with Thunderbolt 4 #1812

@Unb0rn

Description

@Unb0rn

Component

Dasharo firmware

Device

NovaCustom NUC BOX 14th Gen

Dasharo version

v0.9.0

Dasharo Tools Suite version

No response

Test case ID

No response

Brief summary

Running eGPU with ADT-Link UT3G on Thunderbolt 4 port makes the system unstable
Sometimes it hangs with the same glyphs printed in the top part of the screen (will provide the photo)
It produces several errors in system log, all related to amdgpu driver:

  • amdgpu: unable to change power state from d3hot to d0
  • amdgpu: ras table recovery error

It may be related to #1811
Looks like there is no reliable way to use eGPU with NUC Box now

How reproducible

Error messages appear on some operations like "lspci -nnk", system hangs are sporadic

How to reproduce

Connect the GPU (I am using Radeon Pro W7900) to Thunderbolt 4 port
Boot the system
Run lspci -nnk
(once it hanged on apt update while updating some system drivers)

Expected behavior

System is stable, eGPU works the same as iGPU in terms of stability

Actual behavior

System produces errors in logs, hangs from time to time

Screenshots

No response

Additional context

Glyphs the gpu is producing before hang(tried in Ubuntu 25.10 and Arch):

Image

The logs I'm getting from journalctl:

Mar 27 13:35:43 unb0rn-test kernel: [drm:gmc_v11_0_flush_gpu_tlb [amdgpu]] *ERROR* Timeout waiting for sem acquire in VM flush!
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: Timeout waiting for VM flush ACK!
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: PSP is resuming...
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: reserve 0x1300000 from 0x8bfc000000 for PSP TMR
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: GECC is disabled, set amdgpu_ras_enable=1 to enable GECC in next boot cycle if needed
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: RAS Init Status: 0xFFFFFFFF
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: SMU is resuming...
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x004e8200 (78.130.0)
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: SMU driver if version not matched
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: dpm has been disabled
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: Setting new power limit is not supported!
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully!
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: rlc autoload: gc ucode autoload timeout
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: (-110) failed to wait rlc autoload complete
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: resume of IP block <gfx_v11_0> failed -110
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
Mar 27 13:35:43 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: RAS table incorrect checksum or error:236, try to recover
Mar 27 13:35:44 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: RAS table recovery failed
Mar 27 13:35:44 unb0rn-test kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.

Also, boltd seems to have some issues too:

Mar 13 16:49:09 unb0rn-test boltd[1519]: [e2358780-60b5-domain0                    ] udev: failed to determine if uid is stable: unknown NHI PCI id '0x7ec2'
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:02:00.0: bridge window [mem size 0x00300000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:02:00.0: bridge window [mem size 0x00300000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:03:00.0: bridge window [mem size 0x00300000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:03:00.0: bridge window [mem size 0x00300000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:04:00.0: bridge window [mem size 0x00200000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:04:00.0: BAR 0 [mem size 0x00004000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:04:00.0: bridge window [mem size 0x00200000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:04:00.0: BAR 0 [mem size 0x00004000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:05:00.0: bridge window [mem size 0x00200000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:05:00.0: bridge window [mem size 0x00200000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:06:00.0: BAR 5 [mem size 0x00100000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:06:00.0: ROM [mem size 0x00020000 pref]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:06:00.1: BAR 0 [mem size 0x00004000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:06:00.0: BAR 5 [mem size 0x00100000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:06:00.1: BAR 0 [mem size 0x00004000]: failed to assign
Mar 13 16:49:09 unb0rn-test kernel: pci 0000:06:00.0: ROM [mem size 0x00020000 pref]: failed to assign

Solutions you've tried

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions