Skip to content

bus error on deployment of an action that passes pslse #882

@diamantopoulos

Description

@diamantopoulos

Dear team,

I'm facing the following problem: I'm developing an action (GEMM, not an action of the examples) that passes rtl simulation with pslse but when I deploy the action on the 9V3 card I'm getting a bus error. The output of dmesg is appended below.

Some images have been successfully tested on the card but some fail with this "bus error" output. While I'm experimenting and debugging, I've added this issue here, in case there is a "known" way to debug more. Since pslse is OK, it's hard to debug on the card.

Please note, that when I get a "bus error", the card usually switches to the factory image and the system is not affected. But sometimes, it causes the system to reboot. (all images are within -200psWNS).

On the terminal:

[did@zhcc067 /u/did]$ sudo /dataL/did/snap/actions/hls_gemm/sw/snap_gemm -n 512 -k 64 -m 512
INFO: AXI/Cache lines for a  : 2048/1024
INFO: AXI/Cache lines for b  : 2048/1024
INFO: AXI/Cache lines for IN : 4096/2048
INFO: AXI/Cache lines for OUT: 16384/8192
INFO: size_n=512, size_k=64, size_m=512
INFO: printArray 512x64=32768
INFO: printArray 64x512=32768
PARAMETERS:
  input:       none
  output:      none
  type_in:     0 HOST_DRAM
  addr_in:     00007fff83de0000
  type_out:    0 HOST_DRAM
  addr_out:    00007fff83cc0000
  size_in/out: 0000000c
  prepare gemm job of 48 bytes size
Bus error
[did@zhcc067 /u/did]$ 

dmesg output:

[21171.613018] Harmless Hypervisor Maintenance interrupt [Recovered]
[21171.613482]  Error detail: CAPP recovery process is in progress
[21171.613700] 	HMER: 8040000000000000
[21172.032538] Harmless Hypervisor Maintenance interrupt [Recovered]
[21172.038746]  Error detail: CAPP recovery process is in progress
[21172.041086] EEH: Fenced PHB#0 detected, location: N/A
[21172.048256] EEH: This PCI device has failed 1 times in the last hour
[21172.048257] EEH: Notify device drivers to shutdown
[21172.048265] cxl-pci 0000:01:00.0: reflashing, so opting out of EEH!
[21172.048302] EEH: Collect temporary log
[21172.048304] PHB4 PHB#0 Diag-data (Version: 1)
[21172.048305] brdgCtl:    00000002
[21172.048306] RootSts:    00000040 00402000 e1010008 00100107 00000000
[21172.048308] nFir:       0000008000000000 0030001c00000000 0000008000000000
[21172.048309] PhbSts:     0000001800000000 0000001800000000
[21172.048310] Lem:        0000000100000100 0000000000000000 0000000000000100
[21172.048312] PhbErr:     0000048000000000 0000040000000000 2148000098000240 a008400000000000
[21172.048314] RxeMrgErr:  0000000000000001 0000000000000001 0000000000000000 0000000000000000
[21172.048315] RegbErr:    0050000000000000 0010000000000000 8800003c00000000 0000000000000000
[21172.048319] EEH: Reset with hotplug activity
[21172.048435] pci_bus 0008:00: busn_res: [bus 00] is released
[21172.048650] cxl afu0.0: Deactivating AFU directed mode
[21172.084226] cxl afu0.0: PSL Purge called with link down, ignoring
[21172.084648] iommu: Removing device 0000:01:00.0 from group 0
[21172.084874] pci_bus 0000:01: busn_res: [bus 01] is released
[21172.086292] 	HMER: 8040000000000000
[21175.588118] EEH: Sleep 5s ahead of complete hotplug
[21180.628196] pci 0000:00:00.0: [1014:04c1] type 01 class 0x060400
[21180.628270] pci 0000:00:00.0: PME# supported from D0 D3hot D3cold
[21180.628435] pci 0000:01:00.0: [1014:0477] type 00 class 0x1200ff
[21180.628463] pci 0000:01:00.0: reg 0x10: [mem 0x6000000000000-0x600000fffffff 64bit pref]
[21180.628474] pci 0000:01:00.0: reg 0x18: [mem 0x6000010000000-0x600001001ffff 64bit pref]
[21180.628486] pci 0000:01:00.0: reg 0x20: [mem 0x00000000-0x3fffffffff 64bit pref]
[21180.628634] pci 0000:00:00.0: PCI bridge to [bus 01]
[21180.628797] pci 0000:00:00.0:   bridge window [io  0x0000-0x0fff]
[21180.628822] pci 0000:01:00.0: disabling BAR 4: [mem size 0x4000000000 64bit pref] (bad alignment 0x4000000000)
[21180.629022] pci 0000:00:00.0: BAR 15: assigned [mem 0x6000000000000-0x600001fffffff 64bit pref]
[21180.629194] pci 0000:01:00.0: BAR 0: assigned [mem 0x6000000000000-0x600000fffffff 64bit pref]
[21180.629366] pci 0000:01:00.0: BAR 2: assigned [mem 0x6000010000000-0x600001001ffff 64bit pref]
[21180.629555] pci 0000:00     : [PE# 1fe] Secondary bus 0 associated with PE#1fe
[21180.629743] pci 0000:01     : [PE# 00] Secondary bus 1 associated with PE#0
[21180.629923] pci 0000:01     : [PE# 00] Setting up 32-bit TCE table at 0..80000000
[21180.632570] pci 0000:01     : [PE# 00] Setting up window#0 0..7fffffff pg=1000
[21180.632706] pci 0000:01     : [PE# 00] Enabling 64-bit DMA bypass
[21180.632830] iommu: Adding device 0000:01:00.0 to group 12, default domain type -1
[21180.632996] pci 0000:00:00.0: PCI bridge to [bus 01]
[21180.633094] pci 0000:00:00.0:   bridge window [mem 0x6000000000000-0x6003fbfffffff 64bit pref]
[21180.633770] pcieport 0000:00:00.0: enabling device (0105 -> 0107)
[21180.634050] cxl-pci 0000:01:00.0: Device uses a PSL9
[21180.634161] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142)
[21180.635062] pci 0000:01     : [PE# 00] Switching PHB to CXL
[21180.635452] pci 0000:01     : [PE# 00] Switching PHB to CXL
[21180.635757] cxl-pci 0000:01:00.0: PCI host bridge to bus 0008:00
[21180.635875] pci_bus 0008:00: root bus resource [bus 00]
[21180.635979] pci_bus 0008:00: busn_res: [bus 00] end is updated to ff
[21180.635987] pci 0008:00:00.0: [1014:0632] type 00 class 0x120000
[21180.636058] pci_bus 0008:00: busn_res: [bus 00-ff] end is updated to 00
[21180.636072] cxl afu0.0: Activating AFU directed mode
[21180.636724] EEH: Notify device driver to resume

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions