-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Describe the bug
Hi everyone,
I'm encountering an issue with the IRQ handler. When I simulate multiple periodic messages over the CAN bus, at a certain point, I suspect that SPI communication fails to properly manage data exchange with the two HAT modules(https://www.waveshare.com/wiki/2-CH_CAN_FD_HAT) I am using.
I updated the kernel to the latest version (see below), which seemed to improve the situation slightly. However, it only mitigated the buffer issue by delaying the occurrence of the fault rather than fully resolving it.
My setup consists of a Raspberry Pi 5 with two CAN HAT 2CH FD modules, each using the MCP251XFD chip.
To ensure the problem is related to the transmitter and not the receiver, I tested two different scenarios:
-
Using two channels of the Raspberry Pi to transmit and the remaining two channels to receive (all configured with the same parameters: bitrate, data rate, and sampling time).
-
Using two channels of the Raspberry Pi to transmit and two channels of a Vector VN1640 (with the 1057Gcap installed) to receive. Also in this case, I carefully verified the CAN configuration.
In both cases the termination are correctly verified, providing 60 ohms in the buses.
I am aware that I am stressing the module, as I am sending multiple periodic messages (128 periodic messages per CAN FD, generating a 40% bus load). The messages require some processing to calculate the internal CRC and MC in the payload, which is correctly handled by the Python code.
Based on the information provided, the issue appears to be related to overrun packets. I suspect this is the cause of the failure.
ifconfig:
can0: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 6455202 bytes 68902927 (65.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6455202 bytes 68902927 (65.7 MiB)
TX errors 0 dropped 0 overruns 8471 carrier 0 collisions 0
device interrupt 186
can1: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 13736158 bytes 146632299 (139.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13736158 bytes 146632299 (139.8 MiB)
TX errors 0 dropped 0 overruns 11052 carrier 0 collisions 0
device interrupt 187
can2: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 13736162 bytes 146632347 (139.8 MiB)
RX errors 0 dropped 0 overruns 3 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 188
can3: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 6455206 bytes 68902975 (65.7 MiB)
RX errors 0 dropped 0 overruns 1 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 189
The information I provided above is essentially the same as in the second case, where I use the Vector hardware as the receiver. One observation I have made is that after a bus failure, the overruns seem to stop or decrease. However, after a certain period (several hours), at some point, the second bus also fails.
Thanks in advance
Steps to reproduce the behaviour
I can't find a piece of code that fully replicates the bug. I wrote a program that sends periodic messages, which generates some overruns, but not as many as my main code.
I use Bluetooth serial communication to share information about the message I want to simulate, including the ID, initial payload, and CAN bus settings. This communication is not constant.
I could try running the code without using Bluetooth, but I still need to work on it.
How I Initialize the bus:
self.canDB[canName] = SocketcanBus(
channel=canName,
fd=True if isFD == 'True' else False,
receive_own_messages=False,
local_loopback=False,
can_filters=None,
ignore_rx_error_frames=False
)
To run a periodic message, I create a thread using the library function:
send_periodic(self.messageInfo[_bus][_id]['MSG_CLASS'], self.messageInfo[_bus][_id]['PERIOD'], modifier_callback=self.messageInfo[_bus][_id]['CALLBACK'], on_error=self.messageInfo[_bus][_id]['ERROR'], autostart=False)
The message object is defined as follow:
can.Message(arbitration_id=_id, data=payload, dlc=_dlc, is_extended_id=_id_extended, is_fd=fdParam, bitrate_switch=fdParam)
Device (s)
Raspberry Pi 5
System
uname -a:
Linux raspberrypi 6.6.74+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.74-1+rpt1
(2025-01-27) aarch64 GNU/Linux
modinfo can:
filename: /lib/modules/6.6.74+rpt-rpi-2712/kernel/net/can/can.ko.xz
alias: net-pf-29
author: Urs Thuermann <urs.thuermann@volkswagen.de>, Oliver Hartkopp
<oliver.hartkopp@volkswagen.de>
license: Dual BSD/GPL
description: Controller Area Network PF_CAN core
srcversion: 37BF79470A254916FD8C595
depends:
intree: Y
name: can
vermagic: 6.6.74+rpt-rpi-2712 SMP preempt mod_unload modversions aarch64
parm: stats_timer:enable timer for statistics (default:on) (int)
modinfo mcp251xfd:
filename:
/lib/modules/6.6.74+rpt-rpi-2712/kernel/drivers/net/can/spi/mcp251xfd/mcp251xfd.ko.xz
license: GPL v2
description: Microchip MCP251xFD Family CAN controller driver
author: Marc Kleine-Budde <mkl@pengutronix.de>
srcversion: 672D5B0159EB5AA465D4A79
alias: spi:mcp251xfd
alias: spi:mcp251863
alias: spi:mcp2518fd
alias: spi:mcp2517fd
alias: of:N*T*Cmicrochip,mcp251xfdC*
alias: of:N*T*Cmicrochip,mcp251xfd
alias: of:N*T*Cmicrochip,mcp251863C*
alias: of:N*T*Cmicrochip,mcp251863
alias: of:N*T*Cmicrochip,mcp2518fdC*
alias: of:N*T*Cmicrochip,mcp2518fd
alias: of:N*T*Cmicrochip,mcp2517fdC*
alias: of:N*T*Cmicrochip,mcp2517fd
depends: can-dev
intree: Y
name: mcp251xfd
vermagic: 6.6.74+rpt-rpi-2712 SMP preempt mod_unload modversions aarch64
Logs
dmesg | grep can:
[ 3.165808] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.170311] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN
+CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz
rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.179808] mcp251xfd spi1.1 can2: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.186278] mcp251xfd spi1.0 can3: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 112.788097] can: controller area network core
[ 112.795770] can: raw protocol
[ 2339.305706] mcp251xfd spi0.1 can0: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 2339.305714] mcp251xfd spi0.1 can0: IRQ handler returned -22 (intf=0x3f1a0014).
[ 4809.135009] mcp251xfd spi0.0 can1: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 4809.135017] mcp251xfd spi0.0 can1: IRQ handler returned -22 (intf=0x3f1a0014).
Additional context
Thanks in advance