NXP RT1060-EVK faults with update of FlexSPI LUT #56967
-
Hello, I'm encountering a Usage fault on the RT1060-EVK board. This issue happens during the boot of the application, specifically in the mcux_igpio_configure() function. It's worth noting that this is a deterministic bug with a clear symptom. The problem occurs when the ethernet driver calls the eth_phy_reset() function, which in turn executes the mcux_igpio_configure() function on line 982:
Unfortunately, the gpio_cfg_reg pointer is assigned a corrupted value of 0x4294191a, as seen in the variables table on the right side during debugging. I've placed a conditional breakpoint on this line with the condition of cfg_idx == 10 (which is the numeric value of int_gpio). However, when I tried to read the memory value stored in config->pin_muxes[cfg_idx].config_register (0x6001f02c) with gdb at the same time I caught the bug with the debugger, the retrieved value was the valid address of the configuration register (0x401f82d4). When I toggled several breakpoints before the line of the bug with the same condition of cfg_idx == 10 to slow down the run, the error was not reproduced, and the value retrieved into gpio_cfg_reg was the valid one (which seems like a race condition).
If I run the same executable elf without a debugger, I get a bus fault at a different instruction, but still in the mcux_igpio_configure() function (this bug is also deterministic).
Disassembly of mcux_igpio_configure() function:
Project configuration file:
Build output:
I'm aligned with the lastest commit of zephyr at this moment 129291f I would appreciate any assistance provided. Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 8 replies
-
Hi @ofirshe , The final address being read for this pointer at 0x6001f02c is in flash. And the config pointer is also in flash at 0x6001EE88. Since these are constant, I would not expect a race condition when reading those addresses. But this line of code Another thought, you mention adding delay with breakpoints avoids the issue. Are you able to add delay in the source code to avoid the issue? This could be helpful to understand where the delay is required. If you can insert delay and get this working, then moving that delay around in the source may expose what read and/or write is causing a race condition. Do you have steps to replicate this on the EVK? And specifically what is the full part number of the RT1060 EVK you are using? There are multiple versions of this EVK. Good luck with the troubleshooting |
Beta Was this translation helpful? Give feedback.
-
Hey @DerekSnell, Let's explain this code section below:
The problem lies with the instruction "ldr r1, [r2, #8]" at address 6001b9a8. If I set a breakpoint after this instruction instead of before, r1 takes on the value of 0x40 inexplicably, while all other registers remain valid. I don't know why r1 is being overwritten with this value instead of the address of config->pin_muxes (which is 0x6001ef60) as in a valid run. Subsequently, at instruction 6001b9ac (mla r1, r3, r0, r1), r1 will have a value of 0x108. Subsequently, upon accessing the memory location 0x10C (which corresponds to 0x108 + 4), I notice that the value stored at that address, 0x4294191a, matches the corrupted value we saw earlier in the Usage Fault message, and has been assigned to r3. This leads to a usage fault taking place at 0x6001b9b2, as previously explained. I included a delay just before the problematic code section. This modification resolved the issue in both scenarios - with or without a debugger. Regarding the board version, I have observed that the microcontroller present on the board is SOC part MIMXRT1062DVL6B. To compile the code, I use the command "west build -b mimxrt1060_evk." Best regards, |
Beta Was this translation helpful? Give feedback.
-
Hey @DerekSnell, As I continue to debug the problem, I have gone through all the initialization functions at INIT_LEVEL_POST_KERNEL level (where the ethernet driver is initialized), and I have observed that if I slow down the run from the initialization of __device_dts_ord_539 (flash_mcux_flexspi_nor), the bug is not reproduced. Therefore, I have set a conditional breakpoint on line 252 with a condition of entry==0x6001e5d0(__init___device_dts_ord_539). Please refer to the following image: As mentioned in my previous message, I had written that in the mcux_igpio_configure() function at address 6001b9a8, register r1 was being assigned with 0x40 instead of the address of the beginning of the array config->pin_muxes. From the moment the flash_mcux_flexspi_nor device was initialized and the conditional breakpoint was triggered, I set a watchpoint on register r1 to catch the times it was being assigned with 0x40. I caught two places in the debugger: The first one was in the _NVIC_SetPriority function which was triggered by spi_mcux_config_func##n: The second place was in spi_context_unlock_unconditionally: I have no idea why slowing down the initialization of the flash_mcux_flexspi_nor driver has an effect on the problem. Additionally, the problem is not reproduced every time, and if I add or remove some variables and code lines to my code, it raises the error (apparently because the size of the elf and the addresses are being changed in the flash, and again, I don't know why slowing down the run helps to solve the problem). I can guess that during the mcux_igpio_configure() function, an interrupt occurred and modifies r1 register, but I'm not entirely sure. Could you kindly offer me your insights and ideas on why this problem occurs based on what I have shared with you, as well as your understanding of how the system is designed and the drivers' functionality? Your input would be much appreciated. Thanks in advance, |
Beta Was this translation helpful? Give feedback.
-
Hi @DerekSnell, I hope you had a good weekend. We have been struggling with this issue for quite some time. Would you happen to have any insights or ideas regarding this issue? Thank you, |
Beta Was this translation helpful? Give feedback.
-
Hi @ofirshe , Thank you for sharing all this detailed info from your debugging efforts. I suspect, this is an issue related to reading from the flash. And since you found the issue is related to the flash_mcux_flexspi_nor driver initializing, I wonder if flash_flexspi_nor_init() is changing something in the FlexSPI configuration, which is causing this. The FlexSPI is the interface to the XIP flash. When you boot from flash, the ROM bootloader initializes the FlexSPI, and configures it in a safe mode to read the Flash Configuration Block from the flash. Based on that FCB, the ROM reconfigures the FlexSPI for that external flash, and boots the application. If flash_flexspi_nor_init() makes a change to the FlexSPI config while the app is XIPing from the flash, perhaps this leads to an issue reading the data from flash at that time. I suspect the problem is not in mcux_igpio_configure() itself, that just happens to be where you notice the issue. The FlexSPI has a prefetch buffer, plus the cache is enabled. So the wrong data is likely read after flash_flexspi_nor_init() is called, and before you find it in mcux_igpio_configure(). We recommend you try removing portions of the flash_flexspi_nor_init() function, and see if some portion of the configuration is causing issues. The memc_flexspi_set_device_config() should not be called if memc_flexspi_is_running_xip() returns true, but it would be helpful if you can verify this in your case. Best regards |
Beta Was this translation helpful? Give feedback.
-
Hello, @DerekSnell, After using the debugger, I confirmed that memc_flexspi_set_device_config() is not being executed when CONFIG_FLASH_MCUX_FLEXSPI_XIP=y. However, simply modifying the source code by adding or removing lines can sometimes obscure the underlying problem (as I've experienced with similar development work), possibly due to causing some addresses in the Flash to be shifted. Therefore, I don't think that randomly disabling sections of flash_flexspi_nor_init() is an effective solution without first understanding the root cause of the issue. To support my argument, I disabled the following section, which is irrelevant for my situation since I'm executing from the Flash:
and the problem was "resolved". Unfortunately, I didn't manage to understand the root cause for the problem even after quite time debugging this issue. Do you want me to provide you some additional debugging information to understand better the root cause ? Thanks, |
Beta Was this translation helpful? Give feedback.
-
Hi @ofirshe , If the code and steps to replicate this is not something you can share here in this public forum, can you message me on Discord at Derek.Snell#9539, and we can privately discuss how to help each other there. Thanks |
Beta Was this translation helpful? Give feedback.
-
Hi @ofirshe- I believe I have replicated your issue, and found a resolution. It appears that the root cause is likely due to the LUT array that is programmed into the FlexSPI being placed in the FLASH region. This means that while the LUT array (which is used to configure the commands the FlexSPI uses to interact with the flash) is being programmed, flash read operations will be ongoing. This can result in the LUT values that are being programmed being incorrect. To verify this, I first reproduced the FlexSPI issue. Once I had the issue reproduced, I created a duplicate LUT array in RAM, and ensured that array was linked in at build time. With the duplicate array, the FlexSPI fault still occured. At this point, I updated the call to I verified that this change did not affect the alignment of any code or data placed in the FLASH or RAM section of the image by checking the MAP files from builds using each call of the If you'd like to reproduce this on your end, I have attached a Edit: I have also created a PR to fix the issue: #58096 |
Beta Was this translation helpful? Give feedback.
-
Hi, Sorry to revive this discussion, but I believe that I am hitting a similar issue despite using Zephyr v3.4.0 (356c8cb) which has the patch discussed in the previous message. I am using a board based of the Whenever I send data to the configured Similar to the original author, if I break through or slowdown the code of I've gone through the FLEXSPI section of the datasheet but haven't found a clue yet. I was hoping someone may have a better idea. Best, EDIT: Some additional details, we're running XIP mode and using a GD25Q64E flash. |
Beta Was this translation helpful? Give feedback.
-
Hi Derek,
Unfortunately I'm traveling at the moment, I want to say that moving the
breakpoint after the memcpy shows the issue from memory, but I should be
able to confirm this coming Friday or Monday.
…On Thu, Sep 28, 2023, 4:53 AM Derek Snell ***@***.***> wrote:
Hi @mlaventure <https://github.com/mlaventure> , that sounds troubling.
Can you confirm, if you move the breakpoint right after the FlexSPI LUT is
updated, do you still see the issue? Thanks
—
Reply to this email directly, view it on GitHub
<#56967 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD2SE7XRQZXJFUAUUVIBZXTX4VQM5ANCNFSM6AAAAAAXCKSGGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
Hi @ofirshe- I believe I have replicated your issue, and found a resolution. It appears that the root cause is likely due to the LUT array that is programmed into the FlexSPI being placed in the FLASH region. This means that while the LUT array (which is used to configure the commands the FlexSPI uses to interact with the flash) is being programmed, flash read operations will be ongoing. This can result in the LUT values that are being programmed being incorrect.
To verify this, I first reproduced the FlexSPI issue. Once I had the issue reproduced, I created a duplicate LUT array in RAM, and ensured that array was linked in at build time. With the duplicate array, the FlexSPI fault still …