Skip to content

Conversation

@peterharperuk
Copy link
Contributor

We use a pio and dma to write to the cyw43 chip using spi. Normally you write an address and then read the data from that address, so the pio program does does a write then read.

If you just want to write data in the case of uploading firmware we use the fdebug_tx_stall flag to work out if the pio has stalled waiting to read data which will never arrive.

The theory is that this flag will also get set if the bus is busy. So we mistakenly think a write to cyw43 has completed.

Add a check for the dma irq as well.

Fixes #2206

@lurch
Copy link
Contributor

lurch commented Jan 26, 2025

Fixes #2206 #2152 #2123

@Wren6991
Copy link
Contributor

Wren6991 commented Jan 27, 2025

So if I understand the code correctly, these lines:

        dma_channel_configure(bus_data->dma_out, &out_config, &bus_data->pio->txf[bus_data->pio_sm], tx, tx_length / 4, true);

        uint32_t fdebug_tx_stall = 1u << (PIO_FDEBUG_TXSTALL_LSB + bus_data->pio_sm);
        bus_data->pio->fdebug = fdebug_tx_stall;
        pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, true);
        while (!(bus_data->pio->fdebug & fdebug_tx_stall)) {
            tight_loop_contents(); // todo timeout
        }

...are attempting to block until the state machine has consumed all of the TX data, before deconfiguring the state machine. The problem is that this code also detects the case where the TX FIFO has bottomed out for reasons other than all of the data having been transferred. In particular, it immediately falls through if no DMA write has yet taken place between the dma_channel_configure() call and the write to bus_data->pio->fdebug, which is really no time at all. We probably got away with this on RP2040 because the processors were slower relative to the DMA.

If we think that is the failure mode, then the most direct fix would be this one:

diff --git a/src/rp2_common/pico_cyw43_driver/cyw43_bus_pio_spi.c b/src/rp2_common/pico_cyw43_driver/cyw43_bus_pio_spi.c
index bcc7284..a9f5970 100644
--- a/src/rp2_common/pico_cyw43_driver/cyw43_bus_pio_spi.c
+++ b/src/rp2_common/pico_cyw43_driver/cyw43_bus_pio_spi.c
@@ -309,8 +309,10 @@ int cyw43_spi_transfer(cyw43_int_t *self, const uint8_t *tx, size_t tx_length, u
         dma_channel_configure(bus_data->dma_out, &out_config, &bus_data->pio->txf[bus_data->pio_sm], tx, tx_length / 4, true);
 
         uint32_t fdebug_tx_stall = 1u << (PIO_FDEBUG_TXSTALL_LSB + bus_data->pio_sm);
-        bus_data->pio->fdebug = fdebug_tx_stall;
         pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, true);
+        dma_channel_wait_for_finish_blocking(bus_data->dma_out);
+        bus_data->pio->fdebug = fdebug_tx_stall;
         while (!(bus_data->pio->fdebug & fdebug_tx_stall)) {
             tight_loop_contents(); // todo timeout
         }

So the steps are:

  • wait for all data to be enqueued in TX FIFO
  • then clear the TX stall flag, because we know it can no longer be caused by data not having been transferred
  • then poll the TX stall flag to ensure the SM has completed

For me this also fixes the reproducer from #2123.

This also raises the question of why we are messing about with the DMA so much given this code is actually 100% blocking.

@peterharperuk
Copy link
Contributor Author

Yes, I agree. dma_channel_wait_for_finish_blocking in the right place works. Still something funny going on with the bus data but I think that's a separate problem.

@lurch
Copy link
Contributor

lurch commented Jan 27, 2025

In particular, it immediately falls through if no DMA write has yet taken place between the dma_channel_configure() call and the write to bus_data->pio->fdebug, which is really no time at all. We probably got away with this on RP2040 because the processors were slower relative to the DMA.

I guess that also explains why the bug appears / disappears with different compiler optimisation settings 👍

@Wren6991
Copy link
Contributor

Still something funny going on with the bus data but I think that's a separate problem.

The fields you are accessing here are read-only after initialisation, so IMO this more likely to be a timing issue than anything to do with those particular accesses being volatile.

@peterharperuk peterharperuk force-pushed the cyw43_unreliable_writes branch 2 times, most recently from 214f7ec to 53bd61c Compare January 27, 2025 14:35
@peterharperuk
Copy link
Contributor Author

Still something funny going on with the bus data but I think that's a separate problem

False alarm I think. The code could do with a refactor. But I should probably avoid doing that here?

@peterharperuk peterharperuk marked this pull request as ready for review January 27, 2025 14:38
@peterharperuk peterharperuk force-pushed the cyw43_unreliable_writes branch from 53bd61c to 69453bb Compare January 27, 2025 14:46
@lurch
Copy link
Contributor

lurch commented Jan 27, 2025

Minor nitpick, but given Luke's diagnosis above, should the "The theory is that this flag will also get set if the bus is busy. So we mistakenly think a write to cyw43 has completed." part be removed from the commit-message?

@Wren6991
Copy link
Contributor

Wren6991 commented Jan 27, 2025

Minor nitpick, but given Luke's diagnosis above, should the "The theory is that this flag will also get set if the bus is busy. So we mistakenly think a write to cyw43 has completed." part be removed from the commit-message?

Nope, Peter's description there seems to match mine above. There's a race (or at least one), and it manifests as a bug when the bus timing is just right, e.g. if the DMA is held off because the processor is keeping FASTPERI busy.

@kilograham
Copy link
Contributor

it seems to me like this code could still suffer races if the PIO is divided significantly.

Why not just make the RX loop PIO label symbol public from the .pio and check for the state machine address >= base + label_address

@peterharperuk
Copy link
Contributor Author

if the PIO is divided significantly

You mean this flag could be set by another state machine?

@kilograham
Copy link
Contributor

i meant the clock, but either way, i think checking the address is better but i haven't dived deeply into this

@peterharperuk
Copy link
Contributor Author

peterharperuk commented Jan 27, 2025

It's not immediately obvious that this could be done by reading the instruction address. For a write the wrap point is set to lp1_end. So when the write is complete the pio is stalled at "lp" which it will pass through normally. I'll update the description to reflect the actual behaviour. I'll test if things work at different clock speeds, but I suggest the current fix has the least risk at this point.

lp: out pins, 1             side 0 // perform write
    jmp x-- lp              side 1
public lp1_end:
    set pindirs, 0          side 0 // switch pin direction to in
lp2:
    in pins, 1              side 1 // perform read
    jmp y-- lp2             side 0
public end:

This works with the current fix.

    set_sys_clock_khz(266000, true);
    cyw43_set_pio_clock_divisor(4, 0);

We use a pio and dma to write to the cyw43 chip using spi. Normally you
write an address and then read the data from that address, so the pio
program does does a write then read.

If you just want to write data in the case of uploading firmware we
use the fdebug_tx_stall flag to work out if the pio has stalled waiting
to write more data.

The theory is that this flag will also get set if the bus is busy. So
we mistakenly think a write to cyw43 has completed.

Wait for the dma write to complete before waiting for the pio to stall.

Fixes raspberrypi#2206
@peterharperuk peterharperuk force-pushed the cyw43_unreliable_writes branch from 69453bb to ea32e6d Compare January 28, 2025 10:27
@peterharperuk peterharperuk added this to the 2.1.1 milestone Jan 29, 2025
@kilograham kilograham merged commit 524716f into raspberrypi:develop Jan 29, 2025
4 checks passed
@matsobdev
Copy link

matsobdev commented Jan 31, 2025

I'm just a simple man, don't know much of that stuff, but will the last else if statement be reachable here:

if (rx != NULL) {
if (tx == NULL) {
tx = rx;
assert(tx_length && tx_length < rx_length);
}
DUMP_SPI_TRANSACTIONS(
printf("[%lu] bus TX/RX %u bytes rx %u:", counter++, tx_length, rx_length);
dump_bytes(tx, tx_length);
)
assert(!(tx_length & 3));
assert(!(((uintptr_t)tx) & 3));
assert(!(((uintptr_t)rx) & 3));
assert(!(rx_length & 3));
pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, false);
pio_sm_set_wrap(bus_data->pio, bus_data->pio_sm, bus_data->pio_offset, bus_data->pio_offset + SPI_OFFSET_END - 1);
pio_sm_clear_fifos(bus_data->pio, bus_data->pio_sm);
pio_sm_set_pindirs_with_mask64(bus_data->pio, bus_data->pio_sm, 1ull << CYW43_PIN_WL_DATA_OUT, 1ull << CYW43_PIN_WL_DATA_OUT);
pio_sm_restart(bus_data->pio, bus_data->pio_sm);
pio_sm_clkdiv_restart(bus_data->pio, bus_data->pio_sm);
pio_sm_put(bus_data->pio, bus_data->pio_sm, tx_length * 8 - 1);
pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_out(pio_x, 32));
pio_sm_put(bus_data->pio, bus_data->pio_sm, (rx_length - tx_length) * 8 - 1);
pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_out(pio_y, 32));
pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_jmp(bus_data->pio_offset));
dma_channel_abort(bus_data->dma_out);
dma_channel_abort(bus_data->dma_in);
dma_channel_config out_config = dma_channel_get_default_config(bus_data->dma_out);
channel_config_set_bswap(&out_config, true);
channel_config_set_dreq(&out_config, pio_get_dreq(bus_data->pio, bus_data->pio_sm, true));
dma_channel_configure(bus_data->dma_out, &out_config, &bus_data->pio->txf[bus_data->pio_sm], tx, tx_length / 4, true);
dma_channel_config in_config = dma_channel_get_default_config(bus_data->dma_in);
channel_config_set_bswap(&in_config, true);
channel_config_set_dreq(&in_config, pio_get_dreq(bus_data->pio, bus_data->pio_sm, false));
channel_config_set_write_increment(&in_config, true);
channel_config_set_read_increment(&in_config, false);
dma_channel_configure(bus_data->dma_in, &in_config, rx + tx_length, &bus_data->pio->rxf[bus_data->pio_sm], rx_length / 4 - tx_length / 4, true);
pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, true);
__compiler_memory_barrier();
dma_channel_wait_for_finish_blocking(bus_data->dma_out);
dma_channel_wait_for_finish_blocking(bus_data->dma_in);
__compiler_memory_barrier();
memset(rx, 0, tx_length); // make sure we don't have garbage in what would have been returned data if using real SPI
} else if (tx != NULL) {
DUMP_SPI_TRANSACTIONS(
printf("[%lu] bus TX only %u bytes:", counter++, tx_length);
dump_bytes(tx, tx_length);
)
assert(!(((uintptr_t)tx) & 3));
assert(!(tx_length & 3));
pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, false);
pio_sm_set_wrap(bus_data->pio, bus_data->pio_sm, bus_data->pio_offset, bus_data->pio_offset + SPI_OFFSET_LP1_END - 1);
pio_sm_clear_fifos(bus_data->pio, bus_data->pio_sm);
pio_sm_set_pindirs_with_mask64(bus_data->pio, bus_data->pio_sm, 1ull << CYW43_PIN_WL_DATA_OUT, 1ull << CYW43_PIN_WL_DATA_OUT);
pio_sm_restart(bus_data->pio, bus_data->pio_sm);
pio_sm_clkdiv_restart(bus_data->pio, bus_data->pio_sm);
pio_sm_put(bus_data->pio, bus_data->pio_sm, tx_length * 8 - 1);
pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_out(pio_x, 32));
pio_sm_put(bus_data->pio, bus_data->pio_sm, 0);
pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_out(pio_y, 32));
pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_jmp(bus_data->pio_offset));
dma_channel_abort(bus_data->dma_out);
dma_channel_config out_config = dma_channel_get_default_config(bus_data->dma_out);
channel_config_set_bswap(&out_config, true);
channel_config_set_dreq(&out_config, pio_get_dreq(bus_data->pio, bus_data->pio_sm, true));
dma_channel_configure(bus_data->dma_out, &out_config, &bus_data->pio->txf[bus_data->pio_sm], tx, tx_length / 4, true);
pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, true);
dma_channel_wait_for_finish_blocking(bus_data->dma_out);
uint32_t fdebug_tx_stall = 1u << (PIO_FDEBUG_TXSTALL_LSB + bus_data->pio_sm);
bus_data->pio->fdebug = fdebug_tx_stall;
while (!(bus_data->pio->fdebug & fdebug_tx_stall)) {
tight_loop_contents();
}
__compiler_memory_barrier();
pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, false);
pio_sm_set_consecutive_pindirs(bus_data->pio, bus_data->pio_sm, CYW43_PIN_WL_DATA_IN, 1, false);
} else if (rx != NULL) { /* currently do one at a time */
DUMP_SPI_TRANSACTIONS(
printf("[%lu] bus TX %u bytes:", counter++, rx_length);
dump_bytes(rx, rx_length);
)
panic_unsupported();
}

given the fact that:

and:
} else if (rx != NULL) { /* currently do one at a time */

Just asking, maybe it is suppose to be that way.

@peterharperuk
Copy link
Contributor Author

No it won't.

Gadgetoid added a commit to pimoroni/presto that referenced this pull request Mar 7, 2025
This seems to work nicely with WiFi as of the upstream Pico SDK fix:

raspberrypi/pico-sdk#2209
will-v-pi pushed a commit to will-v-pi/pico-sdk that referenced this pull request Mar 20, 2025
We use a pio and dma to write to the cyw43 chip using spi. Normally you
write an address and then read the data from that address, so the pio
program does does a write then read.

If you just want to write data in the case of uploading firmware we
use the fdebug_tx_stall flag to work out if the pio has stalled waiting
to write more data.

The theory is that this flag will also get set if the bus is busy. So
we mistakenly think a write to cyw43 has completed.

Wait for the dma write to complete before waiting for the pio to stall.

Fixes raspberrypi#2206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants