Bluetooth: Mesh: Access model recv #74069

LingaoM · 2024-06-05T03:04:05Z

LingaoM
Jun 5, 2024
Collaborator

https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/mesh/transport.c#L1585
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/mesh/transport.c#L1027

The Mesh protocol stack uses these static variables to cache messages, and then these messages are processed by the application layer. This does not seem to be a problem, because it seems that these messages are processed in BT RX, and the cooperative thread used by Zephyr can avoid competition arises.

But we ignored two points:
Mesh loopback messages are executed through the context of syswork, and the processing of messages into model->recv does not guarantee that the current task is always in the running state.

Consider the following situation:
A certain message come from BT RX is processed at the application layer, but due to the execution of certain Block APIs, perhaps sem lock, perhaps k_sleep, or flash operation(https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/controller/flash/soc_flash_nrf_ticker.c#L225) etc., this will cause BT RX to temporarily lose the opportunity to run.
At this time, the message from a loopback is processed in syswork , at this time static buf is accessed by two different tasks at the same time.

alxelax · 2024-06-05T07:58:01Z

alxelax
Jun 5, 2024
Collaborator

Hi @LingaoM, this is prohibited to use blocking API from interrupt handlers or kernel services (including syswork). Blocking API will stop an ongoing thread until something is not received or timeout expired.

Considering that blocking API is prohibited to call from kernel services, BT Rx thread cannot preempt system work handler in the middle of execution since mesh is running in cooperative scheduling. I do not see the problem here.

Probably, I do not understand the issue to full extent. Could you provide more detailed explanation if you still think this is an issue?

0 replies

alxelax · 2024-06-16T09:14:06Z

alxelax
Jun 16, 2024
Collaborator

FYI @PavelVPV

0 replies

PavelVPV · 2024-06-18T19:33:22Z

PavelVPV
Jun 18, 2024
Collaborator

I didn't spend much time on this but it seems to me you are right @LingaoM. I can't recall that we have any statement regarding using of the blocking API in the mesh callbacks.

I don't see that it is prohibited to use blocking api in the system workqueue: https://docs.zephyrproject.org/latest/kernel/services/threads/workqueue.html#system-workqueue. And can't find any such statement regarding BT RX thread.

The flash operations caused by the mesh stack are executed on the system workqueue (https://docs.zephyrproject.org/latest/connectivity/bluetooth/api/mesh/core.html#work-item-execution-context) when a separate thread is not used. And they are executed from separate work, not from a message handler.

So probably the only case when this issue can happen is when a user does this in a model handler. Unless we yield somewhere in mesh code (but again, I didn't look much at the code).

I need to think more about this, but seems allowing to enter the stack twice from different threads is not good and perhaps the solution should be somewhere in this area rather than doing something with NET_BUF_SIMPLE_DEFINE_STATIC. But before it would be good to try reproduce this issue e.g. in babblesim.

3 replies

alxelax Jun 19, 2024
Collaborator

I don't see that it is prohibited to use blocking api in the system workqueue: https://docs.zephyrproject.org/latest/kernel/services/threads/workqueue.html#system-workqueue. And can't find any such statement regarding BT RX thread.

This obviously should be mentioned in documentation since this is natural limitation causing from implementation. Application can block neither kernel system services nor bluetooth rx thread. Btw. we've already added it in internal documentation: https://github.com/nrfconnect/sdk-nrf/blob/main/doc/nrf/libraries/bluetooth_services/mesh/model_types.rst?plain=1#L62-L67

alxelax Jun 19, 2024
Collaborator

Please also consider that, according to specification, relaying happens after raising frame to upper layer. If you call blocking API in mesh rx functionality that is really bluetooth rx thread (for instance in any model command handlers) it will break relay functionality as well as subnetwork bridge (3.4.6.3 Receiving a Network PDU). Subnetwork bridge shell check and update replay protection list, but rpl might be updated in blocking communication. https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/mesh/net.c#L869-L896

PavelVPV Jun 20, 2024
Collaborator

This obviously should be mentioned in documentation since this is natural limitation causing from implementation. Application can block neither kernel system services nor bluetooth rx thread. Btw. we've already added it in internal documentation: https://github.com/nrfconnect/sdk-nrf/blob/main/doc/nrf/libraries/bluetooth_services/mesh/model_types.rst?plain=1#L62-L67

This is related specifically to the mesh models blocking API and the reason is that the stack uses system workqueue to function, so using the models blocking API from the system workqueue will basically won't work as the message won't even be sent. You can however try to access some peripheral which doesn't rely on the system workqueue and wait for the response in place.

Please also consider that, according to specification, relaying happens after raising frame to upper layer. If you call blocking API in mesh rx functionality that is really bluetooth rx thread (for instance in any model command handlers) it will break relay functionality as well as subnetwork bridge (3.4.6.3 Receiving a Network PDU). Subnetwork bridge shell check and update replay protection list, but rpl might be updated in blocking communication. https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/mesh/net.c#L869-L896

This is the specification requirement and not related to the implementation (I mean the spec doesn't care whether the implementation uses threads, preemption, etc.). I don't agree that this will brake the relay functionality as the message will be relayed as soon as the blocking call is released. The spec doesn't mandate how fast the relay should happen (in milliseconds) and we don't know what exactly user can do from the handler (e.g. something completely unrelated to bluetooth). The thread can be blocked for 1ms which won't have noticeable impact on the relay performance, but is enough for the OS to schedule another thread.

Bluetooth: Mesh: Access model recv #74069

Uh oh!

Uh oh!

LingaoM Jun 5, 2024 Collaborator

Replies: 3 comments · 3 replies

Uh oh!

alxelax Jun 5, 2024 Collaborator

Uh oh!

alxelax Jun 16, 2024 Collaborator

Uh oh!

Uh oh!

PavelVPV Jun 18, 2024 Collaborator

Uh oh!

alxelax Jun 19, 2024 Collaborator

Uh oh!

Uh oh!

alxelax Jun 19, 2024 Collaborator

Uh oh!

Uh oh!

PavelVPV Jun 20, 2024 Collaborator

LingaoM
Jun 5, 2024
Collaborator

Replies: 3 comments 3 replies

alxelax
Jun 5, 2024
Collaborator

alxelax
Jun 16, 2024
Collaborator

PavelVPV
Jun 18, 2024
Collaborator

alxelax Jun 19, 2024
Collaborator

alxelax Jun 19, 2024
Collaborator

PavelVPV Jun 20, 2024
Collaborator