Skip to content

Commit c1cbb78

Browse files
alwa-nordicKyraLengfeld
authored andcommitted
[nrf fromtree] Bluetooth: Host: Don't call user callback from TX thread
ATT is invoking user callbacks in its net_buf destroy function. It is common practice that these callbacks can block on bt_hci_cmd_alloc(). This is a deadlock when the net_buf_unref() happens inside the HCI driver, invoked from tx_processor. Blocking callbacks like this appear in our own samples. See further down about how this problem was detected. tx_processor not protect against blocking callbacks so it is de-facto forbidden. The Host should not equip net_bufs with dangerous destroy callbacks. This commit makes ATT defer its net_buf destruction and user callback invocation to the system workqueue, so that net_buf_unref is safe to call from non-blocking threads. In the case of the deadlock, the net_buf_unref() was below the tx_processor in the call stack, which (at the time of this commit) is on the system work queue, so defering it to the system work queue is preserving the existing behavior. Future improvement may be to allow the user to provide their own workqueue for ATT callbacks. This deadlock was detected because the following test was failing while moving tx_processor to the bt_taskq: tests/bsim/bluetooth/ll/throughput/tests_scripts/gatt_write.sh The above test has an ATT callback `write_cmd_cb` invokes `bt_conn_le_param_update` can block waiting for `tx_processor`. The reason it was not failing while tx_processor was on the system work queue is that the GATT API has a special non-blocking behavior when called from the system work queue. Signed-off-by: Aleksander Wasaznik <[email protected]> (cherry picked from commit 6889042) Signed-off-by: Kyra Lengfeld <[email protected]>
1 parent c09c6ab commit c1cbb78

File tree

1 file changed

+77
-4
lines changed
  • subsys/bluetooth/host

1 file changed

+77
-4
lines changed

subsys/bluetooth/host/att.c

Lines changed: 77 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
#include <zephyr/kernel/thread.h>
2626
#include <zephyr/logging/log.h>
2727
#include <zephyr/net_buf.h>
28+
#include <zephyr/spinlock.h>
2829
#include <zephyr/sys/__assert.h>
2930
#include <zephyr/sys/atomic.h>
3031
#include <zephyr/sys/byteorder.h>
@@ -248,10 +249,37 @@ const char *bt_att_err_to_str(uint8_t att_err)
248249
}
249250
#endif /* CONFIG_BT_ATT_ERR_TO_STR */
250251

251-
static void att_tx_destroy(struct net_buf *buf)
252+
static void att_tx_destroy_work_handler(struct k_work *work);
253+
static K_WORK_DEFINE(att_tx_destroy_work, att_tx_destroy_work_handler);
254+
static struct k_spinlock tx_destroy_queue_lock;
255+
static sys_slist_t tx_destroy_queue = SYS_SLIST_STATIC_INIT(&tx_destroy_queue);
256+
257+
static void att_tx_destroy_work_handler(struct k_work *work)
252258
{
253-
struct bt_att_tx_meta_data *p_meta = att_get_tx_meta_data(buf);
259+
struct net_buf *buf;
260+
struct bt_att_tx_meta_data *p_meta;
254261
struct bt_att_tx_meta_data meta;
262+
sys_snode_t *buf_node;
263+
k_spinlock_key_t key;
264+
bool resubmit;
265+
266+
key = k_spin_lock(&tx_destroy_queue_lock);
267+
buf_node = sys_slist_get(&tx_destroy_queue);
268+
/* If there are more items in the queue, those likely have
269+
* been there before this handler started running and
270+
* coalesced into a single work submission, so we need to
271+
* resubmit.
272+
*/
273+
resubmit = !sys_slist_is_empty(&tx_destroy_queue);
274+
k_spin_unlock(&tx_destroy_queue_lock, key);
275+
276+
/* Spurious wakeups can occur in with some thread interleavings. */
277+
if (buf_node == NULL) {
278+
return;
279+
}
280+
281+
buf = CONTAINER_OF(buf_node, struct net_buf, node);
282+
p_meta = att_get_tx_meta_data(buf);
255283

256284
LOG_DBG("%p", buf);
257285

@@ -266,8 +294,8 @@ static void att_tx_destroy(struct net_buf *buf)
266294
*/
267295
memset(p_meta, 0x00, sizeof(*p_meta));
268296

269-
/* After this point, p_meta doesn't belong to us.
270-
* The user data will be memset to 0 on allocation.
297+
/* After this point, p_meta doesn't belong to us. The user data will
298+
* be memset to 0 on allocation.
271299
*/
272300
net_buf_destroy(buf);
273301

@@ -278,6 +306,51 @@ static void att_tx_destroy(struct net_buf *buf)
278306
if (meta.opcode != 0) {
279307
att_on_sent_cb(&meta);
280308
}
309+
310+
/* We resubmit this work instead of looping to allow other work on
311+
* the work queue to run.
312+
*/
313+
if (resubmit) {
314+
int err = k_work_submit_to_queue(NULL, work);
315+
316+
if (err < 0) {
317+
LOG_ERR("Failed to re-submit %s: %d", __func__, err);
318+
k_oops();
319+
}
320+
}
321+
}
322+
323+
static void att_tx_destroy(struct net_buf *buf)
324+
{
325+
int err;
326+
k_spinlock_key_t key;
327+
328+
/* We need to invoke att_on_sent_cb, which may block. We
329+
* don't want to block in a net buf destroy callback, so we
330+
* defer to a sensible workqueue.
331+
*
332+
* bt_workq cannot be used because it currently forms a
333+
* deadlock with att_pool: bt_workq -> bt_att_recv ->
334+
* send_err_rsp waits for att pool.
335+
*
336+
* We use the system work queue to preserve earlier
337+
* behavior. The tx_processor used to run on the system work
338+
* queue, and it could end up here: tx_processor ->
339+
* bt_hci_send -> net_buf_unref.
340+
*
341+
* A possible alternative is tx_notify_workqueue_get() since
342+
* this workqueue is processing similar "completion" events.
343+
*/
344+
key = k_spin_lock(&tx_destroy_queue_lock);
345+
sys_slist_append(&tx_destroy_queue, &buf->node);
346+
k_spin_unlock(&tx_destroy_queue_lock, key);
347+
348+
err = k_work_submit(&att_tx_destroy_work);
349+
if (err < 0) {
350+
LOG_ERR("Failed to submit att_tx_destroy_work: %d", err);
351+
k_oops();
352+
}
353+
/* Continues in att_tx_destroy_work_handler() */
281354
}
282355

283356
NET_BUF_POOL_DEFINE(att_pool, CONFIG_BT_ATT_TX_COUNT,

0 commit comments

Comments
 (0)