Bluetooth: L2CAP: Queue packets when a segment could not be allocated #20661

Vudentz · 2019-11-13T16:44:54Z

When a segment could not be allocated it should be possible to resume
sending it later once previous segments complete, the only exception is
when there is no previous activity and we are unable to alocate even the
very first segment which should indicate to the caller that it would
block since that only happens on syswq the caller might need to defer to
another thread or resubmit the work.

Fixes #20640

Patch grow a little bit big but the overall impact in memory is quite small:

IPSP sample board=qemu_x86

Before:

make ram_report

l2                                                                                   392     0.35%
  bluetooth                                                                          392     0.35%
    bluetooth.c                                                                      392     0.35%
      __config_net_bt                                                                 12     0.01%
      __device_net_bt                                                                 12     0.01%
      __net_if_dev_net_bt_0                                                           28     0.02%
      __net_if_net_bt_0                                                               32     0.03%
      __net_l2_data_net_bt0                                                            4     0.00%
      bt_context_data                                                                228     0.20%
      bt_if_api                                                                        4     0.00%
      conn_callbacks                                                                  28     0.02%
      default_conn                                                                     4     0.00%
      ipsp_ops                                                                        28     0.02%
      server                                                                          12     0.01%

make rom_report

l2                                                                                  1419     0.47%
  bluetooth                                                                         1419     0.47%
    bluetooth.c                                                                     1419     0.47%
      __config_net_bt                                                                 12     0.00%
      __device_net_bt                                                                 12     0.00%
      __net_if_dev_net_bt_0                                                           28     0.01%
      __net_if_net_bt_0                                                               32     0.01%
      __net_l2_data_net_bt0                                                            4     0.00%
      ad                                                                              16     0.01%
      bt_advertise                                                                   106     0.04%
      bt_context_data                                                                228     0.08%
      bt_if_api                                                                        4     0.00%
      bt_iface_init                                                                   27     0.01%
      conn_callbacks                                                                  28     0.01%
      connected                                                                       46     0.02%
      disconnected                                                                    32     0.01%
      ipsp_accept                                                                     40     0.01%
      ipsp_alloc_buf                                                                  16     0.01%
      ipsp_connected                                                                 247     0.08%
      ipsp_disconnected                                                               47     0.02%
      ipsp_ops                                                                        28     0.01%
      ipsp_recv                                                                      115     0.04%
      net_bt_enable                                                                   35     0.01%
      net_bt_flags                                                                     3     0.00%
      net_bt_get_conn                                                                 17     0.01%
      net_bt_init                                                                     30     0.01%
      net_bt_recv                                                                     23     0.01%
      net_bt_send                                                                    117     0.04%
      net_mgmt_NET_REQUEST_BT_ADVERTISE                                              106     0.04%
      sd                                                                               8     0.00%
      server                                                                          12     0.00%

After:

make ram_report

l2                                                                                   404     0.36%
  bluetooth                                                                          404     0.36%
    bluetooth.c                                                                      404     0.36%
      __config_net_bt                                                                 12     0.01%
      __device_net_bt                                                                 12     0.01%
      __net_if_dev_net_bt_0                                                           28     0.02%
      __net_if_net_bt_0                                                               32     0.03%
      __net_l2_data_net_bt0                                                            4     0.00%
      bt_context_data                                                                240     0.21%
      bt_if_api                                                                        4     0.00%
      conn_callbacks                                                                  28     0.02%
      default_conn                                                                     4     0.00%
      ipsp_ops                                                                        28     0.02%
      server                                                                          12     0.01%

make rom_report

l2                                                                                  1446     0.48%
  bluetooth                                                                         1446     0.48%
    bluetooth.c                                                                     1446     0.48%
      __config_net_bt                                                                 12     0.00%
      __device_net_bt                                                                 12     0.00%
      __net_if_dev_net_bt_0                                                           28     0.01%
      __net_if_net_bt_0                                                               32     0.01%
      __net_l2_data_net_bt0                                                            4     0.00%
      ad                                                                              16     0.01%
      bt_advertise                                                                   106     0.04%
      bt_context_data                                                                240     0.08%
      bt_if_api                                                                        4     0.00%
      bt_iface_init                                                                   27     0.01%
      conn_callbacks                                                                  28     0.01%
      connected                                                                       46     0.02%
      disconnected                                                                    32     0.01%
      ipsp_accept                                                                     40     0.01%
      ipsp_alloc_buf                                                                  16     0.01%
      ipsp_connected                                                                 247     0.08%
      ipsp_disconnected                                                               47     0.02%
      ipsp_ops                                                                        28     0.01%
      ipsp_recv                                                                      115     0.04%
      net_bt_enable                                                                   35     0.01%
      net_bt_flags                                                                     3     0.00%
      net_bt_get_conn                                                                 17     0.01%
      net_bt_init                                                                     30     0.01%
      net_bt_recv                                                                     23     0.01%
      net_bt_send                                                                    132     0.04%
      net_mgmt_NET_REQUEST_BT_ADVERTISE                                              106     0.04%
      sd                                                                               8     0.00%
      server                                                                          12     0.00%

Signed-off-by: Luiz Augusto von Dentz [email protected]

Vudentz · 2019-11-13T16:45:34Z

This is still crashing so more work is needed.

joerchan · 2019-11-14T11:55:55Z

@Vudentz Is this a continuation #20544 ?

It is, but it took it quite a hard measure to transfer a payload that big without deadlocking, but the end result is that bt_l2cap_chan_send will no longer block anymore, it just queue the buffer if it could at least send 1 segment so that we can resume on TX complete.

Vudentz · 2019-11-14T14:30:29Z

@jfischer-phytec-iot With this set I was able to ping -s 1024, Ive also added some tunning so it doesn't take forever to reply at the cost of increasing our frag_pool when NET_L2_BT is set.

jfischer-no · 2019-11-15T17:44:06Z

@Vudentz thanks, looks better. It does not lock anymore. I will test it for a while.

jhedberg · 2019-11-15T17:58:54Z

subsys/bluetooth/host/conn.c

What's the difference here? NET_BUF_POOL_DEFINE() is the same as NET_BUF_POOL_FIXED_DEFINE() in fact, you shouldn't use NET_BUF_POOL_DEFINE() anymore as it's essentially deprecated (even though it might not be formally marked so).

It was just for setting user_data size, the fixed doesn't seem to take the user_data size.

Take a look at the actual definition. All it does with it is a build assert. Ensuring the BT_BUF_USER_DATA_MIN is already covered by range 8 65535 if BT in subsys/net/Kconfig

If that still wasn't clear: since quite a while ago all net_buf objects share the same user data size that gets set using CONFIG_NET_BUF_USER_DATA_SIZE. I.e. you can't set it per-pool

Ok now I realize that it doesn't actually assume a 0 size with fixed version, will fix that.

jfischer-no · 2019-11-16T17:12:50Z

@Vudentz 7ce7658c936e688f9fac3dd05de94d43c017940e does not survive flood like ping -i 0.02 -s 82 2001:db8::1, looks like deadlock. (Previous version 21746b4c8a0e66b3ea96d391576058d2c17e3e71 disconnected during flood after bt_l2cap_chan_disconnect(&chan->chan) )

The test which must be passed:

#!/bin/bash

IP=$1

for (( n=0; n<1; n++ ))
do
        for (( packetsize=0; packetsize<1500; packetsize=$packetsize+1 ))
        do
                echo "payload $packetsize"
                ping6 -w 1 -i 0.002 -c 100 -s $packetsize -q $IP
        done
done

Vudentz · 2019-11-18T13:08:40Z

@Vudentz 7ce7658 does not survive flood like ping -i 0.02 -s 82 2001:db8::1, looks like deadlock. (Previous version 21746b4 disconnected during flood after bt_l2cap_chan_disconnect(&chan->chan) )

The test which must be passed:
#!/bin/bash

IP=$1

for (( n=0; n<1; n++ ))
do
        for (( packetsize=0; packetsize<1500; packetsize=$packetsize+1 ))
        do
                echo "payload $packetsize"
                ping6 -w 1 -i 0.002 -c 100 -s $packetsize -q $IP
        done
done

So I managed to fix the lockup, though sometimes I experience crashes with qemu which I don't experience with native_posix, the backtrace points to buf.c:207 so I guess it is some sort of corruption.

Vudentz · 2019-11-19T13:44:51Z

So I managed to fix the lockup, though sometimes I experience crashes with qemu which I don't experience with native_posix, the backtrace points to buf.c:207 so I guess it is some sort of corruption.

Fixed these issues, the channel will now shutdown and eventually be disconnected if credits cannot be sent due to congestion caused by flooding.

Vudentz · 2019-11-23T13:34:34Z

Depends on #20951

Vudentz · 2019-12-04T12:31:58Z

@dleach02 Im not very sure why this was pushed back to 2.2 but there is a high chance that these changes will need to be backported to 2.1 even after the release because this also fixes the regression where TX errors, on bt_conn_send_cb, are not dealt with so IPSP will not work the as before.

Vudentz · 2019-12-12T16:24:13Z

@jhedberg Is there anything blocking us to merge this now that 2.1 has been release?

jhedberg · 2019-12-12T16:28:06Z

@Vudentz there still seems to be some outstanding stuff from @joerchan (at least GitHub is saying "Merging is blocked" because of that)

Vudentz · 2019-12-12T16:41:12Z

@joerchan Ive updated the comment regarding cleaning the outstanding tx_buf and a define for the RTX timeout.

joerchan · 2019-12-12T16:41:44Z

	if (IS_ENABLED(CONFIG_BT_DEBUG_CONN)) {
		struct bt_conn_tx *tx = k_fifo_get(&free_tx, K_NO_WAIT);

		if (tx) {
			return tx;
		}
	}

	return k_fifo_get(&free_tx, K_FOREVER);

I think this one could be a changed to just the last returned now that BT_WARN is gone.

Vudentz · 2019-12-12T16:43:14Z

	if (IS_ENABLED(CONFIG_BT_DEBUG_CONN)) {
		struct bt_conn_tx *tx = k_fifo_get(&free_tx, K_NO_WAIT);

		if (tx) {
			return tx;
		}
	}

	return k_fifo_get(&free_tx, K_FOREVER);
I think this one could be a changed to just the last returned now that BT_WARN is gone.

Will fix that, this change was not intended actually so Im keeping the warning.

joerchan · 2019-12-12T16:49:05Z

@Vudentz Great. Sorry for the questions. I'm not that familiar with this part of the code :)

When a segment could not be allocated it should be possible to resume sending it later once previous segments complete, the only exception is when there is no previous activity and we are unable to alocate even the very first segment which should indicate to the caller that it would block since that only happens on syswq the caller might need to defer to another thread or resubmit the work. Fixes zephyrproject-rtos#20640 Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This enable chaning the function and line number making it easier to debug where a buffer allocation is blocking. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

When NET_L2_BT the memory pressure for fragments can be quite high since that would be transfering IP packets which are considerable big so this makes our frag_pool to be of the same size as NET_BUF_TX_COUNT. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

If NET_L2_BT is enabled we need enough acl_in_pool needs to be big enough to contain a full IP packet since that is no longer processed by RX thread buffer would be queued to syswq to reassemble the SDU. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This prevents disconnect request packets to not being sent due to lack of buffers normally caused by flooding or congestion. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Drop packets received while disconnecting since they would most likely be flushed once peer respond there is no gain in keeping them on a queue. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This offloads the processing of tx_queue to a work so the callbacks calling resume don't start sending packets directly which can cause stack overflow. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Packets shall never fail to be sent now that they are queued, so if an error occured there is no point in keep the channel connected. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This can actually block system critical threads like the syswq. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This introduces BT_L2CAP_STATUS_SHUTDOWN which is used to indicate when a channel has been shutdown. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Now that bt_l2cap_send_cb can fail the buffer state needs to be save and restored otherwise the data stored on it would be lost. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This documents the special cases where -EAGAIN is returned which leads the buffer to be queued. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

This adds a define for the so called RTX timeout. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Vudentz added bug The issue is a bug, or the PR is fixing a bug area: Bluetooth labels Nov 13, 2019

Vudentz requested a review from jfischer-no November 13, 2019 16:44

Vudentz requested review from jhedberg and joerchan as code owners November 13, 2019 16:44

Vudentz added the DNM This PR should not be merged (Do Not Merge) label Nov 13, 2019

Vudentz force-pushed the fix_#20640 branch from a24cd90 to 22fe7d5 Compare November 14, 2019 14:24

Vudentz removed the DNM This PR should not be merged (Do Not Merge) label Nov 14, 2019

Vudentz added this to the v2.1.0 milestone Nov 15, 2019

jhedberg previously requested changes Nov 15, 2019

View reviewed changes

Vudentz force-pushed the fix_#20640 branch 2 times, most recently from 21746b4 to 7ce7658 Compare November 16, 2019 16:18

Vudentz force-pushed the fix_#20640 branch from 626f6b1 to 2faf7fc Compare November 18, 2019 13:04

Vudentz requested a review from jhedberg November 18, 2019 13:06

zephyrbot added the area: API Changes to public APIs label Nov 18, 2019

Vudentz force-pushed the fix_#20640 branch from 81d3a77 to 4880d21 Compare November 19, 2019 13:41

Vudentz requested review from jukkar and tbursztyka as code owners November 19, 2019 13:41

zephyrbot added the area: Networking label Nov 19, 2019

Vudentz force-pushed the fix_#20640 branch from 4880d21 to e28e2f1 Compare November 20, 2019 11:02

Vudentz added the priority: high High impact/importance bug label Nov 22, 2019

Vudentz force-pushed the fix_#20640 branch from e28e2f1 to b92086e Compare November 23, 2019 13:27

dleach02 modified the milestones: v2.1.0, v2.2.0 Dec 3, 2019

Vudentz force-pushed the fix_#20640 branch from c64d0dc to 18b8f5a Compare December 4, 2019 12:27

Vudentz requested review from jhedberg and joerchan December 8, 2019 22:44

jhedberg approved these changes Dec 12, 2019

View reviewed changes

Vudentz force-pushed the fix_#20640 branch from 18b8f5a to 48db1e9 Compare December 12, 2019 16:39

joerchan approved these changes Dec 12, 2019

View reviewed changes

Vudentz added 13 commits December 12, 2019 08:49

Bluetooth: conn: Add debug variant to buffer allocation

37d05f3

This enable chaning the function and line number making it easier to debug where a buffer allocation is blocking. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Add dedicated pool for disconnect request

1248d64

This prevents disconnect request packets to not being sent due to lack of buffers normally caused by flooding or congestion. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Ignore packets received while disconnecting

2cf0951

Drop packets received while disconnecting since they would most likely be flushed once peer respond there is no gain in keeping them on a queue. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Offload processing of tx_queue to a work

7483ed2

This offloads the processing of tx_queue to a work so the callbacks calling resume don't start sending packets directly which can cause stack overflow. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

net: bt: Disconnect in case of errors sending a packet

a1a17ea

Packets shall never fail to be sent now that they are queued, so if an error occured there is no point in keep the channel connected. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

net: bt: Don't wait for buffer indefinatelly

4dabce3

This can actually block system critical threads like the syswq. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Introduce BT_L2CAP_STATUS_SHUTDOWN flag

088526c

This introduces BT_L2CAP_STATUS_SHUTDOWN which is used to indicate when a channel has been shutdown. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Fix data leaks when segment cannot be sent

2688899

Now that bt_l2cap_send_cb can fail the buffer state needs to be save and restored otherwise the data stored on it would be lost. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Document cases where -EAGAIN is returned

1386bfc

This documents the special cases where -EAGAIN is returned which leads the buffer to be queued. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: L2CAP: Add define for signalling timeout

71bdf6c

This adds a define for the so called RTX timeout. Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Vudentz force-pushed the fix_#20640 branch from 48db1e9 to 71bdf6c Compare December 12, 2019 16:50

jhedberg merged commit 9a111df into zephyrproject-rtos:master Dec 13, 2019

Bluetooth: L2CAP: Queue packets when a segment could not be allocated #20661

Bluetooth: L2CAP: Queue packets when a segment could not be allocated #20661

Uh oh!

Conversation

Vudentz commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vudentz commented Nov 13, 2019

Uh oh!

joerchan commented Nov 14, 2019 • edited by Vudentz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vudentz commented Nov 14, 2019

Uh oh!

jfischer-no commented Nov 15, 2019

Uh oh!

jhedberg Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

Vudentz Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhedberg Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

jhedberg Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

Vudentz Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

Vudentz Nov 16, 2019

Choose a reason for hiding this comment

Uh oh!

jfischer-no commented Nov 16, 2019

Uh oh!

Vudentz commented Nov 18, 2019

Uh oh!

Vudentz commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vudentz commented Nov 23, 2019

Uh oh!

Vudentz commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vudentz commented Dec 12, 2019

Uh oh!

jhedberg commented Dec 12, 2019

Uh oh!

Vudentz commented Dec 12, 2019

Uh oh!

joerchan commented Dec 12, 2019

Uh oh!

Vudentz commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joerchan commented Dec 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Vudentz commented Nov 13, 2019 •

edited

Loading

joerchan commented Nov 14, 2019 •

edited by Vudentz

Loading

Vudentz Nov 15, 2019 •

edited

Loading

Vudentz commented Nov 19, 2019 •

edited

Loading

Vudentz commented Dec 4, 2019 •

edited

Loading

Vudentz commented Dec 12, 2019 •

edited

Loading