Skip to content

Conversation

@Vudentz
Copy link
Contributor

@Vudentz Vudentz commented Nov 13, 2019

When a segment could not be allocated it should be possible to resume
sending it later once previous segments complete, the only exception is
when there is no previous activity and we are unable to alocate even the
very first segment which should indicate to the caller that it would
block since that only happens on syswq the caller might need to defer to
another thread or resubmit the work.

Fixes #20640

Patch grow a little bit big but the overall impact in memory is quite small:

IPSP sample board=qemu_x86

Before:

make ram_report

l2                                                                                   392     0.35%
  bluetooth                                                                          392     0.35%
    bluetooth.c                                                                      392     0.35%
      __config_net_bt                                                                 12     0.01%
      __device_net_bt                                                                 12     0.01%
      __net_if_dev_net_bt_0                                                           28     0.02%
      __net_if_net_bt_0                                                               32     0.03%
      __net_l2_data_net_bt0                                                            4     0.00%
      bt_context_data                                                                228     0.20%
      bt_if_api                                                                        4     0.00%
      conn_callbacks                                                                  28     0.02%
      default_conn                                                                     4     0.00%
      ipsp_ops                                                                        28     0.02%
      server                                                                          12     0.01%

make rom_report

l2                                                                                  1419     0.47%
  bluetooth                                                                         1419     0.47%
    bluetooth.c                                                                     1419     0.47%
      __config_net_bt                                                                 12     0.00%
      __device_net_bt                                                                 12     0.00%
      __net_if_dev_net_bt_0                                                           28     0.01%
      __net_if_net_bt_0                                                               32     0.01%
      __net_l2_data_net_bt0                                                            4     0.00%
      ad                                                                              16     0.01%
      bt_advertise                                                                   106     0.04%
      bt_context_data                                                                228     0.08%
      bt_if_api                                                                        4     0.00%
      bt_iface_init                                                                   27     0.01%
      conn_callbacks                                                                  28     0.01%
      connected                                                                       46     0.02%
      disconnected                                                                    32     0.01%
      ipsp_accept                                                                     40     0.01%
      ipsp_alloc_buf                                                                  16     0.01%
      ipsp_connected                                                                 247     0.08%
      ipsp_disconnected                                                               47     0.02%
      ipsp_ops                                                                        28     0.01%
      ipsp_recv                                                                      115     0.04%
      net_bt_enable                                                                   35     0.01%
      net_bt_flags                                                                     3     0.00%
      net_bt_get_conn                                                                 17     0.01%
      net_bt_init                                                                     30     0.01%
      net_bt_recv                                                                     23     0.01%
      net_bt_send                                                                    117     0.04%
      net_mgmt_NET_REQUEST_BT_ADVERTISE                                              106     0.04%
      sd                                                                               8     0.00%
      server                                                                          12     0.00%

After:

make ram_report

l2                                                                                   404     0.36%
  bluetooth                                                                          404     0.36%
    bluetooth.c                                                                      404     0.36%
      __config_net_bt                                                                 12     0.01%
      __device_net_bt                                                                 12     0.01%
      __net_if_dev_net_bt_0                                                           28     0.02%
      __net_if_net_bt_0                                                               32     0.03%
      __net_l2_data_net_bt0                                                            4     0.00%
      bt_context_data                                                                240     0.21%
      bt_if_api                                                                        4     0.00%
      conn_callbacks                                                                  28     0.02%
      default_conn                                                                     4     0.00%
      ipsp_ops                                                                        28     0.02%
      server                                                                          12     0.01%

make rom_report

l2                                                                                  1446     0.48%
  bluetooth                                                                         1446     0.48%
    bluetooth.c                                                                     1446     0.48%
      __config_net_bt                                                                 12     0.00%
      __device_net_bt                                                                 12     0.00%
      __net_if_dev_net_bt_0                                                           28     0.01%
      __net_if_net_bt_0                                                               32     0.01%
      __net_l2_data_net_bt0                                                            4     0.00%
      ad                                                                              16     0.01%
      bt_advertise                                                                   106     0.04%
      bt_context_data                                                                240     0.08%
      bt_if_api                                                                        4     0.00%
      bt_iface_init                                                                   27     0.01%
      conn_callbacks                                                                  28     0.01%
      connected                                                                       46     0.02%
      disconnected                                                                    32     0.01%
      ipsp_accept                                                                     40     0.01%
      ipsp_alloc_buf                                                                  16     0.01%
      ipsp_connected                                                                 247     0.08%
      ipsp_disconnected                                                               47     0.02%
      ipsp_ops                                                                        28     0.01%
      ipsp_recv                                                                      115     0.04%
      net_bt_enable                                                                   35     0.01%
      net_bt_flags                                                                     3     0.00%
      net_bt_get_conn                                                                 17     0.01%
      net_bt_init                                                                     30     0.01%
      net_bt_recv                                                                     23     0.01%
      net_bt_send                                                                    132     0.04%
      net_mgmt_NET_REQUEST_BT_ADVERTISE                                              106     0.04%
      sd                                                                               8     0.00%
      server                                                                          12     0.00%

Signed-off-by: Luiz Augusto von Dentz [email protected]

@Vudentz Vudentz added bug The issue is a bug, or the PR is fixing a bug area: Bluetooth labels Nov 13, 2019
@Vudentz Vudentz requested a review from jfischer-no November 13, 2019 16:44
@Vudentz Vudentz added the DNM This PR should not be merged (Do Not Merge) label Nov 13, 2019
@Vudentz
Copy link
Contributor Author

Vudentz commented Nov 13, 2019

This is still crashing so more work is needed.

@joerchan
Copy link
Contributor

joerchan commented Nov 14, 2019

@Vudentz Is this a continuation #20544 ?

It is, but it took it quite a hard measure to transfer a payload that big without deadlocking, but the end result is that bt_l2cap_chan_send will no longer block anymore, it just queue the buffer if it could at least send 1 segment so that we can resume on TX complete.

@Vudentz Vudentz removed the DNM This PR should not be merged (Do Not Merge) label Nov 14, 2019
@Vudentz
Copy link
Contributor Author

Vudentz commented Nov 14, 2019

@jfischer-phytec-iot With this set I was able to ping -s 1024, Ive also added some tunning so it doesn't take forever to reply at the cost of increasing our frag_pool when NET_L2_BT is set.

@Vudentz Vudentz added this to the v2.1.0 milestone Nov 15, 2019
@jfischer-no
Copy link
Contributor

@Vudentz thanks, looks better. It does not lock anymore. I will test it for a while.

Comment on lines 56 to 62
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference here? NET_BUF_POOL_DEFINE() is the same as NET_BUF_POOL_FIXED_DEFINE() in fact, you shouldn't use NET_BUF_POOL_DEFINE() anymore as it's essentially deprecated (even though it might not be formally marked so).

Copy link
Contributor Author

@Vudentz Vudentz Nov 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just for setting user_data size, the fixed doesn't seem to take the user_data size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the actual definition. All it does with it is a build assert. Ensuring the BT_BUF_USER_DATA_MIN is already covered by range 8 65535 if BT in subsys/net/Kconfig

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that still wasn't clear: since quite a while ago all net_buf objects share the same user data size that gets set using CONFIG_NET_BUF_USER_DATA_SIZE. I.e. you can't set it per-pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now I realize that it doesn't actually assume a 0 size with fixed version, will fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Vudentz Vudentz force-pushed the fix_#20640 branch 2 times, most recently from 21746b4 to 7ce7658 Compare November 16, 2019 16:18
@jfischer-no
Copy link
Contributor

@Vudentz 7ce7658c936e688f9fac3dd05de94d43c017940e does not survive flood like ping -i 0.02 -s 82 2001:db8::1, looks like deadlock. (Previous version 21746b4c8a0e66b3ea96d391576058d2c17e3e71 disconnected during flood after bt_l2cap_chan_disconnect(&chan->chan) )

The test which must be passed:

#!/bin/bash

IP=$1

for (( n=0; n<1; n++ ))
do
        for (( packetsize=0; packetsize<1500; packetsize=$packetsize+1 ))
        do
                echo "payload $packetsize"
                ping6 -w 1 -i 0.002 -c 100 -s $packetsize -q $IP
        done
done

@Vudentz
Copy link
Contributor Author

Vudentz commented Nov 18, 2019

@Vudentz 7ce7658 does not survive flood like ping -i 0.02 -s 82 2001:db8::1, looks like deadlock. (Previous version 21746b4 disconnected during flood after bt_l2cap_chan_disconnect(&chan->chan) )

The test which must be passed:

#!/bin/bash

IP=$1

for (( n=0; n<1; n++ ))
do
        for (( packetsize=0; packetsize<1500; packetsize=$packetsize+1 ))
        do
                echo "payload $packetsize"
                ping6 -w 1 -i 0.002 -c 100 -s $packetsize -q $IP
        done
done

So I managed to fix the lockup, though sometimes I experience crashes with qemu which I don't experience with native_posix, the backtrace points to buf.c:207 so I guess it is some sort of corruption.

@Vudentz
Copy link
Contributor Author

Vudentz commented Nov 19, 2019

So I managed to fix the lockup, though sometimes I experience crashes with qemu which I don't experience with native_posix, the backtrace points to buf.c:207 so I guess it is some sort of corruption.

Fixed these issues, the channel will now shutdown and eventually be disconnected if credits cannot be sent due to congestion caused by flooding.

@Vudentz Vudentz added the priority: high High impact/importance bug label Nov 22, 2019
@Vudentz
Copy link
Contributor Author

Vudentz commented Nov 23, 2019

Depends on #20951

@dleach02 dleach02 modified the milestones: v2.1.0, v2.2.0 Dec 3, 2019
@Vudentz
Copy link
Contributor Author

Vudentz commented Dec 4, 2019

@dleach02 Im not very sure why this was pushed back to 2.2 but there is a high chance that these changes will need to be backported to 2.1 even after the release because this also fixes the regression where TX errors, on bt_conn_send_cb, are not dealt with so IPSP will not work the as before.

@Vudentz Vudentz requested review from jhedberg and joerchan December 8, 2019 22:44
@Vudentz
Copy link
Contributor Author

Vudentz commented Dec 12, 2019

@jhedberg Is there anything blocking us to merge this now that 2.1 has been release?

@jhedberg
Copy link
Member

@Vudentz there still seems to be some outstanding stuff from @joerchan (at least GitHub is saying "Merging is blocked" because of that)

@Vudentz
Copy link
Contributor Author

Vudentz commented Dec 12, 2019

@joerchan Ive updated the comment regarding cleaning the outstanding tx_buf and a define for the RTX timeout.

@joerchan
Copy link
Contributor

	if (IS_ENABLED(CONFIG_BT_DEBUG_CONN)) {
		struct bt_conn_tx *tx = k_fifo_get(&free_tx, K_NO_WAIT);

		if (tx) {
			return tx;
		}
	}

	return k_fifo_get(&free_tx, K_FOREVER);

I think this one could be a changed to just the last returned now that BT_WARN is gone.

@Vudentz
Copy link
Contributor Author

Vudentz commented Dec 12, 2019

	if (IS_ENABLED(CONFIG_BT_DEBUG_CONN)) {
		struct bt_conn_tx *tx = k_fifo_get(&free_tx, K_NO_WAIT);

		if (tx) {
			return tx;
		}
	}

	return k_fifo_get(&free_tx, K_FOREVER);

I think this one could be a changed to just the last returned now that BT_WARN is gone.

Will fix that, this change was not intended actually so Im keeping the warning.

@joerchan
Copy link
Contributor

@Vudentz Great. Sorry for the questions. I'm not that familiar with this part of the code :)

When a segment could not be allocated it should be possible to resume
sending it later once previous segments complete, the only exception is
when there is no previous activity and we are unable to alocate even the
very first segment which should indicate to the caller that it would
block since that only happens on syswq the caller might need to defer to
another thread or resubmit the work.

Fixes zephyrproject-rtos#20640

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This enable chaning the function and line number making it easier to
debug where a buffer allocation is blocking.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
When NET_L2_BT the memory pressure for fragments can be quite high
since that would be transfering IP packets which are considerable big
so this makes our frag_pool to be of the same size as NET_BUF_TX_COUNT.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
If NET_L2_BT is enabled we need enough acl_in_pool needs to be big
enough to contain a full IP packet since that is no longer processed by
RX thread buffer would be queued to syswq to reassemble the SDU.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This prevents disconnect request packets to not being sent due to lack
of buffers normally caused by flooding or congestion.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Drop packets received while disconnecting since they would most likely
be flushed once peer respond there is no gain in keeping them on a
queue.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This offloads the processing of tx_queue to a work so the callbacks
calling resume don't start sending packets directly which can cause
stack overflow.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Packets shall never fail to be sent now that they are queued, so if an
error occured there is no point in keep the channel connected.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This can actually block system critical threads like the syswq.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This introduces BT_L2CAP_STATUS_SHUTDOWN which is used to indicate when
a channel has been shutdown.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Now that bt_l2cap_send_cb can fail the buffer state needs to be save
and restored otherwise the data stored on it would be lost.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This documents the special cases where -EAGAIN is returned which leads
the buffer to be queued.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This adds a define for the so called RTX timeout.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
@jhedberg jhedberg merged commit 9a111df into zephyrproject-rtos:master Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: API Changes to public APIs area: Bluetooth area: Networking bug The issue is a bug, or the PR is fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bluetooth: l2cap do not recover when faced with long packets and run out of buffers

6 participants