Skip to content

Conversation

@joerchan
Copy link
Contributor

@joerchan joerchan commented Jan 8, 2021

Fix deadlock in RX thread when ATT all buffers are taken by ATT requests and the RX thread blocks waiting for an buffer.
Solve this by not having a reference to the ATT request buffer until the ATT response is received, instead get a new ATT buffer in the case where it has to be resent.

Fixes: #28685

Fix ATT transaction violation in indications if the application has not set a callback function. In this case the indication was not treated as a transaction, leading to transaction violation.

Fix security elevation not working properly when the elevation is done in a two-step, this can happen when MITM is not enforced and security level 3 is required on the attribute.

Fix SMP debug keys overwrite handling not allowing to overwrite debug keys with debug keys, and allowing debug keys to overwrite existing bond information.

@github-actions github-actions bot added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Tests Issues related to a particular existing or missing test labels Jan 8, 2021
@carlescufi carlescufi requested a review from fnde-ot January 13, 2021 10:29
@carlescufi
Copy link
Member

Thanks for the PR @joerchan. This is quite a big change, so I'd like to ask you, what kind of regression tests have you performed on it?

@joerchan
Copy link
Contributor Author

Thanks for the PR @joerchan. This is quite a big change, so I'd like to ask you, what kind of regression tests have you performed on it?

I have run the EDTT test suite locally, and reproduced and verified the issues I'm fixing.
The change might look big, but most of the changes in gatt is just moving of code.

The PR can be split into smaller pieces, if that is preferable.

@carlescufi
Copy link
Member

Thanks for the PR @joerchan. This is quite a big change, so I'd like to ask you, what kind of regression tests have you performed on it?

I have run the EDTT test suite locally, and reproduced and verified the issues I'm fixing.
The change might look big, but most of the changes in gatt is just moving of code.

Sounds good, would you mind also running the throughput sample of nRF Connect SDK, as well as any others that might be applicable? I know that some of the samples there put the stack under heavy stress.

The PR can be split into smaller pieces, if that is preferable.

I don't see any particular reason to do that in this case.

@carlescufi
Copy link
Member

@Vudentz we'd like to try and get this in for 2.5, please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this assumption that the response cannot be bigger than then a single buffer is incorrect for EATT, EATT can in fact have channels which MTU is bigger then MPS which is why net_buf_append_bytes was used so it can allocate multiple buffers if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vudentz As far as I can tell the buf allocated is sized according to the MTU, not the MPS, and when sending the ATT PDU then L2CAP will handle splitting the MTU up into MPS sized fragment if needed. To me this looks like something that is handled properly and does not include GATT/ATT.

The reason why I removed the net_buf_append_bytes was because allocate_cb argument was set to NULL, which means that if this call would have had to allocate more buffers it would have crashed, because the allocate_cb is not checked for non-NULL.

And by not having the return value of net_buf_append_bytes the write_buf callback could be simplified since it does not have to handle failures at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joerchan no it is not allocated according to the remote MTU, in EATT MTUs are not required to be symmetric, that would mean that one cannot take advantage of having bigger MTU because the code can only use a single buffer.

That said if net_buf_append_bytes does require a callback we should provide one, @asbjornsabo this might explain our crashes with EATT with some platforms as they might have bigger MTU then our buffer size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take care of the allocation callback being NULL: #31493

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it wouldn't be a better idea to push the resend logic to gatt.c then, since now we would need to callbacks back to it to encode the request again it might be better to have everything there and perhaps have something like bt_att_resend to prepend the request on the queue, anyway we can make this change later as well if we find it necessary.

Copy link
Contributor

@Vudentz Vudentz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just drop the commit that removes the use of net_buf_append_bytes then Im good with this set.

@joerchan joerchan added the bug The issue is a bug, or the PR is fixing a bug label Jan 25, 2021
@joerchan joerchan force-pushed the fix-indication-deadlock branch 2 times, most recently from 3b4b9c6 to 329c2e9 Compare February 3, 2021 13:17
@joerchan
Copy link
Contributor Author

joerchan commented Feb 3, 2021

@Vudentz I have put back the use of net_buf_append_bytes.

I also found another issue where the buf for the ATT request was not always freed.
On master branch this can be seen when the buffer is successfully allocated, but the storage for the ATT request is not successful and returns error. In this case an error code is returned without the buffer being freed.

In order to fix this issue I put released of the ATT request buffer into the bt_att_req_free function, and eliminated the ATT internal att_req_destroy function.

The changes since last view can bee looked at here:
https://github.com/zephyrproject-rtos/zephyr/compare/3b4b9c62882a3eb72d4972e822b9b0c9a0a2c571..329c2e9358ae77c0ff301c8305364e1c1d6f7ede

Regarding the use of net_buf_append_bytes, I still don't understand why this should be there, and I think maybe you misunderstood what I said?
The size of the buf is determined by our L2CAP TX MTU, you said "no it is not allocated according to the remote MTU".
I'm sure that the condition write != len will never happen because we check this before we allocate the buffer, and either allocate a write-request, or a prepare-write-req with the L2CAP TX MTU.

@joerchan joerchan force-pushed the fix-indication-deadlock branch from 329c2e9 to f0ef59c Compare February 4, 2021 08:25
@joerchan joerchan requested a review from Vudentz February 4, 2021 08:26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMP_FLAG_KEYS_DISTR is set from bt_smp_encrypt_change() so this depends on order of registering channels

since channels are registered via BT_L2CAP_CHANNEL_DEFINE macro order seems to depend on linker action

Copy link
Contributor Author

@joerchan joerchan Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the finished zephyr.elf and this is the order.

<att_fixed_chan>:
 <le_fixed_chan>:
<smp_fixed_chan>:

I'm pretty sure that the order is alphabetical, because the we seem to rely on this behavior for the static GATT service declararions, e.g:

<_1_gatt_svc>:
<_2_gap_svc>:

The request for a higher security level is not done immediately in the att_encrypt_change handler, instead the att_encrypt_change handler retransmits the ATT request that trigger security escalation. And then on a new ATT error PDU with error indication higher security level required do we request the next security level. But this is happening before the key distribution is finished.

That being said I'm fine with taking this particular fix out and fixing it separately, because I'm fairly confident we have other cases where this would fail, for example ATT and L2CAP both initiating security at the same time, or with EATT on multiple enhanced ATT bearers at the same time.

This bug was only reproducible if both devices have the BT_SMP_ENFORCE_MITM option disabled (or the peer is acting in a similar way). This option is default on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough :-)

@joerchan
Copy link
Contributor Author

joerchan commented Feb 4, 2021

Sounds good, would you mind also running the throughput sample of nRF Connect SDK, as well as any others that might be applicable? I know that some of the samples there put the stack under heavy stress.

@carlescufi Done some smoke-tests with the throughput sample and the nrf desktop application, and didn't find any issues.

@Vudentz
Copy link
Contributor

Vudentz commented Feb 5, 2021

@Vudentz I have put back the use of net_buf_append_bytes.

I also found another issue where the buf for the ATT request was not always freed.
On master branch this can be seen when the buffer is successfully allocated, but the storage for the ATT request is not successful and returns error. In this case an error code is returned without the buffer being freed.

In order to fix this issue I put released of the ATT request buffer into the bt_att_req_free function, and eliminated the ATT internal att_req_destroy function.

The changes since last view can bee looked at here:
https://github.com/zephyrproject-rtos/zephyr/compare/3b4b9c62882a3eb72d4972e822b9b0c9a0a2c571..329c2e9358ae77c0ff301c8305364e1c1d6f7ede

Regarding the use of net_buf_append_bytes, I still don't understand why this should be there, and I think maybe you misunderstood what I said?
The size of the buf is determined by our L2CAP TX MTU, you said "no it is not allocated according to the remote MTU".
I'm sure that the condition write != len will never happen because we check this before we allocate the buffer, and either allocate a write-request, or a prepare-write-req with the L2CAP TX MTU.

The chan.tx.mtu doesn't need to match by our L2CAP TX MTU, that is precisely why net_buf_append_bytes is used, though it looks like we could use net_buf_append_bytes in other places like:
https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/host/att.c#L1064

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite understand why are you skipping the offset? The original code used params->data directly, the offset is meant for the server where to start writing not for the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I changed this code was because data pointer should not be modified, otherwise the encode function cannot properly re-encode it when resending.
But you are right, my change would not work properly in the case where the application has set an offset.

I have fixed this now, the data pointer is manipulated, as well as the length and offset like it was in the original code, except it is now done once the prepare_write_rsp has been received.

There was also an off-by one error, and the EDTT tests did not catch this, I have tested it manually with the shell now.
Also added a fix to the shell for when the hex convert failed.

Copy link
Contributor

@Vudentz Vudentz Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we could in theory just encode when the request is actually about to be sent, meaning we don't have to allocate a buffer instead att.c would allocate one just as it does when resending and call write_buf, actually perhaps we should use the term encode instead of write_buf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the name to 'encode'. However the change to just encode once sent I think is best to leave until after the release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be better off chaning bt_att_req_alloc to take all these parameters intead of having a wrapper on gatt.c that just initializes the structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why that wasn't possible at the moment was the bt_gatt_indicate. In order to make indications fit we will have to add a stack internal variable in the indication params to hold the resolved handle, since finding the handle is a rather complicated piece of code and should be kept out of the encode function IMO.

joerchan and others added 9 commits February 9, 2021 20:24
Fix gatt write command returned "write in progress" when either
hex2bin or bt_gatt_write returned an error.
The write_params.func should not be set if the write command was not
successful.

Signed-off-by: Joakim Andersson <[email protected]>
Fix the update keys check allowing to overwrite the keys when using
debug keys. Instead the check disallowed overwriting keys made using
debug keys.

Signed-off-by: Joakim Andersson <[email protected]>
When ATT resends an ATT request it is sent as a "response" instead of
as a request. This causes the ATT request buffer to be released and
the ATT request cannot be resent one more time.

This causes a problem when the ATT request requires authentication
but the elevation of security is not enforcing MITM protection.
In this case the ATT will first require security level 2 and then resend
the request once this has been reached.
This will lead to a new ATT error response and ATT will require security
level L3.

Signed-off-by: Joakim Andersson <[email protected]>
Allow to request a higher security level during the key distribution
phase.

This is required by ATT and L2CAP since they only react to the encrypt
change event where they resend the current request.
The current request might require a higher security level still and
might have to request a higher security level before the pairing
procedure has been finished.

Signed-off-by: Joakim Andersson <[email protected]>
Refactor SMP to have a conn pointer where this pointer is used
multiple times.

Signed-off-by: Joakim Andersson <[email protected]>
ATT channels do support queueing buffer so it no longer need to block
waiting the tx_sem besides the buffer allocation already serves the
same purpose as the application will not be able to have more requests
than there are buffers available.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Fix indicate without func not working properly, when sent as a
non-req by GATT this has two propblems:
 - The indicate would not be treated as a transaction, and back
   to back indicate would be sent without waiting for the confirm
 - The destroy callback would not be called on the indicate parameters
   since the indicate_rsp callback would not be called.

Signed-off-by: Joakim Andersson <[email protected]>
Remove the ATT request destroy callback which is never assigned
by any of the ATT requests.

Signed-off-by: Joakim Andersson <[email protected]>
The ATT request buffers are held until the ATT response has been
received. This means that the ATT request buffers are released by the
RX thread, instead of the from the RX priority context of
num_complete.
This can cause a deadlock in the RX thread when we allocate buffers
and all the available buffers are ATT requests, since the RX thread is
the only thread that can release buffers.

Release the ATT request buffers once they have been sent and instead
handle ATT request resending by reconstructing the buffer from the
GATT parameters.

Also re-order the order of resource allocation by allocating the
request context before the buffer. This ensures that we cannot
allocate more buffers for ATT requests than there are ATT requests.

Fixed a buf reference leak that could occur when the ATT request buffer
has been allocated, but GATT returns an error before handing the
responsebility of the buffer to ATT, for example when bt_att_req_alloc
fails.
This is fixed by moving the functionality of att_req_destroy to
bt_att_req_free.

Signed-off-by: Joakim Andersson <[email protected]>
@joerchan joerchan force-pushed the fix-indication-deadlock branch from f0ef59c to 21131e2 Compare February 9, 2021 19:31
@joerchan joerchan requested review from Vudentz and sjanc February 9, 2021 22:22
Copy link
Contributor

@Vudentz Vudentz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nashif nashif merged commit 10841b9 into zephyrproject-rtos:master Feb 11, 2021
@joerchan joerchan deleted the fix-indication-deadlock branch February 11, 2021 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth area: Tests Issues related to a particular existing or missing test bug The issue is a bug, or the PR is fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bluetooth: Characteristic unsubscribe under indication load results in ATT timeout

6 participants