Skip to content

Conversation

@JordanYates
Copy link
Contributor

Add an error parameter to the discovery callback so that users can differentiate between an error occurring in the discovery process and the attribute simply not existing.

Currently both situations are provided to the user as attr == NULL.

@JordanYates JordanYates changed the title 241115 gatt disc err bluetooth: gatt: add err param to discover cb Nov 15, 2024
@JordanYates
Copy link
Contributor Author

I recognize this needs to go through the stable API change process, looking for an initial indication that the change is acceptable before sending through to the dev list.
Release notes and migration guide will be added after 4.0 release.

@Thalley
Copy link
Contributor

Thalley commented Nov 15, 2024

In theory the disconnect error could be detected, but I have not seen a single instance in tree that treats attr == NULL as anything other than "Discovery complete, move on to the next step". Checking the state of the Bluetooth connection does not appear to be an obvious step based on this.

But that's begs the question: What is the use case for this?

If no-one is checking for the connection state, they do care?
Those that may care can check the state.

In theory, every single goto done in functions touched in the first commit is an error that cannot be detected through the state of the Bluetooth connection. I can count at least 20 instances of those.

So there are a lot of different errors is what you are saying?
I think the API has worked so far because from the application perspective, the discovery procedure either found what it was looking for, or it didn't, and in the case that it didn't the application could not really do anything besides just move on. In the case that we receive invalid data, as @alwa-nordic suggested, then we probably should just disconnect.

So the way I see it there are 3 overall cases:

  1. Attribute found and attr != NULL and everyone is happy
  2. Attribute is not found and attr == NULL
  3. Invalid response from remote => We should just disconnect.

In cases 2) and 3) an application cannot do anything anyways. That's why I am arguing that we need to have a use case for the err value for this change. From a Zephyr contributor perspective I think it's a good idea to always provide information to the upper layers, but as a Zephyr GATT user I have yet to understand a use case to warrant a breaking API change

@JordanYates
Copy link
Contributor Author

But that's begs the question: What is the use case for this?

I have a wrapper around bt_conn_create that also performs MTU updates and characteristic discovery. Trying to continue discovery if the existing discovery has failed because of a disconnect or protocol error is pointless. I'm surprised that trying to determine "did my discovery succeed or fail" is controversial.

So there are a lot of different errors is what you are saying?

I'm saying there are a lot of code paths that can result in errors.

the discovery procedure either found what it was looking for, or it didn't, and in the case that it didn't the application could not really do anything besides just move on.

The correct action to take to "move on" depends on why it failed, which cannot currently be determined.

In cases 2) and 3) an application cannot do anything anyways

I disagree, in situation 2) there are many things the application can do, depending on whether or not the characteristic you were searching for is a requirement or just a nice to have.

That's why I am arguing that we need to have a use case for the err value for this change.

My discovery process has failed, do I as a user need to call bt_disconnect?

@JordanYates
Copy link
Contributor Author

To preserve backwards compatibility we cannot add a new parameter to bt_gatt_discover_func_t.

I am confused, what is the purpose of the API lifecycle then?

Update all instances of `bt_gatt_discover_func_t` in the tree.

Signed-off-by: Jordan Yates <[email protected]>
@Thalley
Copy link
Contributor

Thalley commented Nov 15, 2024

My discovery process has failed, do I as a user need to call bt_disconnect?

If we change it so that the stack will initiate a ATT disconnect in case of invalid GATT values, then we are left with 2 cases:

  1. Either the attribute is not found and attr == NULL
  2. The connection has failed

Both are currently possible to determine without any API changes.

However if we do not decide to disconnect in case of invalid response from the GATT server, then I agree that there's a use case for the err parameter, so the application can make that decision.

@alwa-nordic
Copy link
Contributor

alwa-nordic commented Nov 15, 2024

To preserve backwards compatibility we cannot add a new parameter to bt_gatt_discover_func_t.

I am confused, what is the purpose of the API lifecycle then?

You are right. I did not mean we can't do it, just that we shouldn't. I think it's very feasible to do this change with a smooth transition.

@alwa-nordic
Copy link
Contributor

alwa-nordic commented Nov 15, 2024

But that's begs the question: What is the use case for this?

It is not useful if the application just needs to know when "the current connection has this specific service" and react to that.

It is necessary when the application needs to know that "the remote definitively did not have this specific service". For example if this is cached.

memset(fixture, 0, sizeof(*fixture));

err = bt_bap_unicast_server_register(&param);
(void)bt_bap_unicast_server_register(&param);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the correct solution. If you want to change this in this PR (IMO should be a separate PR), then it should be something like zassert_equal(bt_bap_unicast_server_register(&param), 0); or zassert_equal(err, 0);`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only made the change to pass CI, since this is currently failing on main I assumed it would be properly fixed for the release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the CI error? Unused variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm having a look at how to properly fix this, but it seems non-trivial. Will let you know when I have a PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the non-trivial and proper fix: #81515

@JordanYates
Copy link
Contributor Author

JordanYates commented Nov 15, 2024

We could make bt_gatt_discover_params.func a union of bt_gatt_discover_func_t and a new type and add a bit somewhere to select between them. We deprecate bt_gatt_discover_func_t at the same time.

You are right. I did not mean we can't do it, just that we shouldn't. I think it's very feasible to do this change with a smooth transition.

To me this is just a breaking API change with extra steps. At some stage you are going to hit the definition of "A breaking API change is defined as one that forces users to modify their existing code in order to maintain the current behavior of their application". Forcing a change now vs later is the same thing from the project perspective, and it turns one PR into multiple PR's with additional implementation complexities. Unless there is some other advantage I am missing.

@alwa-nordic
Copy link
Contributor

alwa-nordic commented Nov 16, 2024

We could make bt_gatt_discover_params.func a union of bt_gatt_discover_func_t and a new type and add a bit somewhere to select between them. We deprecate bt_gatt_discover_func_t at the same time.

You are right. I did not mean we can't do it, just that we shouldn't. I think it's very feasible to do this change with a smooth transition.

To me this is just a breaking API change with extra steps. At some stage you are going to hit the definition of "A breaking API change is defined as one that forces users to modify their existing code in order to maintain the current behavior of their application". Forcing a change now vs later is the same thing from the project perspective, and it turns one PR into multiple PR's with additional implementation complexities. Unless there is some other advantage I am missing.

There are definitely benefits for a large project:

It turns some errors into warnings, which allows a downstream project chug along as normal and plan better. The downstream project can easily upgrade to a new Zephyr version, get the new features and bug-fixes and make a plan to upgrade each individual use site of the deprecated APIs.

It makes it easy for downstream users to run their CI with both their adopted version of Zephyr and the next to get advance warning about deprecation and find regressions. There is a smooth upgrade path by using only APIs that exists in both versions, so there is no ifdef-version-soup around the breakage. For this the deprecated "A" and new API "B" have to coexist for at least two releases, so that downstream can upgrade from API A to API B at a point when their CI is testing two Zephyr versions both with both A and B.

It makes it so that upgrading Zephyr is not a "stop the world" period, where merges may have to be stopped for some time to allow someone to upgrade and fix all breakage without several rounds of "Oh. A new use of the old API just got merged, you have to fix that too.", or worse, a broken main branch for some period. For large projects people stepping on each other's feet is not unlikely.

@Thalley
Copy link
Contributor

Thalley commented Dec 12, 2024

My discovery process has failed, do I as a user need to call bt_disconnect?

If we change it so that the stack will initiate a ATT disconnect in case of invalid GATT values, then we are left with 2 cases:

1. Either the attribute is not found and `attr == NULL`

2. The connection has failed

Both are currently possible to determine without any API changes.

However if we do not decide to disconnect in case of invalid response from the GATT server, then I agree that there's a use case for the err parameter, so the application can make that decision.

@JordanYates Do you have a response for this comment?

If we disconnect the ATT bearer on receiving invalid ATT PDUs, then we have ways to handle the other scenarios with the existing API

@JordanYates
Copy link
Contributor Author

My discovery process has failed, do I as a user need to call bt_disconnect?

If we change it so that the stack will initiate a ATT disconnect in case of invalid GATT values, then we are left with 2 cases:

1. Either the attribute is not found and `attr == NULL`

2. The connection has failed

Both are currently possible to determine without any API changes.
However if we do not decide to disconnect in case of invalid response from the GATT server, then I agree that there's a use case for the err parameter, so the application can make that decision.

@JordanYates Do you have a response for this comment?

If we disconnect the ATT bearer on receiving invalid ATT PDUs, then we have ways to handle the other scenarios with the existing API

Sounds reasonable, I am not able to make that change though.

@Thalley
Copy link
Contributor

Thalley commented Dec 13, 2024

My discovery process has failed, do I as a user need to call bt_disconnect?

If we change it so that the stack will initiate a ATT disconnect in case of invalid GATT values, then we are left with 2 cases:

1. Either the attribute is not found and `attr == NULL`

2. The connection has failed

Both are currently possible to determine without any API changes.
However if we do not decide to disconnect in case of invalid response from the GATT server, then I agree that there's a use case for the err parameter, so the application can make that decision.

@JordanYates Do you have a response for this comment?
If we disconnect the ATT bearer on receiving invalid ATT PDUs, then we have ways to handle the other scenarios with the existing API

Sounds reasonable, I am not able to make that change though.

Right, so we are missing value check on responses (e.g. if we are discovering from handle 100 to 200 and the response contains a handle 50, then we assume the remote device or the link is broken.

In that case we can either (as suggested) disconnect the ATT bearer and the ACL, or we can consider that similar to "attribute not found". Which makes the most sense? (@alwa-nordic please chime in here as well)

Once that is resolved, then we are down to either "Not found" or "disconnected", both of which is possible to handle without modifying the API.

@alwa-nordic
Copy link
Contributor

In that case we can either (as suggested) disconnect the ATT bearer and the ACL, or we can consider that similar to "attribute not found". Which makes the most sense? (@alwa-nordic please chime in here as well)

Disconnecting is fine. I don't think we can treat it as "not found", since the remote never said that. We can also be a little more robust. We can skip over handles that we did not request. For example, as long as we got at least one handle we were interested in, we can increment the handle range to discover and continue.

@Thalley
Copy link
Contributor

Thalley commented Dec 16, 2024

Disconnecting is fine. I don't think we can treat it as "not found", since the remote never said that. We can also be a little more robust. We can skip over handles that we did not request. For example, as long as we got at least one handle we were interested in, we can increment the handle range to discover and continue.

Aren't you disagreeing with yourself here? :D

Let's take a simple case where we send a discovery request from handle 100 to handle 200 for some specific type, and the server responds with handles [50, 150].

You seem to both suggest that we disconnect given the invalid value 50, both also that we just ignore that and only provide 150 to the upper layers.

In the case that the server only respond with 50, then we cannot provide anything to the upper layers.

So the way I see it, we have the following options when discovering for handle 100 to 200:

  1. Server responds with [50, 150], and we disconnect due to invalid value 50. Server responds with
    [50] and we also disconnect.
  2. Server responds with [50, 150], and we ignore 50 and provide 150 to the upper layer due to invalid value 50. Server responds with [50] and we disconnect as there is nothing to provide to the upper layer.
  3. Server responds with [50, 150], and we ignore 50 and provide 150 to the upper layer due to invalid value 50. Server responds with [50] and we provide attr = NULL (not found) to the upper layer.

So we have the options to 1) always disconnect due to invalid values, 2) only disconnect if we only get invalid values or 3) treat invalid values are no values

@alwa-nordic
Copy link
Contributor

I've changed my mind. All three options are fine by me.

@Thalley
Copy link
Contributor

Thalley commented Dec 16, 2024

I've changed my mind. All three options are fine by me.

I'll check the core spec if it mentions what the expected behavior is :) A server receiving a request with invalid values shall just be rejected, but unsure if it mentions what happens if a client receives an invalid response

@Thalley
Copy link
Contributor

Thalley commented Dec 19, 2024

I've changed my mind. All three options are fine by me.

I'll check the core spec if it mentions what the expected behavior is :) A server receiving a request with invalid values shall just be rejected, but unsure if it mentions what happens if a client receives an invalid response

The core spec does not specify any defined behavior for the GATT client it if receives invalid responses. The cleanest way to deal with this is, IMO, simply to disconnect. If only a part of the response is valid, we cannot assume that the other part is valid.

@Thalley
Copy link
Contributor

Thalley commented Dec 19, 2024

@jhedberg @alwa-nordic (cc @Vudentz )

I'm not 100% which direction to take here, so would like some input from you.

Current we have:

  1. If we receive an ATT response without having sent a request, we simply ignore it
  2. If we receive an ATT response with invalid format (e.g. incorrect length), we call the application callback with attr == NULL
  3. If we receive an ATT response with invalid handles (currently we only check for handle == 0), then we parse all handles up to the invalid handle (e.g. if we get [50, 100, 0], then the application callback is called for attr with handle 50 and 100, but for 0 we call it with attr == NULL and stop parsing

This seems to have been the behavior of ATT and GATT ever since it was added to Zephyr.

I suggest one of the 2 following solutions:
Solution 1: Strict

  1. If we receive an ATT response without having sent a request, we disconnect
  2. If we receive an ATT response with invalid format (e.g. incorrect length), we disconnect
  3. If we receive an ATT response with invalid handles (0, or handles not in the range of the request), we disconnect

Overall, we just disconnect with any invalid response.

Solution 2: Relaxed

  1. If we receive an ATT response without having sent a request, we simply ignore it
  2. If we receive an ATT response with invalid format (e.g. incorrect length), we call the application callback with attr == NULL
  3. If we receive an ATT response with invalid handles (currently we only check for handle == 0), then we parse all handles up to the invalid handles (0, or handles not in the range of the request), we call the application callback with attr == NULL without calling it with any valid handles in the response

Overall, we ignore what we can and always provide application a callback with attr == NULL on invalid responses.

@jhedberg
Copy link
Member

@Thalley my initial feeling is that not disconnecting may result to better interoperability (or rather, user expereience) in case of buggy devices, in which case I'd be in favor of solution 2.

@Thalley
Copy link
Contributor

Thalley commented Dec 21, 2024

@Thalley my initial feeling is that not disconnecting may result to better interoperability (or rather, user expereience) in case of buggy devices, in which case I'd be in favor of solution 2.

I was thinking the same, but then we cannot fix the issue raised by @JordanYates - We will just treat invalid responses similar to "Not found".

My concern is that if the server does provide invalid responses, and we treat them as "Not found" (during discovery), then how much can we trust the server at all? If we discover from handle 100 to 200 and we get back a characteristic at handle 50, can we trust that a read of handle e.g. 150 then the response is really the value of handle 150?

As mentioned the core spec does not specify any behavior here, so we have to either treat invalid responses as "Not found" or errors, and hope that future requests have valid responses, or we treat the entire remote GATT server (or the link) as invalid and disconnect.

@github-actions
Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Feb 20, 2025
@github-actions github-actions bot closed this Mar 6, 2025
@github-project-automation github-project-automation bot moved this from In Review to Done in Bluetooth LE Audio Mar 6, 2025
@Thalley
Copy link
Contributor

Thalley commented Mar 6, 2025

I have not forgotten this btw, and will provide a PR for disconnecting ACL/GATT if we receive invalid responses (possibly guarded by a Kconfig)

@JordanYates
Copy link
Contributor Author

I have not forgotten this btw, and will provide a PR for disconnecting ACL/GATT if we receive invalid responses (possibly guarded by a Kconfig)

Having attempted to drop these commits in my upgrade to Zephyr v4.2, I'm not convinced that actually solves my problem. Handling invalid responses is not the problem, the problem is the callback doesn't know whether discovery failed or whether the attribute doesn't exist. Discovery can fail for reasons other than an invalid response (the remote disconnecting) and unless you are planning to remove the callback being run on a disconnected connection (breaking the current model), it doesn't help the callback determine whether to continue or not.

@Thalley
Copy link
Contributor

Thalley commented Sep 12, 2025

Discovery can fail for reasons other than an invalid response (the remote disconnecting) and unless you are planning to remove the callback being run on a disconnected connection (breaking the current model), it doesn't help the callback determine whether to continue or not.

In the case of a disconnect, then you can (and possibly should?) check the connection state of the provided bt_conn. As pointed out by my previous comment (#81432 (comment)), discover can only "fail" in 3 ways:

  1. Invalid values (seems we decided to drop the connection on this, so this will never reach the application callback)
  2. Attribute not found (attr == NULL)
  3. Disconnect (either triggered from 1), or something else) - Verifiable via the connection state

Do you think we are still missing a case here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants