fix(auth): debounce getToken() function #6282

nkomonen-amazon · 2024-12-20T17:15:08Z

Problem:

The Identity team noticed a large spike in token refreshes for specific users. One user would trigger refresh over 50 times within a few seconds.

Ticket: P180886632

Solution:

The telemetry showed that getChatAuthState() was being called many times in a short period. This eventually triggered the token refresh logic many times, if the token was expired.

The solution is to add a debounce to getToken() which calls the refresh logic.

debounce() only accepts functions without any args, the refresh logic requires args
getToken() will also load from disk is the token is not expired, so debouncing here saves disk I/O as well.

The current debounce interval is 100 milliseconds, which based on telemetry should be enough to capture the barrage of calls. With some manual testing it does not feel like UX is impacted in any noticeable way.

Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
Your code changes must meet the guidelines in CONTRIBUTING.md.
License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions · 2024-12-20T17:15:23Z

This pull request modifies code in src/* but no tests were added/updated.
- Confirm whether tests should be added or ensure the PR description explains why tests are not required.
This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
- Note: beta or "experiment" features that have active users should announce fixes in the changelog.
- If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

Problem: The Identity team noticed a large spike in token refreshes for specific users. One user would trigger refresh over 50 times within a few seconds. Solution: The telemetry showed that `getChatAuthState()` was being called many times in a short period. This eventually triggered the token refresh logic many times, if the token was expired. The solution is to add a debounce to `getToken()` which calls the refresh logic. - `debounce()` only accepts functions without any args, the refresh logic requires args - `getToken()` will also load from disk is the token is not expired, so debouncing here saves disk I/O as well. The current debounce interval is 100 milliseconds, which based on telemetry should be enough to capture the barrage of calls. Signed-off-by: nkomonen-amazon <[email protected]>

Problem: By default a sinon fake clock was installed on all tests. This caused the new debounce functionality on getToken to freeze since the clock was not progressed, multiple tests were failing. Solution: Only use the fake clock in tests that need it Signed-off-by: nkomonen-amazon <[email protected]>

Signed-off-by: nkomonen-amazon <[email protected]>

- Removed unnecessary fake clock. I guess this existed for historical reasons but is not needed anymore - Refactored one of the methods so that it could be stubbed Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon · 2025-01-08T18:15:35Z

/retryBuilds

We can now spy on the underlying getToken method to verify it is being debounced as expected. Also reduce the debounce interval to 50ms to reduce the delay but still catch a barrage of calls Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon · 2025-01-09T00:40:26Z

/retryBuilds

## Problem: The Identity team noticed a large spike in token refreshes for specific users. One user would trigger refresh over 50 times within a few seconds. Ticket: `P180886632` ## Solution: The telemetry showed that `getChatAuthState()` was being called many times in a short period. This eventually triggered the token refresh logic many times, if the token was expired. The solution is to add a debounce to `getToken()` which calls the refresh logic. - `debounce()` only accepts functions without any args, the refresh logic requires args - `getToken()` will also load from disk is the token is not expired, so debouncing here saves disk I/O as well. The current debounce interval is 100 milliseconds, which based on telemetry should be enough to capture the barrage of calls. With some manual testing it does not feel like UX is impacted in any noticeable way. --- - Treat all work as PUBLIC. Private `feature/x` branches will not be squash-merged at release time. - Your code changes must meet the guidelines in [CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines). - License: I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: nkomonen-amazon <[email protected]>

## Problem: getChatAuthState() is called in many places by the Q features simultaneously, this eventually triggers multiple calls to getToken() and if needed refreshToken(). This resulted in refreshToken being spammed and the Identity team seeing spikes in token refreshes from clients. ## Solution: Throttle getChatAuthState(). Throttling w/ leading: true, allows us to instantly return a fresh result OR a cached result in the case we are throttled. Debounce on the other hand would cause callers to hang since they have to wait for debounce to timeout. Also, we put a debounce on getToken() before in #6282 but this did not work since a new SsoAccessToken instance is created each time the offending code flow triggered (we could look to cache the instance instead which would enable the getToken() debounce to be useful. ### Testing To test the difference after adding the throttle: - Add log statements to `getToken()` - Set an expired date in the SSO cache for both token expiration + client registration expiration - Use chat What would happen is that without throttle it would trigger getChatAuthState() many times, likely due to the connection becoming invalid and sending an event to all Q features, causing each of them to call getChatAuthState() at the same time. But when the throttle was added, the amount of these calls dropped to at most 2. Signed-off-by: nkomonen-amazon <[email protected]> --- - Treat all work as PUBLIC. Private `feature/x` branches will not be squash-merged at release time. - Your code changes must meet the guidelines in [CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines). - License: I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: nkomonen-amazon <[email protected]>

## Problem: The Identity team noticed a large spike in token refreshes for specific users. One user would trigger refresh over 50 times within a few seconds. Ticket: `P180886632` ## Solution: The telemetry showed that `getChatAuthState()` was being called many times in a short period. This eventually triggered the token refresh logic many times, if the token was expired. The solution is to add a debounce to `getToken()` which calls the refresh logic. - `debounce()` only accepts functions without any args, the refresh logic requires args - `getToken()` will also load from disk is the token is not expired, so debouncing here saves disk I/O as well. The current debounce interval is 100 milliseconds, which based on telemetry should be enough to capture the barrage of calls. With some manual testing it does not feel like UX is impacted in any noticeable way. --- - Treat all work as PUBLIC. Private `feature/x` branches will not be squash-merged at release time. - Your code changes must meet the guidelines in [CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines). - License: I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: nkomonen-amazon <[email protected]>

## Problem: getChatAuthState() is called in many places by the Q features simultaneously, this eventually triggers multiple calls to getToken() and if needed refreshToken(). This resulted in refreshToken being spammed and the Identity team seeing spikes in token refreshes from clients. ## Solution: Throttle getChatAuthState(). Throttling w/ leading: true, allows us to instantly return a fresh result OR a cached result in the case we are throttled. Debounce on the other hand would cause callers to hang since they have to wait for debounce to timeout. Also, we put a debounce on getToken() before in aws#6282 but this did not work since a new SsoAccessToken instance is created each time the offending code flow triggered (we could look to cache the instance instead which would enable the getToken() debounce to be useful. ### Testing To test the difference after adding the throttle: - Add log statements to `getToken()` - Set an expired date in the SSO cache for both token expiration + client registration expiration - Use chat What would happen is that without throttle it would trigger getChatAuthState() many times, likely due to the connection becoming invalid and sending an event to all Q features, causing each of them to call getChatAuthState() at the same time. But when the throttle was added, the amount of these calls dropped to at most 2. Signed-off-by: nkomonen-amazon <[email protected]> --- - Treat all work as PUBLIC. Private `feature/x` branches will not be squash-merged at release time. - Your code changes must meet the guidelines in [CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines). - License: I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon requested a review from a team as a code owner December 20, 2024 17:15

hayemaxi approved these changes Dec 30, 2024

View reviewed changes

nkomonen-amazon added 3 commits January 7, 2025 15:24

add debounce sanity check test

6bcf269

Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon force-pushed the refreshTokenSpam branch from ce410ea to 6bcf269 Compare January 7, 2025 23:55

fix another broken test

2efa3ec

- Removed unnecessary fake clock. I guess this existed for historical reasons but is not needed anymore - Refactored one of the methods so that it could be stubbed Signed-off-by: nkomonen-amazon <[email protected]>

refactor to spy on debounced funcitionality

5c56b23

We can now spy on the underlying getToken method to verify it is being debounced as expected. Also reduce the debounce interval to 50ms to reduce the delay but still catch a barrage of calls Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon merged commit 02f6d0b into aws:master Jan 9, 2025
26 checks passed

nkomonen-amazon deleted the refreshTokenSpam branch January 9, 2025 01:14

nkomonen-amazon mentioned this pull request Jan 31, 2025

fix(auth): token refresh rapidly called unexpectedly #6479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(auth): debounce getToken() function #6282

fix(auth): debounce getToken() function #6282

Uh oh!

nkomonen-amazon commented Dec 20, 2024

Uh oh!

github-actions bot commented Dec 20, 2024

Uh oh!

nkomonen-amazon commented Jan 8, 2025 •

edited

Loading

Uh oh!

nkomonen-amazon commented Jan 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(auth): debounce getToken() function #6282

fix(auth): debounce getToken() function #6282

Uh oh!

Conversation

nkomonen-amazon commented Dec 20, 2024

Problem:

Solution:

Uh oh!

github-actions bot commented Dec 20, 2024

Uh oh!

nkomonen-amazon commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nkomonen-amazon commented Jan 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nkomonen-amazon commented Jan 8, 2025 •

edited

Loading