Skip to content

Conversation

@nkomonen-amazon
Copy link
Contributor

Problem:

The Identity team noticed a large spike in token refreshes for specific users. One user would trigger refresh over 50 times within a few seconds.

Ticket: P180886632

Solution:

The telemetry showed that getChatAuthState() was being called many times in a short period. This eventually triggered the token refresh logic many times, if the token was expired.

The solution is to add a debounce to getToken() which calls the refresh logic.

  • debounce() only accepts functions without any args, the refresh logic requires args
  • getToken() will also load from disk is the token is not expired, so debouncing here saves disk I/O as well.

The current debounce interval is 100 milliseconds, which based on telemetry should be enough to capture the barrage of calls. With some manual testing it does not feel like UX is impacted in any noticeable way.


  • Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
  • Your code changes must meet the guidelines in CONTRIBUTING.md.
  • License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@nkomonen-amazon nkomonen-amazon requested a review from a team as a code owner December 20, 2024 17:15
@github-actions
Copy link

  • This pull request modifies code in src/* but no tests were added/updated.
    • Confirm whether tests should be added or ensure the PR description explains why tests are not required.
  • This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
    • Note: beta or "experiment" features that have active users should announce fixes in the changelog.
    • If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

Problem:

The Identity team noticed a large spike in token refreshes for specific users.
One user would trigger refresh over 50 times within a few seconds.

Solution:

The telemetry showed that `getChatAuthState()` was being called many times in a
short period. This eventually triggered the token refresh logic many times, if the token was
expired.

The solution is to add a debounce to `getToken()` which calls the refresh logic.
- `debounce()` only accepts functions without any args, the refresh logic requires args
- `getToken()` will also load from disk is the token is not expired, so debouncing here
  saves disk I/O as well.

The current debounce interval is 100 milliseconds, which based on telemetry should be enough to capture the barrage of calls.

Signed-off-by: nkomonen-amazon <[email protected]>
Problem:

By default a sinon fake clock was installed on all tests. This caused
the new debounce functionality on getToken to freeze since the clock was
not progressed, multiple tests were failing.

Solution:

Only use the fake clock in tests that need it

Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
- Removed unnecessary fake clock. I guess this existed for historical reasons but is
  not needed anymore
- Refactored one of the methods so that it could be stubbed

Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon
Copy link
Contributor Author

nkomonen-amazon commented Jan 8, 2025

/retryBuilds

We can now spy on the underlying getToken method to verify
it is being debounced as expected.

Also reduce the debounce interval to 50ms to reduce the delay but still
catch a barrage of calls

Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon
Copy link
Contributor Author

/retryBuilds

@nkomonen-amazon nkomonen-amazon merged commit 02f6d0b into aws:master Jan 9, 2025
26 checks passed
@nkomonen-amazon nkomonen-amazon deleted the refreshTokenSpam branch January 9, 2025 01:14
karanA-aws pushed a commit to karanA-aws/aws-toolkit-vscode that referenced this pull request Jan 17, 2025
## Problem:

The Identity team noticed a large spike in token refreshes for specific
users. One user would trigger refresh over 50 times within a few
seconds.

Ticket: `P180886632`

## Solution:

The telemetry showed that `getChatAuthState()` was being called many
times in a short period. This eventually triggered the token refresh
logic many times, if the token was expired.

The solution is to add a debounce to `getToken()` which calls the
refresh logic.
- `debounce()` only accepts functions without any args, the refresh
logic requires args
- `getToken()` will also load from disk is the token is not expired, so
debouncing here saves disk I/O as well.

The current debounce interval is 100 milliseconds, which based on
telemetry should be enough to capture the barrage of calls. With some
manual testing it does not feel like UX is impacted in any noticeable
way.

---

- Treat all work as PUBLIC. Private `feature/x` branches will not be
squash-merged at release time.
- Your code changes must meet the guidelines in
[CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines).
- License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
kevluu-aws pushed a commit to kevluu-aws/aws-toolkit-vscode that referenced this pull request Jan 23, 2025
## Problem:

The Identity team noticed a large spike in token refreshes for specific
users. One user would trigger refresh over 50 times within a few
seconds.

Ticket: `P180886632`

## Solution:

The telemetry showed that `getChatAuthState()` was being called many
times in a short period. This eventually triggered the token refresh
logic many times, if the token was expired.

The solution is to add a debounce to `getToken()` which calls the
refresh logic.
- `debounce()` only accepts functions without any args, the refresh
logic requires args
- `getToken()` will also load from disk is the token is not expired, so
debouncing here saves disk I/O as well.

The current debounce interval is 100 milliseconds, which based on
telemetry should be enough to capture the barrage of calls. With some
manual testing it does not feel like UX is impacted in any noticeable
way.

---

- Treat all work as PUBLIC. Private `feature/x` branches will not be
squash-merged at release time.
- Your code changes must meet the guidelines in
[CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines).
- License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
nkomonen-amazon added a commit that referenced this pull request Feb 7, 2025
## Problem:

getChatAuthState() is called in many places by the Q features
simultaneously,
this eventually triggers multiple calls to getToken() and if needed
refreshToken().

This resulted in refreshToken being spammed and the Identity team seeing
spikes in token refreshes
from clients.

## Solution:

Throttle getChatAuthState().

Throttling w/ leading: true, allows us to instantly return
a fresh result OR a cached result in the case we are throttled. Debounce
on the
other hand would cause callers to hang since they have to wait for
debounce to timeout.

Also, we put a debounce on getToken() before in #6282 but this did not
work since a new
SsoAccessToken instance is created each time the offending code flow
triggered (we could
look to cache the instance instead which would enable the getToken()
debounce to be useful.

### Testing

To test the difference after adding the throttle:
- Add log statements to `getToken()`
- Set an expired date in the SSO cache for both token expiration +
client registration expiration
- Use chat

What would happen is that without throttle it would trigger
getChatAuthState() many times, likely due to the connection
becoming invalid and sending an event to all Q features, causing each of
them to call getChatAuthState() at the same time.

But when the throttle was added, the amount of these calls dropped to at
most 2.

Signed-off-by: nkomonen-amazon <[email protected]>


---

- Treat all work as PUBLIC. Private `feature/x` branches will not be
squash-merged at release time.
- Your code changes must meet the guidelines in
[CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines).
- License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
s7ab059789 pushed a commit to s7ab059789/aws-toolkit-vscode that referenced this pull request Feb 19, 2025
## Problem:

The Identity team noticed a large spike in token refreshes for specific
users. One user would trigger refresh over 50 times within a few
seconds.

Ticket: `P180886632`

## Solution:

The telemetry showed that `getChatAuthState()` was being called many
times in a short period. This eventually triggered the token refresh
logic many times, if the token was expired.

The solution is to add a debounce to `getToken()` which calls the
refresh logic.
- `debounce()` only accepts functions without any args, the refresh
logic requires args
- `getToken()` will also load from disk is the token is not expired, so
debouncing here saves disk I/O as well.

The current debounce interval is 100 milliseconds, which based on
telemetry should be enough to capture the barrage of calls. With some
manual testing it does not feel like UX is impacted in any noticeable
way.

---

- Treat all work as PUBLIC. Private `feature/x` branches will not be
squash-merged at release time.
- Your code changes must meet the guidelines in
[CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines).
- License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
s7ab059789 pushed a commit to s7ab059789/aws-toolkit-vscode that referenced this pull request Feb 19, 2025
## Problem:

getChatAuthState() is called in many places by the Q features
simultaneously,
this eventually triggers multiple calls to getToken() and if needed
refreshToken().

This resulted in refreshToken being spammed and the Identity team seeing
spikes in token refreshes
from clients.

## Solution:

Throttle getChatAuthState().

Throttling w/ leading: true, allows us to instantly return
a fresh result OR a cached result in the case we are throttled. Debounce
on the
other hand would cause callers to hang since they have to wait for
debounce to timeout.

Also, we put a debounce on getToken() before in aws#6282 but this did not
work since a new
SsoAccessToken instance is created each time the offending code flow
triggered (we could
look to cache the instance instead which would enable the getToken()
debounce to be useful.

### Testing

To test the difference after adding the throttle:
- Add log statements to `getToken()`
- Set an expired date in the SSO cache for both token expiration +
client registration expiration
- Use chat

What would happen is that without throttle it would trigger
getChatAuthState() many times, likely due to the connection
becoming invalid and sending an event to all Q features, causing each of
them to call getChatAuthState() at the same time.

But when the throttle was added, the amount of these calls dropped to at
most 2.

Signed-off-by: nkomonen-amazon <[email protected]>


---

- Treat all work as PUBLIC. Private `feature/x` branches will not be
squash-merged at release time.
- Your code changes must meet the guidelines in
[CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines).
- License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants