Skip to content

Conversation

@nkomonen-amazon
Copy link
Contributor

@nkomonen-amazon nkomonen-amazon commented Mar 14, 2025

Problem:

With our telemetry, we do not know when the frontend webview UI has actually loaded.

The current process looks like the following:

  • We create a webview and set the HTML to load, but after that we do not have a formal way to detect if the webview actually loaded the HTML/JS successfully. We only know that the process started (toolkit_willOpenModule)

Solution:

Emit certain metrics during the webview loading process to get a better idea of if the webview UI successfully completed its initial load.

  • toolkit_willOpenModule, indicates intent to render a webview. It does not mean the user is seeing anything.
  • toolkit_didLoadModule, indicates the final result of loading the webview
    • We know a result: Succeeded when the frontend send a successful message to the backend. It knows this by ensuring there were no errors and that a certain HTML element can be found, then once the page finishes its initial load it will send a success message to the backend.
    • On result: Failed, what happens is a timer has timed out after 10 seconds. We assume that since there was no response from the frontend, it failed to fully execute the HTML/JS.
    • State is shared between toolkit_willOpenModule and toolkit_didLoadModule so that we can connect them through telemetry. This includes traceId and the duration which is the time between the 2 metrics.

This PR only applies to the Login and Reauth page for now, and future Vue webviews will need to implement some things on their end to get this functionality.

TODO

  • Generalize this solution in a more robust way for other webviews to easily implement this functionality

  • Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
  • Your code changes must meet the guidelines in CONTRIBUTING.md.
  • License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@nkomonen-amazon nkomonen-amazon requested review from a team as code owners March 14, 2025 17:46
@github-actions
Copy link

  • This pull request modifies code in src/* but no tests were added/updated.
    • Confirm whether tests should be added or ensure the PR description explains why tests are not required.

@nkomonen-amazon nkomonen-amazon marked this pull request as draft March 14, 2025 18:05
* This would be equivalent of the duration between "user clicked open q" and "ui has become available"
* NOTE: Amazon Q UI is only loaded ONCE. The state is saved between each hide/show of the webview.
*/
telemetry.webview_load.emit({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is done for all the webviews, will webview_load be deprecated in favor of the more granular metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, webview_x will be deprecated for something like toolkit_moduleX. So we'll also drop webview_error as well

reasonDesc: msg.errorMessage,
})
if (msg.event === 'toolkit_didLoadModule') {
telemetry.toolkit_didLoadModule.emit({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the webview_error metric exist? Shouldn't it in theory mirror the failures of toolkit_didLoadModule or are there non-error reasons a webview fails to load?

If we have a separate metric for error, shouldn't we emit it in this case since we still received an error from the webview?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of dropping webview_error and creating something like toolkit_moduleError. This will capture any errors that happen after loading has happened, anything before would be captured in toolkit_didLoadModule.

I have it as a TODO to deprecate webview_error, this will also allow us to deal with different field names (module vs webviewName)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toolkit_moduleError. This will capture any errors that happen after loading has happened, anything before would be captured in toolkit_didLoadModule.

? errors should be part of all metrics. There should not be a separate "foo_error" metric.

private setupTelemetry() {
this.instance.traceId = randomUUID()
// Notify intent to open a module, this does not mean it successfully opened
telemetry.toolkit_willOpenModule.emit({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where we intend to open a module (emit willOpenModule) then don't also emit a didLoadModule with either fail or succeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, every webview will emit a willOpenModule and didLoadModule will need to be explicitly done by each webview.

This is due to how loading a webview works. We can only indicate our intent to open a webview but have no formal way to know when it has opened (we create a vscode webview instance and set a string of HTML, then have no insight to what happens after that)

So there will be cases where we only have willOpenModule and no trailing didLoadModule. This is essentially how we have it right now, but under different metric names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, did/will pairs should not be needed. Only the "did" case is needed for most metrics, because the metric will wrap the impl logic and track a duration. Why do we need both here?

Problem:

With our telemetry, we do not know when the frontend webview UI has actually loaded

Solution:

Emit certain metrics during the webview loading process to get a better idea of if the
webview UI successfully completed its initial load.

- toolkit_willOpenModule, indicates intent to render a webview
- toolkit_didLoadModule, indicates the final result of loading the webview
  - On Success it it just a success result. We know a success when the frontend send a successful
    message to the backend. It knows this by ensuring there were no errors and that a certain HTML element
    can be found, then once the page finishes its initial load it will send a success message to the backend.
  - On Failure, what happens is a timer times out after 10 seconds.

Signed-off-by: nkomonen-amazon <[email protected]>
Setting .html starts the loading of the UI, but setup() sets up the message listeners in the backend
for messages from the UI.

We had setup() come after, and it worked, but if I added a small sleep() before setup() was run
it would result in a failure due to messages not being handled due to handlers not being setup in time.

As a solution this just moves the handler setup before we set the new UI.

Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon nkomonen-amazon marked this pull request as ready for review March 17, 2025 15:54
reasonDesc: msg.errorMessage,
})
if (msg.event === 'toolkit_didLoadModule') {
telemetry.toolkit_didLoadModule.emit({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toolkit_moduleError. This will capture any errors that happen after loading has happened, anything before would be captured in toolkit_didLoadModule.

? errors should be part of all metrics. There should not be a separate "foo_error" metric.

telemetry.webview_error.emit({
webviewName: qChatModuleName,
result: 'Failed',
reasonDesc: msg.errorMessage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a separate webview_error metiric? The error should be part of the toolkit_didLoadModule metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didLoadModule is mainly intended for the initial load, but if there is an error post-load then we will want a separate metric for that. Example is after clicking the "submit" button after putting in the startUrl+Region for signin

* A webview that supports this will call {@link setDidLoad}
* to confirm the UI has successfully loaded.
*/
public supportsLoadTelemetry: boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this flag? can we just try-and-handle-failure instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates for us to set up a timeout which sends a failure after 10 seconds of no response (makes assumption that the webview didn't postMessage to the backend due to failure). And each webview needs some customization to support the expected postMessage, so by default only webviews we do the custom work for will support it.

My TODO noted above is to update the webview framework so it forces (or at least makes it easy) to set this up

private setupTelemetry() {
this.instance.traceId = randomUUID()
// Notify intent to open a module, this does not mean it successfully opened
telemetry.toolkit_willOpenModule.emit({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, did/will pairs should not be needed. Only the "did" case is needed for most metrics, because the metric will wrap the impl logic and track a duration. Why do we need both here?

@nkomonen-amazon
Copy link
Contributor Author

nkomonen-amazon commented Mar 17, 2025

@justinmk3

In most cases, did/will pairs should not be needed. Only the "did" case is needed for most metrics, because the metric will wrap the impl logic and track a duration. Why do we need both here?

This is due to how webviews asynchronously load, and we aren't able to easily telemetry.run() around a single function. We could maintain some state object and emit a single didLoadModule, but other IDEs found it easier to have separate start/end metrics as it can be difficult to pass around context in a clean way. I just wanted to keep it consistent between IDEs for now

@nkomonen-amazon nkomonen-amazon merged commit e7b7307 into aws:master Mar 21, 2025
16 of 17 checks passed
@nkomonen-amazon nkomonen-amazon deleted the moduleLoadTelemetry branch March 21, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants