-
Notifications
You must be signed in to change notification settings - Fork 737
fix(telemetry): Crash monitoring fixes #5741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(telemetry): Crash monitoring fixes #5741
Conversation
During development the underlying implementation was changed to get the OS uptime. The initial implementation returned uptime in minutes, but the new one returned it in seconds and this was not accounted for. As a result the crash monitoring folder was not being cleaned up when expected on computer restart Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
|
This pull request implements a feature or fix, so it must include a changelog entry. See CONTRIBUTING.md#changelog for instructions. |
| } | ||
| }) | ||
| } finally { | ||
| await fs.delete(tmpFolder, { recursive: true, force: true }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we ever see any problems with delete? Just wondering if we might create a file and then never be able to delete it 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this function to write to the /tmp folder so if we are unable to delete it will eventually be cleaned up
| // The common errors we were seeing were windows EPERM/EBUSY errors. There may be a relation | ||
| // to this https://github.com/aws/aws-toolkit-vscode/pull/5335 | ||
| await withRetries(() => withFailCtx('deleteStaleRunningFile', () => fs.delete(path.join(dir, extId))), { | ||
| maxRetries: 7, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were this numbers determined by something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just equated to 12.7 seconds at most which I think is enough gracetime for the fs operation to succeed. But I think it may be worth to double it to 25 seconds looking back at it
| * Does the required initialization steps, this must always be run after | ||
| * creation of the instance. | ||
| * | ||
| * @throws if the filesystem state cannot get in to a good state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define what a good state means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| private readonly devLogger: Logger | undefined | ||
| ) {} | ||
|
|
||
| static #didTryInstance = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this basically did attempt creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
|
Looks like the withRetries test is being a bit sporadic on mac |
Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
- removed some unnecessary code - fix some comments - if heartbeats fails, stop sending future heartbeats, and remove the existing one from the state so it cannot be seen as a crash. - increase the retries from 12 seconds total to 25 seconds total (1 extra interval) - switch a vscode fs mkdir call to the node one since it looks like it may be flaky based on telemetry data - retry when trying to load the extension state from disk since it seems to fail sometimes due to temporary issues Signed-off-by: nkomonen-amazon <[email protected]>
8fdce30 to
ad2e297
Compare
|
/runIntegrationTests |
Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
## Problem Crash monitoring is reporting incorrect crash metrics. This seems to be due to various filesystem errors such as eperm (even though we were doing an operation on a file we created), enospc (the user ran out of space on their machine, and other errors. Because of this we ran in to situations where our state did not reflect reality, and due to this certain extension instances were seen as crashed. ## Solution - Determine if a filesystem is reliable on a machine (try a bunch of different filesystem flows and ensure nothing throws), if it is THEN we start the crash monitoring process. Otherwise we do not run it since we cannot rely it will be accurate. - We added a `function_call` metric to allow us to determine the ratio of successes to failures - Add retries to critical filesystem operations such as the heartbeats and deleting a crashed extension instance from the state. - Other various fixes --- <!--- REMINDER: Ensure that your PR meets the guidelines in CONTRIBUTING.md --> License: I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: nkomonen-amazon <[email protected]>
Problem
Crash monitoring is reporting incorrect crash metrics.
This seems to be due to various filesystem errors such as eperm (even though we were doing an operation on a file we created), enospc (the user ran out of space on their machine, and other errors.
Because of this we ran in to situations where our state did not reflect reality, and due to this certain extension
instances were seen as crashed.
Solution
function_callmetric to allow us to determine the ratio of successes to failuresLicense: I confirm that my contribution is made under the terms of the Apache 2.0 license.