fix(telemetry): Crash monitoring fixes #5741

nkomonen-amazon · 2024-10-08T06:08:09Z

Problem

Crash monitoring is reporting incorrect crash metrics.
This seems to be due to various filesystem errors such as eperm (even though we were doing an operation on a file we created), enospc (the user ran out of space on their machine, and other errors.

Because of this we ran in to situations where our state did not reflect reality, and due to this certain extension
instances were seen as crashed.

Solution

Determine if a filesystem is reliable on a machine (try a bunch of different filesystem flows and ensure nothing throws), if it is THEN we start the crash monitoring process. Otherwise we do not run it since we cannot rely it will be accurate.
- We added a function_call metric to allow us to determine the ratio of successes to failures
Add retries to critical filesystem operations such as the heartbeats and deleting a crashed extension instance from the state.
Other various fixes

License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

During development the underlying implementation was changed to get the OS uptime. The initial implementation returned uptime in minutes, but the new one returned it in seconds and this was not accounted for. As a result the crash monitoring folder was not being cleaned up when expected on computer restart Signed-off-by: nkomonen-amazon <[email protected]>

Signed-off-by: nkomonen-amazon <[email protected]>

github-actions · 2024-10-08T06:08:24Z

This pull request implements a feature or fix, so it must include a changelog entry. See CONTRIBUTING.md#changelog for instructions.

jpinkney-aws · 2024-10-08T12:39:36Z

packages/core/src/shared/filesystemUtilities.ts

+            }
+        })
+    } finally {
+        await fs.delete(tmpFolder, { recursive: true, force: true })


did we ever see any problems with delete? Just wondering if we might create a file and then never be able to delete it 😄

updated this function to write to the /tmp folder so if we are unable to delete it will eventually be cleaned up

jpinkney-aws · 2024-10-08T12:40:12Z

packages/core/src/shared/crashMonitoring.ts

+        // The common errors we were seeing were windows EPERM/EBUSY errors. There may be a relation
+        // to this https://github.com/aws/aws-toolkit-vscode/pull/5335
+        await withRetries(() => withFailCtx('deleteStaleRunningFile', () => fs.delete(path.join(dir, extId))), {
+            maxRetries: 7,


were this numbers determined by something?

This just equated to 12.7 seconds at most which I think is enough gracetime for the fs operation to succeed. But I think it may be worth to double it to 25 seconds looking back at it

jpinkney-aws · 2024-10-08T12:40:32Z

packages/core/src/shared/crashMonitoring.ts

     * Does the required initialization steps, this must always be run after
     * creation of the instance.
+     *
+     * @throws if the filesystem state cannot get in to a good state


can we define what a good state means?

jpinkney-aws · 2024-10-08T12:47:28Z

packages/core/src/shared/crashMonitoring.ts

        private readonly devLogger: Logger | undefined
    ) {}

+    static #didTryInstance = false


is this basically did attempt creation?

jpinkney-aws · 2024-10-08T12:50:39Z

Looks like the withRetries test is being a bit sporadic on mac

Signed-off-by: nkomonen-amazon <[email protected]>

- removed some unnecessary code - fix some comments - if heartbeats fails, stop sending future heartbeats, and remove the existing one from the state so it cannot be seen as a crash. - increase the retries from 12 seconds total to 25 seconds total (1 extra interval) - switch a vscode fs mkdir call to the node one since it looks like it may be flaky based on telemetry data - retry when trying to load the extension state from disk since it seems to fail sometimes due to temporary issues Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon · 2024-10-09T00:27:41Z

/runIntegrationTests

Signed-off-by: nkomonen-amazon <[email protected]>

## Problem Crash monitoring is reporting incorrect crash metrics. This seems to be due to various filesystem errors such as eperm (even though we were doing an operation on a file we created), enospc (the user ran out of space on their machine, and other errors. Because of this we ran in to situations where our state did not reflect reality, and due to this certain extension instances were seen as crashed. ## Solution - Determine if a filesystem is reliable on a machine (try a bunch of different filesystem flows and ensure nothing throws), if it is THEN we start the crash monitoring process. Otherwise we do not run it since we cannot rely it will be accurate. - We added a `function_call` metric to allow us to determine the ratio of successes to failures - Add retries to critical filesystem operations such as the heartbeats and deleting a crashed extension instance from the state. - Other various fixes ---  License: I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon added 2 commits October 7, 2024 15:57

do not crash monitor on unstable fs

5f0829c

Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon requested a review from a team as a code owner October 8, 2024 06:08

jpinkney-aws reviewed Oct 8, 2024

View reviewed changes

nkomonen-amazon added 5 commits October 8, 2024 11:04

retry heartbeats on fail

a130ae5

Signed-off-by: nkomonen-amazon <[email protected]>

keep emitting heartbeats even on failure

9464dea

Signed-off-by: nkomonen-amazon <[email protected]>

minor fs test fixes

933813b

Signed-off-by: nkomonen-amazon <[email protected]>

tweak retries + intervals

dd1db97

Signed-off-by: nkomonen-amazon <[email protected]>

nkomonen-amazon force-pushed the crashMonitoringFixes branch from 8fdce30 to ad2e297 Compare October 9, 2024 00:27

nkomonen-amazon added 2 commits October 9, 2024 12:31

minor refactors

545508a

Signed-off-by: nkomonen-amazon <[email protected]>

fix failing tests

3449e48

Signed-off-by: nkomonen-amazon <[email protected]>

jpinkney-aws approved these changes Oct 9, 2024

View reviewed changes

nkomonen-amazon merged commit 17e3b6b into aws:master Oct 10, 2024
27 of 30 checks passed

nkomonen-amazon deleted the crashMonitoringFixes branch October 10, 2024 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(telemetry): Crash monitoring fixes #5741

fix(telemetry): Crash monitoring fixes #5741

Uh oh!

nkomonen-amazon commented Oct 8, 2024

Uh oh!

github-actions bot commented Oct 8, 2024

Uh oh!

jpinkney-aws Oct 8, 2024

Uh oh!

nkomonen-amazon Oct 8, 2024

Uh oh!

jpinkney-aws Oct 8, 2024

Uh oh!

nkomonen-amazon Oct 8, 2024

Uh oh!

jpinkney-aws Oct 8, 2024

Uh oh!

nkomonen-amazon Oct 8, 2024

Uh oh!

jpinkney-aws Oct 8, 2024

Uh oh!

nkomonen-amazon Oct 8, 2024

Uh oh!

jpinkney-aws commented Oct 8, 2024

Uh oh!

nkomonen-amazon commented Oct 9, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(telemetry): Crash monitoring fixes #5741

fix(telemetry): Crash monitoring fixes #5741

Uh oh!

Conversation

nkomonen-amazon commented Oct 8, 2024

Problem

Solution

Uh oh!

github-actions bot commented Oct 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpinkney-aws commented Oct 8, 2024

Uh oh!

nkomonen-amazon commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nkomonen-amazon commented Oct 9, 2024 •

edited

Loading