Skip to content

Conversation

@nkomonen-amazon
Copy link
Contributor

Problem

Crash monitoring is reporting incorrect crash metrics.
This seems to be due to various filesystem errors such as eperm (even though we were doing an operation on a file we created), enospc (the user ran out of space on their machine, and other errors.

Because of this we ran in to situations where our state did not reflect reality, and due to this certain extension
instances were seen as crashed.

Solution

  • Determine if a filesystem is reliable on a machine (try a bunch of different filesystem flows and ensure nothing throws), if it is THEN we start the crash monitoring process. Otherwise we do not run it since we cannot rely it will be accurate.
    • We added a function_call metric to allow us to determine the ratio of successes to failures
  • Add retries to critical filesystem operations such as the heartbeats and deleting a crashed extension instance from the state.
  • Other various fixes

License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

During development the underlying implementation was changed to get
the OS uptime. The initial implementation returned uptime in minutes,
but the new one returned it in seconds and this was not accounted for.

As a result the crash monitoring folder was not being cleaned up when expected
on computer restart

Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon nkomonen-amazon requested a review from a team as a code owner October 8, 2024 06:08
@github-actions
Copy link

github-actions bot commented Oct 8, 2024

This pull request implements a feature or fix, so it must include a changelog entry. See CONTRIBUTING.md#changelog for instructions.

}
})
} finally {
await fs.delete(tmpFolder, { recursive: true, force: true })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we ever see any problems with delete? Just wondering if we might create a file and then never be able to delete it 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this function to write to the /tmp folder so if we are unable to delete it will eventually be cleaned up

// The common errors we were seeing were windows EPERM/EBUSY errors. There may be a relation
// to this https://github.com/aws/aws-toolkit-vscode/pull/5335
await withRetries(() => withFailCtx('deleteStaleRunningFile', () => fs.delete(path.join(dir, extId))), {
maxRetries: 7,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were this numbers determined by something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just equated to 12.7 seconds at most which I think is enough gracetime for the fs operation to succeed. But I think it may be worth to double it to 25 seconds looking back at it

* Does the required initialization steps, this must always be run after
* creation of the instance.
*
* @throws if the filesystem state cannot get in to a good state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define what a good state means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

private readonly devLogger: Logger | undefined
) {}

static #didTryInstance = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this basically did attempt creation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@jpinkney-aws
Copy link
Contributor

Looks like the withRetries test is being a bit sporadic on mac

Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
- removed some unnecessary code
- fix some comments
- if heartbeats fails, stop sending future heartbeats, and remove the existing one from the state so it cannot
  be seen as a crash.
- increase the retries from 12 seconds total to 25 seconds total (1 extra interval)
- switch a vscode fs mkdir call to the node one since it looks like it may be flaky based on telemetry data
- retry when trying to load the extension state from disk since it seems to fail sometimes due to temporary issues

Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon
Copy link
Contributor Author

nkomonen-amazon commented Oct 9, 2024

/runIntegrationTests

Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon nkomonen-amazon merged commit 17e3b6b into aws:master Oct 10, 2024
27 of 30 checks passed
@nkomonen-amazon nkomonen-amazon deleted the crashMonitoringFixes branch October 10, 2024 19:54
tverney pushed a commit to tverney/aws-toolkit-vscode that referenced this pull request Oct 21, 2024
## Problem

Crash monitoring is reporting incorrect crash metrics.
This seems to be due to various filesystem errors such as eperm (even
though we were doing an operation on a file we created), enospc (the
user ran out of space on their machine, and other errors.

Because of this we ran in to situations where our state did not reflect
reality, and due to this certain extension
instances were seen as crashed.

## Solution

- Determine if a filesystem is reliable on a machine (try a bunch of
different filesystem flows and ensure nothing throws), if it is THEN we
start the crash monitoring process. Otherwise we do not run it since we
cannot rely it will be accurate.
- We added a `function_call` metric to allow us to determine the ratio
of successes to failures
- Add retries to critical filesystem operations such as the heartbeats
and deleting a crashed extension instance from the state.
- Other various fixes


---

<!--- REMINDER: Ensure that your PR meets the guidelines in
CONTRIBUTING.md -->

License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants