You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Problem
In our telemetry data we saw the following scenario:
- We saw a session (`7ebb2966-7e94-4177-80d9-c5a485511c13`) sending
heartbeats as normal, with the latest one being at: `Nov 4, 2024 @
03:35:45` (1730691345628). A heartbeat is simply a file with a timestamp
value that is constantly updated.
- ~2 minutes later we see that this session was reported as crashed at
`Nov 4, 2024 @ 03:37:32`, but the timestamp on the heartbeat file it
read was `Nov 4, 2024 @ 03:05:45` (1730689545627)
- This does not make sense since the latest heartbeat was at `3:35`, but
what was seen was `3:05`. 2 more heartbeats are known to have happened
after `3:05` based on telemetry, but it looks like they somehow were not
seen when the crash check happened
We already do handling for this edge case by...
- After the heartbeat file is written, we immediately read to ensure it
returns the content we just wrote
- If a heartbeat write fails for any reason, we terminate all crash
monitoring for that session and clean it up so that there is not a
chance for it to be falsely reported as a crash.
- For any error that happens, we report a telemetry event and collect
them in a graph to see any significant ones
## Solution
Even with the handling above we were still seeing odd amounts of
sessions being reported as crashed, even though heartbeats were
appropriately being sent.
A guess to why we still had issues is that even though a new heartbeat
file write is successful, it does not truly propagate to all readers of
it. `fsync` is a known solution for "finalizing" the change to disk as
certain OS's may do things like caching a write and then eventually
writing it to disk.
The solution we have is to use `fsync` after writing the heartbeat file.
We will then monitor our telemetry dashboard to see if these issues
drop.
### Additional
- On HB file deletions we will now clear the content of it first, then
delete. This is due to an assumption that something is reading the
previous text from the file even after it was deleted, such as an open
file handle existing before delete.
- Even if we have an empty HB file on disk [we handle that case
gracefully](https://github.com/aws/aws-toolkit-vscode/blob/55e0b83aa13a09b49af5fe4db5b0d8879fd6f1dd/packages/core/src/shared/crashMonitoring.ts#L590)
---
<!--- REMINDER: Ensure that your PR meets the guidelines in
CONTRIBUTING.md -->
License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.
---------
Signed-off-by: nkomonen-amazon <[email protected]>
0 commit comments