Skip to content

Conversation

@iQQBot
Copy link
Contributor

@iQQBot iQQBot commented Nov 20, 2024

Description

We encountered some workspaces where the workspace container was still reported as running for several hours after the timeout, until the node was deleted.
This led to data loss, this PR make the backup was started five minutes after the pod was marked for deletion regardless.

[ws-daemon] start backup even pod still report the container is running after 5 minutes

Related Issue(s)

Fixes #

How to test

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

@iQQBot
Copy link
Contributor Author

iQQBot commented Nov 21, 2024

/hold
need drop debug commit before merge

@geropl
Copy link
Member

geropl commented Nov 21, 2024

@iQQBot How to test this change, could you add a description? What do I have to do to keep the workspace "stuck"?

url:
type: string
required:
- podRecreated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iQQBot Should we really make this part of this PR?

I'd love to drop it, and ship it at a later point. Just trying to avoid this complicating operations until we are sure this is rolled out to all installations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e.g. gitpod.io)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generated code, if you want this to be optional, you need add annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not "just delete this line"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we can just delete the line for now to achieve the desired effect. E.g. it worked the last couple of weeks. 😉

Would be great to add that annotation, though, if you know how to do that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't made any updates to the workspace cluster on gitpod.io in the past few weeks either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we could have some manual operations going on, considering that we need to renew soon.
And if somebody forgets to update any of the three of 1. CRD, 2. ws-daemon, 3. ws-manager, they will accidentally block workspace starts.


if ws.Status.Phase == workspacev1.WorkspacePhaseStopping && old.Phase != workspacev1.WorkspacePhaseStopping {
t := metav1.Now()
ws.Status.PodStoppingTime = &t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this line to be placed in status.go updateWorkspaceStatus, as this method's responsibility is something else.

Could we move it to the end of function updateWorkspaceStatus maybe...? 🤔

@geropl
Copy link
Member

geropl commented Nov 22, 2024

I tested the following:

  • running container got force-stopped with successful backup ✔️
  • (replaced "ForceKill" with a log) confirmed seeing the "dumpWorkspaceInfo" (incl. it's still running) + the successful backup + stop
    • workspace is gone, pod is gone ✔️

@iQQBot Is there more to test here? 🤔

Also, the approach itself looks safe to me:

  • this is effectively guarded exactly by the "ContainerIsRunning" check, which blocked us before, so no potential for harm as far as I can see ✔️
  • the only point of consideration I see is here

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, tested and works! ✔️

Thank you! 🙏

@iQQBot
Copy link
Contributor Author

iQQBot commented Nov 22, 2024

/unhold

@roboquat roboquat merged commit b77f687 into main Nov 22, 2024
14 of 16 checks passed
@roboquat roboquat deleted the pd/CLC-954 branch November 22, 2024 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants