[ws-daemon] start backup even pod still report the container is running after 5 minutes #20382

iQQBot · 2024-11-20T16:36:24Z

Description

We encountered some workspaces where the workspace container was still reported as running for several hours after the timeout, until the node was deleted.
This led to data loss, this PR make the backup was started five minutes after the pod was marked for deletion regardless.

[ws-daemon] start backup even pod still report the container is running after 5 minutes

Related Issue(s)

Fixes #

How to test

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

🏷️ Name - pd-clc-954
🔗 URL - pd-clc-954.preview.gitpod-dev.com/workspaces.
📚 Documentation - See our internal documentation for information on how to interact with your preview environment.
📦 Version - pd-CLC-954-gha.30241
🗒️ Logs - GCP Logs Explorer

Build Options

Build

/werft with-werft
Run the build with werft instead of GHA
leeway-no-cache
/werft no-test
Run Leeway with --dont-test

Publish

/werft publish-to-npm
/werft publish-to-jb-marketplace

Installer

analytics=segment
with-dedicated-emulation
workspace-feature-flags
Add desired feature flags to the end of the line above, space separated

Preview Environment / Integration Tests

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-large-vm
/werft with-gce-vm
If enabled this will create the environment on GCE infra
/werft preemptible
Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
with-integration-tests=all
Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
with-monitoring

/hold

iQQBot · 2024-11-21T15:33:55Z

/hold
need drop debug commit before merge

geropl · 2024-11-21T16:56:33Z

@iQQBot How to test this change, could you add a description? What do I have to do to keep the workspace "stuck"?

geropl · 2024-11-22T08:55:41Z

components/ws-manager-mk2/config/crd/bases/workspace.gitpod.io_workspaces.yaml

              url:
                type: string
            required:
+            - podRecreated


@iQQBot Should we really make this part of this PR?

I'd love to drop it, and ship it at a later point. Just trying to avoid this complicating operations until we are sure this is rolled out to all installations.

(e.g. gitpod.io)

it's generated code, if you want this to be optional, you need add annotations

not "just delete this line"

I'm pretty sure we can just delete the line for now to achieve the desired effect. E.g. it worked the last couple of weeks. 😉

Would be great to add that annotation, though, if you know how to do that!

We haven't made any updates to the workspace cluster on gitpod.io in the past few weeks either.

Yes, but we could have some manual operations going on, considering that we need to renew soon.
And if somebody forgets to update any of the three of 1. CRD, 2. ws-daemon, 3. ws-manager, they will accidentally block workspace starts.

geropl · 2024-11-22T08:59:51Z

components/ws-manager-mk2/controllers/workspace_controller.go


 	if ws.Status.Phase == workspacev1.WorkspacePhaseStopping && old.Phase != workspacev1.WorkspacePhaseStopping {
+		t := metav1.Now()
+		ws.Status.PodStoppingTime = &t


I would expect this line to be placed in status.go updateWorkspaceStatus, as this method's responsibility is something else.

Could we move it to the end of function updateWorkspaceStatus maybe...? 🤔

components/ws-daemon/pkg/controller/workspace_controller.go

geropl · 2024-11-22T09:05:10Z

I tested the following:

running container got force-stopped with successful backup ✔️
(replaced "ForceKill" with a log) confirmed seeing the "dumpWorkspaceInfo" (incl. it's still running) + the successful backup + stop
- workspace is gone, pod is gone ✔️

@iQQBot Is there more to test here? 🤔

Also, the approach itself looks safe to me:

this is effectively guarded exactly by the "ContainerIsRunning" check, which blocked us before, so no potential for harm as far as I can see ✔️
the only point of consideration I see is here

components/ws-manager-mk2/controllers/status.go

geropl

Code LGTM, tested and works! ✔️

Thank you! 🙏

…ng after 5 minutes

iQQBot · 2024-11-22T10:10:36Z

/unhold

roboquat added do-not-merge/hold do-not-merge/work-in-progress size/S labels Nov 20, 2024

iQQBot force-pushed the pd/CLC-954 branch from 3267707 to 568dd91 Compare November 21, 2024 12:56

roboquat added size/M size/L and removed size/S size/M labels Nov 21, 2024

iQQBot force-pushed the pd/CLC-954 branch from 4708ca9 to f32997a Compare November 21, 2024 14:18

iQQBot marked this pull request as ready for review November 21, 2024 15:34

iQQBot requested review from a team as code owners November 21, 2024 15:34

roboquat removed the do-not-merge/work-in-progress label Nov 21, 2024

github-actions bot added team: team-engine team: team-enterprise labels Nov 21, 2024

geropl reviewed Nov 22, 2024

View reviewed changes

components/ws-daemon/pkg/controller/workspace_controller.go Show resolved Hide resolved

geropl reviewed Nov 22, 2024

View reviewed changes

components/ws-manager-mk2/controllers/status.go Show resolved Hide resolved

geropl approved these changes Nov 22, 2024

View reviewed changes

[ws-daemon] start backup even pod still report the container is runni…

719f178

…ng after 5 minutes

iQQBot force-pushed the pd/CLC-954 branch from afcd8b7 to 719f178 Compare November 22, 2024 10:10

roboquat removed the do-not-merge/hold label Nov 22, 2024

roboquat merged commit b77f687 into main Nov 22, 2024
14 of 16 checks passed

roboquat deleted the pd/CLC-954 branch November 22, 2024 10:14

[ws-daemon] start backup even pod still report the container is running after 5 minutes #20382

[ws-daemon] start backup even pod still report the container is running after 5 minutes #20382

Uh oh!

Conversation

iQQBot commented Nov 20, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

How to test

Documentation

Preview status

Build Options

Uh oh!

iQQBot commented Nov 21, 2024

Uh oh!

geropl commented Nov 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geropl commented Nov 22, 2024

Uh oh!

Uh oh!

geropl left a comment

Choose a reason for hiding this comment

Uh oh!

iQQBot commented Nov 22, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iQQBot commented Nov 20, 2024 •

edited by github-actions bot

Loading