Skip to content

Conversation

@iQQBot
Copy link
Contributor

@iQQBot iQQBot commented May 21, 2025

Description

[ws-manager-mk2] do cleanup of failed workspace with unknown status

Related Issue(s)

Fixes CLC-1372

How to test

  1. start a workspace in preview env
  2. ssh in to workspace node
  3. restart the node
  4. you will observe the workspace are cleanup after reboot, and you can also see logs workspace container xxxx terminated for an unknown reason in ws-manager-mk2 pod

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

return fmt.Sprintf("container %s completed; containers of a workspace pod are not supposed to do that", cs.Name), nil
}
} else if !isPodBeingDeleted(pod) && terminationState.ExitCode != containerUnknownExitCode {
} else if !isPodBeingDeleted(pod) && terminationState.ExitCode == containerUnknownExitCode {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In very old versions of Kubernetes, if there was a network issue with the node, the kubelet could not report pod status for a while, and the API server will mark the container status of the pod would be marked as "unknown". Therefore, our previous approach was to treat this situation as a temporary problem and expect it to recover automatically.

But in the recent version, the API server does not change the status of the pod; instead, it modifies the status of the node to mark it as NotReady. Therefore, this unknown situation is actually returned by containerd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In very old versions of Kubernetes

How old, are you able to tell where the cut off is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding approval to unblock.

Please check K3s version for gitpod.io is not impacted before removing hold.

@iQQBot iQQBot requested a review from Copilot May 21, 2025 15:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses the cleanup of failed workspaces with an unknown exit status by updating the controller’s status extraction logic and adding a corresponding test case.

  • Updated the logic in workspace status extraction to handle an unknown container exit code.
  • Added a new test case to simulate a workspace failure with an unknown exit code.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
components/ws-manager-mk2/controllers/workspace_controller_test.go Added a test to simulate and verify cleanup behavior for workspaces failing with an unknown exit code.
components/ws-manager-mk2/controllers/status.go Adjusted the exit code check and updated the error message for workspaces terminating with an unknown exit code.

@kylos101
Copy link
Contributor

testing now, will then review code

@kylos101
Copy link
Contributor

stop and restart are fine.

@kylos101
Copy link
Contributor

With a running workspace, after forcing reboot I see:

gitpod /workspace/gitpod (pd/CLC-1372) $ kubectl get pods
NAME                                      READY   STATUS    RESTARTS   AGE
agent-smith-7vlr7                         0/2     Unknown   0          152m
blobserve-87c7b754b-z4ksb                 0/2     Unknown   0          152m
content-service-5dcd854b64-lxq7r          0/2     Unknown   0          152m
dashboard-64978bbd7f-wfdqn                0/1     Unknown   0          152m
fluent-bit-6m9k8                          0/1     Unknown   0          152m
ide-metrics-7bbcbfc445-xvqrc              0/2     Unknown   0          152m
ide-proxy-56d48b75c7-wpgt6                0/1     Unknown   0          152m
ide-service-84f5fb86f5-pzshb              0/2     Unknown   0          152m
image-builder-mk3-5fc988db7d-p4rsb        0/2     Unknown   0          152m
minio-6f794b8c88-d5b9l                    0/1     Unknown   0          152m
mysql-0                                   0/1     Unknown   0          152m
node-labeler-5977846d48-hkd6f             0/2     Unknown   0          152m
node-labeler-5977846d48-hxjwl             0/2     Unknown   0          152m
openvsx-proxy-0                           0/3     Unknown   0          152m
proxy-7585cb4877-rthrj                    0/2     Unknown   0          152m
public-api-server-75b8b45bd4-grb2x        0/2     Unknown   0          152m
redis-59c7fb74f6-55jck                    0/3     Unknown   0          152m
registry-facade-tnlvr                     0/2     Unknown   0          152m
server-55d7dc4948-zr9tj                   0/2     Unknown   0          152m
spicedb-7dbb8667b5-wg4ln                  0/2     Unknown   1          152m
usage-7759967bcc-k89d5                    0/2     Unknown   0          152m
ws-3f302446-5342-4889-a15b-89fbb845f34f   0/1     Unknown   0          3m38s
ws-daemon-bwsfp                           0/2     Unknown   0          152m
ws-manager-bridge-6b6dddd86f-5rwff        0/2     Unknown   0          152m
ws-manager-mk2-f98847879-5jx7k            0/2     Unknown   0          152m
ws-manager-mk2-f98847879-cdszh            0/2     Unknown   0          152m
ws-proxy-867cf54978-bldr5                 0/2     Unknown   0          152m

After services start we see:
image

And for the workspace I see:
image

And if I try restarting the failed workspace:
image

@kylos101
Copy link
Contributor

Unit tests run (had to bump timeout to 5m), otherwise we panic at 30s mark:
image

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @iQQBot prior to removing hold, can you double check K3s version for gitpod.io is not impacted? Ref: #20829 (comment)

@kylos101
Copy link
Contributor

I was wrong, we know this version of K3s isn't impacted (as the preview env uses the same version).

However, it might be interesting, to restart the CNI pod, and measure/gauge impact. If it is fast, I expect no change, and if it is slow I expect similar behavior. Please check beforehand?

@iQQBot
Copy link
Contributor Author

iQQBot commented May 22, 2025

I was wrong, we know this version of K3s isn't impacted (as the preview env uses the same version).

However, it might be interesting, to restart the CNI pod, and measure/gauge impact. If it is fast, I expect no change, and if it is slow I expect similar behavior. Please check beforehand?

/unhold

CNI does not affect this logic

@roboquat roboquat merged commit e5a2c82 into main May 22, 2025
50 checks passed
@roboquat roboquat deleted the pd/CLC-1372 branch May 22, 2025 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants