Skip to content

Conversation

@Spground
Copy link
Collaborator

Summary

  • What is changing and why?

Add features to support OpenSandbox can deployed in kubernetes cluster.

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

@CLAassistant
Copy link

CLAassistant commented Dec 18, 2025

CLA assistant check
All committers have signed the CLA.

@Spground Spground changed the title feat(k8s): add k8s controllers feat(k8s): add k8s controllers WIP Dec 18, 2025
@Pangjiping
Copy link
Collaborator

please sign your CLA

@Spground Spground force-pushed the feature/sandbox-k8s-dev branch 2 times, most recently from bc32ffe to 1adab19 Compare December 18, 2025 03:54
@Spground Spground changed the title feat(k8s): add k8s controllers WIP feat(k8s): add k8s controllers Dec 18, 2025
@Spground Spground force-pushed the feature/sandbox-k8s-dev branch from 1adab19 to ff0b13f Compare December 18, 2025 04:09
@jwx0925
Copy link
Collaborator

jwx0925 commented Dec 23, 2025

严重问题 (Critical Issues)

=== FILE: kubernetes/internal/task-executor/runtime/container.go ===

问题 1: Container mode 触发 panic 导致服务崩溃

  • 文件: kubernetes/internal/task-executor/runtime/container.go:44
  • 严重程度: Critical
  • 问题描述: Start/Inspect/Stop 直接 panic("container mode is not implemented yet")。当 EnableContainerMode=true 且任务使用 TaskSpec.Container 时会触发 panic,导致 task-executor 进程崩溃。
  • 影响: 启用容器模式或误配置时服务直接崩溃,任务执行与调度中断。
  • 建议修复: 明确在配置层阻止启用容器模式或返回可处理的错误;或者实现容器执行逻辑,避免 panic。

=== FILE: kubernetes/internal/scheduler/default_scheduler.go ===

问题 2: 任务删除无法被感知,资源永远不释放

  • 文件: kubernetes/internal/scheduler/default_scheduler.go:176
  • 严重程度: Critical
  • 问题描述: collectTaskStatus 只在 endpoint 返回 task 时更新 tNode.Status,当 task 被删除(endpoint 返回 nil)时不会清空 Status。而 scheduleSingleTaskNode 依赖 Status == nil 才会从 stateReleasing 进入 stateReleased,导致释放流程无法完成。
  • 影响: 任务完成/删除后资源释放不发生,池中 Pod 长期被占用,可能导致资源泄露与调度阻塞。
  • 建议修复: 在 collectTaskStatus 中对未返回 task 的已分配节点显式置 tNode.Status=nil(或让 collector 返回包含 nil 的结果),以触发释放状态迁移。

@Spground Spground force-pushed the feature/sandbox-k8s-dev branch from c403cfa to 07333a9 Compare December 23, 2025 08:55
@Spground
Copy link
Collaborator Author

严重问题 (Critical Issues)

=== FILE: kubernetes/internal/task-executor/runtime/container.go ===

问题 1: Container mode 触发 panic 导致服务崩溃

  • 文件: kubernetes/internal/task-executor/runtime/container.go:44
  • 严重程度: Critical
  • 问题描述: Start/Inspect/Stop 直接 panic("container mode is not implemented yet")。当 EnableContainerMode=true 且任务使用 TaskSpec.Container 时会触发 panic,导致 task-executor 进程崩溃。
  • 影响: 启用容器模式或误配置时服务直接崩溃,任务执行与调度中断。
  • 建议修复: 明确在配置层阻止启用容器模式或返回可处理的错误;或者实现容器执行逻辑,避免 panic。

=== FILE: kubernetes/internal/scheduler/default_scheduler.go ===

问题 2: 任务删除无法被感知,资源永远不释放

  • 文件: kubernetes/internal/scheduler/default_scheduler.go:176
  • 严重程度: Critical
  • 问题描述: collectTaskStatus 只在 endpoint 返回 task 时更新 tNode.Status,当 task 被删除(endpoint 返回 nil)时不会清空 Status。而 scheduleSingleTaskNode 依赖 Status == nil 才会从 stateReleasing 进入 stateReleased,导致释放流程无法完成。
  • 影响: 任务完成/删除后资源释放不发生,池中 Pod 长期被占用,可能导致资源泄露与调度阻塞。
  • 建议修复: 在 collectTaskStatus 中对未返回 task 的已分配节点显式置 tNode.Status=nil(或让 collector 返回包含 nil 的结果),以触发释放状态迁移。

fixed

@Spground Spground force-pushed the feature/sandbox-k8s-dev branch from 07333a9 to 5d287df Compare December 23, 2025 09:02
@jwx0925
Copy link
Collaborator

jwx0925 commented Dec 25, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Spground Spground force-pushed the feature/sandbox-k8s-dev branch from 5d287df to 11eba40 Compare December 25, 2025 08:10
@Spground Spground force-pushed the feature/sandbox-k8s-dev branch from 11eba40 to 84a96cf Compare December 26, 2025 08:13
@Pangjiping
Copy link
Collaborator

LGTM

@Pangjiping Pangjiping merged commit 73ab784 into alibaba:main Dec 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants