-
Notifications
You must be signed in to change notification settings - Fork 28
Hyperjob controller #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperjob controller #25
Conversation
|
Welcome @JesseStutler! |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @JesseStutler, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant new feature: a HyperJob controller designed to orchestrate and manage distributed machine learning or batch workloads by splitting them into multiple Volcano Jobs and propagating them using Karmada's PropagationPolicies. It ensures that the lifecycle of these child jobs and policies is synchronized with the parent HyperJob. Accompanying this core functionality are essential updates to the project's build environment, Go module dependencies, and deployment configurations, including the addition of health checks and enhanced admission webhook capabilities, to support the new controller and maintain system stability.
Highlights
- New HyperJob Controller: Introduced a new HyperJob controller responsible for managing split HyperJobs, including the creation, update, and deletion of associated Volcano Jobs (VCJobs) and Karmada PropagationPolicies (PPs).
- Build System Updates: The Makefile has been updated to include targets for generating HyperJob Custom Resource Definitions (CRDs) and to manage the
controller-gentool. - Dependency Upgrades: The Go version has been upgraded to 1.24.0, and various Kubernetes and Volcano dependencies have been updated to newer versions, including
k8s.io/apiandvolcano.sh/apis. - Health Checks and Webhook Configuration: Liveness and readiness probes have been added to the controller and webhook managers, along with updates to webhook admission paths to include job mutation and validation.
- Documentation Enhancement: The deployment README has been revised to reflect the new HyperJob CRD and the updated order of deployment steps.
- Shared Controller Framework: A new shared controller manager has been added to integrate and manage
controller-runtimebased reconcilers within thevolcano-globalproject.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new HyperJob controller for splitting a HyperJob into multiple VCJobs. The changes are extensive, including the new controller logic, dependency updates, Makefile targets for CRD generation, and updates to deployment configurations and documentation. My review has identified a critical issue regarding a dependency on a personal fork, a high-severity misconfiguration in the webhook manager's health probes, and a medium-severity issue related to a risky programming pattern in the new controller. Addressing these points will improve the stability, security, and maintainability of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new HyperJob controller to split a HyperJob into multiple vcjobs and manage their lifecycle. The implementation uses controller-runtime and integrates it into the existing Volcano framework, which is a good approach. The controller logic for creating, updating, and deleting child vcjobs and PropagationPolicys is mostly sound, as is the status aggregation.
However, I've identified a few issues, including a critical bug in a loop in hyperjob_controller.go that could lead to incorrect behavior, a potential robustness issue in the reconciliation logic, an unhandled error, and some configuration issues in the Makefile. I've provided specific comments and suggestions for each of these points. Overall, this is a significant and well-architected feature addition, but the identified issues, especially the critical ones, should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new HyperJob controller to manage and split a HyperJob into multiple VolcanoJobs across clusters. The overall implementation is a good start, but there are several critical issues that need to be addressed. A replace directive in go.mod points to a personal fork, which is a major concern. There's also a critical bug in the controller logic related to a common Go pitfall with for...range loops that could lead to incorrect behavior. Additionally, there are some potential issues with configuration files, documentation, and error handling. I've provided detailed comments and suggestions to address these points.
e23c47f to
b9f82d1
Compare
|
@JesseStutler: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
b9f82d1 to
3e0b547
Compare
|
Will also add a design doc asap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a HyperJob controller for managing split HyperJob resources in a multi-cluster environment. The controller creates and manages multiple VCJobs (Volcano Jobs) and PropagationPolicies for distributing training workloads across Kubernetes clusters via Karmada.
Key Changes:
- New HyperJob controller with reconciliation logic for VCJob and PropagationPolicy lifecycle management
- Status aggregation from child VCJobs to parent HyperJob
- Hash-based change detection for spec updates
- Comprehensive test coverage for core functionality
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/controllers/hyperjob/hyperjob_controller.go | Core controller implementing reconciliation logic for HyperJob splitting and status aggregation |
| pkg/controllers/hyperjob/hyperjob_controller_utils.go | Utility functions for hash computation, change detection, and status checking |
| pkg/controllers/hyperjob/hyperjob_controller_test.go | Comprehensive unit tests for controller operations |
| pkg/controllers/scheme/scheme.go | Registration infrastructure for reconciler initialization |
| pkg/controllers/controller.go | Shared controller manager initialization |
| go.mod/go.sum | Updated dependencies including future-dated API fork |
| installer/dockerfile/*/Dockerfile | Updated Go version to 1.24.0 |
| docs/deploy/*.yaml | Deployment manifests and CRD definitions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The corresponding issue needs to be associated. |
3e0b547 to
726f9af
Compare
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
726f9af to
2d41bf5
Compare
@wangyang0616 Already associated, please check again, thanks |
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
2d41bf5 to
79157eb
Compare
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
And Hyperjob controller for automatic splitting hyperjob to support cross cluster training
Fixes: #31
HyperJob CRD: volcano-sh/volcano#4922