-
Notifications
You must be signed in to change notification settings - Fork 51
proposal: Add the role status to modelserving.Status #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhoujinyu <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @FAUST-BENCHOU, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new documentation proposal that aims to significantly improve the observability of Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a design proposal to add role-level status to the ModelServing.Status object. The proposal is well-structured and clearly outlines the motivation and design. However, it is missing some critical sections. I've added comments to suggest populating the 'Non-Goals', 'Risks and Mitigations', and 'Alternatives' sections. In particular, the risk of the status object becoming too large with many roles and replicas needs to be carefully considered and mitigated.
| #### Risks and Mitigations | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Risks and Mitigations section is currently empty. It's crucial to identify potential risks and outline mitigation strategies. Please consider adding risks such as:
- Increased size of the
ModelServingstatus object: With a large number ofServingGroupsandRoles, theRoleStatusesarray could become very large, potentially exceeding etcd's object size limits and increasing load on the API server. A mitigation could be to set a limit on the number of statuses reported or to summarize them if the list grows too large. - Increased controller load: Collecting and comparing role statuses for every reconciliation loop could increase the controller's CPU and memory usage, especially for large-scale deployments.
| - Enable users to identify which roles are not yet activated (Creating/Deleting states) | ||
| - Maintain consistency with existing ServingGroup status display pattern | ||
|
|
||
| #### Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Non-Goals section is currently empty. It would be beneficial to explicitly state what is out of scope for this proposal to clarify its boundaries. For example:
- Exposing detailed pod-level information (e.g., pod IPs, node names).
- Modifying the role lifecycle management, as this proposal focuses only on status reporting.
- Providing metrics or events for role status changes (if that's handled separately).
| - `TestUpdateModelServingStatusRoleStatuses`: Validates `status.roleStatuses` population across scenarios (Running, Creating, Deleting states; multiple ServingGroups; skipping Deleting groups; empty cases) | ||
| - `TestUpdateModelServingStatusRoleStatusesChangeDetection`: Validates roleStatuses update only when status changes | ||
|
|
||
| ### Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Alternatives section is empty. Discussing alternative designs and explaining why the proposed solution was chosen strengthens the proposal. For example, you could consider and discuss the trade-offs of:
- Exposing status via a separate CRD: A new
ModelServingRoleStatusCRD could hold this information. This would avoid bloating the mainModelServingobject but might be harder for users to discover and correlate. - Using Kubernetes events: Role status changes could be published as events. This is less declarative and might be harder to inspect for the current state.
| // RoleStatuses track the status of roles across all ServingGroups. | ||
| // This allows users to view which roles have not been activated. | ||
| // +optional | ||
| RoleStatuses []RoleStatusInfo `json:"roleStatuses,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very unefficient to update the slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about emitting k8s events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a recorder to emit role status events
Signed-off-by: zhoujinyu <[email protected]>
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of #603