Skip to content

Conversation

@aabhinavvvvvvv
Copy link

Enable rolling updates for networkTopology and gangPolicy changes

  • Modified revision calculation to hash entire Spec.Template instead of just Roles
  • Added GetTemplateFromControllerRevision() for backward-compatible template extraction
  • NetworkTopology and GangPolicy changes now trigger rolling updates as expected
  • Added comprehensive unit tests for revision change detection
  • Created example YAML demonstrating networkTopology rolling updates
  • Updated documentation explaining rolling update triggers

Fixes #690

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR enables automatic rolling updates when networkTopology or gangPolicy fields are modified in a ModelServing resource, aligning with Kubernetes' declarative API philosophy.

Current Problem:

  • Changing spec.template.networkTopology updates the PodGroup but does not reschedule existing pods
  • Users must manually delete pods to apply new topology constraints
  • This violates the declarative API principle where spec changes should reconcile cluster state

Solution:

  • Changed revision hash calculation from Spec.Template.Roles to Spec.Template (entire template)
  • NetworkTopology and GangPolicy changes now trigger rolling updates via revision detection
  • Added backward-compatible GetTemplateFromControllerRevision() function
  • Comprehensive test coverage ensures correctness

Key Features:

  • NetworkTopology changes (groupPolicy, rolePolicy) trigger rolling updates
  • GangPolicy changes trigger rolling updates (primarily when initially adding)
  • Role.Replicas changes still treated as scaling operations (not rolling updates)
  • Fully backward compatible with existing ControllerRevisions

Which issue(s) this PR fixes:

Fixes #690

Special notes for your reviewer:

  1. Backward Compatibility: The new GetTemplateFromControllerRevision() function handles both old (roles-only) and new (full template) ControllerRevision formats. Existing deployments will continue to work without migration.

  2. GangPolicy Inclusion: While GangPolicy is mostly immutable after being set, including it in the revision hash is correct because:

    • Adding gang scheduling for the first time requires pods to operate under the new scheduling paradigm
    • The immutability constraint prevents accidental rolling updates after initial setup
  3. No Breaking Changes:

    • GetRolesFromControllerRevision() is deprecated but still functional (calls new function internally)
    • Revision calculation change is transparent to existing logic
    • All existing tests pass
  4. Test Coverage:

    • 7 new unit tests covering all networkTopology change scenarios
    • Tests verify role.replicas don't trigger revisions (maintains existing behavior)
    • Revision consistency test ensures deterministic hashing
  5. Files to Review:

    • controller_revision.go: New template extraction function (backward compatible)
    • revision_test.go: Comprehensive test coverage
    • model-serving-rolling-update.md: User-facing documentation
    • network-topology-rolling-update.yaml: Practical example with detailed comments

Does this PR introduce a user-facing change?:

ModelServing now automatically triggers rolling updates when `spec.template.networkTopology` or `spec.template.gangPolicy` are modified, enabling declarative network topology management with partition-based canary deployments.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Contributor

Welcome @aabhinavvvvvvv! It looks like this is your first PR to volcano-sh/kthena 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello @aabhinavvvvvvv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ModelServing controller's ability to manage declarative updates. Previously, modifications to networkTopology or gangPolicy fields did not automatically trigger rolling updates, requiring manual intervention. By adjusting the revision calculation to consider the entire Spec.Template, the controller now correctly detects these changes and initiates rolling updates, ensuring that the cluster state aligns with the desired specification without manual pod rescheduling. This improves the user experience and adheres more closely to Kubernetes' declarative principles.

Highlights

  • Enhanced Revision Calculation: The revision calculation now hashes the entire Spec.Template instead of just Spec.Template.Roles, ensuring that a broader range of template changes are detected.
  • Automatic Rolling Updates for Network Topology and Gang Policy: Changes to spec.template.networkTopology and spec.template.gangPolicy fields now correctly trigger automatic rolling updates, aligning with declarative API principles.
  • Backward-Compatible Template Extraction: A new function, GetTemplateFromControllerRevision(), has been introduced to extract the full ServingGroup template from ControllerRevision objects, while maintaining backward compatibility with older revisions that only stored roles.
  • Comprehensive Unit Testing: Extensive unit tests have been added to validate the new revision change detection logic for various networkTopology and gangPolicy modifications, and to ensure revision consistency.
  • Updated Documentation and Example: The developer guide has been updated to explain what triggers rolling updates, and a new example YAML file demonstrates how network topology changes initiate rolling updates.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, enabling rolling updates for networkTopology and gangPolicy changes. The core logic change to hash the entire spec.template is correct and well-implemented. I appreciate the attention to backward compatibility in controller_revision.go, which ensures existing deployments are not affected. The addition of comprehensive unit tests in revision_test.go and a practical example YAML file significantly improves the quality and usability of this feature. The documentation updates are also clear and helpful.

I have a couple of suggestions to improve the code further. One is to fix a broken link in the documentation, and the other is to refactor a small piece of duplicated code for better maintainability.

3. Perform a rolling update respecting the `rolloutStrategy` configuration
4. Reschedule pods according to the new topology constraints

For a complete example, see [network-topology-rolling-update.yaml](../assets/examples/model-serving/network-topology-rolling-update.yaml).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The relative link to the example YAML file appears to be incorrect. Given the file structure, the path should be adjusted to correctly point to network-topology-rolling-update.yaml.

Suggested change
For a complete example, see [network-topology-rolling-update.yaml](../assets/examples/model-serving/network-topology-rolling-update.yaml).
For a complete example, see [network-topology-rolling-update.yaml](../../../../examples/model-serving/network-topology-rolling-update.yaml).

…icy changes

- Modified revision calculation to hash entire Spec.Template instead of just Roles
- Added comprehensive unit tests for revision change detection
- Created example YAML demonstrating networkTopology rolling updates
- Updated documentation explaining rolling update triggers

Signed-off-by: aabhinavvvvvvv <[email protected]>
@hzxuzhonghu
Copy link
Member

Enable rolling updates for networkTopology and gangPolicy changes

@aabhinavvvvvvv This need a deep discussion. Whether allow updating them, and what's the behavior if allow

@aabhinavvvvvvv
Copy link
Author

Ok. I'll wait for updates and further guidance

@hzxuzhonghu
Copy link
Member

@aabhinavvvvvvv Thanks for the understanding, we can further discuss in the issue linked

@hzxuzhonghu
Copy link
Member

/hold for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Modifying network topology triggers rollingUpdate of ModelServing.

3 participants