Skip to content

Conversation

whosefriendA
Copy link

What type of PR is this?

/kind feature

What this PR does / why we need it:

Developed a cluster failure optimized eviction queue and its dynamic rate control and observability based on the design document.
design document: #6613

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 21, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xishanyongye-chang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 21, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @whosefriendA, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to Karmada's graceful eviction mechanism by implementing a dynamic rate limiter. The core purpose is to optimize resource eviction during cluster failures, ensuring system stability by adjusting the eviction rate based on the overall health of the clusters. This feature includes the development of a new, failure-optimized eviction queue, along with robust dynamic rate control and comprehensive observability. The changes allow for a more resilient and adaptive system when dealing with unhealthy clusters, preventing aggressive evictions that could exacerbate issues.

Highlights

  • Dynamic Eviction Rate Limiting: A new DynamicRateLimiter has been implemented, which intelligently adjusts the rate of resource evictions based on the real-time health status of the clusters. This ensures that evictions are performed gracefully, preventing cascading failures during periods of cluster instability.
  • Specialized Eviction Worker: A dedicated EvictionWorker has been introduced to manage graceful eviction operations. This worker integrates the new dynamic rate limiter and is designed to enhance the robustness of the eviction process.
  • Configurable Eviction Queue Options: The behavior of the dynamic eviction queue is now highly configurable through new EvictionQueueOptions. These options, exposed via command-line flags, allow administrators to fine-tune parameters such as resource eviction rates, thresholds for unhealthy clusters, and considerations for large-scale deployments.
  • Enhanced Observability and Metrics: Comprehensive observability has been added for the new dynamic eviction queue. New Prometheus metrics track key indicators such as eviction queue depth, processing latency, counts by resource kind, and overall cluster health (failure rate and faulty cluster count), providing critical insights into system performance.
  • Integration with Taint Manager: The NoExecuteTaintManager has been updated to seamlessly integrate the new dynamic eviction worker, ensuring that the graceful eviction mechanism is applied consistently across resource and cluster resource bindings.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic rate limiter for graceful eviction, a significant enhancement for Karmada's stability. The rate limiter adjusts eviction speed based on the overall health of member clusters, preventing cascading failures. The implementation is well-structured, with clear separation of concerns for the rate limiter logic, a new instrumented eviction worker, and configuration options. The addition of comprehensive Prometheus metrics for the new components is also a major plus for observability. I've found one critical issue that needs to be addressed and a minor one to improve usability.

@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch 2 times, most recently from d7e5278 to f7f4225 Compare August 21, 2025 11:52
@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 65.30612% with 85 lines in your changes missing coverage. Please review.
✅ Project coverage is 45.86%. Comparing base (a759915) to head (d4d9ad8).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
pkg/controllers/cluster/taint_manager.go 0.00% 36 Missing ⚠️
pkg/metrics/cluster.go 0.00% 26 Missing ⚠️
pkg/controllers/cluster/dynamic_rate_limiter.go 88.05% 6 Missing and 2 partials ⚠️
cmd/controller-manager/app/controllermanager.go 0.00% 7 Missing ⚠️
cmd/controller-manager/app/options/failover.go 0.00% 4 Missing ⚠️
pkg/controllers/cluster/eviction_worker.go 96.19% 3 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6675      +/-   ##
==========================================
+ Coverage   45.74%   45.86%   +0.11%     
==========================================
  Files         689      691       +2     
  Lines       57174    57409     +235     
==========================================
+ Hits        26157    26328     +171     
- Misses      29389    29452      +63     
- Partials     1628     1629       +1     
Flag Coverage Δ
unittests 45.86% <65.30%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch 2 times, most recently from 3e7793f to f7928e7 Compare August 24, 2025 16:07
@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch 4 times, most recently from 3dcbfb8 to 3d141f5 Compare September 19, 2025 06:46
@XiShanYongYe-Chang
Copy link
Member

Hi @whosefriendA, all E2E tests have failed. Have you analyzed the root cause of the errors?

// - Normal rate when system is healthy
// - Secondary rate when system is unhealthy but large-scale
// - Zero (halt evictions) when system is unhealthy and small-scale
func (d *DynamicRateLimiter[T]) getCurrentRate() float32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test code for these new additions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure ,I will upload the test code to this pr

"github.com/karmada-io/karmada/pkg/util/fedinformer/genericmanager"
)

// maxEvictionDelay is the maximum delay for eviction when the rate is 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable refers to the eviction of one resource every 1000 seconds, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maxEvictionDelay := 1000 * time.Second specifies the maximum wait time the queue will impose on an element when the calculated currentRate == 0 (i.e., eviction should be paused). Rather than setting a rate of "one evict every 1000 seconds," it uses a long delay to pause processing, waking up after 1000 seconds to re-evaluate the rate. Normally, the processing interval is time.Duration(1/currentRate * time.Second). For example, if currentRate = 5/s, the queue will process one element every 200ms.

@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch 2 times, most recently from 2c436ec to 005dc7d Compare September 23, 2025 05:09
@whosefriendA
Copy link
Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic rate limiter for graceful eviction of resources, which adjusts its speed based on the overall health of the cluster federation. This is a valuable feature for improving system stability during widespread cluster failures. The implementation is solid, introducing a new EvictionWorker and DynamicRateLimiter along with comprehensive tests and metrics. However, I've identified a critical bug where a necessary InformerManager is not initialized, which would cause a panic at runtime. I've also included several medium-severity suggestions to improve validation, code clarity, and robustness. Addressing these points will make the new feature more reliable and maintainable.

Comment on lines +73 to +77
clusterGVR := schema.GroupVersionResource{
Group: clusterv1alpha1.GroupName,
Version: "v1alpha1",
Resource: "clusters",
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and better maintainability, consider defining clusterGVR using clusterv1alpha1.SchemeGroupVersion.WithResource("clusters") as is done in the corresponding test file. This avoids hardcoding the group and version strings.

	clusterGVR := clusterv1alpha1.SchemeGroupVersion.WithResource("clusters")

Comment on lines +125 to +157
func (w *evictionWorker) Add(item interface{}) {
if item == nil {
klog.Warningf("Ignore nil item from queue")
return
}

w.queue.Add(item)
metrics.RecordEvictionQueueMetrics(w.name, float64(w.queue.Len()))

// Update resource kind metrics if possible
if w.resourceKindFunc != nil {
clusterName, resourceKind := w.resourceKindFunc(item)
metrics.RecordEvictionKindMetrics(clusterName, resourceKind, true)
}
}

// AddAfter adds an item to the queue after a delay and updates metrics.
func (w *evictionWorker) AddAfter(item interface{}, duration time.Duration) {
if item == nil {
klog.Warningf("Ignore nil item from queue")
return
}

w.queue.AddAfter(item, duration)
metrics.RecordEvictionQueueMetrics(w.name, float64(w.queue.Len()))

// Update resource kind metrics if possible
if w.resourceKindFunc != nil {
clusterName, resourceKind := w.resourceKindFunc(item)
metrics.RecordEvictionKindMetrics(clusterName, resourceKind, true)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The methods Add and AddAfter contain duplicated code for recording metrics. To improve maintainability and reduce redundancy, you could extract this logic into a private helper method.


// processNextWorkItem handles a single item from the queue with metrics tracking.
// Returns false when the queue is shutting down, true otherwise.
func (w *evictionWorker) processNextWorkItem(_ context.Context) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The processNextWorkItem function receives a context.Context but it is ignored (named _). The reconcileFunc that is called does not accept a context, which leads to context.TODO() being used deeper in the call stack (e.g., in syncBindingEviction).

This is a missed opportunity for proper cancellation propagation. Consider refactoring ReconcileFunc and the subsequent call chain to accept and use the context, which would allow for more responsive shutdowns and better resource management.

@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch from 005dc7d to 5931aef Compare September 23, 2025 12:24
@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch 3 times, most recently from f433fbb to e9c803d Compare September 24, 2025 14:56
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
@whosefriendA whosefriendA force-pushed the feat/graceful_eviction_dynamicratelimit branch from e9c803d to 5db74da Compare September 26, 2025 11:19
@karmada-bot karmada-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 26, 2025
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
@XiShanYongYe-Chang
Copy link
Member

Hi @whosefriendA, can you help squash the commits into one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants