-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: dynamic ratelimiter for gracefuleviction #6675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: dynamic ratelimiter for gracefuleviction #6675
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @whosefriendA, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement to Karmada's graceful eviction mechanism by implementing a dynamic rate limiter. The core purpose is to optimize resource eviction during cluster failures, ensuring system stability by adjusting the eviction rate based on the overall health of the clusters. This feature includes the development of a new, failure-optimized eviction queue, along with robust dynamic rate control and comprehensive observability. The changes allow for a more resilient and adaptive system when dealing with unhealthy clusters, preventing aggressive evictions that could exacerbate issues.
Highlights
- Dynamic Eviction Rate Limiting: A new
DynamicRateLimiter
has been implemented, which intelligently adjusts the rate of resource evictions based on the real-time health status of the clusters. This ensures that evictions are performed gracefully, preventing cascading failures during periods of cluster instability. - Specialized Eviction Worker: A dedicated
EvictionWorker
has been introduced to manage graceful eviction operations. This worker integrates the new dynamic rate limiter and is designed to enhance the robustness of the eviction process. - Configurable Eviction Queue Options: The behavior of the dynamic eviction queue is now highly configurable through new
EvictionQueueOptions
. These options, exposed via command-line flags, allow administrators to fine-tune parameters such as resource eviction rates, thresholds for unhealthy clusters, and considerations for large-scale deployments. - Enhanced Observability and Metrics: Comprehensive observability has been added for the new dynamic eviction queue. New Prometheus metrics track key indicators such as eviction queue depth, processing latency, counts by resource kind, and overall cluster health (failure rate and faulty cluster count), providing critical insights into system performance.
- Integration with Taint Manager: The
NoExecuteTaintManager
has been updated to seamlessly integrate the new dynamic eviction worker, ensuring that the graceful eviction mechanism is applied consistently across resource and cluster resource bindings.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dynamic rate limiter for graceful eviction, a significant enhancement for Karmada's stability. The rate limiter adjusts eviction speed based on the overall health of member clusters, preventing cascading failures. The implementation is well-structured, with clear separation of concerns for the rate limiter logic, a new instrumented eviction worker, and configuration options. The addition of comprehensive Prometheus metrics for the new components is also a major plus for observability. I've found one critical issue that needs to be addressed and a minor one to improve usability.
d7e5278
to
f7f4225
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6675 +/- ##
==========================================
+ Coverage 45.74% 45.86% +0.11%
==========================================
Files 689 691 +2
Lines 57174 57409 +235
==========================================
+ Hits 26157 26328 +171
- Misses 29389 29452 +63
- Partials 1628 1629 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3e7793f
to
f7928e7
Compare
3dcbfb8
to
3d141f5
Compare
Hi @whosefriendA, all E2E tests have failed. Have you analyzed the root cause of the errors? |
// - Normal rate when system is healthy | ||
// - Secondary rate when system is unhealthy but large-scale | ||
// - Zero (halt evictions) when system is unhealthy and small-scale | ||
func (d *DynamicRateLimiter[T]) getCurrentRate() float32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test code for these new additions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure ,I will upload the test code to this pr
"github.com/karmada-io/karmada/pkg/util/fedinformer/genericmanager" | ||
) | ||
|
||
// maxEvictionDelay is the maximum delay for eviction when the rate is 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable refers to the eviction of one resource every 1000 seconds, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxEvictionDelay := 1000 * time.Second specifies the maximum wait time the queue will impose on an element when the calculated currentRate == 0 (i.e., eviction should be paused). Rather than setting a rate of "one evict every 1000 seconds," it uses a long delay to pause processing, waking up after 1000 seconds to re-evaluate the rate. Normally, the processing interval is time.Duration(1/currentRate * time.Second). For example, if currentRate = 5/s, the queue will process one element every 200ms.
2c436ec
to
005dc7d
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dynamic rate limiter for graceful eviction of resources, which adjusts its speed based on the overall health of the cluster federation. This is a valuable feature for improving system stability during widespread cluster failures. The implementation is solid, introducing a new EvictionWorker
and DynamicRateLimiter
along with comprehensive tests and metrics. However, I've identified a critical bug where a necessary InformerManager
is not initialized, which would cause a panic at runtime. I've also included several medium-severity suggestions to improve validation, code clarity, and robustness. Addressing these points will make the new feature more reliable and maintainable.
clusterGVR := schema.GroupVersionResource{ | ||
Group: clusterv1alpha1.GroupName, | ||
Version: "v1alpha1", | ||
Resource: "clusters", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and better maintainability, consider defining clusterGVR
using clusterv1alpha1.SchemeGroupVersion.WithResource("clusters")
as is done in the corresponding test file. This avoids hardcoding the group and version strings.
clusterGVR := clusterv1alpha1.SchemeGroupVersion.WithResource("clusters")
func (w *evictionWorker) Add(item interface{}) { | ||
if item == nil { | ||
klog.Warningf("Ignore nil item from queue") | ||
return | ||
} | ||
|
||
w.queue.Add(item) | ||
metrics.RecordEvictionQueueMetrics(w.name, float64(w.queue.Len())) | ||
|
||
// Update resource kind metrics if possible | ||
if w.resourceKindFunc != nil { | ||
clusterName, resourceKind := w.resourceKindFunc(item) | ||
metrics.RecordEvictionKindMetrics(clusterName, resourceKind, true) | ||
} | ||
} | ||
|
||
// AddAfter adds an item to the queue after a delay and updates metrics. | ||
func (w *evictionWorker) AddAfter(item interface{}, duration time.Duration) { | ||
if item == nil { | ||
klog.Warningf("Ignore nil item from queue") | ||
return | ||
} | ||
|
||
w.queue.AddAfter(item, duration) | ||
metrics.RecordEvictionQueueMetrics(w.name, float64(w.queue.Len())) | ||
|
||
// Update resource kind metrics if possible | ||
if w.resourceKindFunc != nil { | ||
clusterName, resourceKind := w.resourceKindFunc(item) | ||
metrics.RecordEvictionKindMetrics(clusterName, resourceKind, true) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// processNextWorkItem handles a single item from the queue with metrics tracking. | ||
// Returns false when the queue is shutting down, true otherwise. | ||
func (w *evictionWorker) processNextWorkItem(_ context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processNextWorkItem
function receives a context.Context
but it is ignored (named _
). The reconcileFunc
that is called does not accept a context, which leads to context.TODO()
being used deeper in the call stack (e.g., in syncBindingEviction
).
This is a missed opportunity for proper cancellation propagation. Consider refactoring ReconcileFunc
and the subsequent call chain to accept and use the context, which would allow for more responsive shutdowns and better resource management.
005dc7d
to
5931aef
Compare
f433fbb
to
e9c803d
Compare
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
e9c803d
to
5db74da
Compare
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Signed-off-by: whosefriendA <[email protected]>
Hi @whosefriendA, can you help squash the commits into one? |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Developed a cluster failure optimized eviction queue and its dynamic rate control and observability based on the design document.
design document: #6613