-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: dynamic ratelimiter for gracefuleviction #6675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
whosefriendA
wants to merge
18
commits into
karmada-io:master
Choose a base branch
from
whosefriendA:feat/graceful_eviction_dynamicratelimit
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
4c681e8
feat: dynamic ratelimiter for gracefuleviction
whosefriendA a41b7e0
fix lint
whosefriendA cc23d45
fix spell
whosefriendA e133324
fix lint
whosefriendA ac95f1d
refactor: change evictionoption to failover.go
whosefriendA db9e0fb
remove unuse define
whosefriendA 0658c78
fix lint
whosefriendA ed2ebda
fix lint
whosefriendA b5a6c2b
fix some comment
whosefriendA f60f68e
Add test
whosefriendA 00a6d3c
fix lint
whosefriendA 0648e81
fix lint
whosefriendA 105ba4f
fix a code hygiene
whosefriendA e29b12c
remove evictionworker interface
whosefriendA 5db74da
fix: change the wrong enqueue rate
whosefriendA d94e6f4
fix lint
whosefriendA 957fe71
fix lint
whosefriendA d4d9ad8
fix:add informermanager init
whosefriendA File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
/* | ||
Copyright 2025 The Karmada Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package cluster | ||
|
||
import ( | ||
"time" | ||
|
||
"k8s.io/apimachinery/pkg/labels" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"k8s.io/client-go/util/workqueue" | ||
"k8s.io/klog/v2" | ||
|
||
clusterv1alpha1 "github.com/karmada-io/karmada/pkg/apis/cluster/v1alpha1" | ||
"github.com/karmada-io/karmada/pkg/metrics" | ||
"github.com/karmada-io/karmada/pkg/sharedcli/ratelimiterflag" | ||
"github.com/karmada-io/karmada/pkg/util" | ||
"github.com/karmada-io/karmada/pkg/util/fedinformer/genericmanager" | ||
) | ||
|
||
// maxEvictionDelay is the maximum delay for eviction when the rate is 0 | ||
const maxEvictionDelay = 1000 * time.Second | ||
|
||
// DynamicRateLimiter adjusts its rate based on the overall health of clusters. | ||
// It implements the workqueue.RateLimiter interface with dynamic behavior. | ||
type DynamicRateLimiter[T comparable] struct { | ||
resourceEvictionRate float32 | ||
secondaryResourceEvictionRate float32 | ||
unhealthyClusterThreshold float32 | ||
largeClusterNumThreshold int | ||
informerManager genericmanager.SingleClusterInformerManager | ||
} | ||
|
||
// NewDynamicRateLimiter creates a new DynamicRateLimiter with the given options. | ||
func NewDynamicRateLimiter[T comparable](informerManager genericmanager.SingleClusterInformerManager, opts EvictionQueueOptions) workqueue.TypedRateLimiter[T] { | ||
return &DynamicRateLimiter[T]{ | ||
resourceEvictionRate: opts.ResourceEvictionRate, | ||
secondaryResourceEvictionRate: opts.SecondaryResourceEvictionRate, | ||
unhealthyClusterThreshold: opts.UnhealthyClusterThreshold, | ||
largeClusterNumThreshold: opts.LargeClusterNumThreshold, | ||
informerManager: informerManager, | ||
} | ||
} | ||
|
||
// When determines how long to wait before processing an item. | ||
// Returns a longer delay when the system is unhealthy. | ||
func (d *DynamicRateLimiter[T]) When(_ T) time.Duration { | ||
currentRate := d.getCurrentRate() | ||
if currentRate == 0 { | ||
return maxEvictionDelay | ||
} | ||
return time.Duration(1 / currentRate * float32(time.Second)) | ||
} | ||
|
||
// getCurrentRate calculates the appropriate rate based on cluster health: | ||
// - Normal rate when system is healthy | ||
// - Secondary rate when system is unhealthy but large-scale | ||
// - Zero (halt evictions) when system is unhealthy and small-scale | ||
func (d *DynamicRateLimiter[T]) getCurrentRate() float32 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add test code for these new additions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure ,I will upload the test code to this pr |
||
clusterGVR := schema.GroupVersionResource{ | ||
Group: clusterv1alpha1.GroupName, | ||
Version: "v1alpha1", | ||
Resource: "clusters", | ||
} | ||
|
||
var lister = d.informerManager.Lister(clusterGVR) | ||
if lister == nil { | ||
klog.Errorf("Failed to get cluster lister, halting eviction for safety") | ||
return 0 | ||
} | ||
|
||
clusters, err := lister.List(labels.Everything()) | ||
if err != nil { | ||
klog.Errorf("Failed to list clusters from informer cache: %v, halting eviction for safety", err) | ||
return 0 | ||
} | ||
|
||
totalClusters := len(clusters) | ||
if totalClusters == 0 { | ||
return d.resourceEvictionRate | ||
} | ||
|
||
unhealthyClusters := 0 | ||
for _, clusterObj := range clusters { | ||
cluster, ok := clusterObj.(*clusterv1alpha1.Cluster) | ||
if !ok { | ||
continue | ||
} | ||
if !util.IsClusterReady(&cluster.Status) { | ||
unhealthyClusters++ | ||
} | ||
} | ||
|
||
// Update metrics | ||
failureRate := float32(unhealthyClusters) / float32(totalClusters) | ||
metrics.RecordClusterHealthMetrics(unhealthyClusters, float64(failureRate)) | ||
|
||
// Determine rate based on health status | ||
isUnhealthy := failureRate > d.unhealthyClusterThreshold | ||
if !isUnhealthy { | ||
return d.resourceEvictionRate | ||
} | ||
|
||
isLargeScale := totalClusters > d.largeClusterNumThreshold | ||
if isLargeScale { | ||
klog.V(2).Infof("System is unhealthy (failure rate: %.2f), downgrading eviction rate to secondary rate: %.2f/s", | ||
failureRate, d.secondaryResourceEvictionRate) | ||
return d.secondaryResourceEvictionRate | ||
} | ||
|
||
klog.V(2).Infof("System is unhealthy (failure rate: %.2f) and instance is small, halting eviction.", failureRate) | ||
return 0 | ||
} | ||
|
||
// Forget is a no-op as this rate limiter doesn't track individual items. | ||
func (d *DynamicRateLimiter[T]) Forget(_ T) { | ||
// No-op | ||
} | ||
|
||
// NumRequeues always returns 0 as this rate limiter doesn't track retries. | ||
func (d *DynamicRateLimiter[T]) NumRequeues(_ T) int { | ||
return 0 | ||
} | ||
whosefriendA marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// NewGracefulEvictionRateLimiter creates a combined rate limiter for eviction. | ||
// It uses the maximum delay from both dynamic and default rate limiters to ensure | ||
// both cluster health and retry backoff are considered. | ||
func NewGracefulEvictionRateLimiter[T comparable]( | ||
informerManager genericmanager.SingleClusterInformerManager, | ||
evictionOpts EvictionQueueOptions, | ||
rateLimiterOpts ratelimiterflag.Options) workqueue.TypedRateLimiter[T] { | ||
dynamicLimiter := NewDynamicRateLimiter[T](informerManager, evictionOpts) | ||
defaultLimiter := ratelimiterflag.DefaultControllerRateLimiter[T](rateLimiterOpts) | ||
return workqueue.NewTypedMaxOfRateLimiter[T](dynamicLimiter, defaultLimiter) | ||
whosefriendA marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable refers to the eviction of one resource every 1000 seconds, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxEvictionDelay := 1000 * time.Second specifies the maximum wait time the queue will impose on an element when the calculated currentRate == 0 (i.e., eviction should be paused). Rather than setting a rate of "one evict every 1000 seconds," it uses a long delay to pause processing, waking up after 1000 seconds to re-evaluate the rate. Normally, the processing interval is time.Duration(1/currentRate * time.Second). For example, if currentRate = 5/s, the queue will process one element every 200ms.