-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Problem Scenario
When an etcd server encounters a large number of write requests, and its applying rate is unable to process these requests in a timely manner (for example, when the Kubernetes API server writes a large number of Event objects, creates a large number of Pods in bulk, or unexpected batch Pods restarts occur), the server may return ErrTooManyRequests
due to the current logic as below. This is understood as a protective mechanism. However, the current strategy can lead to catastrophic consequences.
func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {
if ci > ai+maxGapBetweenApplyAndCommitIndex {
return nil, errors.ErrTooManyRequests
}
}
The Kubernetes API server binds the storage of Event objects to Leases, aiming to automatically clean up associated Events once the Lease expires. Lease expiration is handled by the etcd leader polling the Leases and initiating LeaseRevoke
requests. If the etcd cluster is in the aforementioned protective state at this time, the LeaseRevoke
requests may have no chance to be executed (because Txn
request is much more than LeaseRevoke
), leading to a surge in the number of keys, further negatively affecting the apply rate, and ultimately causing etcd completely unavailable.
The core issue is that the current protection logic does not differentiate between user requests and internal system requests, resulting in indiscriminate rejection. When internal system requests cannot be executed, the system state may deteriorate, finally causing a system crash. LeaseRevoke
is just one such internal system request. Compact
requests have similar issues.
Proposal
- As discussed in errors.ErrTooManyRequests and maxGapBetweenApplyAndCommitIndex #18175 , configurable
maxGapBetweenApplyAndCommitIndex
would be more flexible. - Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (
LeaseRevoke
,Compact
) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as below.
func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {
if isTooLargeGap(ai, ci, &r) {
return nil, errors.ErrTooManyRequests
}
}
func isTooLargeGap(ai, ci uint64, r *pb.InternalRaftRequest) bool {
isCriticalReq := r != nil && (r.Compact != nil || r.LeaseRevoke != nil)
// for normal request
if ci > ai+maxGapBetweenApplyAndCommitIndex && !isCriticalReq {
return true
}
// for system critical request, have a seperate 500 queue buffer.
if ci - ai > maxGapBetweenApplyAndCommitIndex + 500 && isCriticalReq {
return true
}
return false
}