-
Notifications
You must be signed in to change notification settings - Fork 182
feat: Add top-level Flow Controller #1525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
109785c
44efb45
02c3f2e
196e5eb
16bc129
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
/* | ||
Copyright 2025 The Kubernetes Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package controller | ||
|
||
import ( | ||
"fmt" | ||
"time" | ||
) | ||
|
||
const ( | ||
// defaultExpiryCleanupInterval is the default frequency for scanning for expired items. | ||
defaultExpiryCleanupInterval = 1 * time.Second | ||
// defaultProcessorReconciliationInterval is the default frequency for the supervisor loop. | ||
defaultProcessorReconciliationInterval = 5 * time.Second | ||
// defaultEnqueueChannelBufferSize is the default size of a worker's incoming request buffer. | ||
defaultEnqueueChannelBufferSize = 100 | ||
) | ||
|
||
// Config holds the configuration for the `FlowController`. | ||
type Config struct { | ||
// DefaultRequestTTL is the default Time-To-Live applied to requests that do not | ||
// specify their own TTL hint. | ||
// Optional: If zero, no TTL is applied by default and we rely solely on request context cancellation. | ||
DefaultRequestTTL time.Duration | ||
|
||
// ExpiryCleanupInterval is the interval at which each shard processor scans its queues for expired items. | ||
// Optional: Defaults to `defaultExpiryCleanupInterval` (1 second). | ||
ExpiryCleanupInterval time.Duration | ||
|
||
// ProcessorReconciliationInterval is the frequency at which the `FlowController`'s supervisor loop garbage collects | ||
// stale workers. | ||
// Optional: Defaults to `defaultProcessorReconciliationInterval` (5 seconds). | ||
ProcessorReconciliationInterval time.Duration | ||
|
||
// EnqueueChannelBufferSize is the size of the buffered channel that accepts incoming requests for each shard | ||
// processor. This buffer acts as a shock absorber, decoupling the high-frequency distributor from the processor's | ||
// serial execution loop and allowing the system to handle short bursts of traffic without blocking. | ||
// Optional: Defaults to `defaultEnqueueChannelBufferSize` (100). | ||
EnqueueChannelBufferSize int | ||
} | ||
|
||
// newConfig performs validation and initialization, returning a guaranteed-valid `Config` object. | ||
// This is the required constructor for creating a new configuration. | ||
// It does not mutate the input `cfg`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OOC: Why not? Can we just log what the value was before it was defaulted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We could do this. I prefer this being a pure function though. It is easier to test and less prone to unexpected side effects. Also, I am changing this in a followup PR and removing the requirement for this constructor to be invoked (this becomes test-only convenience utility). In registry/config.co and controller/config.go I will be exposing a Would you like me to absorb this refactoring (for at least the controller package into this PR) instead of a followup? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To keep this PR focused squarely on the controller implementation itself, I'd prefer to tackle the config changes in the immediate follow-up PR I have staged. In that PR, I implement the plan I mentioned: removing the newConfig constructor in favor of exposing a public ValidateAndApplyDefaults() method on the Config struct itself. This will make the EPP runner responsible for calling it, which is a cleaner, more explicit approach. Does that sound like a reasonable path forward? If so, I'll resolve this thread and proceed with the follow-up PR after this one merges. |
||
func newConfig(cfg Config) (*Config, error) { | ||
LukeAVanDrie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
newCfg := cfg.deepCopy() | ||
if err := newCfg.validateAndApplyDefaults(); err != nil { | ||
return nil, err | ||
} | ||
return newCfg, nil | ||
} | ||
|
||
// validateAndApplyDefaults checks the global configuration for validity and then mutates the receiver to populate any | ||
// empty fields with system defaults. | ||
func (c *Config) validateAndApplyDefaults() error { | ||
// --- Validation --- | ||
if c.DefaultRequestTTL < 0 { | ||
return fmt.Errorf("DefaultRequestTTL cannot be negative, but got %v", c.DefaultRequestTTL) | ||
} | ||
if c.ExpiryCleanupInterval < 0 { | ||
return fmt.Errorf("ExpiryCleanupInterval cannot be negative, but got %v", c.ExpiryCleanupInterval) | ||
} | ||
if c.ProcessorReconciliationInterval < 0 { | ||
return fmt.Errorf("ProcessorReconciliationInterval cannot be negative, but got %v", | ||
c.ProcessorReconciliationInterval) | ||
} | ||
if c.EnqueueChannelBufferSize < 0 { | ||
return fmt.Errorf("EnqueueChannelBufferSize cannot be negative, but got %d", c.EnqueueChannelBufferSize) | ||
} | ||
|
||
// --- Defaulting --- | ||
if c.ExpiryCleanupInterval == 0 { | ||
c.ExpiryCleanupInterval = defaultExpiryCleanupInterval | ||
} | ||
if c.ProcessorReconciliationInterval == 0 { | ||
c.ProcessorReconciliationInterval = defaultProcessorReconciliationInterval | ||
} | ||
if c.EnqueueChannelBufferSize == 0 { | ||
c.EnqueueChannelBufferSize = defaultEnqueueChannelBufferSize | ||
} | ||
return nil | ||
} | ||
|
||
// deepCopy creates a deep copy of the `Config` object. | ||
func (c *Config) deepCopy() *Config { | ||
if c == nil { | ||
return nil | ||
} | ||
newCfg := &Config{ | ||
DefaultRequestTTL: c.DefaultRequestTTL, | ||
ExpiryCleanupInterval: c.ExpiryCleanupInterval, | ||
ProcessorReconciliationInterval: c.ProcessorReconciliationInterval, | ||
EnqueueChannelBufferSize: c.EnqueueChannelBufferSize, | ||
} | ||
return newCfg | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
/* | ||
Copyright 2025 The Kubernetes Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package controller | ||
|
||
import ( | ||
"testing" | ||
"time" | ||
|
||
"github.com/stretchr/testify/assert" | ||
"github.com/stretchr/testify/require" | ||
) | ||
|
||
func TestNewConfig(t *testing.T) { | ||
t.Parallel() | ||
|
||
testCases := []struct { | ||
name string | ||
input Config | ||
expectErr bool | ||
expectedCfg Config | ||
shouldDefault bool | ||
}{ | ||
{ | ||
name: "ValidConfig_NoChanges", | ||
input: Config{ | ||
DefaultRequestTTL: 10 * time.Second, | ||
ExpiryCleanupInterval: 2 * time.Second, | ||
ProcessorReconciliationInterval: 10 * time.Second, | ||
EnqueueChannelBufferSize: 200, | ||
}, | ||
expectErr: false, | ||
expectedCfg: Config{ | ||
DefaultRequestTTL: 10 * time.Second, | ||
ExpiryCleanupInterval: 2 * time.Second, | ||
ProcessorReconciliationInterval: 10 * time.Second, | ||
EnqueueChannelBufferSize: 200, | ||
}, | ||
}, | ||
{ | ||
name: "EmptyConfig_ShouldApplyDefaults", | ||
input: Config{}, | ||
expectErr: false, | ||
expectedCfg: Config{ | ||
DefaultRequestTTL: 0, | ||
ExpiryCleanupInterval: defaultExpiryCleanupInterval, | ||
ProcessorReconciliationInterval: defaultProcessorReconciliationInterval, | ||
EnqueueChannelBufferSize: defaultEnqueueChannelBufferSize, | ||
}, | ||
shouldDefault: true, | ||
}, | ||
{ | ||
name: "NegativeDefaultRequestTTL_Invalid", | ||
input: Config{DefaultRequestTTL: -1}, | ||
expectErr: true, | ||
}, | ||
{ | ||
name: "NegativeExpiryCleanupInterval_Invalid", | ||
input: Config{ExpiryCleanupInterval: -1}, | ||
expectErr: true, | ||
}, | ||
{ | ||
name: "NegativeProcessorReconciliationInterval_Invalid", | ||
input: Config{ProcessorReconciliationInterval: -1}, | ||
expectErr: true, | ||
}, | ||
{ | ||
name: "NegativeEnqueueChannelBufferSize_Invalid", | ||
input: Config{EnqueueChannelBufferSize: -1}, | ||
expectErr: true, | ||
}, | ||
} | ||
|
||
for _, tc := range testCases { | ||
t.Run(tc.name, func(t *testing.T) { | ||
t.Parallel() | ||
originalInput := tc.input.deepCopy() | ||
validatedCfg, err := newConfig(tc.input) | ||
|
||
if tc.expectErr { | ||
require.Error(t, err, "expected an error but got nil") | ||
assert.Nil(t, validatedCfg, "validatedCfg should be nil on error") | ||
} else { | ||
require.NoError(t, err, "expected no error but got: %v", err) | ||
require.NotNil(t, validatedCfg, "validatedCfg should not be nil on success") | ||
assert.Equal(t, tc.expectedCfg, *validatedCfg, "validatedCfg should match expected config") | ||
|
||
// Ensure the original config is not mutated. | ||
assert.Equal(t, *originalInput, tc.input, "input config should not be mutated") | ||
} | ||
}) | ||
} | ||
} | ||
|
||
func TestConfig_DeepCopy(t *testing.T) { | ||
t.Parallel() | ||
|
||
t.Run("ShouldReturnNil_ForNilReceiver", func(t *testing.T) { | ||
t.Parallel() | ||
var nilConfig *Config | ||
assert.Nil(t, nilConfig.deepCopy(), "Deep copy of a nil config should be nil") | ||
}) | ||
|
||
t.Run("ShouldCreateIdenticalButSeparateObject", func(t *testing.T) { | ||
t.Parallel() | ||
original := &Config{ | ||
DefaultRequestTTL: 1 * time.Second, | ||
ExpiryCleanupInterval: 2 * time.Second, | ||
ProcessorReconciliationInterval: 3 * time.Second, | ||
EnqueueChannelBufferSize: 4, | ||
} | ||
clone := original.deepCopy() | ||
|
||
require.NotSame(t, original, clone, "Clone should be a new object in memory") | ||
assert.Equal(t, *original, *clone, "Cloned object should have identical values") | ||
|
||
// Modify the clone and ensure the original is unchanged. | ||
clone.DefaultRequestTTL = 99 * time.Second | ||
assert.NotEqual(t, original.DefaultRequestTTL, clone.DefaultRequestTTL, | ||
"Original should not be mutated after clone is changed") | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this buffer overflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If running with data parallelism (more than one shard), the distribution algorithm will select the best candidate and attempt a non-blocking send. If the candidate buffer is full, it immediately falls back to the next best candidate and so on. If every worker's buffer is full, then we change from a non-blocking to a blocking send on the best candidate.
As the type comment mentions, it is simply a shock absorber for transient request bursts. The goal is to get the request enqueued (in the actual queue structure) as quickly as possible. If the buffer is full, it's fine, we just try to find a less congested worker even if it is not the "best" candidate for our data parallelism strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfswain Hopefully this clears it up. If you don't find the config documentation sufficient here, I can update it to better incorporate these details. Else, if no change is needed, can you resolve this thread?