feat(flowcontrol): Implement the FlowRegistry #1319

LukeAVanDrie · 2025-08-06T23:54:44Z

This PR introduces the complete, concrete implementation of the FlowRegistry, the stateful control plane for the flow control system. This is a foundational architectural component that manages the lifecycle of all flows, queues, and policies, providing a sharded, concurrent-safe view of its state to the FlowController workers.

The architecture is designed to prioritize data path performance and strict correctness for control plane state transitions, resulting in a robust, scalable, and maintainable foundation for the flow control engine.

This tracks #674

Architectural Highlights

The design introduces a clear separation between the control plane (FlowRegistry) and the data plane (registryShard), employing several patterns to ensure correctness and performance under high concurrency:

Serialized Control Plane (Actor Model): The FlowRegistry uses an actor-like pattern. A single background goroutine processes all state change events from a channel. This serializes all mutations to the registry's core state (like scaling or GC), eliminating a significant class of race conditions.
Sharded Data Plane with Fine-Grained Locking: The registry's state is partitioned across multiple registryShard instances. Each shard uses fine-grained, per-priority-band locks, allowing concurrent data path operations across different priorities and dramatically reducing lock contention.
Asynchronous, Lock-Free Signaling: A lock-free atomic state machine is used for signaling between the data path and the control plane (e.g., for queue empty/non-empty transitions). This completely decouples the data path from control plane backpressure, guarantees strictly ordered signals, and prevents lost transitions even under high contention by coalescing signals at the source.
"Trust but Verify" Garbage Collection: A periodic, time-based scanner manages the lifecycle of idle flows. It uses a "Trust but Verify" pattern: it identifies candidate flows using an eventually-consistent cache ("Trust"), then performs a "stop-the-world" live check on the relevant priority band across all shards ("Verify") before deletion. This provides strong consistency precisely when needed while minimizing data path disruption.
Immutable Flow Identity: The FlowKey (ID + Priority) is immutable. To change the priority of traffic, a caller simply registers a new flow. The old flow is gracefully and automatically garbage collected once it becomes idle. This elegant design completely avoids complex and error-prone state migration logic.

Suggested Review Path

Start with the contracts/ directory to understand the high-level interfaces and public API contracts.
Next, please read the comprehensive package documentation (in registry/doc.go). This file contains the detailed architectural overview, including the concurrency and garbage collection strategies.
Review the implementation files in a logical order: shard.go (the data plane slice), managedqueue.go (the stateful decorator with its lock-free signaler), flowstate.go (the GC cache), and finally registry.go (the orchestrator).
Finally, review the configuration and validation logic in config.go.

Testing Philosophy and Validation

This PR includes a complete and robust test suite that provides extremely high confidence in the correctness of this complex, concurrent system. The testing strategy is a key feature of this contribution:

Deterministic Asynchronous Testing: The primary tests for the FlowRegistry (registry_test.go) use a test harness with an "event tap". This allows for gray-box testing of the actor model, enabling fast, deterministic, and race-free validation of asynchronous operations without sleeps or polling.
Targeted Concurrency Tests: Dedicated concurrency tests (Test..._Concurrency_...) exist to validate thread-safety under stress, specifically targeting the most critical race conditions like the GC/scaling lock interaction, the draining state transition, and the lock-free signaling mechanism.
Isolated Unit Tests: Lower-level components (config, flowstate, etc.) are tested in strict isolation to exhaustively validate their specific logic, error paths, and invariants.
Comprehensive Coverage: The suite achieves near-100% statement coverage on all key components and provides high behavioral coverage for the entire package.

netlify · 2025-08-06T23:54:49Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`eefc461`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68a7919661be160008cfa4d6
😎 Deploy Preview	https://deploy-preview-1319--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-08-06T23:54:53Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LukeAVanDrie · 2025-08-06T23:54:57Z

/assign @kfswain

ahg-g · 2025-08-07T07:25:43Z

/ok-to-test

pkg/epp/flowcontrol/registry/config.go

pkg/epp/flowcontrol/registry/flowstate.go

pkg/epp/flowcontrol/registry/shard.go

pkg/epp/flowcontrol/registry/managedqueue.go

pkg/epp/flowcontrol/registry/flowstate.go

pkg/epp/flowcontrol/registry/gc.go

pkg/epp/flowcontrol/registry/shard.go

pkg/epp/flowcontrol/registry/gc.go

pkg/epp/flowcontrol/registry/registry.go

pkg/epp/flowcontrol/registry/config.go

pkg/epp/flowcontrol/registry/registry.go

pkg/epp/flowcontrol/registry/flowstate.go

kfswain

Still reviewing, I just have some comments that have been hanging since last night

pkg/epp/flowcontrol/contracts/registry.go

pkg/epp/flowcontrol/registry/gc.go

pkg/epp/flowcontrol/registry/registry.go

pkg/epp/flowcontrol/registry/shard.go

pkg/epp/flowcontrol/registry/registry.go

shmuelk · 2025-08-10T08:18:42Z

pkg/epp/flowcontrol/registry/registry.go: The FlowRegistry itself. The central orchestrator.

If the code is an orchestrator, why is it called a registry and not simply FlowOrchestrator?

LukeAVanDrie · 2025-08-11T22:49:48Z

pkg/epp/flowcontrol/registry/registry.go: The FlowRegistry itself. The central orchestrator.

If the code is an orchestrator, why is it called a registry and not simply FlowOrchestrator?

That is a very precise question. You've hit on a key point about the component's dual role. The name FlowRegistry was chosen deliberately because its primary external-facing responsibility is to act as a stateful catalog of flow instances. While it uses orchestration as an internal implementation strategy, its core purpose in the system's architecture is that of a registry.

Registry Pattern:

The FlowRegistry fits this pattern from a client's perspective:

Registration: The primary entry point is RegisterOrUpdateFlow(). A client explicitly registers a flow's specification.
Lookup/Discovery: Clients discover the state of the system via Shards(), which provides access to the registered entities.
Deregistration: It manages the removal of stale registrations via garbage collection.

Its main job is to be the single source of truth for "what flows exist and what is their configuration?"

Orchestrator Pattern:

The FlowRegistry uses this pattern internally to maintain its own state correctly in a concurrent environment:

It orchestrates shard scaling by telling individual registryShard instances to re-partition their configuration or enter a draining state.
It orchestrates garbage collection by reacting to events from queues and timers and then commanding shards to delete specific queue instances.

Why Registry is the More Fitting Name:

The orchestration is the how, not the what. It's the complex internal machinery that makes the registry robust.

We chose the name FlowRegistry because:

It describes the public contract. From the outside, you interact with it as a registry.
It avoids confusion with the FlowController. In our system, the FlowController is the component that actually orchestrates the dispatch of user requests. Calling this component FlowOrchestrator would create a significant naming collision and make it unclear which component is responsible for what part of the workflow.

This is an excellent point of feedback, though. It's clear my documentation could be more precise. I will update the GoDoc comment to clarify this distinction: that it is a Registry which uses an internal Actor-based orchestrator to manage its state.

Thank you for the sharp observation!

pkg/epp/flowcontrol/registry/doc.go

pkg/epp/flowcontrol/types/flow.go

pkg/epp/flowcontrol/registry/lifecycle.go

pkg/epp/flowcontrol/registry/config.go

pkg/epp/flowcontrol/registry/events.go

pkg/epp/flowcontrol/registry/flowstate.go

pkg/epp/flowcontrol/registry/registry.go

ahg-g · 2025-08-16T06:59:36Z

/approve
/lgtm
/hold

I approved in case you would like to address the comments in the followup PR.

k8s-ci-robot · 2025-08-16T06:59:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LukeAVanDrie · 2025-08-21T02:00:21Z

@ahg-g (cc: @kfswain) -- Thanks again for the LGTM and approval on the initial version.

After it was approved, I did another deep pass on the implementation with a focus on hardening the concurrency model and improving the long-term maintainability before merging. I've just pushed up the refined version for your final review.

The core logic is the same, but I've made a few significant enhancements that I believe are critical for this foundational layer:

Hardened Concurrency & Performance:

I've moved the registryShard from a single coarse-grained lock to a fine-grained locking model (per-priority-band). This significantly reduces lock contention and improves data path parallelism.
The signaling mechanism in managedQueue has been upgraded to a lock-free atomic state machine. This completely decouples the data path from control plane backpressure, guarantees no signals are lost, and makes the system far more resilient under high contention. It also significantly reduces noise on the events channel by coalescing signals when thrashing.

More Robust GC & Scaling:

I've introduced an RWMutex (gcScaleLock) to explicitly synchronize the Garbage Collection and shard scaling operations, eliminating a potential TOCTOU race condition and making the system safer.
The GC logic itself was simplified from a generational model (still "mark and sweep" -- the algorithm is the same; only the marker has changed) to a more intuitive time-based approach, which is easier to reason about.

Improved Testability: The test suite was overhauled to use a new harness with an "event tap," allowing for faster, fully deterministic validation of the asynchronous logic without any sleeps or flaky polling.
Improved Documentation Maintainability:

While I still have detailed docs, they now follow the principle of first disclosure and information exists closer to the relevant types.

I'm much more confident in the robustness and performance of this version. Let me know what you think.

ahg-g · 2025-08-21T13:14:31Z

How can I tell the difference between the latest commit and what I reviewed before? for the future, I prefer to not force push so that it is easier for the reviewer to diff

LukeAVanDrie · 2025-08-21T19:01:22Z

How can I tell the difference between the latest commit and what I reviewed before? for the future, I prefer to not force push so that it is easier for the reviewer to diff

Split from my reflog, so it should be visible as an independent commit now. What is the repo best practices for merge? Do we squash?

ahg-g · 2025-08-21T19:18:59Z

How can I tell the difference between the latest commit and what I reviewed before? for the future, I prefer to not force push so that it is easier for the reviewer to diff

Split from my reflog, so it should be visible as an independent commit now. What is the repo best practices for merge? Do we squash?

We have github configured to automatically squash, so just send reviewer responses as separate commits.

pkg/epp/flowcontrol/registry/managedqueue.go

This commit introduces the complete, concrete implementation of the `FlowRegistry`. As the stateful control plane for the flow control systemm, it provides a scalable, concurrent-safe, and robust foundation for managing the lifecycle of all flows, queues, and shards. The architecture is designed to prioritize data path performance and strict correctness for control plane state transitions. Key architectural features include: - **Serialized Control Plane (Actor Model):** All administrative operations and internal state change events are processed serially by a single background event loop. This fundamental design choice eliminates race conditions for complex, multi-step operations like shard scaling and garbage collection, simplifying the logic and guaranteeing correctness. - **Sharded Architecture:** The registry's state is partitioned across multiple `registryShard` instances. This allows the data path (enqueue/ dispatch operations) to scale linearly with the number of workers and CPU cores by minimizing global lock contention. - **Generational Garbage Collection:** We employ a periodic, generational scanner. This uses a "Trust but Verify" pattern: it identifies candidate flows using an eventually-consistent cache ("Trust"), then performs a "stop-the-world" live check against all shards ("Verify") before deletion. This provides strong consistency precisely when needed. - **Immutable Flow Identity (`FlowKey`):** The `FlowKey` (ID + Priority) is treated as an immutable identifier. To change the priority of traffic, a caller simply registers a new flow with the new priority. The old flow is gracefully and automatically garbage collected once it becomes idle. This elegant design completely avoids complex and error-prone state migration logic. - **Hybrid Concurrency Model:** A multi-tiered locking strategy is employed to maximize performance and correctness: - `FlowRegistry`: Coarse-grained lock for the serialized control plane. - `registryShard`: R/W locks to allow parallel reads from workers. - `managedQueue`: A hybrid mutex/atomic model to guarantee strict consistency between queue contents and statistics, which is critical for GC correctness.

ahg-g · 2025-08-21T21:46:17Z

/lgtm

LukeAVanDrie · 2025-08-21T22:16:43Z

/remove-hold

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 6, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2025

k8s-ci-robot requested review from danehans and kfswain August 6, 2025 23:54

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 6, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 6, 2025

k8s-ci-robot assigned kfswain Aug 6, 2025

LukeAVanDrie force-pushed the feat/flow-registry branch from 6b26721 to 3d6ed72 Compare August 7, 2025 00:38

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2025

ahg-g reviewed Aug 7, 2025

View reviewed changes

LukeAVanDrie commented Aug 7, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/gc.go Outdated Show resolved Hide resolved

LukeAVanDrie commented Aug 7, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/registry.go Outdated Show resolved Hide resolved

LukeAVanDrie commented Aug 7, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/config.go Outdated Show resolved Hide resolved

LukeAVanDrie commented Aug 7, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/registry.go Outdated Show resolved Hide resolved

LukeAVanDrie commented Aug 7, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/registry.go Outdated Show resolved Hide resolved

ahg-g reviewed Aug 7, 2025

View reviewed changes

kfswain reviewed Aug 7, 2025

View reviewed changes

kfswain reviewed Aug 8, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/registry.go Outdated Show resolved Hide resolved

pkg/epp/flowcontrol/registry/registry.go Outdated Show resolved Hide resolved

LukeAVanDrie mentioned this pull request Aug 9, 2025

refactor(flowcontrol): Adopt Composite FlowKey as Primary Identifier #1340

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025

LukeAVanDrie changed the title ~~[WIP] feat(flowcontrol): Implement the FlowRegistry~~ feat(flowcontrol): Implement the FlowRegistry Aug 15, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2025

LukeAVanDrie force-pushed the feat/flow-registry branch from 3d6ed72 to 6cfed87 Compare August 15, 2025 02:09

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2025

ahg-g reviewed Aug 15, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2025

k8s-ci-robot assigned ahg-g Aug 16, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2025

LukeAVanDrie force-pushed the feat/flow-registry branch from 6cfed87 to 34f429c Compare August 21, 2025 02:03

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2025

LukeAVanDrie force-pushed the feat/flow-registry branch from 34f429c to 264b81a Compare August 21, 2025 02:04

LukeAVanDrie force-pushed the feat/flow-registry branch from 264b81a to e9170bc Compare August 21, 2025 18:31

ahg-g reviewed Aug 21, 2025

View reviewed changes

pkg/epp/flowcontrol/registry/managedqueue.go Outdated Show resolved Hide resolved

LukeAVanDrie force-pushed the feat/flow-registry branch from e9170bc to eefc461 Compare August 21, 2025 21:37

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2025

k8s-ci-robot merged commit 46b4553 into kubernetes-sigs:main Aug 21, 2025
10 checks passed

feat(flowcontrol): Implement the FlowRegistry #1319

feat(flowcontrol): Implement the FlowRegistry #1319

Uh oh!

Conversation

LukeAVanDrie commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architectural Highlights

Suggested Review Path

Testing Philosophy and Validation

Uh oh!

netlify bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

LukeAVanDrie commented Aug 6, 2025

Uh oh!

ahg-g commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfswain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shmuelk commented Aug 10, 2025

Uh oh!

LukeAVanDrie commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahg-g commented Aug 16, 2025

Uh oh!

k8s-ci-robot commented Aug 16, 2025

Uh oh!

LukeAVanDrie commented Aug 21, 2025

Uh oh!

ahg-g commented Aug 21, 2025

Uh oh!

LukeAVanDrie commented Aug 21, 2025

Uh oh!

ahg-g commented Aug 21, 2025

Uh oh!

Uh oh!

ahg-g commented Aug 21, 2025

Uh oh!

LukeAVanDrie commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

LukeAVanDrie commented Aug 6, 2025 •

edited

Loading

netlify bot commented Aug 6, 2025 •

edited

Loading