Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR introduces a major architectural refactoring of the flow registry. It replaces the complex, event-driven, actor-model-based garbage collection with a simpler and more performant lease-based lifecycle model.

The primary motivation is to enhance correctness, improve performance by removing global lock contention, and provide a safer, leak-proof client API.

Architectural Changes:

This refactoring touches every part of the registry. To help with the review, here is a high-level overview of the "before" and "after" states:

Previous Architecture (Event-Driven Actor Model)

  • Lifecycle: Flows were manually registered via RegisterOrUpdateFlow. Their lifecycle was tracked via an eventually consistent cache (flowState) that was updated by asynchronous signals (BecameEmpty, BecameNonEmpty) sent from each managedQueue over a buffered channel.
  • Concurrency: A central Run loop acted as a serialized actor, processing all events to prevent race conditions in the control plane.
  • Garbage Collection: A generational "Trust but Verify" GC algorithm ran periodically. The "Verify" step required a "stop-the-world" pause, acquiring a write lock on all shards simultaneously to get a consistent view of queue lengths. This introduced global lock contention that would impact P99 latency at scale.

New Architecture (Lease-Based Lifecycle)

  • Lifecycle: The new WithConnection API is the sole entry point to the data path. It handles Just-In-Time (JIT) registration of flows and manages their lifecycle via a lease. A flow is considered "Active" as long as it has one or more active leases.
  • Concurrency: The complex actor model is gone. Concurrency is now managed with a multi-layered strategy:
    1. A sync.Map for highly-concurrent lookups of flow states.
    2. An atomic.Int64 (leaseCount) on each flow for lock-free reference counting.
    3. A per-flow sync.RWMutex (gcLock) that provides surgical locking, arbitrating between active connections (read lock) and the garbage collector (write lock) for a single flow without impacting any other flows.
  • Garbage Collection: The GC is now a simple, periodic scanner. It identifies flows where leaseCount is zero and an idleness timer has expired. The "stop-the-world" pause is completely eliminated.

Key Benefits:

  1. Performance: Eliminates the global "stop-the-world" GC pause, drastically reducing lock contention and improving scalability and P99 latency.
  2. Correctness & Simplicity: Removes the complex, event-driven state machine, which was difficult to reason about and prone to subtle race conditions (e.g., dropped or out-of-order events). The new lease-counting model is simpler and more robust.
  3. API Safety: The WithConnection callback pattern makes resource leaks impossible from the client's perspective, as the lease is guaranteed to be released.

Suggested Review Plan:

This is a large change, but it has been structured into a series of logical commits to make the review easier. I recommend reviewing this PR commit-by-commit:

  1. refactor: Reorder methods for logical grouping

    • This is a no-op commit that just reorganizes the code in registry.go for better readability. You can scan and approve this one quickly.
  2. refactor: Simplify managedQueue to a decorator

    • Focuses on the lowest-level component. You'll see the signaling logic is completely removed, simplifying its responsibility.
  3. refactor: Adapt shard to to new managedQueue contract

    • Shows the impact on the registryShard, which no longer needs to handle signals from its queues.
  4. feat: Replace event lifecycle with leases

    • This is the heart of the PR. It introduces the WithConnection API, the new flowState with its gcLock, and replaces the entire GC and administrative logic.
  5. test: Overhaul tests for new lease-based design

    • This final commit updates all tests to validate the new reality. You'll notice the tests are now simpler and more focused, especially for registry_test.go, which now uses a fake clock for deterministic lifecycle testing.

Which issue(s) this PR fixes:

Tracks #674

Does this PR introduce a user-facing change?:

NONE

Reorganizes the methods in `registry.go` to improve readability and group
related functionality. Public API methods are placed first, followed by
administrative helpers, GC logic, and finally callbacks and statistics
propagation. This is to minimize the delta for subsequent changes.

No logical changes are included in this commit.
Removes the complex, stateful signaling mechanism from the
`managedQueue`.
Its sole responsibility is now to decorate a `SafeQueue` with atomic,
strictly consistent statistics tracking.

This change is the foundational first step in moving the registry from a
complex, event-driven lifecycle model to a simpler, lease-based one.
The now-unused event and lifecycle types are also removed.
Updates the `registryShard` to work with the new, signal-free
`managedQueue`.

The shard's responsibility is simplified; it no longer needs to handle
or propagate lifecycle signals from its queues. Its draining logic is
now based on directly checking its aggregate length, removing the need
for the `BecameDrained` signal.
Replaces the registry's core lifecycle management system. The complex,
event-driven, actor-based model is removed in favor of a simpler and
more performant lease-based lifecycle.

This commit introduces:
- A new `WithConnection` client API that provides a safe, leak-proof
  entry point for managing flow leases via reference counting.
- A per-flow `RWMutex` to provide fine-grained locking, allowing the
  garbage collector to operate on one flow without blocking others.
- A simplified GC that periodically scans for flows with a zero lease
  count.

Simultaneously, this commit removes the old machinery:
- The central event-processing loop (`Run` is now just for GC).
- The `RegisterOrUpdateFlow` administrative method.
- The entire signaling and event-handling subsystem.
Updates all tests in the registry package to align with the new
lease-based API and simplified component responsibilities.

- `managedqueue_test` is rewritten to focus on stats propagation and
   invariants, removing all signal testing.
- `shard_test` is updated to reflect its simpler role and new draining
   logic.
- `registry_test` is completely overhauled to test the `WithConnection`
   API, JIT registration, and the new lease-based GC logic, using a fake
  clock for deterministic lifecycle testing.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 27, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 27, 2025
@netlify
Copy link

netlify bot commented Aug 27, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit da9490d
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68ae6194a4bd010008255a29
😎 Deploy Preview https://deploy-preview-1476--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

// ErrInvalidShardCount indicates that an invalid shard count was provided (e.g., zero or negative).
ErrInvalidShardCount = errors.New("invalid shard count")

// ErrShardDraining indicates that an operation could not be completed because the target shard is in the process of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes the responsibility of the controller.FlowController which distributes requests across workers.

// mu protects all mutating operations (writes) on the queue. It ensures that the underlying queue's state and the
// atomic counters are updated atomically. Read operations (like `Peek`) do not acquire this lock.
mu sync.Mutex
// --- Immutable Identity & Dependencies (set at construction) ---
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply regrouped to better align with the concurrency model. No major changes here except flattening/renaming onStatsDelta from the now removed managedQueueCallbacks and adding isDraining.

// It is stored as an `int32` for atomic operations.
status atomic.Int32 // `componentStatus`
// onStatsDelta is the callback used to propagate statistics changes up to the parent registry.
onStatsDelta propagateStatsDeltaFunc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flattened (and renamed) from the now-removed parentCallbacks.


// mu protects the shard's internal maps (`priorityBands`).
// mu protects the shard's internal topology (`priorityBands`) and `config`.
// TODO: This is a priority inversion issue. Administrative operations (e.g., GC) for a low-priority flow block all
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will tackle this in a fast-followup PR once the flow control system is wired up and benchmarked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A per band lock makes sense, but it is not clear to me if this should be a "fast" follow; once we have the system working end-to-end, I recommend listing all followup items and then prioritize them.

//
// Errors returned by the callback `fn` are propagated up.
// Returns `ErrFlowIDEmpty` if the provided key has an empty ID.
WithConnection(key types.FlowKey, fn func(conn ActiveFlowConnection) error) error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative here is:

FlowRegistryClient.Connect with a corresponding ActiveFlowConnection.Close. The functional approach is leak proof and more architecturally sound, but I am open to feedback here. Connect/Close also works just fine with the flow control synchronous blocking model.

e.g.,

FlowController.EnqueueAndWait(...) (...) { 
  conn := fc.registryClient.Connect(req.FlowKey())
  defer conn.Close()
 
  // ... rest of logic proceeds as normal
}

Also, this method should eventually accept and respect caller context. I am not doing this in this PR as this will need to be done comprehensively across the flow control module contracts. As of right now, none of our function have unbounded blocking, so this is more for contract hardening and tracing than correctness at the moment.

I will do this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can simply be in registry.go, but I plan on expanding the set of functionality exposed from ActiveConnection, and I want to keep registry.go lean. I don't have a strong opinion here though. We should do whatever is best for readability / maintainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dramatically shortened for doc maintainability (plus a lot of it became obsolete with the new lease-based model). Following principle of first disclosure by moving details closer to the relevant types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly modified and rolled into registry.go as it no longer has a role as an eventually consistent cache.

"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging"
)

// propagateStatsDeltaFunc defines the callback function used to propagate statistics changes (deltas) up the hierarchy
Copy link
Contributor Author

@LukeAVanDrie LukeAVanDrie Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved here from the now deleted events.go. No changes to this signature (or stats propagation generally).

// It is the sole source of truth for determining if a flow is Idle.
leaseCount atomic.Int64

// becameIdleAt tracks the time at which the lease count last dropped to zero.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearer to follow and test than the generation marker we used previously... but the same concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}
}

// RegisterOrUpdateFlow handles the registration of a new flow instance or the update of an existing instance's
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in favor of the JIT registration in the WithConnection path. We don't allow flow specific config overrides yet, so aReconcile/Update method is not necessary.

return true
}

// verifyFlowIsTrulyIdleLocked performs the "stop-the-world" verification step of GC.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this complex "stop the world" logic is needed anymore as our reference counter (leases) is already aggregated at the registry level and not scopes to individual queue instances.


// UpdateShardCount dynamically adjusts the number of internal state shards.
func (fr *FlowRegistry) UpdateShardCount(n int) error {
// updateShardCount dynamically adjusts the number of internal state shards.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privated for now per feedback on the previous registry PR.

fr.mu.Unlock()

// Execute all deferred side effects outside the lock.
for _, action := range deferredActions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for deferred actions anymore since everything is synchronous and can happen under lock (no side-effects).

func (fr *FlowRegistry) buildFlowComponents(key types.FlowKey, numInstances int) ([]flowComponents, error) {
bandConfig, err := fr.config.getBandConfig(key.Priority)
if err != nil {
return nil, fmt.Errorf("failed to get configuration for priority %d: %w", priority, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned from getBandConfig is already well-formed. I was doing redundant error wrapping.

@ahg-g
Copy link
Contributor

ahg-g commented Aug 27, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2025
@LukeAVanDrie LukeAVanDrie changed the title [WIP] refactor(registry): Replace event-driven GC with a lease-based lifecycle refactor(registry): Replace event-driven GC with a lease-based lifecycle Aug 27, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Aug 28, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2025
@k8s-ci-robot k8s-ci-robot merged commit d7b2a56 into kubernetes-sigs:main Aug 28, 2025
11 checks passed
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Sep 1, 2025
…cle (kubernetes-sigs#1476)

* refactor: Reorder methods for logical grouping

Reorganizes the methods in `registry.go` to improve readability and group
related functionality. Public API methods are placed first, followed by
administrative helpers, GC logic, and finally callbacks and statistics
propagation. This is to minimize the delta for subsequent changes.

No logical changes are included in this commit.

* refactor: Simplify managedQueue to a decorator

Removes the complex, stateful signaling mechanism from the
`managedQueue`.
Its sole responsibility is now to decorate a `SafeQueue` with atomic,
strictly consistent statistics tracking.

This change is the foundational first step in moving the registry from a
complex, event-driven lifecycle model to a simpler, lease-based one.
The now-unused event and lifecycle types are also removed.

* refactor: Adapt shard to new managedQueue contract

Updates the `registryShard` to work with the new, signal-free
`managedQueue`.

The shard's responsibility is simplified; it no longer needs to handle
or propagate lifecycle signals from its queues. Its draining logic is
now based on directly checking its aggregate length, removing the need
for the `BecameDrained` signal.

* feat: Replace events lifecycle with leases

Replaces the registry's core lifecycle management system. The complex,
event-driven, actor-based model is removed in favor of a simpler and
more performant lease-based lifecycle.

This commit introduces:
- A new `WithConnection` client API that provides a safe, leak-proof
  entry point for managing flow leases via reference counting.
- A per-flow `RWMutex` to provide fine-grained locking, allowing the
  garbage collector to operate on one flow without blocking others.
- A simplified GC that periodically scans for flows with a zero lease
  count.

Simultaneously, this commit removes the old machinery:
- The central event-processing loop (`Run` is now just for GC).
- The `RegisterOrUpdateFlow` administrative method.
- The entire signaling and event-handling subsystem.

* test: Overhaul tests for new lease-based design

Updates all tests in the registry package to align with the new
lease-based API and simplified component responsibilities.

- `managedqueue_test` is rewritten to focus on stats propagation and
   invariants, removing all signal testing.
- `shard_test` is updated to reflect its simpler role and new draining
   logic.
- `registry_test` is completely overhauled to test the `WithConnection`
   API, JIT registration, and the new lease-based GC logic, using a fake
  clock for deterministic lifecycle testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants