Automatic vLLM pod discovery + ZMQ subscription mgmt for KVEvents #212

vMaroon · 2025-12-14T23:05:18Z

Summary

Implemented a pod reconciler controller that manages per-pod ZMQ subscribers for KVEvents processing, and the required logic. Also moved the kvevents to the same level of kvcache library, as should have been.

Components:

PodReconciler: Kubernetes controller that watches pods and manages subscribers
- This is one standalone approach, actual integration with the scheduler depends on [RFC] Pod lifecycle subscription through data layer kubernetes-sigs/gateway-api-inference-extension#2017
SubscriberManager: Thread-safe manager for multiple ZMQ subscribers
Updated kvevents.Pool - Now works with SubscriberManager instead of a global socket

Added integration + unit tests for all new functionality, updated documentation + examples.

Related Issues

Fixes Active-Active HA #208

Copilot

Pull request overview

This PR implements automatic vLLM pod discovery with dynamic ZMQ subscriber management for KV-cache events. The implementation adds a Kubernetes pod reconciler controller that watches vLLM pods and automatically creates/removes ZMQ subscribers based on pod lifecycle. The kvevents package has been moved from pkg/kvcache/kvevents to pkg/kvevents to reflect its independent status as a top-level component.

Key changes:

Pod reconciler controller for automatic per-pod ZMQ subscriber management
Thread-safe SubscriberManager for managing multiple concurrent ZMQ connections
Configuration support for both global socket mode and pod discovery mode

Reviewed changes

Copilot reviewed 11 out of 15 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
pkg/kvevents/zmq_subscriber.go	New ZMQ subscriber implementation for connecting to individual vLLM pods
pkg/kvevents/subscriber_manager.go	Thread-safe manager for multiple ZMQ subscribers with lifecycle management
pkg/kvevents/controller/pod_reconciler.go	Kubernetes controller that watches pods and manages subscribers
pkg/kvevents/events.go	Event type definitions moved from kvcache package
pkg/kvevents/pool.go	Updated pool configuration with pod discovery support, removed global subscriber
pkg/kvevents/doc.go	Package documentation for the kvevents system
tests/integration/kv_events_test.go	Integration tests for Pool and SubscriberManager
pkg/kvevents/subscriber_manager_test.go	Unit tests for subscriber manager functionality
examples/kv_events/pod_reconciler/main.go	Example demonstrating pod reconciler usage
examples/kv_events/online/main.go	Updated import path for kvevents package
examples/kv_cache_index_service/server/server.go	Updated import path for kvevents package
examples/helper/events.go	Updated import path for kvevents package
docs/configuration.md	Comprehensive documentation for pod discovery configuration
go.mod	Added k8s.io/api dependency for pod reconciler
go.sum	Updated dependency checksums

Comments suppressed due to low confidence (2)

pkg/kvevents/pool.go:293

Inconsistent casing in device tier default. The constant is named "defaultEventSourceDeviceTier" with value "GPU" (uppercase), but when Medium is provided from vLLM, it's converted to lowercase with strings.ToLower. This means the default tier will be "GPU" while custom tiers will be lowercase like "cpu" or "disk", creating inconsistency. Either the default should be "gpu" (lowercase) or the custom values shouldn't be converted to lowercase.
pkg/kvevents/pool.go:345
Inconsistent casing in device tier default. The constant is named "defaultEventSourceDeviceTier" with value "GPU" (uppercase), but when Medium is provided from vLLM, it's converted to lowercase with strings.ToLower. This means the default tier will be "GPU" while custom tiers will be lowercase like "cpu" or "disk", creating inconsistency. Either the default should be "gpu" (lowercase) or the custom values shouldn't be converted to lowercase.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/configuration.md

Copilot · 2025-12-14T23:15:21Z

pkg/kvevents/subscriber_manager.go

+	subCtx, cancel := context.WithCancel(ctx)
+	go subscriber.Start(subCtx)


Potential context leak: The subscriber is started with context.WithCancel(ctx) where ctx is the request context passed to EnsureSubscriber. If this is a reconciler request context, it may be canceled before the subscriber should stop. The subscriber goroutines should likely use a longer-lived context (like a manager context) rather than the request context, with the cancel func used only for explicit cleanup.

Copilot's review here seems like an important catch, my understanding is that the controller-runtime cancels the Reconcile context immediately after the reconcile function returns. If that's the case, won't the subscriber goroutine receive that cancellation and shut down almost immediately, and we'll lose the connection?

Should we be using a detached context (like context.Background()) here since the subscription needs to outlive the reconciliation request?

Copilot hallucinated this being the request context. The context passed here, as can be seen in the scorer reference, is indeed context.Background(). It's just up to the caller to set the context.

pkg/kvevents/subscriber_manager.go

examples/kv_events/pod_reconciler/pod_reconciler.go

pkg/kvevents/subscriber_manager_test.go

docs/configuration.md

pkg/kvevents/subscriber_manager_test.go

Copilot · 2025-12-14T23:15:24Z

pkg/kvevents/subscriber_manager.go

+		entry.cancel()
+		delete(sm.subscribers, podIdentifier)


Potential resource leak: when a subscriber's context is canceled (either by endpoint change on line 70 or during shutdown on line 121), the goroutine running subscriber.Start will exit, but the ZMQ socket cleanup happens in runSubscriber's defer. However, there's no guarantee that the socket is properly closed before the subscriberEntry is removed from the map. Consider adding a sync mechanism or wait group to ensure the subscriber goroutine has fully cleaned up before removing the entry.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

vMaroon · 2026-01-16T23:01:55Z

Tested (quay.io/vmaroon/llm-d-inference-scheduler:ha) with configuration:

      - type: precise-prefix-cache-scorer
        parameters:
          tokenProcessorConfig:
            blockSize: 64
            hashSeed: "42"
          indexerConfig:
            tokenizersPoolConfig:
              modelName: "Qwen/Qwen3-32B"
              hf:
                tokenizersCacheDir: "/tmp/tokenizers"
          kvEventsConfig:
            topicFilter: "kv@"
            concurrency: 8
            discoverPods: true
            podDiscoveryConfig:
              socketPort: 5556

with vLLM (port 5556 exposed):

--kv-events-config "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://*:5556\",\"topic\":\"kv@${POD_IP}@Qwen/Qwen3-32B\"}"

vMaroon · 2026-01-16T23:06:23Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

+	typedName      plugins.TypedName
+	kvCacheIndexer *kvcache.Indexer
+
+	// until the IGW data-layer is ready to provide endpoint events,


@elevran this will be PR'd into the scheduler. Note that this includes other required updates for when syncing with the to-be v0.5 release.

sagearc

Great work @vMaroon! I’ve published part of the review and will follow up with the rest shortly.

sagearc · 2026-01-19T13:24:20Z

docs/configuration.md

 ### KV-Event Pool Configuration (`Config`)

-Configures the ZMQ event processing pool for handling KV cache events.
+Configures the ZMQ event processing pool for handling KV cache events. The pool supports two modes:


The current naming (GlobalSocket vs PodReconciler) reflects implementation details. I suggest renaming these to reflect the intent rather than the mechanism, for instance:

Global Socket Mode -> Static Endpoint Mode (Implies a fixed target)

Pod Reconciler Mode -> Auto-Discovery Mode (Implies dynamic finding of targets)

sagearc · 2026-01-19T13:28:05Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

-/*
-Copyright 2025 The llm-d Authors.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-	http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-*/


Is this intentional?

sagearc · 2026-01-19T13:33:58Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

+	// initialize the subscribers cache only if pod discovery is enabled
+	if config.KVEventsConfig.DiscoverPods == true {
+		// initialize the subscribers TTL cache
+		subscriptionTimeout := 10 * time.Minute


Why not keep the connection alive (assuming a specific pod might actually not receive incoming requests for that period)?

As long as any request to any pod comes in, a live pod gets refreshed. If the cluster goes static (no requests at all) or if a pod disappears from the serving list - this timer starts counting. 10 minutes in this state means the pod is likely gone.

sagearc · 2026-01-19T14:04:30Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

+	if s.kvEventsConfig.DiscoverPods == true {
+		// update subscribers here temporarily
+		for _, pod := range pods {
+			podObj := pod.GetPod()
+			if podObj == nil {
+				continue
+			}
+			podKey := podObj.NamespacedName.String()
+			s.subscribersCache.Set(podKey, struct{}{}, 0) // use default TTL
+
+			if err := s.subscribersManager.EnsureSubscriber(context.Background(), podKey, // dont use request ctx
+				fmt.Sprintf("tcp://%s:%d", podObj.Address, s.kvEventsConfig.PodDiscoveryConfig.SocketPort),
+				s.kvEventsConfig.TopicFilter, true); err != nil {
+				logger.Error(err, "Failed to ensure KV-events subscriber for pod", "pod", podKey,
+					"endpoint", podObj.Address)
+				continue
+			}
+		}
+	}


This implementation mixes network discovery into the critical Score() path.

While a full refactor to an asynchronous mechanism is out of scope for this release, I believe we should document this as technical debt. We can add a TODO noting that this discovery logic belongs in a separate async loop (similar to events.Pool), not in the synchronous scoring path.
What do you think?

There is a TODO above the subscribersCache - this update mechanism is a short lived state, it does not belong here at all, but is the minimum we can do right now to ship the feature for trial. llm-d precise prefix-cache scheduling guide won't switch to discovery, it will only have a sub active-active HA guide for now.

sagearc · 2026-01-19T14:12:25Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

+func (s *PrecisePrefixCacheScorer) getScores(ctx context.Context, request *types.LLMRequest) (map[string]float64, error) {
+	logger := log.FromContext(ctx).WithName(s.typedName.String())
+	traceLogger := logger.V(logutil.TRACE)
+
+	traceLogger.Info("Getting scores",
+		"isChatCompletions", request.Body != nil && request.Body.ChatCompletions != nil,
+		"isCompletions", request.Body != nil && request.Body.Completions != nil)
+
+	// The upstream parser guarantees exactly one body is populated, but we defensively prioritize chat completions.
+	// If an unexpected dual payload slips through (parser regression/new client), log it and use chat semantics.
+	if request.Body != nil && request.Body.ChatCompletions != nil {
+		if request.Body.Completions != nil {
+			traceLogger.Info("Both chat/completions and completions present; defaulting to chat/completions")
+		}
+
+		renderReq := &preprocessing.ApplyChatTemplateRequest{
+			Conversation:              make([][]preprocessing.Conversation, 0),
+			Tools:                     request.Body.ChatCompletions.Tools,
+			Documents:                 request.Body.ChatCompletions.Documents,
+			ChatTemplate:              request.Body.ChatCompletions.ChatTemplate,
+			ReturnAssistantTokensMask: request.Body.ChatCompletions.ReturnAssistantTokensMask,
+			ContinueFinalMessage:      request.Body.ChatCompletions.ContinueFinalMessage,
+			AddGenerationPrompt:       request.Body.ChatCompletions.AddGenerationPrompt,
+			ChatTemplateKWArgs:        request.Body.ChatCompletions.ChatTemplateKWArgs,
+		}
+
+		// Convert messages to the format expected by the renderer
+		for _, msg := range request.Body.ChatCompletions.Messages {
+			renderReq.Conversation = append(renderReq.Conversation, []preprocessing.Conversation{{
+				Role:    msg.Role,
+				Content: msg.Content.Raw,
+			}})
+		}
+
+		traceLogger.Info("Processing chat completion request",
+			"conversationCount", len(renderReq.Conversation),
+			"toolsCount", len(renderReq.Tools),
+			"documentsCount", len(renderReq.Documents))
+
+		scores, err := s.kvCacheIndexer.GetPodScores(ctx, renderReq, "", request.TargetModel, nil)
+		if err != nil {
+			return nil, fmt.Errorf("failed to get pod scores for chat/completions: %w", err)
+		}
+		return scores, nil
+	}
+
+	// For regular completions, use the prompt directly
+	if request.Body != nil && request.Body.Completions != nil {
+		prompt := request.Body.Completions.Prompt
+		traceLogger.Info("Using completion prompt directly", "promptLength", len(prompt))
+
+		scores, err := s.kvCacheIndexer.GetPodScores(ctx, nil, prompt, request.TargetModel, nil)
+		if err != nil {
+			return nil, fmt.Errorf("failed to get pod scores for completions: %w", err)
+		}
+		return scores, nil
+	}
+
+	return nil, errors.New("no valid input found in request")


Later we should decouple the chat template rendering to the tokenizer

…ule rename artifacts Signed-off-by: Maroon Ayoub <[email protected]>

Signed-off-by: Maroon Ayoub <[email protected]>

- fix zmq connection mgmt Signed-off-by: Maroon Ayoub <[email protected]>

Signed-off-by: Maroon Ayoub <[email protected]>

vMaroon · 2026-01-19T14:47:52Z

Thanks @sagearc, addressed.

sagearc

The implementation looks good to me. My only concern is the context handling in the Reconcile loop, which will likely cause subscribers to drop immediately

sagearc · 2026-01-19T16:26:41Z

pkg/kvevents/subscriber_manager.go

+	subCtx, cancel := context.WithCancel(ctx)
+	go subscriber.Start(subCtx)


Copilot's review here seems like an important catch, my understanding is that the controller-runtime cancels the Reconcile context immediately after the reconcile function returns. If that's the case, won't the subscriber goroutine receive that cancellation and shut down almost immediately, and we'll lose the connection?

Should we be using a detached context (like context.Background()) here since the subscription needs to outlive the reconciliation request?

sagearc · 2026-01-19T18:06:24Z

/lgtm

vMaroon · 2026-01-19T18:07:06Z

/approve

hyeongyun0916 · 2026-01-20T03:34:24Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

+	var subscribersCache *ttlcache.Cache[string, struct{}]
+
+	// initialize the subscribers cache only if pod discovery is enabled
+	if config.KVEventsConfig.DiscoverPods == true {


nit: https://deepsource.com/directory/go/issues/SCC-S1002

hyeongyun0916 · 2026-01-20T04:01:30Z

examples/kv_events/pod_reconciler/main.go

+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
+
+	// Setup graceful shutdown
+	sigChan := make(chan os.Signal, 1)
+	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
+	go func() {
+		<-sigChan
+		cancel()
+	}()


we could simplify this signal handling logic by using ctrl.SetupSignalHandler() which does the same thing internally.

hyeongyun0916 · 2026-01-20T04:03:05Z

examples/kv_events/pod_reconciler/main.go

+	logger.Info("Topic filter", "filter", poolConfig.TopicFilter)
+
+	// Start the manager (this will start the reconciler)
+	mgrCtx, mgrCancel := context.WithCancel(ctx)


Is there a specific reason for creating a separate mgrCtx here instead of passing the ctx received?
If we apply the ctrl.SetupSignalHandler() suggestion above, since ctx would already handle signal cancellation, it seems like we could simplify this to just mgr.Start(ctx).

hyeongyun0916 · 2026-01-20T04:07:49Z

Once this PR is merged, will the scheduler need to add a reconciler similar to the one in this example to enable the pod discovery feature?

hyeongyun0916 · 2026-01-20T12:23:20Z

examples/kv_cache_aware_scorer/kvcache_aware_scorer.go

-	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins"
+	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/plugins"
+	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework"
+	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework/plugins/multi/prefix"


It looks like the ttlcache package is being used below but hasn't been imported yet.

Copilot AI review requested due to automatic review settings December 14, 2025 23:05

vMaroon requested review from dannyharnik, elevran and kfirtoledo as code owners December 14, 2025 23:05

Copilot started reviewing on behalf of vMaroon December 14, 2025 23:05 View session

vMaroon force-pushed the active-active-ha branch from eceea98 to 670d703 Compare December 14, 2025 23:11

vMaroon requested review from delavet, liu-cong and yankay December 14, 2025 23:13

Copilot AI reviewed Dec 14, 2025

View reviewed changes

vMaroon force-pushed the active-active-ha branch 2 times, most recently from f0b6481 to ddf6d02 Compare December 14, 2025 23:42

vMaroon force-pushed the active-active-ha branch from 951f23a to 7ce9191 Compare January 4, 2026 10:28

vMaroon requested a review from hyeongyun0916 January 4, 2026 15:31

vMaroon force-pushed the active-active-ha branch from 5b784dc to 2f91975 Compare January 16, 2026 20:11

vMaroon requested a review from sagearc January 16, 2026 20:11

vMaroon commented Jan 16, 2026

View reviewed changes

sagearc mentioned this pull request Jan 19, 2026

refactor: kv cache manager repo llm-d/llm-d-inference-scheduler#570

Merged

sagearc reviewed Jan 19, 2026

View reviewed changes

vMaroon requested a review from sagearc January 19, 2026 14:40

vMaroon added 7 commits January 19, 2026 16:47

implement pod discovery for KVEvents subscription mgmt + blocking mod…

ae3cb76

…ule rename artifacts Signed-off-by: Maroon Ayoub <[email protected]>

move pod_reconciler into the example

5f6e1ce

Signed-off-by: Maroon Ayoub <[email protected]>

fix imports

87250b8

Signed-off-by: Maroon Ayoub <[email protected]>

- update scorer reference

0cabba5

- fix zmq connection mgmt Signed-off-by: Maroon Ayoub <[email protected]>

fix tests, gofumpt

c531b85

Signed-off-by: Maroon Ayoub <[email protected]>

minor refactoring

1183a98

Signed-off-by: Maroon Ayoub <[email protected]>

address review

324a6ae

Signed-off-by: Maroon Ayoub <[email protected]>

vMaroon force-pushed the active-active-ha branch from fce7a62 to 324a6ae Compare January 19, 2026 14:47

sagearc suggested changes Jan 19, 2026

View reviewed changes

github-actions bot approved these changes Jan 19, 2026

View reviewed changes

vMaroon added the lgtm label Jan 19, 2026

vMaroon merged commit 8877f62 into main Jan 19, 2026
5 checks passed

vMaroon deleted the active-active-ha branch January 19, 2026 18:11

hyeongyun0916 reviewed Jan 20, 2026

View reviewed changes

vMaroon mentioned this pull request Jan 21, 2026

Enable prefix-cache awareness in active-active multi-replica scheduler deployments llm-d/llm-d-inference-scheduler#578

Merged

		subCtx, cancel := context.WithCancel(ctx)
		go subscriber.Start(subCtx)

Automatic vLLM pod discovery + ZMQ subscription mgmt for KVEvents #212

Automatic vLLM pod discovery + ZMQ subscription mgmt for KVEvents #212

Uh oh!

Conversation

vMaroon commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

sagearc Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

vMaroon commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagearc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon commented Jan 19, 2026

Uh oh!

sagearc left a comment

Choose a reason for hiding this comment

Uh oh!

sagearc Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagearc commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vMaroon commented Jan 19, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

vMaroon commented Dec 14, 2025 •

edited

Loading

sagearc Jan 19, 2026 •

edited

Loading

vMaroon Jan 19, 2026 •

edited

Loading

vMaroon Jan 19, 2026 •

edited

Loading

sagearc Jan 19, 2026 •

edited

Loading

sagearc commented Jan 19, 2026 •

edited

Loading