Last Update: 2026-03-09
This document contains select component-level (i.e. low-level) architectural views such as UML sequence diagrams (to document certain interesting behaviors) and UML class diagrams pertaining to the LLM-D Inference Scheduler, which extends the Gateway API Inference Extension a.k.a. GIE or IGW. The Inference Scheduler implements optimized request routing logic that leverages dissagregated Prefill-Decode (PD) inference cluster topologies.
For high-level architectural views, consult the repos that this document references, as well as this excellent video by Robert Shaw presenting a technical overview of LLM-D's features:
- Purpose: provide a concise, developer-focused reference tying configuration to runtime behavior so contributors can find where to implement new scheduling logic.
- Scope: explains extension points (
PrepareData,Filter,Scorer,ProfileHandler,PreRequest), the scheduling lifecycle (Prepare → Filter → Score → Pick → PreRequest/Response), and common plugin implementations with direct links to source files. - Deliverables: UML class/sequence diagrams, per-plugin examples and appendices, and configuration-to-runtime mapping to speed discovery and safe changes.
- Use: read this first to decide the correct extension point and then follow the linked factories and files to implement or register plugins.
- Not exhaustive: this document does not attempt to list every plugin, interface, or runtime behavior across the repository or upstream dependencies — omissions are expected.
- Work in progress: content, examples, and links may change; treat this as a living reference and open issues/PRs for corrections or additions.
- Focus: the document concentrates on the LLM-D Inference Scheduler and scheduler-related extension points; other components are referenced but not comprehensively documented here.
Upon startin, the EPP performs the following intialization sequence:
cmd/epp/main.go: cmd/epp/main.go — callsplugins.RegisterAllPlugins()and starts theRunner.pkg/plugins/register.go: pkg/plugins/register.go — repository plugin factory registrations.- Runner / plugin configuration (dependency): github: gateway-api-inference-extension/cmd/epp/runner —
parsePluginsConfiguration()and built-in plugin registration.
At the heart of the EPP (which extends the Gateway API Inference Extension) is the Director, which is responsible for managing the lifecycle of incoming inference requests. This section describes the key classifiers and shows theire collaboration via a UML sequence diagram:
-
Director: request orchestration and plugin invocation.
-
Orchestrates the entire per-request lifecycle: accepts
reqCtx, fetches theInferenceObjective, and sets defaults (priority, timeouts) when none is present. TheInferenceObjectiveis a lightweight policy object (target model/profile, priority, placement hints, and tags) usually supplied by the request originator, an API/adapter, or by operator configuration and persisted in the Datastore. Plugins consult theInferenceObjectiveto influence behavior. -
Calls the admission controller (
Admit(...)) to allow or deny requests early, preventing unnecessary work for denied requests. -
Uses
contracts.PodLocatorto discover candidate pods and converts them into scheduler endpoints. -
Runs
PrepareDataPluginimplementations to enrich or decorate theLLMRequestbefore scheduling based onInferenceObjectivefields (preferred model, profile, or request-level hints); failures/timeouts are logged and treated fail-open. These plugins receive theLLMRequestand candidate endpoints, may mutate or attach metadata, and should avoid blocking indefinitely. Notable upstream/frameworkPrepareDataPluginimplementations referenced or used by the scheduler: -
prefixprepare plugin (hashes the prompt and attaches longest-prefix match info to endpoints; see Appendix G) -
predicted-latencyprepare hooks (prepares SLO context and predictions used by latency-aware scoring; see Appendix H) -
Runs
AdmissionPluginevaluatesInferenceObjectiveconstraints (priority, tenancy, policy) and can allow/deny a request (returnnilto allow, non-nilerrorto deny). Admission plugins evaluate theLLMRequestand endpoints and return allow/deny with optional reasons. -
Runs the
Scheduler, which runs the concreteFilter+Scorerpipeline to apply objective preferences when ranking or excluding endpoints (e.g., prefer specific profiles or locality). TheFilterandScorerinterfaces are declared here; examples live in pkg/plugins/filter and pkg/plugins/scorer. See theScheduler(interface) and its concrete implementation.- pkg/plugins/filter/by_label.go: a
Filterthat selects endpoints by a single label key and a whitelist of allowed values. See Appendix A. - pkg/plugins/filter/by_label_selector.go: a
Filterthat applies a Kubernetes-style label selector against endpointMetadata.Labels. See Appendix B. - pkg/plugins/filter/pd_role.go: PD‑role
Filterfor disaggregated Prefill‑Decode (PD) inference clusters. See Appendix C. - pkg/plugins/scorer/load_aware.go: a
Scorerthat scores endpoints inversely proportional to current observed load. See Appendix D. - pkg/plugins/scorer/no_hit_lru.go: a
Scorerimplementing a no-hit LRU strategy to prefer warm endpoints and reduce cold starts. See Appendix E. - pkg/plugins/scorer/precise_prefix_cache.go: prefix-cache aware
Scorerthat rewards endpoints with matching cached prefixes. See Appendix F.
- pkg/plugins/filter/by_label.go: a
-
Runs the Scheduler to pick a
TargetPod/endpoint and returns aSchedulingResultused to prepare the outgoing request. -
Executes
PreRequestplugins to finalize headers, routing, or authentication and then returns a populatedreqCtxto the handler.- pkg/plugins/pre-request/pd_prerequest.go: prepares the outbound request to the model server (target selection, header injection, auth headers).
- pkg/plugins/scorer/active_request.go:
ActiveRequestplugin that annotates/records active-request state before dispatch. - pkg/plugins/scorer/no_hit_lru.go:
NoHitLRUPreRequest that restores/updates cold-start state and related metrics before sending requests.
-
On response paths, invokes response plugins for received, streaming, and complete events (for logging, metrics, or post-processing). Response plugins handle response events (received, streaming chunks, completion) and can transform, log, or emit metrics.
-
Exposes helper utilities used across flows (model rewrites, weighted model selection, metric conversion, random endpoint selection).
-
-
Scheduling types (dependency): https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/pkg/epp/framework/interface/scheduling/types.go —
LLMRequest.LLMRequest: canonical request model carrying payload, model id, metadata, priority, and timeouts passed to plugins and scheduler.SchedulingResult: encapsulates the chosen pod/endpoint, scoring details, rewrites, and routing hints the Director uses to prepare the outbound request.
-
Pod locator / Scheduler:
contracts.PodLocator(see Director) and scheduler implementations under pkg/scheduling/pd.contracts.PodLocator: abstract discovery API used to list candidate pods matching request metadata (model, labels, locality).- Scheduler implementations convert discovered pods to endpoints, apply filters and scorers, and run selection algorithms.
- The scheduling pipeline uses filter and scorer plugins (and may apply weighted-model selection or rewrite rules) to return the best endpoint(s).
-
Prepare/PreRequest/Admission/Response plugin implementations in the
llm-d-inference-schedulerrepo: pkg/plugins/pre-request/pd_prerequest.go, scorer plugins under pkg/plugins/scorer, filter plugins under [pkg/plugins/filter], and profile/PD deciders under [pkg/plugins/profile].- Profile / PD deciders (
pkg/plugins/profile): influence placement decisions (PD selection, fallback pools, priority-specific rules). Filter(interface) plugins (pkg/plugins/filter): exclude or mutate candidate endpoints before scoring. Declaration: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/pkg/epp/framework/interface/scheduling/plugins.go — method:Filter(ctx, cycleState, request, pods) []Endpoint.Scorer(interface) plugins (pkg/plugins/scorer): compute numeric scores used by the scheduler to rank endpoints. Declaration: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/pkg/epp/framework/interface/scheduling/plugins.go — method:Score(ctx, cycleState, request, pods) map[Endpoint]float64(scores normalized to [0,1]) andCategory() ScorerCategory.pkg/plugins/pre-request/pd_prerequest.go: prepares the outbound request to the model server (target selection, header injection, auth headers).- All repo plugins are registered via
pkg/plugins/register.goand configured by the Runner; they must follow the defined interfaces and error-handling semantics.
- Profile / PD deciders (
-
The
Scheduleris responsible for converting candidate endpoints into a ranked/filtered selection suitable for routing. It is driven by two concepts: -
SchedulerProfile (see Appendix I): a per-profile configuration that lists:
-
zero or more
Filterplugins -
zero or more weighted
Scorerplugins -
and a single
Pickerplugin.A profile encapsulates a routing strategy (for example:
decodevsprefill, orshadowingvsproduction). See https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/pkg/epp/scheduling/scheduler_profile.go.
-
-
ProfileHandler: a single plugin instance per
Schedulerthat decides, for each scheduling cycle, which profiles should run (thePickextension point) and how to consolidate their results into a finalSchedulingResult(theProcessResultsextension point). Examples and implementations live underpkg/plugins/profile(notably the PD-aware handlerpd_profile_handler.go).
Key behaviors (from the code):
- The
Scheduler.Schedule(...)loop repeatedly calls the configuredProfileHandler.Pick(...)to obtain a set ofSchedulerProfileobjects to run for this cycle. The loop continues untilPickreturns an empty map. - For each selected profile the Scheduler calls
profile.Run(...).Runexecutes the profile's plugins in strict order:Filters->Scorers->Picker. If filters remove all endpoints the profile run returns an error. Filterplugins prune or mutate candidate endpoints before scoring. If all endpoints are removed the profile run fails and its result is recorded asnil(the ProfileHandler sees this inprofileResults).Scorerplugins return per-endpoint scores (normalized to [0,1]). The profile accumulates weighted scores across scorers usingWeightedScorerweight values.- The
Pickerplugin selects the final endpoint(s) (one or more) from the scored candidates. - After all selected profiles have run, the Scheduler calls
ProfileHandler.ProcessResults(...)to aggregate profile outputs and pick thePrimaryProfileName(which determines the default target endpoint used by the Director). - The scheduler records plugin latencies and an overall scheduler E2E latency metric.
PD (Disaggregated Prefill‑Decode) notes:
- The PD-aware
PdProfileHandlerdrives decode-first logic: it instructs the Scheduler to always run adecodeprofile first (so decode endpoints are located/scored), then decides whether to run aprefillprofile based on apdDeciderplugin and the decode run results (seepkg/plugins/profile/pd_profile_handler.go). ProcessResultsin the PD handler will transform decode results into a Data‑Parallel form whenprimaryPortis configured (it rewrites endpoint metadata ports and populates a header with the decode pod for subsequent PreRequest handling). Whenprefillalso ran,ProcessResultsincludes both profile results in the returnedSchedulingResult.
This sequence shows how the EPP creates and populates the Datastore at startup and incrementally via pod watches. This behavior is implemented in the IGW, which the Inference Scheduler inherits.
-
Created at startup:
Runnercallsdatastore.NewDatastore(ctx, epFactory, ...)to allocate internal maps and wire anEndpointFactory. See cmd/epp/runner/runner.go and pkg/epp/datastore/datastore.go. -
Pool bootstrap (full resync):
InferencePoolReconcilercallsDatastore.PoolSet(...);PoolSetstores the pool and runspodResyncAll(...)which list pods via the controllerreaderand seeds the store viaPodUpdateOrAddIfNotExist(...)for ready pods. -
Pod watch updates:
PodReconcilerreceives pod create/update/delete events filtered by pool selector and callsPodUpdateOrAddIfNotExist(...)(for ready/matching pods) orPodDelete(...)(for removed/not-ready pods). See pkg/epp/controller/pod_reconciler.go. -
Endpoint creation & lifecycle:
PodUpdateOrAddIfNotExistbuildsEndpointMetadataper target port and either creates endpoints viaEndpointFactory.NewEndpoint(...)or updates an existing endpoint's metadata (ep.UpdateMetadata(...)).EndpointFactorystarts per-endpoint collectors;ReleaseEndpointtears them down on deletion. See pkg/epp/datastore/datastore.go and pkg/epp/datalayer/factory.go. -
Querying: Runtime consumers (e.g.,
requestcontrol.PodLocator) callDatastore.PodList(predicate)to obtain candidate endpoints.PodListiterates the internal map and returns a slice of endpoints matching the predicate. See pkg/epp/requestcontrol/locator.go andPodListin pkg/epp/datastore/datastore.go.
- Configuration:
label(the label key to check),validValues(array of acceptable label values), andallowsNoLabel(boolean; if true endpoints missing the label are included). - Factory validation: the plugin factory validates
nameandlabelat startup and requires eithervalidValuesbe non-empty orallowsNoLabel=trueto avoid accidental exclusion of all endpoints. - Runtime behavior: the filter returns endpoints whose
Metadata.Labels[label]is present invalidValues. Endpoints missing the label are included only whenallowsNoLabel=true. If the filter excludes all endpoints the scheduler will receive an empty candidate set unless other filters or fallback logic add candidates. - When to use: simple, single-dimension routing like model-version/profile routing, release tagging (
release=canary), role-based placement (role=pd), or lightweight operational isolation. Use when you only need to check one label key and prefer a compact configuration over a full selector expression. - Interaction: less expressive than
by_label_selector(which supports multi-key and set-based selectors). Combineby_labelwith scorers or broader filters when you want to prefer or restrict by a single key while still allowing fallbacks. - Caution: incorrect
validValuesor forgetting to setallowsNoLabelduring migrations can unintentionally exclude endpoints—test configurations and prefer gradual rollouts.
-
Goal: route a small percentage of incoming requests to an EPP variant that has the
by_labelfilter enabled (canary), while sending the remainder to a primary EPP without that filter. -
Two practical Envoy approaches:
-
Weighted-cluster split (simple, no custom sampling)
# route: split 90/10 between primary and canary EPP services match: prefix: "/" route: weighted_clusters: clusters: - name: epp-primary weight: 90 - name: epp-canary weight: 10
Placement (route / weighted-cluster): Put the
weighted_clustersroute inside Envoy'sRouteConfigurationfor theHttpConnectionManager(path:static_resources.listeners[].filter_chains[].filters[name: envoy.filters.network.http_connection_manager].typed_config.route_config.virtual_hosts[].routes[].route.weighted_clusters). For Istio, configure an equivalent weighted split inVirtualService.spec.http[].route[]. Ensure clustersepp-primaryandepp-canaryexist instatic_resources.clustersor are provided via CDS. -
Header sampling + route-by-header (explicit sampling, sticky options)
http_filters: - name: envoy.filters.http.lua typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua inline_code: | math.randomseed(os.time() + (ngx and ngx.worker and ngx.worker.pid() or 0)) function envoy_on_request(request_handle) if math.random() <= 0.10 then request_handle:headers():add("x-epp-variant", "canary") end end # route-by-header rules - match: headers: - name: x-epp-variant exact_match: canary route: { cluster: epp-canary } - match: { prefix: "/" } route: { cluster: epp-primary }
Placement (Lua / header sampling): Add the Lua
http_filterssnippet into theHttpConnectionManagertyped_config.http_filterslist (place the Lua filter beforeenvoy.filters.http.router). Use the routevirtual_hosts[].routes[]to match onx-epp-variant(header) and route to theepp-canarycluster; default route should point toepp-primary. For Istio, insert Lua using anEnvoyFilterand use header-basedVirtualServicematches to split traffic. -
Operational notes:
- Deploy two EPP stacks (or two deployments of the same binary with different Runner/ConfigMap), one configured with the
by_labelfilter enabled and the other without. Expose each as a distinct Envoy cluster (epp-primary,epp-canary). - Use weighted-cluster splitting for simple percentage-based rollouts; use Lua sampling when you need programmatic control (per-user hashing, sticky keys, cookies, or advanced sampling logic).
- Monitor health, latency, and error metrics for the canary; ramp weights gradually (e.g., 1% → 5% → 10%).
- For session stickiness, sample deterministically (hashing on a cookie or header) rather than pure random sampling.
- If you prefer a single EPP service to make the routing decision internally, you can inject a header (via Envoy) and implement a
PrepareData/Admissionplugin or aFilterthat readsx-epp-variant/ headers and modifiesInferenceObjectiveor filters endpoints accordingly.
- Deploy two EPP stacks (or two deployments of the same binary with different Runner/ConfigMap), one configured with the
- Configuration: a
selector(string or structured selector) supporting equality and set-based operators such as=,!=,in,notin, andexists(examples:region=us-west,gpu in (a100,t4),envfor existence). - Factory validation: selector syntax is validated at plugin creation time; invalid selectors cause the factory to fail fast to prevent runtime surprises.
- Runtime behavior: the filter retains only endpoints whose labels satisfy the configured selector. It supports multi-attribute and set-based matching (multi-key AND semantics). If the selector matches no endpoints the candidate set will be empty and the scheduler will have no endpoints to choose from unless other filters or fallback logic add candidates.
- When to use: complex capability routing (GPU family, accelerator type, memory tier), locality/compliance routing (
region,zone, jurisdiction), tenant/environment isolation (tenant=acme,env=staging), canary/blue‑green rollouts (release=canary), progressive migrations (shift traffic by changing selectors), and targeted debugging or testing. - Interaction: more expressive than
by_label(which checks a single key and a whitelist of values). Combineby_label_selectorwith scorers or fallback filters when you want to prefer matching endpoints but still allow broader coverage under fallback conditions. - Caution: overly restrictive or misconfigured selectors can exclude all endpoints—test selectors carefully and prefer gradual rollouts or explicit fallbacks in production.
- Key constants: the filter checks label
llm-d.ai/rolewith well-known valuesprefill,decode, andboth. - Plugin types:
prefill-filter— matches endpoints with roleprefill(strict);decode-filter— matches endpoints with roledecodeorbothand allows unlabeled endpoints (fallback). - Configuration: these are instantiated by factory functions (no per-instance
rolesarray). The plugin factories create the appropriately preconfiguredByLabelinstances (PrefillRoleFactoryandDecodeRoleFactory). - Runtime behavior:
prefill-filterretains only endpoints whosellm-d.ai/roleequalsprefill.decode-filterretains endpoints whosellm-d.ai/roleisdecodeorboth, and because it allows unlabeled endpoints, endpoints missing the label are also kept. - When to use: route prefill workloads to P clusters optimized for throughput/memory and route decode or mixed workloads to Ds suited for low latency or mixed-capability pools. Use these filters to enforce PD‑level workload separation and cost/SLAs tradeoffs in a disaggregated PD architecture.
- Interaction: use alongside
by_label/by_label_selectorwhen additional endpoint labels are present; combine with scorers to prefer certain PDs without hard exclusion. Note thatdecode-filter's allowlist behavior intentionally provides a fallback when PD labeling is missing. - Caution: because role metadata and labeling may be produced by external controllers, it can be absent or stale—prefer staged rollouts, monitor capacity, and avoid relying on strict exclusion unless you control labeling guarantees.
- Configuration: optional parameters for weight, smoothing/window size, and metric source (e.g., CPU, active requests); these tune how aggressively the scorer penalizes loaded endpoints.
- Runtime behavior: consumes endpoint load metrics and produces normalized scores in [0,1], preferring lower-load endpoints. Normalization and smoothing prevent extreme score swings.
- When to use: distribute traffic to avoid hotspots, reduce request latency by spreading load, and complement affinity-based routing.
- Interaction: often combined with cache/warmth scorers (like
NoHitLRU) and label-based filters; ensure weighting balances load vs. affinity needs. - Caution: relies on accurate, timely metrics—stale or missing metrics can misrank endpoints; consider graceful degradation to neutral scores when metrics are unavailable.
- Configuration: parameters for LRU window size, decay, and whether to promote recently-hit endpoints aggressively.
- Runtime behavior: maintains recency state for endpoints and boosts scores for endpoints with recent successful hits (warm). Endpoints without recent activity receive lower scores to de-prioritize cold instances.
- When to use: reduce cold-start latency by favoring endpoints that have recently served similar requests, helpful for models with significant startup cost or cache warmups.
- Interaction: pairs well with
PreRequestplugins that update warm-state and with load-aware scorers to avoid overloading a few warm endpoints. - Caution: can starve cold endpoints if used alone; combine with load-aware scoring or occasional randomized selection to ensure capacity utilization.
- Configuration: tuning options for prefix match thresholds, score bonus magnitude, and cache key strategies.
- Runtime behavior: examines request content (prefixes/keys) and endpoint cache metadata to boost scores for endpoints likely to have a matching cached result, improving hit rates and latency for prefix-heavy workloads.
- When to use: workloads where prefix or token-level caching yields large latency wins (e.g., autocomplete, code-completion, repetitive prompt patterns).
- Interaction: use with
NoHitLRUand load-aware scorers to balance cache affinity against load and freshness; ensure cache metadata is maintained by pre-request or background processes. - Caution: cache staleness or incorrect metadata can produce suboptimal routing; provide mechanisms to invalidate or soften cache bonuses when cache health is uncertain.
-
Purpose: Compute and attach longest-prefix match metadata to endpoints so downstream scorers and filters can prefer endpoints with cache hits or affinity for the request prefix.
-
Files:
pkg/epp/framework/plugins/scheduling/scorer/prefix/plugin.go(factory and prepare hooks). -
Configuration: optional parameters:
prefix_length(number of tokens/bytes to extract),hash_algo(e.g.,fnv32,murmur), andinclude_metadata(boolean; attach match details to endpointMetadata). -
Factory validation: validates sensible
prefix_lengthranges and supportedhash_algovalues at startup; invalid configs cause factory failure to avoid silent misrouting. -
Runtime behavior: extracts a canonical prefix from the request (by tokens or bytes), computes a short fingerprint, and compares it to endpoint cache keys. Annotates each candidate endpoint with a
prefixMatchmetadata structure containingmatchedLength,score, andcacheKey. Runs as aPrepareDatahook and is fail-open on timeouts/errors. -
Inputs: request payload (prompt/text), candidate endpoint metadata (may include
prefix_cache_keys), and configured prefix extraction rules. -
Outputs: per-endpoint
prefixMatchmetadata and aggregated prefix statistics in the requestcycleStatefor use by scorers and selection logic. -
When to use: workloads with high prefix reuse (autocomplete, code completion, repetitive prompts) where cache affinity reduces TTFT/TPOT; helps prefer endpoints likely to have warm cached responses.
-
Interaction: pairs with
precise_prefix_cachescorer andNoHitLRUto balance cache affinity against load. Theprefixplugin must run before scorers so its annotations are available during scoring. -
Caution: depends on timely, accurate endpoint cache metadata—stale or inconsistent metadata can mislead the scheduler. Use conservative
prefix_lengthand monitor prefix-match rates when deploying. -
Use Case: configure
prefixto extract the first 32 tokens and enableinclude_metadata; combine withprecise_prefix_cacheto award score bonuses whenmatchedLengthexceeds a threshold, reducing TTFT for autocomplete workloads.
- Purpose: Predict per-endpoint Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT) to select endpoints likely to meet request SLOs.
- Main files:
pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/preparedata_hooks.go— collect inputs (prefix-cache scores, request SLOs).pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/prediction.go— call predictor and validate predictions.pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/selection.go(+ helpers) — compute headroom and choose endpoints.
- Inputs: endpoint metrics, request prompt/token counts, prefix-cache match scores, SLO headers (TTFT/TPOT).
- Outputs: per-endpoint
endpointPredictionResult(TTFT, TPOT, validity, headroom) stored in the request context for scoring/selection. - Where the model runs: predictions come from the latency predictor sidecar client under sidecars/latencypredictorasync:
bayesian_ridge: client caches coefficients and evaluates a local linear model (predictBayesianRidge). The model used is Bayesian Ridge, which provides predictions after only a few hunderd examples, and requires no tuning.xgboost/lightgbm: predictions are normally performed via HTTP calls to configured prediction servers (/predict, /predict/bulk). The servers run Gradient Boosted Decision Tree (GBDT) models. The client can optionally fetch XGBoost trees for native use (UseNativeXGBoost).
- Key endpoints: training (
/add_training_data_bulk), model info (/model/download/info), metrics/trees (/model/.../xgb/json), prediction (/predict,/predict/bulk). - Notes: plugin falls back to composite scoring if predictor is unavailable; configuration and runtime model type are managed by the latency predictor sidecar.
ProfileHandlers and SchedulerProfiles collaborate to implement the scheduler's policy decisions.
-
SchedulerProfile ("profile"): a small, per-profile pipeline that the
Schedulerruns to pick endpoints for that profile. A profile is constructed from a list ofpluginRefs and, at runtime, contains three logical pieces: a list ofFilterplugins, a list of weightedScorerplugins, and a singlePickerplugin. The profile's implementation lives in the Scheduler package: .go/pkg/mod/sigs.k8s.io/gateway-api-inference-extension@v0.0.0-20260128235548-fd30cb97714a/pkg/epp/scheduling/scheduler_profile.go. -
ProfileHandler: a single plugin instance the
Schedulerconsults each scheduling cycle to decide which profiles should run (viaPick(...)) and how to consolidate their outputs into the finalSchedulingResult(viaProcessResults(...)).ProfileHandlerimplementations live underpkg/plugins/profile(examples:pd_profile_handler.go,dp_profile_handler.go) and are registered by the repo at startup (pkg/plugins/register.go). -
How they relate to the Scheduler: the
Scheduleris built with a configuredProfileHandlerinstance (injected viaNewSchedulerWithConfig) and a map of namedSchedulerProfileobjects. DuringSchedule(...)theSchedulerrepeatedly callsprofileHandler.Pick(...)to get the next profiles to run; it then callsprofile.Run(...)for each selected profile (the profile enforcesFilters →Scorers →Picker) and finally callsprofileHandler.ProcessResults(...)to aggregate results.
See the concrete scheduler implementation here: scheduler.go.
These examples show:
- top-level plugin instance declarations; and
- how
schedulingProfilesreference those instances viapluginRef
Example (excerpt from deploy/config/epp-config.yaml):
# Sample EPP configuration for running without P/D
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins: # Declare plugin instances.
- type: prefix-cache-scorer
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default # Define a profile named "default" that references some of the declared plugins.
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 2Example (excerpt from deploy/config/dp-epp-config.yaml):
# Sample EPP configuration for running with Data Parallel
#
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: precise-prefix-cache-scorer
parameters:
indexerConfig:
tokenProcessorConfig:
blockSize: 5
kvBlockIndexConfig:
maxPrefixBlocksToMatch: 256
- type: decode-filter
- type: max-score-picker
- type: data-parallel-profile-handler
parameters:
primaryPort: 8000
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: precise-prefix-cache-scorer
weight: 2-
Minimal structure: a
SchedulerProfilehasfilters(0..N),scorers(0..N, weighted) and a singlepicker. -
Registration: when the Runner constructs a profile from
schedulingProfiles[].pluginsit looks up each named plugin instance and registers it with the profile according to the interfaces it implements:Filter→ appended top.filtersin the order encounteredWeightedScorer→ appended top.scorers(plainScorerwithout a weight causes a factory/registration error)Picker→ assigned top.picker(only one picker allowed)
-
Runtime execution order:
SchedulerProfile.Run(...)invokesrunFilterPlugins(...)→runScorerPlugins(...)→runPickerPlugin(...)unconditionally. This enforces the canonical Filters→Scorers→Picker pipeline regardless of declaration order in top-levelplugins:or profilepluginReflists. -
Failure behaviour: if filters remove all endpoints the profile run returns an error and yields a
nilProfileRunResult; theProfileHandlersees this in theprofileResultsmap and can decide fallbacks. -
Best practice: list profile
pluginRefs in the human-friendly order (filters, then scorers, then picker) so configs are readable and reviewers don't get confused. Note: some example YAML snippets in this document do not show the picker listed last. That does not break runtime behavior — theSchedulerProfileconstruction andRun()logic register plugins by interface and then enforce the pipeline order (Filters → Scorers → Picker), so a picker declared earlier is stored and invoked only during the picker phase.
-
What a ProfileHandler does: for each scheduling cycle it implements two methods:
Pick(ctx, cycleState, request, profiles, profileResults) map[string]SchedulerProfile— which named profiles should run nextProcessResults(ctx, cycleState, request, profileResults) (*SchedulingResult, error)— aggregate profile outputs into the finalSchedulingResult
-
Where implemented: repository handlers live under
pkg/plugins/profileand are registered bypkg/plugins/register.go. -
Concrete handlers in the
llm-d-inference-schedulerrepo:-
pd-profile-handler(PdProfileHandler) — pkg/plugins/profile/pd_profile_handler.go- Behavior: enforces a decode-first strategy. On the first cycle it returns the configured
decodeprofile to run. If decode succeeds and a configuredpdDeciderindicates disaggregation is needed, it will instruct the scheduler to run theprefillprofile next; finallyProcessResultsrewrites endpoints into Data‑Parallel form whenprimaryPortis configured and includes both decode and prefill results when available. - Config params (factory):
decodeProfile: name of theSchedulerProfileto run first (the "decode" phase). Controls which profile the handler will select on the first cycle. (default: "decode")prefillProfile: name of theSchedulerProfileto run for the prefill phase when PD disaggregation is required. Only used if the configured decider indicates prefill should run. (default: "prefill")prefixPluginType: the plugin type to use for prefix preparation (aPrepareDataplugin). Determines which prepare hook will annotate endpoints with prefix/cache metadata consumed by prefix-aware scorers. (default:prefix.PrefixCachePluginTypei.e.prefix-cache-scorer) Note:PdProfileHandlerstores this type/name so other plugins (scorers or LRU helpers) can locate prefix prepare-state in the schedulingCycleState. The handler itself does not call the prefix plugin at runtime; it only records the configured typed name for consumers that read prefix annotations.prefixPluginName: the plugin instance name (from top-levelplugins:) to run for prefix preparation. Allows selecting a specific configured instance when multiple prefix plugins exist. (default: same asprefixPluginType) Note: likeprefixPluginType, thePdProfileHandlerstores the configured instance name so other plugins can find the prefix prepare-state in the schedulingCycleState. The handler does not invoke the prefix plugin directly at runtime.primaryPort: TCP port number used when rewriting endpoints into Data‑Parallel form;ProcessResultsuses this to set the primary service port on rewritten endpoints and to populate routing headers. (default:0— no primary port / no rewrite)deciderPluginName: name of the PD-decider plugin (fromplugins:) used to decide whether to runprefillafterdecode. Controls the disaggregation decision logic. (default:AlwaysDisaggDeciderPluginType— typically "always-disagg-pd-decider")
- Behavior: enforces a decode-first strategy. On the first cycle it returns the configured
-
data-parallel-profile-handler(DataParallelProfileHandler) — pkg/plugins/profile/dp_profile_handler.go- Behavior: intended for single-profile Data‑Parallel workflows; validates exactly one profile is configured and converts its run result into Data‑Parallel endpoints (rewrites ports and sets the DataParallel header) in
ProcessResults. - Config params (factory):
primaryPort.
- Behavior: intended for single-profile Data‑Parallel workflows; validates exactly one profile is configured and converts its run result into Data‑Parallel endpoints (rewrites ports and sets the DataParallel header) in
-
-
How handlers are wired/configured:
- Declare a handler instance in the top-level
plugins:block (typepd-profile-handlerordata-parallel-profile-handler) with any parameters. - When the Runner builds the
Schedulerit injects the chosen ProfileHandler instance intoSchedulerConfigand constructs namedSchedulerProfileobjects fromschedulingProfiles.
- Declare a handler instance in the top-level
-
Registration entry points: see pkg/plugins/register.go for the
plugin.Register(...)calls that make these handler factories available at runtime.
Related files: - Runner config / loader: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/cmd/epp/runner - Scheduler implementation: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/pkg/epp/scheduling/scheduler.go - SchedulerProfile implementation: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/master/pkg/epp/scheduling/scheduler_profile.go
- Profile handlers: pkg/plugins/profile





