Skip to content

feat: wire V2 saturation analyzer into engine, gated by analyzerName#695

Open
ev-shindin wants to merge 2 commits intollm-d:mainfrom
ev-shindin:saturation-v2-engine-integration-impl
Open

feat: wire V2 saturation analyzer into engine, gated by analyzerName#695
ev-shindin wants to merge 2 commits intollm-d:mainfrom
ev-shindin:saturation-v2-engine-integration-impl

Conversation

@ev-shindin
Copy link
Collaborator

Summary

Wire the V2 token-based saturation analyzer into the optimization engine, gated by analyzerName: "saturation" in the saturation scaling config. When active, it replaces the V1 percentage-based analyzer while keeping the rest of the pipeline (enforcer, limiter, decision converter) unchanged via an adapter pattern.

This PR also introduces the CostAwareOptimizer — the first ScalingOptimizer implementation for the V2 pipeline — which handles unlimited-mode multi-variant scaling with cost-based replica allocation.

Base branch: main (after PR #689 merge)
Depends on: PR #689 (saturation_v2 package)

Changes

Engine Integration (internal/engines/saturation/)

  • V1/V2 split in engine.go: Refactor optimize() into optimizeV1() and optimizeV2(), gated by analyzerName == "saturation" from global config
  • V2 fields on Engine struct: saturationV2Analyzer, capacityStore, optimizer — initialized once in NewEngine()
  • optimizeV2() three-stage pipeline:
    1. Collect ModelScalingRequests (run V2 analyzer per model, pre-populate capacity store from deployments)
    2. Call optimizer.Optimize() across all models
    3. Apply enforcer per-model via bridge functions
  • Enforcer bridge (engine_v2.go): extractTargetsFromDecisions, buildVariantAnalysesFromDecisions, applyEnforcedTargetsToDecisions — adapts V2 optimizer output to existing V1 enforcer interface
  • Config pattern: Follows upstream's e.Config DI pattern with namespace-aware config loading (SaturationConfigForNamespace, ScaleToZeroConfigForNamespace)

CostAwareOptimizer (internal/engines/pipeline/)

  • optimizer_interfaces.go: ScalingOptimizer interface and ModelScalingRequest type
  • cost_aware_optimizer.go: Unlimited-mode optimizer that processes each model independently:
    • Scale-up: allocates replicas to most cost-efficient variant (lowest cost / perReplicaCapacity). Variants with pending replicas are not skipped — the analyzer already accounts for their capacity in the supply calculation, so RequiredCapacity > 0 means demand exceeds total supply including pending.
    • Scale-down: removes replicas from most expensive variant (highest absolute cost). The cheapest variant is protected at min 1 replica only when it is the last variant with replicas — this prevents scale-down deadlocks where the expensive variant's per-replica capacity exceeds spare but cheaper replicas could be removed.
    • Skips variants with zero capacity

Limiter Infrastructure (internal/engines/pipeline/)

  • limiter_interfaces.go: New ResourcePool, ResourceConstraints, and ConstraintProvider interface — enables V2 limited-mode path (future GreedyBySaturationOptimizer)
  • default_limiter.go: DefaultLimiter now implements ConstraintProvider via ComputeConstraints() (V2 path) alongside existing Limiter.Limit() (V1 path)
  • type_inventory.go: Added GetResourcePools() to Inventory interface and TypeInventory implementation

What stays unchanged

Component Why
Enforcer.EnforcePolicy() Consumes []VariantSaturationAnalysis — provided by bridge adapter
convertSaturationTargetsToDecisions() V1-only path, not touched
GPULimiter.Limit() Applied globally after both V1 and V2 paths
applySaturationDecisions() Consumes []VariantDecision — unchanged
saturation_v2 package No changes to analyzer logic

@ev-shindin ev-shindin assigned ev-shindin and unassigned ev-shindin Feb 12, 2026
@ev-shindin ev-shindin force-pushed the saturation-v2-engine-integration-impl branch from b2b903d to 567a44f Compare February 12, 2026 13:48
Integrate the V2 token-based saturation analyzer into the optimization
engine behind a config gate (analyzerName: "saturation"). When active,
it replaces the V1 percentage-based analyzer inside RunSaturationAnalysis
while keeping the rest of the pipeline (enforcer, limiter, decision
converter) unchanged via an adapter pattern.

Also introduces the CostAwareOptimizer — the first ScalingOptimizer
implementation for the V2 pipeline — which handles unlimited-mode
multi-variant scaling with cost-based replica allocation.

Engine integration:
- Add saturationV2Analyzer, capacityStore, and optimizer fields to
  Engine struct, initialized once in NewEngine()
- Gate V2 path in optimize() via analyzerName == "saturation" from
  global config
- optimizeV2() three-stage pipeline: collect ModelScalingRequests,
  call optimizer.Optimize(), apply enforcer per-model via bridge
- Enforcer bridge: extractTargetsFromDecisions,
  buildVariantAnalysesFromDecisions, applyEnforcedTargetsToDecisions

CostAwareOptimizer (unlimited mode):
- Scale-up: allocate to most cost-efficient variant (lowest
  cost/perReplicaCapacity). Variants with pending replicas are NOT
  skipped — the analyzer already accounts for their capacity in the
  supply calculation, so RequiredCapacity > 0 means demand exceeds
  total supply including pending.
- Scale-down: remove from most expensive variant (highest absolute
  cost). The cheapest variant is protected at min 1 replica only when
  it is the last variant with replicas — this prevents scale-down
  deadlocks where the expensive variant's per-replica capacity exceeds
  spare but cheaper replicas could be removed.
- Skips variants with zero capacity

Limiter infrastructure:
- ResourcePool, ResourceConstraints, ConstraintProvider interface for
  future V2 limited-mode path (GreedyBySaturationOptimizer)
- DefaultLimiter implements ConstraintProvider via ComputeConstraints()
- TypeInventory.GetResourcePools() for per-type resource availability
@ev-shindin ev-shindin force-pushed the saturation-v2-engine-integration-impl branch from 567a44f to 74afe4c Compare February 12, 2026 14:58
Replace V(1) calls with V(logging.DEBUG) in cost_aware_optimizer.go,
engine.go, and engine_v2.go for better readability per review feedback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance saturation detection in lieu of flow controller enablement in inference scheduler

2 participants