Skip to content

feat: add support for WVA#3243

Merged
openshift-merge-bot[bot] merged 4 commits intoopendatahub-io:mainfrom
zdtsw-forking:wva_1
Mar 13, 2026
Merged

feat: add support for WVA#3243
openshift-merge-bot[bot] merged 4 commits intoopendatahub-io:mainfrom
zdtsw-forking:wva_1

Conversation

@zdtsw
Copy link
Member

@zdtsw zdtsw commented Mar 10, 2026

DO NOT MERGE TILL opendatahub-io/workload-variant-autoscaler#36 IS IN, this will unblocked the new e2e test on configable configmap

Description

  • add WVA under DSC kserve, default as Removed
  • WVA will be deployment via modelcontroller
  • only when WVA is Managed in dsc.kserve, and CMA subs not exist, kserve get condtion WVA only work with HPA,
  • if WVA is Managed and CMA is install, WVA can both with both HPA and CMA(keda)
  • if WVA is Managed cna CAM is installed but kedacontroller is in degradated status, Kserver CR get condition in this case, up to user either remove CMA subs or fix kedacontroller -- this will be documented
  • checks on WVA e.g CMA (we dont take Keda as its operator is deprecated) only when WVA is Managed:
    1. subscritption: if not exist, we warn user to install it
    2. if Kedacontroller exist (means operator installed): if degradated, we error user, WVA wont work, esclate to Kserve CR and DSC status
  • no need watch or own VariantAutoscaling as it is user resource
  • no need signal kserve when WVA is enabled or disabled (As we did for nim) kserve only checks VA CRD for now.
  • add special watch on deletion of configmap, or the one we have annoataiton managed:false on wont get recreated if user delete it.

ref :https://issues.redhat.com/browse/RHOAIENG-52198

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

Summary by CodeRabbit

  • New Features

    • Added Workload Variant Autoscaler (WVA) component with configurable management state (default Removed); operator manages WVA when enabled.
    • Operator now observes Custom Metrics Autoscaler (CMA) to reflect CMA presence/status when WVA is enabled.
  • Documentation

    • Updated API docs, samples and design docs to include WVA configuration and examples.
  • Tests

    • Added end-to-end tests validating WVA deployment, configuration, and cleanup when enabled/disabled.
  • Chores

    • Added build/manifest entries and related mappings for WVA.

@openshift-ci
Copy link

openshift-ci bot commented Mar 10, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds workload-variant-autoscaler (WVA) support across the operator: introduces WVASpec and wva fields on multiple KServe- and ModelController-related specs; updates deepcopy generation; adds wva entries to sample configs, manifests, docs, and manifests lookup; registers two new GVKs (KEDA CustomMetricsAutoscaler and llmd.ai VariantAutoscaling); extends the KServe controller to watch/monitor the Custom Metrics Autoscaler (CMA) operator and incorporate CMA dependency checks into reconciliation; updates modelcontroller logic to conditionally include WVA manifests and parameters; adds unit tests and e2e validations; updates build/manifest mappings and relatedImages for WVA.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Security and Code Quality Issues

  • CWE-400 (Uncontrolled Resource Consumption): new watches (CMA, VariantAutoscaling) increase event surface. Action: add tight predicates/filters and rate-limiting/debounce on reconciles; consider workqueue rate limits.
  • CWE-20 (Missing Input Validation): WVASpec.ManagementState added without visible admission/CEL validation. Action: add CRD validation rules or admission webhook/CEL to enforce allowed enum values and defaults.
  • CWE-22 (Path Traversal / Unsafe Path Handling): wvaManifestsPath constructs manifest paths. Action: validate/sanitize Path/ContextDir/SourcePath inputs and fail-fast on unexpected values.
  • CWE-785 (Incorrect/Fragile Permission Assignment): new GVK watches expand RBAC surface. Action: regenerate and review RBAC rules, restrict to least privilege (watch/list on specific namespaces if applicable).
  • Condition semantic mismatch risk: cmaConditionFilter mirrors existing logic but may not match upstream CMA status names/semantics. Action: verify CMA condition names/true/false semantics against upstream CRD and add unit tests for all expected condition states.
  • Naming/clarity issue (code quality): exported constant LLMWorkVariantAutoscallingDepedencies contains typos. Action: correct identifier, update all references and tests, provide deprecated alias if preserving compatibility.
  • Observability gap: dependency failures and CMA absence update status but lack structured logs/metrics. Action: add structured logs (resource, namespace, reason) and metrics for dependency health and reconciliation errors.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add support for WVA' accurately describes the main change: adding support for the Workload Variant Autoscaler component across the codebase.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/components/modelcontroller/modelcontroller_actions.go (1)

63-65: ⚠️ Potential issue | 🟡 Minor

WVA manifests appended at line 48 bypass parameter substitution.

ApplyParams processes only rr.Manifests[0] (the base modelcontroller manifest). WVA manifests appended at line 48 become rr.Manifests[1] and are never passed to ApplyParams. While WVA manifests currently don't include params.env, this creates an architectural gap: if future versions require parameter substitution, they'll silently skip processing since the extraParamsMap is only applied to index 0.

Apply ApplyParams to all manifest entries in the loop, not just the first one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/components/modelcontroller/modelcontroller_actions.go`
around lines 63 - 65, The code only calls odhdeploy.ApplyParams on
rr.Manifests[0], so any additional manifests in rr.Manifests (e.g., WVA entries)
never get parameter substitution; change the logic to iterate over rr.Manifests
and call odhdeploy.ApplyParams for each manifest string, passing "params.env"
and extraParamsMap for every entry (use the same signature currently used for
rr.Manifests[0]); ensure errors from ApplyParams on any manifest are
handled/returned the same way as the current single-call case so failures on
later manifests surface correctly.
🧹 Nitpick comments (2)
internal/controller/components/kserve/kserve_controller_actions_test.go (1)

313-354: Add a degraded-operator test for the new WVA path.

These cases only cover CMA Subscription presence/absence. The new WVA behavior also depends on the operator/controller health path, so a present-but-degraded CMA/KEDA controller branch should be covered here as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/components/kserve/kserve_controller_actions_test.go`
around lines 313 - 354, Add a new subtest that covers the "present-but-degraded"
operator path for WVA: create kserveWithWVA and include cmaOperatorSubscription
plus a corresponding ClusterServiceVersion (or other operator health object)
whose status indicates degraded (e.g., CSV.Status.Phase = "Failed" or a
condition showing not available), then call checkPreConditions with that
ReconciliationRequest and assert
rr.Conditions.GetCondition(LLMWorkVariantAutoscallingDepedencies) yields the
degraded/false result; use the same setup style as the existing tests
(createKserveCR, fakeclient.New, conditions.NewManager) and reference
cmaOperatorSubscription, checkPreConditions, and
LLMWorkVariantAutoscallingDepedencies to locate where to add the test.
internal/controller/components/kserve/kserve_controller_actions.go (1)

195-195: Typos in constant name: "Autoscalling" → "Autoscaling", "Depedencies" → "Dependencies".

The constant LLMWorkVariantAutoscallingDepedencies (defined in kserve.go:42) contains two spelling errors. Consider correcting to LLMWorkVariantAutoscalingDependencies for consistency with LLMInferenceServiceDependencies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/components/kserve/kserve_controller_actions.go` at line
195, The constant name LLMWorkVariantAutoscallingDepedencies contains typos;
rename its declaration to LLMWorkVariantAutoscalingDependencies and update every
usage (e.g., the call
rr.Conditions.MarkUnknown(LLMWorkVariantAutoscallingDepedencies) in
kserve_controller_actions.go and any other references) to the corrected
identifier so the code compiles and matches the existing naming pattern like
LLMInferenceServiceDependencies; ensure you update the original constant
definition (formerly in kserve.go) and any tests or comments that reference the
old misspelled name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/components/kserve/kserve_controller_actions.go`:
- Around line 267-289: Fix the typo in the comment: change "escalte" to
"escalate" on the line describing Severity. Also avoid allocating a new
dependency action on every reconciliation by extracting the instantiation of
dependency.NewAction(dependency.MonitorOperator(...)) into a package-level
variable (e.g., preconstructedAction) and have monitorCMAOperator call that
variable; reference the symbols monitorCMAOperator, dependency.NewAction,
dependency.MonitorOperator, gvk.CustomMetricsAutoscalerV1Alpha1,
common.ConditionSeverityError and cmaConditionFilter when locating the code to
change.

In `@internal/controller/components/kserve/kserve.go`:
- Around line 42-44: Update UpdateDSCStatus() to also propagate the new KServe
dependency condition LLMWorkVariantAutoscallingDepedencies into the DSC status
the same way LLMInferenceServiceDependencies and
LLMInferenceServiceWideEPDependencies are copied: when reading the KServe CR
conditions, look up the LLMWorkVariantAutoscallingDepedencies condition and set
or update the corresponding DSC condition (preserving status, reason, message,
and timestamps) so degraded CMA/WVA state becomes visible in the DSC conditions.

In `@README.md`:
- Around line 510-511: Add a prerequisite note to the example explaining that
setting "wva: Managed" requires the external CMA/KEDA controller path to be
installed and healthy; update the README example where "wva: Managed" appears to
include a short bullet or inline sentence stating the dependency, how to verify
the CMA/KEDA controllers are installed and running (e.g., check controller
pods/CRDs/health endpoints), and a brief remediation step if not present (switch
to Removed or install/enable the controller). Ensure the text references the
exact field name "wva: Managed" and the external components "CMA/KEDA
controller" so users see the dependency next to the example.

In `@tests/e2e/modelcontroller_test.go`:
- Around line 97-113: The test currently only flips
`.spec.components.kserve.wva.managementState` and then asserts the WVA
Deployment exists, which makes the outcome depend on prior suite state; change
the test to explicitly establish prerequisites by first ensuring the KServe
component is present and set to Managed (use the same DataScienceCluster object
mutated via tc.EventuallyResourcePatched with
WithMinimalObject(gvk.DataScienceCluster, tc.DataScienceClusterNamespacedName)
and WithMutateFunc using testf.TransformPipeline/testf.Transform to set
`.spec.components.kserve.managementState = "Managed"` or otherwise
create/configure the KServe component), then flip
`.spec.components.kserve.wva.managementState` and finally call
EnsureResourceExists for the Deployment as before; reference the existing
helpers EventuallyResourcePatched, WithMutateFunc, TransformPipeline, Transform,
and EnsureResourceExists to implement the setup so the test is self-contained.

---

Outside diff comments:
In `@internal/controller/components/modelcontroller/modelcontroller_actions.go`:
- Around line 63-65: The code only calls odhdeploy.ApplyParams on
rr.Manifests[0], so any additional manifests in rr.Manifests (e.g., WVA entries)
never get parameter substitution; change the logic to iterate over rr.Manifests
and call odhdeploy.ApplyParams for each manifest string, passing "params.env"
and extraParamsMap for every entry (use the same signature currently used for
rr.Manifests[0]); ensure errors from ApplyParams on any manifest are
handled/returned the same way as the current single-call case so failures on
later manifests surface correctly.

---

Nitpick comments:
In `@internal/controller/components/kserve/kserve_controller_actions_test.go`:
- Around line 313-354: Add a new subtest that covers the "present-but-degraded"
operator path for WVA: create kserveWithWVA and include cmaOperatorSubscription
plus a corresponding ClusterServiceVersion (or other operator health object)
whose status indicates degraded (e.g., CSV.Status.Phase = "Failed" or a
condition showing not available), then call checkPreConditions with that
ReconciliationRequest and assert
rr.Conditions.GetCondition(LLMWorkVariantAutoscallingDepedencies) yields the
degraded/false result; use the same setup style as the existing tests
(createKserveCR, fakeclient.New, conditions.NewManager) and reference
cmaOperatorSubscription, checkPreConditions, and
LLMWorkVariantAutoscallingDepedencies to locate where to add the test.

In `@internal/controller/components/kserve/kserve_controller_actions.go`:
- Line 195: The constant name LLMWorkVariantAutoscallingDepedencies contains
typos; rename its declaration to LLMWorkVariantAutoscalingDependencies and
update every usage (e.g., the call
rr.Conditions.MarkUnknown(LLMWorkVariantAutoscallingDepedencies) in
kserve_controller_actions.go and any other references) to the corrected
identifier so the code compiles and matches the existing naming pattern like
LLMInferenceServiceDependencies; ensure you update the original constant
definition (formerly in kserve.go) and any tests or comments that reference the
old misspelled name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 774b3326-9d66-4213-9518-84280710b4c1

📥 Commits

Reviewing files that changed from the base of the PR and between 6d541e8 and a101383.

📒 Files selected for processing (18)
  • README.md
  • api/components/v1alpha1/kserve_types.go
  • api/components/v1alpha1/modelcontroller_types.go
  • api/components/v1alpha1/zz_generated.deepcopy.go
  • config/rhoai/samples/datasciencecluster_v2_datasciencecluster.yaml
  • config/samples/datasciencecluster_v2_datasciencecluster.yaml
  • docs/DESIGN.md
  • get_all_manifests.sh
  • internal/controller/components/kserve/kserve.go
  • internal/controller/components/kserve/kserve_controller.go
  • internal/controller/components/kserve/kserve_controller_actions.go
  • internal/controller/components/kserve/kserve_controller_actions_test.go
  • internal/controller/components/kserve/kserve_support.go
  • internal/controller/components/modelcontroller/modelcontroller.go
  • internal/controller/components/modelcontroller/modelcontroller_actions.go
  • internal/controller/components/modelcontroller/modelcontroller_support.go
  • pkg/cluster/gvk/gvk.go
  • tests/e2e/modelcontroller_test.go

Comment on lines +267 to +289

// monitorCMAOperator conditionally monitors CMA operator health only when WVA is enabled.
func monitorCMAOperator(ctx context.Context, rr *odhtypes.ReconciliationRequest) error {
k, ok := rr.Instance.(*componentApi.Kserve)
if !ok {
return fmt.Errorf("resource instance %v is not a componentApi.Kserve", rr.Instance)
}

if k.Spec.WVA.ManagementState != operatorv1.Managed {
return nil
}

// Reuse existing dependency monitoring infrastructure
monitorAction := dependency.NewAction(
dependency.MonitorOperator(dependency.OperatorConfig{
OperatorGVK: gvk.CustomMetricsAutoscalerV1Alpha1,
Severity: common.ConditionSeverityError, // cause Kserve Ready: False and escalte to DSC
Filter: cmaConditionFilter,
}),
)

return monitorAction(ctx, rr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify cmaConditionFilter is defined
rg -n 'cmaConditionFilter' --type go

Repository: opendatahub-io/opendatahub-operator

Length of output: 460


Fix typo on line 283 and consider extracting action instantiation.

  1. Line 283: "escalte" → "escalate"

  2. Unlike LWS monitoring (configured statically in builder), this creates a new dependency.NewAction instance per reconciliation. While functionally correct, consider extracting to a package-level variable if reconciliation frequency makes allocation overhead measurable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/components/kserve/kserve_controller_actions.go` around
lines 267 - 289, Fix the typo in the comment: change "escalte" to "escalate" on
the line describing Severity. Also avoid allocating a new dependency action on
every reconciliation by extracting the instantiation of
dependency.NewAction(dependency.MonitorOperator(...)) into a package-level
variable (e.g., preconstructedAction) and have monitorCMAOperator call that
variable; reference the symbols monitorCMAOperator, dependency.NewAction,
dependency.MonitorOperator, gvk.CustomMetricsAutoscalerV1Alpha1,
common.ConditionSeverityError and cmaConditionFilter when locating the code to
change.

Comment on lines +510 to +511
wva:
managementState: Removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the prerequisite for wva: Managed.

This is the first user-facing example of the new field, but it does not mention that enabling WVA depends on the external CMA/KEDA controller path being installed and healthy. As written, users can flip this to Managed and get a broken rollout with no hint why.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 510 - 511, Add a prerequisite note to the example
explaining that setting "wva: Managed" requires the external CMA/KEDA controller
path to be installed and healthy; update the README example where "wva: Managed"
appears to include a short bullet or inline sentence stating the dependency, how
to verify the CMA/KEDA controllers are installed and running (e.g., check
controller pods/CRDs/health endpoints), and a brief remediation step if not
present (switch to Removed or install/enable the controller). Ensure the text
references the exact field name "wva: Managed" and the external components
"CMA/KEDA controller" so users see the dependency next to the example.

@github-actions
Copy link
Contributor

This PR can't be merged just yet 😢

Please run make generate manifests api-docs and commit the changes.

For more info: https://github.com/opendatahub-io/opendatahub-operator/actions/runs/22965964127

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
internal/controller/components/modelcontroller/modelcontroller_support.go (1)

49-55: Optional: Extract "wva" to a constant for pattern consistency.

The string literal "wva" appears only once in the codebase, so extracting it to a constant won't prevent typos elsewhere. However, manifestsPath() uses the ComponentName constant for ContextDir, while this function uses a hardcoded string. Adopting a uniform pattern would improve readability.

♻️ Optional refactor
 const (
 	ComponentName = componentApi.ModelControllerComponentName
 
 	ReadyConditionType = componentApi.ModelControllerKind + status.ReadySuffix
 
 	// LegacyComponentName is the name of the component that is assigned to deployments
 	// via Kustomize. Since a deployment selector is immutable, we can't upgrade existing
 	// deployment to the new component name, so keep it around till we figure out a solution.
 	LegacyComponentName = "odh-model-controller"
+
+	// WVAContextDir is the context directory for WVA manifests.
+	WVAContextDir = "wva"
 )
 func wvaManifestsPath() types.ManifestInfo {
 	return types.ManifestInfo{
 		Path:       odhdeploy.DefaultManifestPath,
-		ContextDir: "wva",
+		ContextDir: WVAContextDir,
 		SourcePath: "default",
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/components/modelcontroller/modelcontroller_support.go`
around lines 49 - 55, Extract the literal "wva" used as the ContextDir inside
wvaManifestsPath() into a shared constant (e.g., WvaComponentName) to match the
pattern used by manifestsPath() which uses the ComponentName constant; update
wvaManifestsPath() to reference that constant instead of the hardcoded "wva" and
add the new constant near other component constants so the codebase is
consistent.
pkg/cluster/gvk/gvk.go (1)

450-455: Rename this GVK constant to match the resource kind.

The exported name says CustomMetricsAutoscaler..., but the actual resource here is keda.sh/v1alpha1, Kind: KedaController. That mixes the operator/product name with the CR kind and makes later call sites and logs easy to misread. Use a symbol name that matches the API object, e.g. KedaControllerV1Alpha1, and keep the comment for the operator/product alias if needed. (docs.okd.io)

Suggested diff
-	// CustomMetricsAutoscaler uses KEDA's CRD.
-	CustomMetricsAutoscalerV1Alpha1 = schema.GroupVersionKind{
+	// KedaController is the CR exposed by the Custom Metrics Autoscaler / KEDA operator.
+	KedaControllerV1Alpha1 = schema.GroupVersionKind{
 		Group:   "keda.sh",
 		Version: "v1alpha1",
 		Kind:    "KedaController",
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cluster/gvk/gvk.go` around lines 450 - 455, Rename the exported constant
CustomMetricsAutoscalerV1Alpha1 to match the API Kind by changing its identifier
to KedaControllerV1Alpha1 and update the preceding comment to mention KEDA as an
operator/product alias (keep the original comment text if useful). Ensure the
schema.GroupVersionKind value stays the same (Group "keda.sh", Version
"v1alpha1", Kind "KedaController") and update all references/usages of
CustomMetricsAutoscalerV1Alpha1 throughout the codebase to the new symbol name
KedaControllerV1Alpha1 to avoid mismatched logs and call sites.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/api-overview.md`:
- Line 193: Update the docs entry for `wva`/`WVASpec` to explicitly state
ownership semantics: clarify that the "Managed" status means the operator
deploys and monitors the WVA integration path (installation, health of the
integration components) but does NOT claim ownership or watch
`VariantAutoscaling` resources — those remain user-managed and must be
created/edited by users; update the descriptive text wherever `wva`, `WVASpec`,
or `VariantAutoscaling` are mentioned (including the other listed occurrences)
so it warns readers that troubleshooting autoscaling should be done in the
user's control plane/VariantAutoscaling resources, not assumed to be controlled
by the operator.
- Line 1266: Rewrite the ModelControllerKerveSpec description to be a clear,
grammatical sentence that explicitly states this spec is a minimal subset of
DSCKserve preserving only the public API fields: management, NIM, and WVA
specifications; update the text for ModelControllerKerveSpec to list those
preserved fields (e.g., "management", "nim", "wva") and clarify that other
DSCKserve fields are omitted while these retained fields keep their original
semantics and public API surface.

In `@internal/controller/components/kserve/kserve.go`:
- Around line 42-44: Rename the exported constant
LLMWorkVariantAutoscallingDepedencies to the correctly spelled
LLMWorkVariantAutoscalingDependencies and update every reference (definition
plus all usages in production and test code) to the new identifier; ensure the
corrected constant value is used where it forms condition/type strings so
user-visible status conditions reflect the fix, update imports/usages in kserve
controller functions and tests that reference the old name, and run tests/build
to validate no remaining references to the misspelled symbol.

---

Nitpick comments:
In `@internal/controller/components/modelcontroller/modelcontroller_support.go`:
- Around line 49-55: Extract the literal "wva" used as the ContextDir inside
wvaManifestsPath() into a shared constant (e.g., WvaComponentName) to match the
pattern used by manifestsPath() which uses the ComponentName constant; update
wvaManifestsPath() to reference that constant instead of the hardcoded "wva" and
add the new constant near other component constants so the codebase is
consistent.

In `@pkg/cluster/gvk/gvk.go`:
- Around line 450-455: Rename the exported constant
CustomMetricsAutoscalerV1Alpha1 to match the API Kind by changing its identifier
to KedaControllerV1Alpha1 and update the preceding comment to mention KEDA as an
operator/product alias (keep the original comment text if useful). Ensure the
schema.GroupVersionKind value stays the same (Group "keda.sh", Version
"v1alpha1", Kind "KedaController") and update all references/usages of
CustomMetricsAutoscalerV1Alpha1 throughout the codebase to the new symbol name
KedaControllerV1Alpha1 to avoid mismatched logs and call sites.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 5586afc7-8203-4d5d-bf86-f6cb4afa7617

📥 Commits

Reviewing files that changed from the base of the PR and between a101383 and 67de789.

📒 Files selected for processing (19)
  • README.md
  • api/components/v1alpha1/kserve_types.go
  • api/components/v1alpha1/modelcontroller_types.go
  • api/components/v1alpha1/zz_generated.deepcopy.go
  • config/rhoai/samples/datasciencecluster_v2_datasciencecluster.yaml
  • config/samples/datasciencecluster_v2_datasciencecluster.yaml
  • docs/DESIGN.md
  • docs/api-overview.md
  • get_all_manifests.sh
  • internal/controller/components/kserve/kserve.go
  • internal/controller/components/kserve/kserve_controller.go
  • internal/controller/components/kserve/kserve_controller_actions.go
  • internal/controller/components/kserve/kserve_controller_actions_test.go
  • internal/controller/components/kserve/kserve_support.go
  • internal/controller/components/modelcontroller/modelcontroller.go
  • internal/controller/components/modelcontroller/modelcontroller_actions.go
  • internal/controller/components/modelcontroller/modelcontroller_support.go
  • pkg/cluster/gvk/gvk.go
  • tests/e2e/modelcontroller_test.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • internal/controller/components/kserve/kserve_controller_actions_test.go
  • internal/controller/components/kserve/kserve_controller.go
  • internal/controller/components/kserve/kserve_controller_actions.go
  • api/components/v1alpha1/modelcontroller_types.go
  • internal/controller/components/modelcontroller/modelcontroller.go
  • internal/controller/components/modelcontroller/modelcontroller_actions.go
  • config/rhoai/samples/datasciencecluster_v2_datasciencecluster.yaml
  • get_all_manifests.sh

| `rawDeploymentServiceConfig` _[RawServiceConfig](#rawserviceconfig)_ | Configures the type of service that is created for InferenceServices using RawDeployment.<br />The values for RawDeploymentServiceConfig can be "Headless" (default value) or "Headed".<br />Headless: to set "ServiceClusterIPNone = true" in the 'inferenceservice-config' configmap for Kserve.<br />Headed: to set "ServiceClusterIPNone = false" in the 'inferenceservice-config' configmap for Kserve. | Headless | Enum: [Headless Headed] <br /> |
| `nim` _[NimSpec](#nimspec)_ | Configures and enables NVIDIA NIM integration | | |
| `modelsAsService` _[DSCModelsAsServiceSpec](#dscmodelsasservicespec)_ | Configures and enables Models as a Service integration | | |
| `wva` _[WVASpec](#wvaspec)_ | Configures and enables workload-variant-autoscaler (WVA) integration | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document WVA ownership semantics explicitly.

These descriptions currently read as if Managed means the operator fully manages WVA, but this PR’s behavior is narrower: the operator deploys/monitors the integration path while VariantAutoscaling resources remain user-managed and are not watched or owned. That distinction is important for supportability and will otherwise send users to the wrong control plane when autoscaling does not behave as expected.

Suggested doc fix
-| `wva` _[WVASpec](`#wvaspec`)_ | Configures and enables workload-variant-autoscaler (WVA) integration |  |  |
+| `wva` _[WVASpec](`#wvaspec`)_ | Configures workload-variant-autoscaler (WVA) integration. When managed, the operator deploys and monitors the integration prerequisites, but `VariantAutoscaling` resources remain user-managed and are not watched or owned by the operator. |  |  |
-WVASpec enables workload-variant-autoscaler integration
+WVASpec configures workload-variant-autoscaler integration.
+When `managementState` is `Managed`, the operator manages the WVA integration path and related health checks.
+`VariantAutoscaling` resources remain user-managed and are not watched or owned by the operator.

Also applies to: 888-888, 924-924, 1277-1277, 1983-2000

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-overview.md` at line 193, Update the docs entry for `wva`/`WVASpec`
to explicitly state ownership semantics: clarify that the "Managed" status means
the operator deploys and monitors the WVA integration path (installation, health
of the integration components) but does NOT claim ownership or watch
`VariantAutoscaling` resources — those remain user-managed and must be
created/edited by users; update the descriptive text wherever `wva`, `WVASpec`,
or `VariantAutoscaling` are mentioned (including the other listed occurrences)
so it warns readers that troubleshooting autoscaling should be done in the
user's control plane/VariantAutoscaling resources, not assumed to be controlled
by the operator.

Comment on lines +42 to +44
// For WVA.
LLMWorkVariantAutoscallingDepedencies = componentApi.KserveKind + "LLMWorkVariantAutoscallingDepedencies"
cmaOperatorSubscription = "openshift-custom-metrics-autoscaler-operator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n "LLMWorkVariantAutoscall" --type go

Repository: opendatahub-io/opendatahub-operator

Length of output: 1255


Typo in exported constant propagates to 8 locations.

LLMWorkVariantAutoscallingDepedencies contains two spelling errors:

  • "Autoscalling" → "Autoscaling" (double 'l')
  • "Depedencies" → "Dependencies" (missing 'n')

The constant appears in 8 locations (1 definition + 7 usages in kserve_controller_actions.go, kserve_controller_actions_test.go). Since this is exported and used for condition types, the misspelling propagates to user-visible status conditions. All locations require synchronized fixes.

Proposed fix
-	LLMWorkVariantAutoscallingDepedencies = componentApi.KserveKind + "LLMWorkVariantAutoscallingDepedencies"
+	LLMWorkVariantAutoscalingDependencies = componentApi.KserveKind + "LLMWorkVariantAutoscalingDependencies"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/components/kserve/kserve.go` around lines 42 - 44, Rename
the exported constant LLMWorkVariantAutoscallingDepedencies to the correctly
spelled LLMWorkVariantAutoscalingDependencies and update every reference
(definition plus all usages in production and test code) to the new identifier;
ensure the corrected constant value is used where it forms condition/type
strings so user-visible status conditions reflect the fix, update imports/usages
in kserve controller functions and tests that reference the old name, and run
tests/build to validate no remaining references to the misspelled symbol.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/e2e/modelcontroller_test.go (1)

95-111: ⚠️ Potential issue | 🟡 Minor

Make this test establish KServe as a prerequisite itself.

This still only flips .spec.components.kserve.wva.managementState, so the result depends on some earlier case having already left KServe Managed. If this case runs alone or after a cleanup path, it can fail without exercising WVA at all.

Suggested change
 	tc.EventuallyResourcePatched(
 		WithMinimalObject(gvk.DataScienceCluster, tc.DataScienceClusterNamespacedName),
 		WithMutateFunc(
 			testf.TransformPipeline(
+				testf.Transform(`.spec.components.%s.managementState = "%s"`, componentApi.KserveComponentName, operatorv1.Managed),
 				testf.Transform(`.spec.components.%s.wva.managementState = "%s"`, componentApi.KserveComponentName, operatorv1.Managed),
 			),
 		),
 	)
+
+	tc.verifyResourcesDeployed()
 
 	// Ensure WVA deployment exists.
 	tc.EnsureResourceExists(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/modelcontroller_test.go` around lines 95 - 111, The test currently
only flips `.spec.components.kserve.wva.managementState` (via
tc.EventuallyResourcePatched and componentApi.KserveComponentName) so it depends
on KServe already being Managed; change the test to explicitly establish KServe
as a prerequisite by also patching `.spec.components.kserve.managementState =
"Managed"` on the DataScienceCluster (add a second Transform in
testf.TransformPipeline or extend the existing Transform to set both
`.spec.components.%s.managementState` for the KServe component and
`.spec.components.%s.wva.managementState`), and ensure the KServe resource is
present before asserting the WVA deployment (use tc.EnsureResourceExists for the
KServe resource if necessary) so the WVA deployment check (wvaDeploymentName,
tc.AppsNamespace) does not depend on prior tests.
🧹 Nitpick comments (1)
tests/e2e/modelcontroller_test.go (1)

105-111: Assert rollout health, not just object creation.

EnsureResourceExists passes as soon as the Deployment is present, so this test will miss broken rollouts where WVA is created but never becomes available. Gate this on Deployment status to prove the feature is actually usable.

Suggested change
 	tc.EnsureResourceExists(
 		WithMinimalObject(gvk.Deployment, types.NamespacedName{
 			Name:      wvaDeploymentName,
 			Namespace: tc.AppsNamespace,
 		}),
+		WithCondition(
+			And(
+				jq.Match(`.status.availableReplicas >= 1`),
+				jq.Match(`(.status.unavailableReplicas // 0) == 0`),
+			),
+		),
 	)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/modelcontroller_test.go` around lines 105 - 111, The test currently
only ensures the Deployment object exists via EnsureResourceExists
(WithMinimalObject, gvk.Deployment, wvaDeploymentName, tc.AppsNamespace) but
does not verify the rollout completed; update the test to wait for the
Deployment to become healthy by asserting its status rather than presence: after
ensuring the object exists, poll the Deployment status (e.g., check
Status.AvailableReplicas equals Spec.Replicas and/or the Available condition is
True) until it reports available or timeout, or use an existing helper like
WaitForDeploymentAvailable/EnsureDeploymentRollout if present; fail the test if
the rollout never becomes healthy within the timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/e2e/modelcontroller_test.go`:
- Around line 95-111: The test currently only flips
`.spec.components.kserve.wva.managementState` (via tc.EventuallyResourcePatched
and componentApi.KserveComponentName) so it depends on KServe already being
Managed; change the test to explicitly establish KServe as a prerequisite by
also patching `.spec.components.kserve.managementState = "Managed"` on the
DataScienceCluster (add a second Transform in testf.TransformPipeline or extend
the existing Transform to set both `.spec.components.%s.managementState` for the
KServe component and `.spec.components.%s.wva.managementState`), and ensure the
KServe resource is present before asserting the WVA deployment (use
tc.EnsureResourceExists for the KServe resource if necessary) so the WVA
deployment check (wvaDeploymentName, tc.AppsNamespace) does not depend on prior
tests.

---

Nitpick comments:
In `@tests/e2e/modelcontroller_test.go`:
- Around line 105-111: The test currently only ensures the Deployment object
exists via EnsureResourceExists (WithMinimalObject, gvk.Deployment,
wvaDeploymentName, tc.AppsNamespace) but does not verify the rollout completed;
update the test to wait for the Deployment to become healthy by asserting its
status rather than presence: after ensuring the object exists, poll the
Deployment status (e.g., check Status.AvailableReplicas equals Spec.Replicas
and/or the Available condition is True) until it reports available or timeout,
or use an existing helper like
WaitForDeploymentAvailable/EnsureDeploymentRollout if present; fail the test if
the rollout never becomes healthy within the timeout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6ac5cb1e-f1a5-4427-947e-cbdb42c2582b

📥 Commits

Reviewing files that changed from the base of the PR and between 67de789 and 320fdaf.

📒 Files selected for processing (2)
  • internal/controller/components/modelcontroller/modelcontroller_support.go
  • tests/e2e/modelcontroller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/components/modelcontroller/modelcontroller_support.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@build/operands-map.yaml`:
- Around line 236-237: The image reference for
RELATED_IMAGE_ODH_WORKLOAD_VARIANT_AUTOSCALER_CONTROLLER_IMAGE uses an invalid
`@v0.5.1` tag; replace this with the immutable published digest for
workload-variant-autoscaler v0.5.1 (e.g., change the value from
`quay.io/opendatahub/workload-variant-autoscaler@v0.5.1` to
`quay.io/opendatahub/workload-variant-autoscaler@sha256:<published-digest>`),
ensuring the entry uses the sha256 digest like the other entries so OCI syntax
is correct and supply-chain pinning is preserved.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 949edb48-1cd7-4c4d-9bac-aeeb6f206047

📥 Commits

Reviewing files that changed from the base of the PR and between 320fdaf and c387c3e.

📒 Files selected for processing (2)
  • build/manifests-config.yaml
  • build/operands-map.yaml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tests/e2e/modelcontroller_test.go (1)

96-104: ⚠️ Potential issue | 🟠 Major

Make each WVA test establish KServe/WVA state explicitly.

Both cases still depend on a previous test having already driven KServe to Managed. With WVA defaulting to Removed, these assertions are order-dependent and can fail or hang when run in isolation. Set both .spec.components.kserve.managementState and .spec.components.kserve.wva.managementState in each WVA-specific test before checking the Deployment or ConfigMap.

Also applies to: 134-143

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/modelcontroller_test.go` around lines 96 - 104, Tests that assert
WVA behavior are order-dependent because they only set
.spec.components.kserve.wva.managementState and assume kserve is already
Managed; update each WVA-specific test (the blocks using
tc.EventuallyResourcePatched with WithMinimalObject/WithMutateFunc and
testf.TransformPipeline/testf.Transform) to explicitly set both
.spec.components.kserve.managementState = "Managed" and
.spec.components.kserve.wva.managementState = "<desiredState>" before proceeding
to check Deployments/ConfigMaps (apply the same change to the other occurrence
noted at lines 134-143).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/modelcontroller_test.go`:
- Around line 128-153: The test currently hardcodes queueSpareTriggerOriginal =
"3" which will break if the cluster default changes; instead read the live
ConfigMap value and derive the original string before patching. Update the test
to GET the ConfigMap named wvaConfigMapName (use the test helper that retrieves
resources, e.g. tc.GetResource or tc.KubeClient.Get) and read cm.Data["default"]
to compute the current queueSpareTriggerOriginal value, then call
tc.EventuallyResourcePatched with testf.TransformPipeline/testf.Transform using
that derived original value and the desired queueSpareTriggerModified to perform
the substitution; keep the rest of the assertions (jq.Match, wvaConfigMapName,
queueSpareTriggerModified) the same.
- Around line 145-181: Register the ConfigMap rollback before performing the
first mutation: move or invoke the existing restore call (the
tc.EventuallyResourcePatched block that uses WithMinimalObject(gvk.ConfigMap,
namespaced name wvaConfigMapName) and the WithMutateFunc that transforms
queueSpareTriggerModified back to queueSpareTriggerOriginal) so it is installed
prior to the initial tc.EventuallyResourcePatched that sets
queueSpareTriggerOriginal→queueSpareTriggerModified (or register it as a cleanup
via tc.RegisterCleanup/defer), ensuring the restore will run on every failure
path.

---

Duplicate comments:
In `@tests/e2e/modelcontroller_test.go`:
- Around line 96-104: Tests that assert WVA behavior are order-dependent because
they only set .spec.components.kserve.wva.managementState and assume kserve is
already Managed; update each WVA-specific test (the blocks using
tc.EventuallyResourcePatched with WithMinimalObject/WithMutateFunc and
testf.TransformPipeline/testf.Transform) to explicitly set both
.spec.components.kserve.managementState = "Managed" and
.spec.components.kserve.wva.managementState = "<desiredState>" before proceeding
to check Deployments/ConfigMaps (apply the same change to the other occurrence
noted at lines 134-143).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 1de8b2e2-39b8-441c-bbeb-90c188d5450e

📥 Commits

Reviewing files that changed from the base of the PR and between c387c3e and 0c7af7e.

📒 Files selected for processing (2)
  • build/operands-map.yaml
  • tests/e2e/modelcontroller_test.go
✅ Files skipped from review due to trivial changes (1)
  • build/operands-map.yaml

zdtsw added 4 commits March 13, 2026 17:58
- add WVA under DSC kserve
- WVA will be deployment via modelcontroller
- checks on WVA e.g CMA (we dont take Keda as its operator is
  deprecated) only when WVA is Managed:
  1. subscritption: if not exist, we warn user to install it
- build: some new konflux build related config
- test: add more test for WVA also check CM can be user configable
- update: add missing owns() on secret

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- wont be any degradate status on kedacontroller
- only show subscription miss warning when wva is Managed
- some updates on the e2e tests

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- the check should be done on the resource recreated after deletion
not the one before gets deleted
- since no call is on this check but only if the resource gets recreated
  or not, it is not gonna break behvaior

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
@davidebianchi
Copy link
Member

/override "Red Hat Konflux / odh-operator-ci-e2e-its-manual / odh-operator-ci"
/override "Red Hat Konflux / odh-operator-ci-on-pull-request"

@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

@davidebianchi: Overrode contexts on behalf of davidebianchi: Red Hat Konflux / odh-operator-ci-e2e-its-manual / odh-operator-ci, Red Hat Konflux / odh-operator-ci-on-pull-request

Details

In response to this:

/override "Red Hat Konflux / odh-operator-ci-e2e-its-manual / odh-operator-ci"
/override "Red Hat Konflux / odh-operator-ci-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zdtsw
Copy link
Member Author

zdtsw commented Mar 13, 2026

/test opendatahub-operator-rhoai-e2e

@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidebianchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zdtsw
Copy link
Member Author

zdtsw commented Mar 13, 2026

/override ci/prow/opendatahub-operator-rhoai-e2e

as this failed only in RHOAI and on another PR, manifest pointed by branch might caused issue from kserve

@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

@zdtsw: Overrode contexts on behalf of zdtsw: ci/prow/opendatahub-operator-rhoai-e2e

Details

In response to this:

/override ci/prow/opendatahub-operator-rhoai-e2e

as this failed only in RHOAI and on another PR, manifest pointed by branch might caused issue from kserve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 1.75439% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.59%. Comparing base (f671c39) to head (a38f88e).

Files with missing lines Patch % Lines
...ents/modelcontroller/modelcontroller_controller.go 0.00% 27 Missing ⚠️
...ponents/modelcontroller/modelcontroller_actions.go 0.00% 22 Missing ⚠️
...ponents/modelcontroller/modelcontroller_support.go 0.00% 5 Missing ⚠️
...ller/components/modelcontroller/modelcontroller.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3243      +/-   ##
==========================================
- Coverage   50.78%   50.59%   -0.20%     
==========================================
  Files         194      194              
  Lines       14179    14235      +56     
==========================================
+ Hits         7201     7202       +1     
- Misses       6241     6295      +54     
- Partials      737      738       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-merge-bot openshift-merge-bot bot merged commit 4af9290 into opendatahub-io:main Mar 13, 2026
26 of 28 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants