feat: add dynamoModel CRD #4166

julienmancuso · 2025-11-06T21:19:31Z

Overview:

add dynamoModel CRD

Summary by CodeRabbit

Release Notes

New Features
- Introduced DynamoModel custom resource for managing model deployments
- Added ModelRef field to DynamoComponentDeployment and DynamoGraphDeployment for model linking
- Automatic headless service creation for model endpoint discovery
- Support for LoRA model loading and management across endpoints
Chores
- Updated RBAC permissions for model resource management and service discovery
- Enhanced controller infrastructure with model lifecycle capabilities

Example of a DGD and associated new DynamoModel CR :

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: sglang-disagg
spec:
  services:
    Frontend:
      dynamoNamespace: sglang-disagg
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0
    decode:
      modelRef:
        name: Qwen/Qwen3-0.6B
      envFromSecret: hf-token-secret
      dynamoNamespace: sglang-disagg
      componentType: worker
      subComponentType: decode
      replicas: 1
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0
          workingDir: /workspace/examples/backends/sglang
          command:
          - python3
          - -m
          - dynamo.sglang
          args:
            - --model-path
            - Qwen/Qwen3-0.6B
            - --served-model-name
            - Qwen/Qwen3-0.6B
            - --page-size
            - "16"
            - --tp
            - "1"
            - --trust-remote-code
            - --skip-tokenizer-init
            - --disaggregation-mode
            - decode
            - --disaggregation-transfer-backend
            - nixl
            - --disaggregation-bootstrap-port
            - "12345"
            - --host
            - "0.0.0.0"
    prefill:
      modelRef:
        name: Qwen/Qwen3-0.6B
      envFromSecret: hf-token-secret
      dynamoNamespace: sglang-disagg
      componentType: worker
      subComponentType: prefill
      replicas: 1
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0
          workingDir: /workspace/examples/backends/sglang
          command:
          - python3
          - -m
          - dynamo.sglang
          args:
            - --model-path
            - Qwen/Qwen3-0.6B
            - --served-model-name
            - Qwen/Qwen3-0.6B
            - --page-size
            - "16"
            - --tp
            - "1"
            - --trust-remote-code
            - --skip-tokenizer-init
            - --disaggregation-mode
            - prefill
            - --disaggregation-transfer-backend
            - nixl
            - --disaggregation-bootstrap-port
            - "12345"
            - --host
            - "0.0.0.0"
---
# Example DynamoModel CR - Base Model
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: sglang-3-0-6b-my-lora
spec:
  modelName: Qwen/Qwen3-0.6B-my-lora
  baseModelName: Qwen/Qwen3-0.6B
  modelType: lora
  source:
    uri: s3://my-bucket/Qwen/Qwen3-0.6B-my-lora

the new controller would make sure the workers of the DGD (both decode and worker) would have the LORA loaded by calling their POST /v1/loras API.

internally we use headless service and associated endpointSlices to make sure the LORA are loaded

Signed-off-by: Julien Mancuso <[email protected]>

coderabbitai · 2025-11-06T21:29:52Z

Walkthrough

Introduces a new DynamoModel Kubernetes custom resource definition with associated API types, controller, and infrastructure for managing model endpoint discovery and LoRA loading. Adds ModelRef fields to existing CRDs for model association. Includes endpoint discovery utilities, a bounded-concurrency LoRA client, headless service generation, and updates to existing controllers and RBAC rules.

Changes

Cohort / File(s)	Summary
New DynamoModel CRD Definitions `deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml`	Introduces CustomResourceDefinition for DynamoModel resource with spec fields (baseModelName, modelName, modelType enum, loraPath), status fields (conditions, endpoints, ready/total counters), and printer columns (BaseModel, Type, Ready, Total, Age).
ModelRef Field Additions to Existing CRDs `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml`	Adds optional modelRef field (object with required name and optional revision) to component deployment specs for model association and headless service endpoint discovery.
API Type Definitions `deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go`, `deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go`, `deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go`	Introduces DynamoModel, DynamoModelSpec, DynamoModelStatus, EndpointInfo, ModelSource, ModelReference types with Kubebuilder markers; adds ModelReference to component spec; generates DeepCopy methods and updates imports for core/v1 aliasing.
DynamoModel Controller `deploy/cloud/operator/internal/controller/dynamo_model_controller.go`	Implements DynamoModelReconciler with reconciliation loop handling finalizers, EndpointSlice discovery, parallel LoRA loading, status/condition updates, and lifecycle management via EndpointClient integration.
Existing Controller Updates `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go`, `deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go`, `deploy/cloud/operator/internal/dynamo/graph.go`	Integrates headless model service reconciliation and base model label generation into component and graph deployment reconciliation paths.
Endpoint Management Client `deploy/cloud/operator/internal/modelendpoint/client.go`, `deploy/cloud/operator/internal/modelendpoint/lora.go`, `deploy/cloud/operator/internal/modelendpoint/discovery.go`, `deploy/cloud/operator/internal/modelendpoint/types.go`	Introduces Client for bounded-concurrency LoRA loading/unloading with timeout controls; adds endpoint candidate extraction and model discovery query utilities; defines Candidate type.
Headless Service Generation `deploy/cloud/operator/internal/dynamo/headless_service.go`	Adds ReconcileModelServicesForComponents, GenerateHeadlessServiceForModel, and AddBaseModelLabel helpers for creating and syncing headless services indexed by base model name.
Worker Pool Utility `deploy/cloud/operator/internal/workerpool/pool.go`	Introduces generic Execute function with parameterized Task and Result types for bounded-concurrency task execution with timeout enforcement and error aggregation.
Configuration & Setup `deploy/cloud/operator/PROJECT`, `deploy/cloud/operator/cmd/main.go`, `deploy/cloud/operator/config/rbac/role.yaml`, `deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml`, `deploy/cloud/operator/config/crd/kustomization.yaml`, `deploy/cloud/operator/config/samples/kustomization.yaml`, `deploy/cloud/operator/internal/consts/consts.go`	Registers DynamoModel resource in PROJECT file; wires DynamoModelReconciler with EndpointClient in main; adds RBAC rules for dynamomodels and endpointslices; includes CRD and sample in kustomization manifests; adds KubeLabelDynamoBaseModel constant.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

dynamo_model_controller.go: Complex reconciliation logic with finalizers, EndpointSlice discovery, parallel LoRA operations, condition management, and event emission; requires careful review of lifecycle handling and error paths.
modelendpoint/client.go: Concurrent LoRA loading/unloading with bounded worker pools, timeout enforcement, and aggregated error handling; verify timeout semantics and edge cases.
modelendpoint/discovery.go: Endpoint extraction and model discovery via field indexing; ensure query logic and request mapping are correct.
zz_generated.deepcopy.go: Auto-generated DeepCopy methods; verify consistency with new API types and import aliasing changes.
Existing controller modifications: Verify integration points in DynamoComponentDeployment and DynamoGraphDeployment controllers align with new reconciliation paths and error handling.
workerpool/pool.go: Generic concurrency pattern; verify goroutine lifecycle, result ordering, and timeout propagation.

Poem

🐰 A new model emerges, endpoints take flight,
LoRA loads swiftly through worker pool might,
Services headless guide discovery's way,
Dynamo models shall dance and play!
With conditions and status, the system's delight! 🎉

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add dynamoModel CRD' clearly and concisely summarizes the main change, which is introducing a new DynamoModel CustomResourceDefinition across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description provides a clear overview of the feature (adding DynamoModel CRD) and includes practical examples showing how the new CR integrates with existing DynamoGraphDeployment configurations.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
10005-10018: Tighten modelRef schema (non-empty name, disallow unknown keys).

Looks good overall. To prevent empty strings and catch typos, add minimal validations.

Apply within this block:
               modelRef:
                 description: |-
                   ModelRef references a model that this component serves
                   When specified, a headless service will be created for endpoint discovery
-                properties:
+                additionalProperties: false
+                properties:
                   name:
                     description: Name is the base model identifier (e.g., "llama-3-70b-instruct-v1")
                     type: string
+                    minLength: 1
                   revision:
                     description: Revision is the model revision/version (optional)
                     type: string
+                    minLength: 1
                 required:
                   - name
                 type: object
Optional (nice UX): add an additionalPrinterColumn to surface the model at kubectl get time:
@@
     - additionalPrinterColumns:
       - description: Dynamo component
         jsonPath: .spec.dynamoComponent
         name: DynamoComponent
         type: string
+      - description: Model
+        jsonPath: .spec.modelRef.name
+        name: Model
+        type: string
Please confirm:

types.go defines ModelRef with json:"modelRef,omitempty" and string fields, and controller tolerates missing/empty revision.

No model names require characters beyond basic DNS-1123 label charset; if they do, keep minLength but skip adding a strict pattern. Based on learnings.
deploy/cloud/operator/internal/modelendpoint/lora.go (1)
66-66: Standardize logging levels for success cases.

Success logging is inconsistent: loadLoRA uses Info level (line 66) while unloadLoRA uses V(1) level (line 96). For consistency and operational visibility, both should log at the same level.

Consider standardizing to Info level for both operations:
-	logs.V(1).Info("Successfully unloaded LoRA", "address", address, "modelName", modelName)
+	logs.Info("Successfully unloaded LoRA", "address", address, "modelName", modelName)
Also applies to: 96-96
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (1)
85-87: Consider adding validation to enforce loraPath usage.

The loraPath field is described as "only applicable for lora model type" but there's no schema-level validation to enforce this constraint. Users could accidentally set loraPath on base or adapter models.

Add CEL validation to ensure loraPath is only set when modelType is lora:
                 loraPath:
                   description: LoraPath is the path to the LoRA adapter (only applicable for lora model type)
                   type: string
                 modelName:
                   description: ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora")
                   type: string
                 modelType:
                   default: base
                   description: ModelType specifies the type of model (e.g., "base", "lora", "adapter")
                   enum:
                     - base
                     - lora
                     - adapter
                   type: string
               required:
                 - baseModelName
                 - modelName
               type: object
+              x-kubernetes-validations:
+                - rule: "self.modelType != 'lora' || has(self.loraPath)"
+                  message: "loraPath is required when modelType is 'lora'"
+                - rule: "self.modelType == 'lora' || !has(self.loraPath)"
+                  message: "loraPath should only be set when modelType is 'lora'"
Also applies to: 91-98
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (2)

10005-10018: Validate and sanitize modelRef for Service/labels; document empty revision semantics

Good addition. Please confirm:

Reconcile sanitizes modelRef.name (and revision if used) into valid DNS-1123 Service names (lowercase, [a-z0-9-], <=63), with truncation+hash to avoid collisions when names exceed 63 or contain dots/uppercases.

If modelRef is used in labels/selector values, ensure label constraints (<=63; allowed charset) or apply normalization similarly.

Clarify behavior when revision is empty (e.g., treated as “latest”, or excluded from identity). Add this to the Go type docstring so controller-gen propagates it.

Optionally, enforce constraints at the API by adding kubebuilder validation on the Go types (e.g., Patterns and MaxLength for name/revision) instead of manual YAML edits.

Based on learnings.

10005-10018: Improve kubectl UX with printer columns

Consider adding print columns on the Go type for:

Model (.spec.modelRef.name)

Revision (.spec.modelRef.revision)

Use +kubebuilder:printcolumn annotations so controller-gen emits them here (don’t hand-edit this YAML). This makes kubectl get dcd more informative.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0763f and 885d792.

📒 Files selected for processing (26)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (4 hunks)
deploy/cloud/operator/PROJECT (1 hunks)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2 hunks)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (11 hunks)
deploy/cloud/operator/cmd/main.go (2 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (1 hunks)
deploy/cloud/operator/config/crd/kustomization.yaml (1 hunks)
deploy/cloud/operator/config/rbac/role.yaml (4 hunks)
deploy/cloud/operator/config/samples/kustomization.yaml (1 hunks)
deploy/cloud/operator/internal/consts/consts.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamo_model_controller.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2 hunks)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
deploy/cloud/operator/internal/dynamo/headless_service.go (1 hunks)
deploy/cloud/operator/internal/modelendpoint/client.go (1 hunks)
deploy/cloud/operator/internal/modelendpoint/discovery.go (1 hunks)
deploy/cloud/operator/internal/modelendpoint/lora.go (1 hunks)
deploy/cloud/operator/internal/modelendpoint/types.go (1 hunks)
deploy/cloud/operator/internal/workerpool/pool.go (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.

📚 Learning: 2025-07-18T16:05:05.534Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

deploy/cloud/operator/PROJECT
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml

📚 Learning: 2025-07-18T16:04:31.771Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

deploy/cloud/operator/PROJECT
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml
deploy/cloud/operator/config/crd/kustomization.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml

📚 Learning: 2025-09-04T19:03:06.643Z

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

deploy/cloud/operator/PROJECT
deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml
deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml
deploy/cloud/operator/config/samples/kustomization.yaml
deploy/cloud/operator/config/crd/kustomization.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml

📚 Learning: 2025-07-18T16:04:47.465Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

deploy/cloud/operator/PROJECT
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml

📚 Learning: 2025-10-24T04:21:08.751Z

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

Applied to files:

deploy/cloud/operator/config/crd/kustomization.yaml

🧬 Code graph analysis (12)

deploy/cloud/operator/internal/modelendpoint/discovery.go (2)

deploy/cloud/operator/internal/modelendpoint/types.go (1)

Candidate (21-24)

deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1)

DynamoModelList (110-114)

deploy/cloud/operator/cmd/main.go (2)

deploy/cloud/operator/internal/controller/dynamo_model_controller.go (1)

DynamoModelReconciler (63-67)

deploy/cloud/operator/internal/modelendpoint/client.go (2)

Client (43-45)

NewClient (48-54)

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

deploy/cloud/operator/internal/dynamo/headless_service.go (1)

ReconcileModelServicesForComponents (37-97)

deploy/cloud/operator/internal/workerpool/pool.go (1)

deploy/cloud/operator/api/dynamo/schemas/schemas.go (1)

Duration (38-38)

deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1)

deploy/cloud/operator/api/v1alpha1/groupversion_info.go (1)

SchemeBuilder (35-35)

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)

deploy/cloud/operator/internal/dynamo/headless_service.go (2)

ReconcileModelServicesForComponents (37-97)

AddBaseModelLabel (143-147)

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

ModelReference (279-287)

deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (6)

DynamoModel (99-105)

DynamoModelList (110-114)

DynamoModelSpec (25-44)

ModelSource (47-54)

DynamoModelStatus (72-86)

EndpointInfo (57-69)

deploy/cloud/operator/internal/controller/dynamo_model_controller.go (5)

deploy/cloud/operator/internal/modelendpoint/client.go (2)

Client (43-45)

NewClient (48-54)

deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (2)

DynamoModel (99-105)

EndpointInfo (57-69)

deploy/cloud/operator/internal/modelendpoint/discovery.go (2)

FindModelsForBaseModel (77-112)

ExtractCandidates (35-73)

deploy/cloud/operator/internal/modelendpoint/types.go (1)

Candidate (21-24)

deploy/cloud/operator/internal/consts/consts.go (1)

DynamoSystemPort (22-22)

deploy/cloud/operator/internal/dynamo/headless_service.go (3)

deploy/cloud/operator/internal/controller_common/resource.go (2)

Reconciler (49-52)

SyncResource (60-195)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)

DynamoComponentDeploymentSharedSpec (48-111)

ModelReference (279-287)

deploy/cloud/operator/internal/consts/consts.go (3)

KubeLabelDynamoBaseModel (41-41)

DynamoSystemPortName (23-23)

DynamoSystemPort (22-22)

deploy/cloud/operator/internal/dynamo/graph.go (1)

deploy/cloud/operator/internal/dynamo/headless_service.go (1)

AddBaseModelLabel (143-147)

deploy/cloud/operator/internal/modelendpoint/client.go (3)

deploy/cloud/operator/internal/modelendpoint/types.go (1)

Candidate (21-24)

deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (2)

DynamoModel (99-105)

EndpointInfo (57-69)

deploy/cloud/operator/internal/workerpool/pool.go (2)

Task (28-31)

Execute (43-102)

deploy/cloud/operator/internal/modelendpoint/lora.go (1)

deploy/cloud/operator/internal/modelendpoint/client.go (1)

Client (43-45)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: sglang (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (amd64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (17)

deploy/cloud/operator/PROJECT (1)

27-34: LGTM!

The DynamoModel resource configuration follows the same structure and conventions as the existing DynamoComponentDeployment and DynamoGraphDeployment resources.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

344-354: LGTM!

The model service reconciliation is appropriately placed after Grove scaling, ensuring that workload resources are created before setting up endpoint discovery services. Error handling follows the established pattern in this controller.

deploy/cloud/operator/internal/consts/consts.go (1)

41-41: LGTM!

The new constant follows the established naming convention and is appropriately positioned with other Kubernetes label constants.

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)

86-89: LGTM!

The optional ModelRef field is well-documented and designed for backward compatibility. The documentation clearly explains its purpose for endpoint discovery via headless services.

278-287: LGTM!

The ModelReference type is well-designed with appropriate validation markers. The required Name field and optional Revision field provide flexibility while ensuring essential information is present.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)

330-343: LGTM!

The model service reconciliation is correctly implemented for component-level reconciliation. The componentMap contains only the current component, which is appropriate for this controller's scope.

943-955: Improved label handling.

The function now properly initializes and populates labels instead of returning an empty map. This ensures that:

Existing component labels are preserved

Base model labels are added when a ModelRef is specified

This is a positive change that enables proper label propagation throughout the resource hierarchy.

deploy/cloud/operator/cmd/main.go (2)

63-63: LGTM!

The modelendpoint import is correctly added and used for creating the EndpointClient.

564-571: LGTM!

The DynamoModelReconciler setup follows the established pattern for controller initialization. The EndpointClient is appropriately created once and injected into the reconciler.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

10139-10152: Go types are properly defined and aligned with the CRD schema.

The verification confirms that the ModelReference struct is correctly defined in deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (lines 278–287) with proper kubebuilder annotations (+kubebuilder:validation:Required for name, +optional for revision). The modelRef field in DynamoComponentDeploymentSharedSpec is correctly typed as *ModelReference with the +optional tag and proper JSON marshaling hints. The autogenerated CRD schema accurately reflects these Go types, and the structure aligns with how modelRef is used in the controller code (e.g., AddBaseModelLabel function). No issues found.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

65-72: LGTM: RBAC permissions properly scoped for DynamoModel CRD.

The added permissions for EndpointSlices discovery and DynamoModel lifecycle management (including finalizers and status updates) are appropriate and follow standard Kubernetes controller patterns.

Also applies to: 372-372, 387-387, 396-396

deploy/cloud/operator/config/rbac/role.yaml (1)

89-96: LGTM: RBAC permissions consistent with Helm template.

The RBAC additions mirror those in the Helm template and are properly scoped for the DynamoModel controller's operational needs.

Also applies to: 173-173, 188-188, 197-197
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (2)
166-182: Verify whether podName should be required in EndpointInfo.

The podName field is not in the required list (lines 179-181), suggesting it may be optional. However, in a Kubernetes environment with endpoint discovery via EndpointSlices, the pod name should typically always be known and valuable for debugging and observability.

Please confirm whether podName can legitimately be absent in any scenario. If not, consider adding it to the required fields:
                     required:
                       - address
+                      - podName
                       - ready
174-178: Clarify the design intent for base model endpoint tracking.

The comment states "For base models: always false (no probing performed)," which suggests base model endpoints are tracked but never marked ready. This raises questions about the utility of endpoint tracking for base models and whether the status structure optimally serves both base and LoRA model use cases.

Please clarify the design rationale:

Why track endpoints for base models if ready is always false?

Is there a future plan to probe base model readiness?

Would separate status structures for base vs LoRA models improve clarity?
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)

10005-10018: RBAC check for headless Service creation

Since modelRef triggers headless service creation for endpoint discovery, verify the PR includes RBAC for Services and EndpointSlices (get/list/watch/create/update/patch) in the operator’s ClusterRole.

Based on learnings.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

10139-10152: Verify that base CRD file has been updated to generate this template change.

Per prior learnings on this codebase, CRD schemas in deploy/cloud/helm/crds/templates/*.yaml are auto-generated from base CRD files in deploy/cloud/operator/config/crd/bases/ and should not be manually edited, as manual changes would be overwritten during regeneration.

The modelRef field definition itself appears structurally sound as OpenAPI v3 schema. However, ensure that the corresponding base CRD file (deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml) has been updated with this field, and that this template change was auto-generated from it rather than manually added.

deploy/cloud/operator/internal/dynamo/headless_service.go (1)

91-105: Review comment is incorrect for Go 1.24.0

The repository declares go 1.24.0 in deploy/cloud/operator/go.mod, which is well after Go 1.22. Starting with Go 1.22, loop variables are scoped per iteration rather than reused across iterations, so closures over loop variables are safe. The problematic code pattern the review describes is not an issue in this codebase.

Additionally, the review references the wrong file. The actual loops capturing candidate are in deploy/cloud/operator/internal/modelendpoint/client.go (LoadLoRA at lines 91–102, UnloadLoRA at lines 152–163), not in headless_service.go.

The code requires no changes.

Likely an incorrect or invalid review comment.

deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml

deploy/cloud/operator/internal/controller/dynamo_model_controller.go

deploy/cloud/operator/internal/dynamo/headless_service.go

deploy/cloud/operator/internal/modelendpoint/discovery.go

deploy/cloud/operator/internal/modelendpoint/lora.go

deploy/cloud/operator/internal/workerpool/pool.go

Signed-off-by: Julien Mancuso <[email protected]>

atchernych · 2025-11-06T23:01:10Z

deploy/cloud/operator/internal/controller/dynamo_model_controller.go

+	logs.Info("Finalizing DynamoModel", "modelType", model.Spec.ModelType)
+
+	// Only perform cleanup for LoRA models
+	if model.Spec.ModelType == "lora" {


Bring the left side to the lower case

fixed in 2b9d6c2

atchernych · 2025-11-06T23:04:09Z

deploy/cloud/operator/internal/controller/dynamo_model_controller.go

+			logs.Info("Unloading LoRA from endpoints", "endpointCount", len(candidates))
+
+			// Initialize endpoint client if needed
+			if r.EndpointClient == nil {


This check happens in the Reconcile and FinalizeResource(). Could a race condition happen?

fixed in 2b9d6c2

atchernych · 2025-11-06T23:06:13Z

deploy/cloud/operator/internal/controller/dynamo_model_controller.go

+	candidates, serviceNames, err := r.getEndpointCandidates(ctx, model)
+	if err != nil {
+		// Error already logged and status updated in helper
+		return ctrl.Result{RequeueAfter: 30 * time.Second}, err


It looks like 30 is used more than once. Make it a constant?

fixed in 2b9d6c2

atchernych · 2025-11-06T23:11:13Z

deploy/cloud/operator/internal/dynamo/headless_service.go

+	ctx context.Context,
+	reconciler commonController.Reconciler,
+	owner client.Object,
+	services map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec,


neat: these are components not services - rename?

fixed in 2b9d6c2

atchernych · 2025-11-06T23:12:59Z

deploy/cloud/operator/internal/dynamo/headless_service.go

+// Uses a hash of the model name to avoid label length/character restrictions
+func AddBaseModelLabel(labels map[string]string, modelRef *v1alpha1.ModelReference) {
+	if modelRef != nil && modelRef.Name != "" {
+		labels[commonconsts.KubeLabelDynamoBaseModelHash] = HashModelName(modelRef.Name)


could labels be nill?

fixed in 2b9d6c2

atchernych · 2025-11-06T23:20:22Z

deploy/cloud/operator/internal/dynamo/headless_service.go

+	"sigs.k8s.io/controller-runtime/pkg/log"
+)
+
+// ReconcileModelServicesForComponents creates headless services for components with modelRef


I think the name "headless_service" exposes implementation detail.

fixed in 2b9d6c2

Signed-off-by: Julien Mancuso <[email protected]>

julienmancuso added 3 commits November 6, 2025 10:13

fix: add dynamoModel CRD

206ec64

Signed-off-by: Julien Mancuso <[email protected]>

fix: add dynamoModel CRD

c103ff4

Signed-off-by: Julien Mancuso <[email protected]>

fix: add dynamoModel CRD

885d792

Signed-off-by: Julien Mancuso <[email protected]>

julienmancuso requested a review from a team as a code owner November 6, 2025 21:19

pull-request-size bot added the size/XXL label Nov 6, 2025

github-actions bot added the feat label Nov 6, 2025

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

julienmancuso added 2 commits November 6, 2025 15:00

fix: add dynamoModel CRD

a9f6f00

Signed-off-by: Julien Mancuso <[email protected]>

fix: add dynamoModel CRD

1fc50d6

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 6, 2025 22:10 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 6, 2025 22:13 Inactive

fix: add dynamoModel CRD

845867b

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 6, 2025 22:15 Inactive

fix: add dynamoModel CRD

49991d6

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 6, 2025 22:18 Inactive

fix: add dynamoModel CRD

cacf4d4

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 6, 2025 22:34 Inactive

atchernych reviewed Nov 6, 2025

View reviewed changes

fix: taking Anna's comments into account

2b9d6c2

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 6, 2025 23:43 Inactive

julienmancuso requested a review from atchernych November 7, 2025 00:14

feat: add dynamoModel CRD #4166

Are you sure you want to change the base?

feat: add dynamoModel CRD #4166

Uh oh!

Conversation

julienmancuso commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

julienmancuso commented Nov 6, 2025 •

edited

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading