Setting option for node draining by external controllers #952
Conversation
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
30699af to
cc4a257
Compare
Pull Request Test Coverage Report for Build 20809145270Details
💛 - Coveralls |
SchSeba
left a comment
There was a problem hiding this comment.
In general I am fine with having this here. but can we please have a more generic name for the variable?
cc4a257 to
62632d8
Compare
62632d8 to
237e08b
Compare
16474e5 to
cd01dcf
Compare
a7fbd43 to
c3d7edf
Compare
daaf0f6 to
7fed536
Compare
|
/test-all |
7fed536 to
ade293d
Compare
…rnal-drainer=true' in case exteranl drainer is enabled The motivation is for external drainer verification, that SRIOV operator is set with external drainer Signed-off-by: Ido Heyvi <iheyvi@nvidia.com>
2b29581 to
cfd4970
Compare
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
|
@SchSeba , @adrianchiris PTAL upon last commit changes, handling Sebastian's proposal |
pkg/daemon/daemon.go
Outdated
|
|
||
| // manages addition/removal of external drainer annotation upon node state objects | ||
| func (dn *NodeReconciler) addRemoveExternalDrainerAnnotation(ctx context.Context, desiredNodeState *sriovnetworkv1.SriovNetworkNodeState) error { | ||
| funcLog := log.Log.WithName("addRemoveExternalDrainerAnnotation") |
There was a problem hiding this comment.
lets use the log from context:
reqLogger := log.FromContext(ctx).WithName("addRemoveExternalDrainerAnnotation")
| func (dn *NodeReconciler) addRemoveExternalDrainerAnnotation(ctx context.Context, desiredNodeState *sriovnetworkv1.SriovNetworkNodeState) error { | ||
| funcLog := log.Log.WithName("addRemoveExternalDrainerAnnotation") | ||
|
|
||
| // external drainer annotation will be added/removed only when both desired/current node state are in 'Idle' state |
There was a problem hiding this comment.
i believe we need to take into consideration also when the annotation does not exist.
e.g first time node state is created, the annotation is only set later in the reconcile process (and moved to idle after successfull first configuration)
so i would proceed with setting this annotation when either:
- both annotations are unset
- both annotations are idle
pkg/daemon/daemon.go
Outdated
| funcLog.Info("remove external drainer nodestate annotation", "annotation", consts.NodeStateExternalDrainerAnnotation) | ||
| original := desiredNodeState.DeepCopy() | ||
| delete(annotations, consts.NodeStateExternalDrainerAnnotation) | ||
| desiredNodeState.SetAnnotations(annotations) |
There was a problem hiding this comment.
is this needed ? i believe you get a reference for the map at L683
which you then update.
no copy takes place for this map
pkg/daemon/daemon.go
Outdated
| return nil | ||
| } | ||
|
|
||
| // check if external drainer environment variable is enabled, before adding external drainer annotation |
There was a problem hiding this comment.
i dont understand the comment. maybe just remove it ? it doesnt add much here IMO
ae5083f to
82aff1d
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
README.md (1)
467-468: PR reference in documentation should be removed.The documentation note includes a parenthetical reference "(PR #952)" which is unusual for user-facing documentation. This should be removed, as PR references are better tracked in version control/changelog rather than in the README. The rest of the text—clarifying that draining is "delegated" rather than "disabled"—effectively addresses the previous feedback about ambiguous wording.
🔎 Proposed fix
-> **NOTE**: Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). This means that internal drain-controller continues to work on nodeState objects which were not annotated with `sriovnetwork.openshift.io/use-external-drainer`. In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will be in charge of [parallel node operations](https://github.com/Mellanox/maintenance-operator/blob/main/api/v1alpha1/maintenanceoperatorconfig_types.go#L38-L46). +> **NOTE**: Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g., [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)). The internal drain-controller continues to work on nodeState objects not annotated with `sriovnetwork.openshift.io/use-external-drainer`. When external draining is active, `SriovNetworkPoolConfig.maxUnavailable` is not enforced; the external controller owns parallelism configuration (see [MaintenanceOperatorConfig](https://github.com/Mellanox/maintenance-operator/blob/main/api/v1alpha1/maintenanceoperatorconfig_types.go#L38-L46)).This revision:
- Removes the PR reference
- Clarifies the annotation label that triggers external draining
- Breaks the long sentence into shorter, more scannable statements
- Uses more concise language ("not enforced; owns parallelism") per the LanguageTool hint
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.mddoc/design/parallel-node-config.mdpkg/consts/constants.gopkg/daemon/daemon.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/consts/constants.go
- pkg/daemon/daemon.go
🧰 Additional context used
🪛 LanguageTool
README.md
[style] ~467-~467: Consider using more formal and concise wording here.
Context: ...re, since the maintenance operator will be in charge of [parallel node operations](https://gith...
(RESPONSIBLE_FOR)
doc/design/parallel-node-config.md
[style] ~63-~63: Consider using more formal and concise wording here.
Context: ...re, since the maintenance operator will be in charge of parallel node operations. Draining pro...
(RESPONSIBLE_FOR)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Analyze (go)
- GitHub Check: build
- GitHub Check: test-coverage
- GitHub Check: Golangci-lint
- GitHub Check: test
82aff1d to
0c29812
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @pkg/daemon/daemon.go:
- Around line 675-685: The early-return condition is inverted: currently it
returns when either `neitherExists` or `bothIdle` is true, preventing annotation
changes; change the guard so the function returns early only when NEITHER
condition holds (i.e., proceed only if `neitherExists` OR `bothIdle` is true).
Locate the block that computes `neitherExists` and `bothIdle` (using
`utils.ObjectHasAnnotationKey` and `utils.ObjectHasAnnotation` with
`consts.NodeStateDrainAnnotationCurrent`, `consts.NodeStateDrainAnnotation`, and
`consts.DrainIdle`) and replace the `if` check to return when not (neitherExists
OR bothIdle) so the code proceeds for the two valid scenarios.
🧹 Nitpick comments (2)
README.md (1)
467-467: Remove PR reference and formalize the documentation language.The NOTE contains a PR reference "(PR #952)" which is inappropriate in user-facing documentation. Additionally, the phrase "in charge of" is informal; use "manages" or "controls" instead. The sentence is also complex with multiple clauses that could be simplified for clarity.
🔎 Proposed documentation improvement
-> **NOTE**: Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). This means that internal drain-controller continues to work on nodeState objects which were not annotated with `sriovnetwork.openshift.io/use-external-drainer`. In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will be in charge of [parallel node operations](https://github.com/Mellanox/maintenance-operator/blob/main/api/v1alpha1/maintenanceoperatorconfig_types.go#L38-L46). +> **NOTE**: Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (for example, using the [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)). When enabled, the SR-IOV operator's internal drain-controller skips draining nodes annotated with `sriovnetwork.openshift.io/use-external-drainer`, while the external controller manages parallelism via [MaintenanceOperatorConfig](https://github.com/Mellanox/maintenance-operator/blob/main/api/v1alpha1/maintenanceoperatorconfig_types.go#L38-L46). Note that `SriovNetworkPoolConfig.maxUnavailable` is not enforced during externally-delegated draining.doc/design/parallel-node-config.md (1)
63-64: Consider more concise wording.The phrase "will be in charge of" could be replaced with "will manage" or "will control" for more formal technical documentation.
🔎 Suggested rewording
-*NOTE:* Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will be in charge of parallel node operations. +*NOTE:* Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will manage parallel node operations.Based on static analysis hints.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.mddoc/design/parallel-node-config.mdpkg/consts/constants.gopkg/daemon/daemon.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/consts/constants.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/daemon/daemon.go (3)
pkg/utils/cluster.go (3)
ObjectHasAnnotationKey(16-19)ObjectHasAnnotation(22-27)AnnotateObject(30-60)pkg/consts/constants.go (4)
NodeStateDrainAnnotationCurrent(82-82)NodeStateDrainAnnotation(81-81)DrainIdle(85-85)NodeStateExternalDrainerAnnotation(83-83)pkg/vars/vars.go (1)
UseExternalDrainer(91-91)
🪛 LanguageTool
README.md
[style] ~467-~467: Consider using more formal and concise wording here.
Context: ...re, since the maintenance operator will be in charge of [parallel node operations](https://gith...
(RESPONSIBLE_FOR)
doc/design/parallel-node-config.md
[style] ~63-~63: Consider using more formal and concise wording here.
Context: ...re, since the maintenance operator will be in charge of parallel node operations. Draining pro...
(RESPONSIBLE_FOR)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test
- GitHub Check: Golangci-lint
- GitHub Check: test-coverage
- GitHub Check: build
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
doc/design/parallel-node-config.md (1)
9-9: LGTM!The metadata update reflects the current documentation revision date appropriately.
pkg/daemon/daemon.go (2)
152-157: LGTM!The function is invoked at the appropriate point in the reconcile loop—early enough to set up external drainer annotation state before drain logic executes.
687-711: Addition and removal logic is correct.The annotation removal logic (lines 687-701) properly checks for annotation existence and uses
client.MergeFromfor efficient patching. The addition logic (lines 703-709) correctly delegates toutils.AnnotateObject, which handles idempotency.However, this code is currently unreachable due to the critical bug at line 683.
0c29812 to
84a9721
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
doc/design/parallel-node-config.md (1)
63-64: Optional: Minor wording refinement for formality.The NOTE accurately describes the external drainer feature and has been reviewed by maintainers. For slightly more formal documentation style, consider rephrasing "will be in charge of" to "will manage" or "will handle".
Optional rewording
-*NOTE:* Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will be in charge of parallel node operations. +*NOTE:* Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will manage parallel node operations.Based on static analysis hints.
README.md (1)
467-467: LGTM with optional wording refinement.The NOTE accurately describes the external drainer feature with helpful clarification that the internal drain-controller continues handling non-annotated nodeState objects. For consistency with formal documentation style, consider the same minor refinement as in the design doc.
Optional rewording for consistency
-> **NOTE**: Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). This means that internal drain-controller continues to work on nodeState objects which were not annotated with `sriovnetwork.openshift.io/use-external-drainer`. In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will be in charge of [parallel node operations](https://github.com/Mellanox/maintenance-operator/blob/main/api/v1alpha1/maintenanceoperatorconfig_types.go#L38-L46). +> **NOTE**: Node draining can be delegated to an external drain-controller by setting `USE_EXTERNAL_DRAINER=true` (e.g. using [NVIDIA maintenance-operator](https://github.com/Mellanox/maintenance-operator)) (PR #952). This means that internal drain-controller continues to work on nodeState objects which were not annotated with `sriovnetwork.openshift.io/use-external-drainer`. In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will manage [parallel node operations](https://github.com/Mellanox/maintenance-operator/blob/main/api/v1alpha1/maintenanceoperatorconfig_types.go#L38-L46).Based on static analysis hints.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.mddoc/design/parallel-node-config.mdpkg/consts/constants.gopkg/daemon/daemon.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/consts/constants.go
🧰 Additional context used
🪛 LanguageTool
README.md
[style] ~467-~467: Consider using more formal and concise wording here.
Context: ...re, since the maintenance operator will be in charge of [parallel node operations](https://gith...
(RESPONSIBLE_FOR)
doc/design/parallel-node-config.md
[style] ~63-~63: Consider using more formal and concise wording here.
Context: ...re, since the maintenance operator will be in charge of parallel node operations. Draining pro...
(RESPONSIBLE_FOR)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Golangci-lint
- GitHub Check: test-coverage
- GitHub Check: test
- GitHub Check: build
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
doc/design/parallel-node-config.md (1)
9-9: LGTM: Documentation date updated appropriately.The last-updated date correctly reflects the timing of this enhancement.
pkg/daemon/daemon.go (2)
152-161: LGTM: External drainer annotation managed before drain processing.The early invocation of
addRemoveExternalDrainerAnnotationcorrectly ensures the external drainer annotation state is synchronized before the daemon processes drain requests. The immediate return when the annotation changes allows the drain controller to observe and react to the updated state.
675-722: LGTM: External drainer annotation logic is correct.The
addRemoveExternalDrainerAnnotationmethod correctly manages the external drainer annotation lifecycle:
Timing guard (lines 689-691): Annotation changes are permitted only when both drain annotations are Idle OR neither exists yet, preventing mid-drain state corruption.
Removal path (lines 694-708): When
UseExternalDraineris disabled, the annotation is cleanly removed usingclient.PatchwithMergeFrom.Addition path (lines 710-719): When
UseExternalDraineris enabled, the annotation is added via the existingAnnotateObjectutility.The conditional logic at lines 689-691 correctly evaluates to:
- Proceed when
neitherExists=true(first reconcile)- Proceed when
bothIdle=true(safe toggle point)- Return early when drain is requested or in progress (avoid mid-drain changes)
This addresses the state machine requirements discussed in past reviews.
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
pkg/daemon/daemon.go
Outdated
| return ctrl.Result{}, err | ||
| } | ||
| if changed { | ||
| reqLogger.Info("external drainer annotation changed, return to allow drain controller to react") |
There was a problem hiding this comment.
the last part of the log is missleading
we may have the idle annot already there. we will requeue since we patch the node state annotation which will trigger a reconcile.
86176f9 to
a10be09
Compare
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
1 similar comment
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
…n in daemon reconcile loop Signed-off-by: Ido Heyvi <iheyvi@nvidia.com>
SRIOV OP internal drain controller can be disabled, through
USE_EXTERNAL_DRAINER. For example draining can be performed by external NVIDIA maintenance operatorSummary by CodeRabbit
New Features
Behavior
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.