-
Notifications
You must be signed in to change notification settings - Fork 147
OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters #1487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters #1487
Conversation
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds Pacemaker status collection and health subsystems: a collector CLI and CronJob to gather Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (7)
pkg/tnf/pacemaker/healthcheck.go (3)
71-75
: Avoid event spam: emit “Healthy” only on transitions.At 30s resync, always emitting a Healthy event will generate noisy duplicates. Track last status and only emit on transition to Healthy.
type HealthCheck struct { operatorClient v1helpers.StaticPodOperatorClient kubeClient kubernetes.Interface eventRecorder events.Recorder + lastOverallStatus string } @@ - // Record a normal event for healthy status - if status.OverallStatus == StatusHealthy { - c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started") - } + // Record a normal event for healthy status only on transition + if status.OverallStatus == StatusHealthy && c.lastOverallStatus != StatusHealthy { + c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started") + } + c.lastOverallStatus = status.OverallStatusAlso applies to: 446-450
347-356
: Micro: precompile timestamp regex in isWithinTimeWindow.Avoid recompiling on each iteration.
func (c *HealthCheck) isWithinTimeWindow(line string, cutoffTime time.Time) bool { timeFormats := []string{ "Mon Jan 2 15:04:05 2006", "2006-01-02 15:04:05.000000Z", "2006-01-02 15:04:05Z", } - for _, format := range timeFormats { - re := regexp.MustCompile(`'([^']+)'`) - matches := re.FindStringSubmatch(line) + tsRe := regexp.MustCompile(`'([^']+)'`) + for _, format := range timeFormats { + matches := tsRe.FindStringSubmatch(line) if len(matches) > 1 { if t, err := time.Parse(format, matches[1]); err == nil { return t.After(cutoffTime) } } }
71-75
: Unused dependency: kubeClient is stored but never used.Drop it from the struct/constructor or put it to use (e.g., namespacing events, future lookups). Keeping it increases surface without value.
type HealthCheck struct { operatorClient v1helpers.StaticPodOperatorClient - kubeClient kubernetes.Interface eventRecorder events.Recorder lastOverallStatus string } @@ func NewHealthCheck( livenessChecker *health.MultiAlivenessChecker, operatorClient v1helpers.StaticPodOperatorClient, - kubeClient kubernetes.Interface, eventRecorder events.Recorder, ) factory.Controller { c := &HealthCheck{ operatorClient: operatorClient, - kubeClient: kubeClient, eventRecorder: eventRecorder, } @@ - operatorClient, - kubeClient, + operatorClient, eventRecorder, )If this is intentionally reserved for future use, ignore and mark with a TODO comment to avoid churn.
Also applies to: 80-84
pkg/tnf/pacemaker/healthcheck_test.go (4)
106-137
: Assert condition transitions, not just error absence.Validate that degraded/healthy actually update the operator condition in the fake client.
err := controller.updateOperatorAvailability(ctx, status) require.NoError(t, err, "updateOperatorAvailability should not return an error") + // Verify degraded condition set + _, st, _, _ := operatorClient.GetStaticPodOperatorState() + require.Condition(t, func() (success bool) { + for _, c := range st.Conditions { + if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionTrue { + return true + } + } + return false + }, "expected degraded condition to be true") @@ err = controller.updateOperatorAvailability(ctx, status) require.NoError(t, err, "updateOperatorAvailability should not return an error") + // Verify degraded condition cleared + _, st, _, _ = operatorClient.GetStaticPodOperatorState() + require.Condition(t, func() (success bool) { + for _, c := range st.Conditions { + if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionFalse { + return true + } + } + return false + }, "expected degraded condition to be false")
163-184
: Validate emitted events.Use the in-memory recorder to assert reasons were produced (failed action, fencing, generic warning, error, healthy transition).
controller.recordHealthCheckEvents(status) - + // Collect and assert events for warnings and error + evts := eventRecorder.Events() + require.NotEmpty(t, evts) + // Sanity: at least one of each warning type and an error + require.Contains(t, reasons(evts), EventReasonFailedAction) + require.Contains(t, reasons(evts), EventReasonFencingEvent) + require.Contains(t, reasons(evts), EventReasonWarning) + require.Contains(t, reasons(evts), EventReasonError) @@ status = &HealthStatus{ OverallStatus: StatusHealthy, Warnings: []string{}, Errors: []string{}, } controller.recordHealthCheckEvents(status) + // After transition to healthy, expect a healthy event + evts = eventRecorder.Events() + require.Contains(t, reasons(evts), EventReasonHealthy) + +} + +// helper +func reasons(evts []events.Event) []string { + out := make([]string, 0, len(evts)) + for _, e := range evts { + out = append(out, e.Reason) + } + return out }
261-291
: Add a test for “standby” node to ensure it’s a Warning, not Degraded.Guards the updated parsing logic and aligns with the stated degradation policy.
// Test with offline node @@ require.NotEmpty(t, status.Errors, "Should have errors for offline nodes") + + // Test with standby node -> warning, but still considered online + lines = []string{ + SectionNodeList, + " * Node master-0 (1): online, feature set 3.19.6", + " * Node master-1 (2): standby, feature set 3.19.6", + SectionFullListResources, + } + status = &HealthStatus{Warnings: []string{}, Errors: []string{}} + allOnline = controller.parseNodeStatus(lines, status) + require.True(t, allOnline, "Standby should not be treated as offline") + require.NotEmpty(t, status.Warnings, "Standby should surface as a warning")
296-313
: Cover dynamic node-count logic.Add a case where only one node has resources started while two nodes are listed; expect an error message reflecting 1/2.
// Test with all resources started @@ require.Empty(t, status.Errors, "Should have no errors for started resources") + + // Test with partial placement (only one node started) + lines = []string{ + SectionNodeList, + " * Node master-0 (1): online, feature set 3.19.6", + " * Node master-1 (2): online, feature set 3.19.6", + SectionFullListResources, + " * Clone Set: kubelet-clone [kubelet]:", + " * kubelet (systemd:kubelet): Started master-0", + " * Clone Set: etcd-clone [etcd]:", + " * etcd (ocf:heartbeat:podman-etcd): Started master-0", + SectionNodeAttributes, + } + status = &HealthStatus{Warnings: []string{}, Errors: []string{}} + allStarted = controller.parseResourceStatus(lines, status) + require.False(t, allStarted, "Resources not started on all nodes should be false") + require.NotEmpty(t, status.Errors)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (2)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)
pkg/tnf/pacemaker/healthcheck.go
Outdated
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) { | ||
allNodesOnline = false | ||
nodeName := c.extractNodeName(line) | ||
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t treat “not online” as “offline”; handle standby distinctly.
Current check flags any non-“online” node as offline (e.g., “standby”). That contradicts the PR objective (degrade only when nodes are offline).
- if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
- allNodesOnline = false
- nodeName := c.extractNodeName(line)
- status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
- }
+ if c.isNodeStatusLine(line) {
+ lower := strings.ToLower(line)
+ if strings.Contains(lower, "offline") {
+ allNodesOnline = false
+ nodeName := c.extractNodeName(line)
+ status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName))
+ } else if strings.Contains(lower, "standby") {
+ nodeName := c.extractNodeName(line)
+ status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName))
+ }
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) { | |
allNodesOnline = false | |
nodeName := c.extractNodeName(line) | |
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName)) | |
} | |
if c.isNodeStatusLine(line) { | |
lower := strings.ToLower(line) | |
if strings.Contains(lower, "offline") { | |
allNodesOnline = false | |
nodeName := c.extractNodeName(line) | |
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName)) | |
} else if strings.Contains(lower, "standby") { | |
nodeName := c.extractNodeName(line) | |
status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName)) | |
} | |
} |
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 214-218, the code currently
treats any node state that is not NodeStatusOnline as an offline error; change
the logic to only mark nodes as offline when their status equals the explicit
offline constant (e.g., NodeStatusOffline) and ignore legitimate non-online
states like NodeStatusStandby. Concretely, replace the current negation check
with an explicit equality check for offline (or a compound condition that skips
Standby), set allNodesOnline = false and append an error only when the status is
offline, and avoid adding errors for standby nodes.
pkg/tnf/pacemaker/healthcheck.go
Outdated
func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool { | ||
resourcesStarted := make(map[string]map[string]bool) | ||
resourcesStarted[ResourceKubelet] = make(map[string]bool) | ||
resourcesStarted[ResourceEtcd] = make(map[string]bool) | ||
|
||
inResources := false | ||
|
||
for _, line := range lines { | ||
line = strings.TrimSpace(line) | ||
|
||
if strings.Contains(line, SectionFullListResources) { | ||
inResources = true | ||
continue | ||
} | ||
|
||
if inResources { | ||
if strings.Contains(line, SectionNodeAttributes) { | ||
break | ||
} | ||
|
||
c.checkResourceLine(line, resourcesStarted) | ||
} | ||
} | ||
|
||
// Check if we have both resources started on all nodes | ||
kubeletCount := len(resourcesStarted[ResourceKubelet]) | ||
etcdCount := len(resourcesStarted[ResourceEtcd]) | ||
|
||
if kubeletCount < ExpectedNodeCount { | ||
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount)) | ||
} | ||
|
||
if etcdCount < ExpectedNodeCount { | ||
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount)) | ||
} | ||
|
||
return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hard-coded node count (2); derive from the Node List.
Clusters may not always be 2 nodes. Parse the Node List to compute expected count and use it when validating resource placement.
-func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
+func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
resourcesStarted := make(map[string]map[string]bool)
resourcesStarted[ResourceKubelet] = make(map[string]bool)
resourcesStarted[ResourceEtcd] = make(map[string]bool)
@@
- // Check if we have both resources started on all nodes
- kubeletCount := len(resourcesStarted[ResourceKubelet])
- etcdCount := len(resourcesStarted[ResourceEtcd])
+ // Determine expected node count from the Node List; fall back to constant if not found
+ expectedCount := c.detectNodeCount(lines)
+ if expectedCount == 0 {
+ expectedCount = ExpectedNodeCount
+ }
+ kubeletCount := len(resourcesStarted[ResourceKubelet])
+ etcdCount := len(resourcesStarted[ResourceEtcd])
- if kubeletCount < ExpectedNodeCount {
- status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount))
+ if kubeletCount < expectedCount {
+ status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, expectedCount))
}
- if etcdCount < ExpectedNodeCount {
- status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount))
+ if etcdCount < expectedCount {
+ status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, expectedCount))
}
- return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount
+ return kubeletCount >= expectedCount && etcdCount >= expectedCount
}
+
+// detectNodeCount returns the number of nodes listed in the "Node List" section.
+func (c *HealthCheck) detectNodeCount(lines []string) int {
+ inNodeList := false
+ count := 0
+ for _, line := range lines {
+ line = strings.TrimSpace(line)
+ if strings.Contains(line, SectionNodeList) {
+ inNodeList = true
+ continue
+ }
+ if inNodeList {
+ if strings.Contains(line, SectionFullListResources) {
+ break
+ }
+ if c.isNodeStatusLine(line) {
+ count++
+ }
+ }
+ }
+ return count
+}
Committable suggestion skipped: line range outside the PR's diff.
c077a2b
to
81984c7
Compare
81984c7
to
26d6df8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)
502-510
: Degrade on StatusError as well.If we fail to collect status (
OverallStatus == StatusError
), the operator condition remains untouched, so outages never surface. TreatStatusError
the same asStatusDegraded
when updating conditions.Suggested adjustment:
- if status.OverallStatus == StatusDegraded { - return c.setDegradedCondition(ctx, status) - } else if status.OverallStatus == StatusHealthy { + switch status.OverallStatus { + case StatusDegraded, StatusError: + return c.setDegradedCondition(ctx, status) + case StatusHealthy: return c.clearDegradedCondition(ctx) } - - // For Warning or Error status, we don't update the degraded condition - return nil + // Warning / Unknown: leave condition unchanged + return nil
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
- pkg/tnf/pacemaker/testdata/healthy_cluster.xml
26d6df8
to
e42e5dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)
3-9
: Prevent hangs; add exec timeout and non-interactive sudo (-n). Also introduce StatusError for collection failures.pcs can block (sudo prompt/hung pcs). Use a short timeout context and sudo -n. While here, add StatusError and treat exec failures accordingly.
Apply:
--- a/pkg/tnf/pacemaker/healthcheck.go +++ b/pkg/tnf/pacemaker/healthcheck.go @@ import ( + "errors" "context" "encoding/xml" "fmt" "strings" "time" @@ const ( // Time windows for detecting recent events FailedActionTimeWindow = 5 * time.Minute FencingEventTimeWindow = 24 * time.Hour + + // Upper bound for executing pcs commands + ExecTimeout = 10 * time.Second @@ // Status strings StatusHealthy = "Healthy" StatusWarning = "Warning" StatusDegraded = "Degraded" StatusUnknown = "Unknown" + StatusError = "Error" @@ // Pacemaker commands - PcsStatusXMLCommand = "sudo pcs status xml" + PcsStatusXMLCommand = "sudo -n pcs status xml" @@ - // Execute the pcs status xml command - stdout, stderr, err := exec.Execute(ctx, PcsStatusXMLCommand) + // Execute the pcs status xml command with timeout and non-interactive sudo + ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout) + defer cancel() + stdout, stderr, err := exec.Execute(ctxExec, PcsStatusXMLCommand) if err != nil { - return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil + if errors.Is(err, context.DeadlineExceeded) { + err = fmt.Errorf("pcs status xml command timed out: %w", err) + } + return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil }Also applies to: 21-84, 336-345
379-383
: Set OverallStatus when pacemaker isn’t running (avoid returning Unknown).Without setting OverallStatus here, outages can be masked and the degraded condition won’t be set.
if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") - return status, nil + status.OverallStatus = StatusDegraded + return status, nil }
568-576
: Degrade on StatusError as well (not only Degraded).Collection failures should surface as degraded to operators/users.
switch status.OverallStatus { - case StatusDegraded: + case StatusDegraded, StatusError: return c.setDegradedCondition(ctx, status) case StatusHealthy: return c.clearDegradedCondition(ctx) default: // For Warning or Unknown status, we don't update the degraded condition return nil }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/healthcheck_test.go
🔇 Additional comments (2)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-88
: Fixture looks consistent with parser expectations.IDs/roles/attributes align with parseResources/parseNodes and time formats for histories. LGTM.
pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: Recent-failures fixture is well-formed and matches parser formats.Timestamps match PacemakerTimeFormat/PacemakerFenceTimeFormat; placeholders are suitable for templating. LGTM.
e42e5dc
to
3cab3d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)
386-390
: SetOverallStatus
toStatusDegraded
before returning.When Pacemaker isn't running, the code appends an error but returns with
OverallStatus
still set toStatusUnknown
(from initialization at line 370). This prevents the degraded condition from being set downstream, masking the outage.Apply this diff:
// Check if pacemaker is running if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") + status.OverallStatus = StatusDegraded return status, nil }
567-575
: HandleStatusError
in the switch statement.If
StatusError
is added (per earlier comment), theupdateOperatorStatus
switch should treat it likeStatusDegraded
and set the degraded condition.Apply this diff:
// Update operator conditions based on pacemaker status switch status.OverallStatus { - case StatusDegraded: + case StatusDegraded, StatusError: return c.setDegradedCondition(ctx, status) case StatusHealthy: return c.clearDegradedCondition(ctx) default: - // For Warning or Unknown status, we don't update the degraded condition + // For Warning or Unknown status, we don't update the degraded condition return nil }
267-274
: Rename tonewErrorHealthStatus
and useStatusError
instead ofStatusUnknown
.Collection failures (exec timeout, XML parse errors, etc.) should be marked as
StatusError
rather thanStatusUnknown
so they're surfaced as actionable failures in the operator condition. This aligns with previous feedback.Apply this diff:
-// newUnknownHealthStatus creates a HealthStatus with Unknown status and an error message -func newUnknownHealthStatus(errMsg string) *HealthStatus { +// newErrorHealthStatus creates a HealthStatus with Error status and an error message +func newErrorHealthStatus(errMsg string) *HealthStatus { return &HealthStatus{ - OverallStatus: StatusUnknown, + OverallStatus: StatusError, Warnings: []string{}, Errors: []string{errMsg}, } }Then update all call sites (lines 343, 346, 355, 361) to use
newErrorHealthStatus
instead ofnewUnknownHealthStatus
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (7)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-88
: LGTM! Test fixture is well-formed.The XML fixture represents a valid healthy Pacemaker cluster state with:
- Both nodes online (
online="true"
on lines 12-13)- All resources started and active
- No failed operations in history
- Empty fence history
This provides good test coverage for the happy path.
pkg/tnf/pacemaker/healthcheck_test.go (6)
102-113
: Update test expectations whenStatusError
replacesStatusUnknown
.This test expects
StatusUnknown
for exec failures (line 111), which is correct for the current implementation. However, oncenewUnknownHealthStatus
is renamed tonewErrorHealthStatus
and returnsStatusError
(per the comment on healthcheck.go lines 267-274), update this assertion accordingly.After applying the main implementation changes, update line 111:
require.NotNil(t, status, "HealthStatus should not be nil") - require.Equal(t, StatusUnknown, status.OverallStatus, "Status should be Unknown when exec fails") + require.Equal(t, StatusError, status.OverallStatus, "Status should be Error when exec fails") require.NotEmpty(t, status.Errors, "Should have error messages when exec fails")
301-329
: UpdateTestHealthCheck_determineOverallStatus
forStatusError
.Line 306 tests
newUnknownHealthStatus
and expectsStatusUnknown
(line 308). OnceStatusError
is introduced, add a dedicated test case and update the existing one.After implementing
StatusError
, update the test:func TestHealthCheck_determineOverallStatus(t *testing.T) { controller := &HealthCheck{} - // Test degraded status status - errMsg := "test error message" - status := newUnknownHealthStatus(errMsg) - result := controller.determineOverallStatus(status) - require.Equal(t, StatusUnknown, result) + // Test error status (collection failures) + errMsg := "test error message" + status := newErrorHealthStatus(errMsg) + result := controller.determineOverallStatus(status) + require.Equal(t, StatusError, result)Also add a test case for
StatusUnknown
if it's still used for other scenarios, or remove it if no longer needed.
377-392
: UpdateTestHealthCheck_validateResourceCount
signature after refactor.Once
validateResourceCount
is updated to acceptexpectedCount
as a parameter (per the comment on healthcheck.go lines 480-487), update the test calls on lines 382 and 388.Apply this diff after the main implementation is updated:
// Test with correct count status := &HealthStatus{Warnings: []string{}, Errors: []string{}} - result := controller.validateResourceCount(ResourceEtcd, 2, status) + result := controller.validateResourceCount(ResourceEtcd, 2, 2, status) require.True(t, result, "Should return true for correct count") require.Empty(t, status.Errors, "Should have no errors for correct count") // Test with insufficient count status = &HealthStatus{Warnings: []string{}, Errors: []string{}} - result = controller.validateResourceCount(ResourceKubelet, 1, status) + result = controller.validateResourceCount(ResourceKubelet, 1, 2, status) require.False(t, result, "Should return false for insufficient count")
248-266
: Consider removingTestHealthCheck_collectNodeStatus_MismatchedCount
after refactor.This test validates the warning for node count mismatch against
ExpectedNodeCount
. Once the hard-coded constant is removed (per the comment on healthcheck.go lines 423-427), this specific warning logic will be deleted, making this test obsolete.After the refactor, either remove this test or update it to test a different node-count-related scenario if applicable.
23-46
: LGTM! Test helpers are well-structured.The
createTestHealthCheck
andNewHealthCheck
test setup properly use fake clients and in-memory event recorders for isolated testing. Good practice for controller tests.Also applies to: 72-100
176-230
: LGTM! XML parsing tests provide good coverage.The tests for
parsePacemakerStatusXML
cover:
- Healthy cluster (lines 176-188)
- Offline node scenario (lines 190-202)
- Recent failures with dynamic timestamps (lines 204-230)
The use of
getRecentFailuresXML
with dynamic timestamps (lines 55-69) is particularly good for avoiding time-sensitive test flakes.
3cab3d9
to
c3fb43e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (8)
pkg/tnf/pacemaker/healthcheck.go (8)
267-274
: Duplicate: Mark collection failures as StatusError.Per past review, this function should return
StatusError
instead ofStatusUnknown
to properly distinguish collection failures from unknown states. Collection failures are actionable and should be surfaced as errors.Apply this diff:
func newUnknownHealthStatus(errMsg string) *HealthStatus { return &HealthStatus{ - OverallStatus: StatusUnknown, + OverallStatus: StatusError, Warnings: []string{}, Errors: []string{errMsg}, } }Note: This requires adding the
StatusError
constant to the status constants section (lines 34-38).
34-38
: Add StatusError constant for collection failures.To properly handle collection/communication failures distinct from operational issues, add a
StatusError
constant alongside the existing status constants.Apply this diff:
// Status strings StatusHealthy = "Healthy" StatusWarning = "Warning" StatusDegraded = "Degraded" + StatusError = "Error" StatusUnknown = "Unknown"
429-433
: Distinguish offline from standby nodes.The current check treats any non-online node as an error, including legitimate standby states. Per the PR objectives, only offline nodes should trigger degradation.
Apply this diff:
for _, node := range result.Nodes.Node { - if node.Online != NodeStatusOnline { - status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) + if node.Online == "false" { + status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", node.Name)) + } else if node.Standby == BooleanValueTrue { + status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", node.Name)) } }
387-390
: Set OverallStatus when pacemaker isn't running.When pacemaker is down, the function returns with
OverallStatus
still set toUnknown
. This prevents the degraded condition from being set, masking the outage.Apply this diff:
if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") + status.OverallStatus = StatusDegraded return status, nil }
497-509
: Fix failure detection logic: use OR instead of AND.The condition requires both RC and RCText to indicate failure, which misses cases where only one field is non-success. Use OR logic to catch all failure cases.
Apply this diff:
for _, operation := range resourceHistory.OperationHistory { // Check for failed operations (rc != "0") - if operation.RC != OperationRCSuccess && operation.RCText != OperationRCTextSuccess { + if operation.RC != OperationRCSuccess || operation.RCText != OperationRCTextSuccess { // Parse the timestamp t, err := time.Parse(PacemakerTimeFormat, operation.LastRCChange)
567-575
: Handle StatusError in operator status update.When status collection fails (
StatusError
), the operator should surface a degraded condition to avoid masking outages. Currently onlyStatusDegraded
triggers the degraded condition.Apply this diff:
// Update operator conditions based on pacemaker status switch status.OverallStatus { - case StatusDegraded: + case StatusDegraded, StatusError: return c.setDegradedCondition(ctx, status) case StatusHealthy: return c.clearDegradedCondition(ctx) default: - // For Warning or Unknown status, we don't update the degraded condition + // For Warning or Unknown status, we don't update the degraded condition return nil }
469-487
: Derive expected node count dynamically from parsed XML.Replace the hard-coded
ExpectedNodeCount
constant with a dynamic count derived from the parsed node list. This makes the validation flexible for clusters of varying sizes.Apply this diff:
// Check if we have both resources started on all nodes + expectedCount := len(result.Nodes.Node) kubeletCount := len(resourcesStarted[ResourceKubelet]) etcdCount := len(resourcesStarted[ResourceEtcd]) - kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, status) - etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, status) + kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, expectedCount, status) + etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, expectedCount, status) return kubeletOK && etcdOK } // validateResourceCount validates that a resource is started on the expected number of nodes -func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) bool { - if actualCount < ExpectedNodeCount { +func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) bool { + if actualCount < expectedCount { status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", - resourceName, actualCount, ExpectedNodeCount)) + resourceName, actualCount, expectedCount)) return false } return true
423-427
: Remove hard-coded node count check.The hard-coded
ExpectedNodeCount = 2
check is inflexible for clusters that may scale or undergo node replacement. Since the subsequent validation already uses the actual node count from the XML, this strict equality check can be removed.Apply this diff:
- // Check if we have the expected number of nodes. - // This should almost always be 2, but 1 node is possible during a control-plane node replacement event. - if len(result.Nodes.Node) != ExpectedNodeCount { - status.Warnings = append(status.Warnings, fmt.Sprintf("Expected %d nodes, found %d", ExpectedNodeCount, len(result.Nodes.Node))) - } -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🔇 Additional comments (4)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)
1-67
: LGTM!This XML test fixture correctly represents an offline node scenario with master-1 offline (line 13) and a corresponding fencing event (line 64). The structure is valid and appropriate for testing the health check parser's handling of degraded cluster states.
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-88
: LGTM!This XML test fixture correctly represents a healthy cluster with both nodes online and all resources started. The structure is valid and provides appropriate test coverage for the healthy path scenario.
pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: LGTM!This template correctly uses placeholders for dynamic timestamp generation (lines 56, 63), allowing tests to verify time-based filtering of recent failures and fencing events. The template approach is appropriate for testing scenarios that depend on recency.
pkg/tnf/pacemaker/healthcheck_test.go (1)
1-440
: LGTM!This comprehensive test suite provides excellent coverage for the Pacemaker health check component:
- Thorough testing of XML parsing with multiple scenarios (healthy, offline, recent failures)
- Good edge case coverage (no nodes, mismatched counts, large/invalid XML)
- Effective use of test helpers and fixtures
- Appropriate use of fake clients for isolation
The test structure is clean and maintainable.
c3fb43e
to
f87437c
Compare
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)
386-390
: Set OverallStatus before returning when pacemaker is not running.When pacemaker is not running, the function appends an error and returns immediately (line 389), but
OverallStatus
remainsUnknown
(initialized at line 370). This prevents the degraded condition from being set properly sinceupdateOperatorStatus
(line 567-575) only sets degraded forStatusError
, notStatusUnknown
.Apply this diff:
if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") + status.OverallStatus = c.determineOverallStatus(status) return status, nil }
🧹 Nitpick comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)
102-113
: Clarify test expectation: exec failure should not return an error.The comment states "exec.Execute will fail due to permissions" but the test expects
NoError
. This appears correct (the function returnsUnknown
status without propagating the error), but the comment could be clearer about this design choice.Consider updating the comment to:
- // In test environment, exec.Execute will fail due to permissions - require.NoError(t, err, "getPacemakerStatus should not return an error") + // In test environment, exec.Execute will fail due to permissions. + // getPacemakerStatus handles this gracefully by returning Unknown status without propagating the error. + require.NoError(t, err, "getPacemakerStatus should not return an error even when exec fails")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/healthy_cluster.xml
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (10)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)
1-67
: LGTM! Well-structured test fixture for offline node scenario.The XML fixture accurately represents an offline node scenario with appropriate fencing history. The structure is valid and will effectively support testing of offline node detection and fencing event handling.
pkg/tnf/pacemaker/healthcheck_test.go (2)
23-69
: LGTM! Well-designed test helpers.The test helpers are well-structured:
createTestHealthCheck
: Properly initializes all dependencies with fakesloadTestXML
: Simple and effective for loading fixturesgetRecentFailuresXML
: Clever approach using dynamic timestamps to ensure failures are within the 5-minute window
176-440
: LGTM! Comprehensive test coverage.The test suite provides excellent coverage:
- XML parsing for healthy, offline, and recent-failures scenarios
- Node status collection with various edge cases (empty nodes, mismatched counts)
- Resource status validation
- Overall status determination logic
- Helper functions (newUnknownHealthStatus, recent failures, fencing events)
- Edge cases (large XML, invalid XML, missing state)
The tests are well-organized, use clear assertions, and validate both happy paths and error conditions.
pkg/tnf/pacemaker/healthcheck.go (7)
334-365
: LGTM! Robust command execution with proper safeguards.The implementation correctly addresses potential hanging issues:
- Uses context with timeout (10 seconds) to prevent indefinite blocking
- Uses
sudo -n
for non-interactive execution- Explicitly checks for timeout errors
- Validates XML size to prevent XML bombs
- Returns graceful Unknown status on errors
23-85
: LGTM! Well-organized constants with clear documentation.The constants are well-structured and appropriately documented:
- Time windows for event detection are reasonable (5 min for failures, 24 hours for fencing)
- 30-second resync interval balances responsiveness with resource usage
- Status constants, event reasons, and command strings are clearly defined
ExpectedNodeCount = 2
correctly reflects the Two-Node OpenShift design
416-434
: Node online check aligns with Two-Node OpenShift design.The check at line 430 treats any node where
Online != "true"
as an error, including standby nodes. Per the maintainer's clarification, this is the intended behavior for Two-Node OpenShift with Fencing, where standby nodes are considered offline for health purposes.Based on learnings
497-509
: LGTM! Correct failure detection using OR logic.The failure detection properly uses OR logic (
||
) to catch failures when eitherRC
orRCText
indicates non-success. This correctly identifies cases where only one field shows failure (e.g.,RC="1"
butRCText="ok"
).
560-576
: Consider whether StatusUnknown should also set degraded condition.Currently, only
StatusError
sets the degraded condition. When the controller cannot determine cluster health (e.g., exec timeout, parse failure), it returnsStatusUnknown
, but this doesn't update the operator condition. This could mask operational issues.Should
StatusUnknown
also set a degraded condition to surface collection failures? If the current behavior is intentional (avoiding false alarms during transient issues), consider adding a comment explaining this design choice.Would you like the controller to set degraded on Unknown status, or is the current "wait and see" approach preferred?
480-487
: Hard-coded ExpectedNodeCount correctly reflects architectural constraint.The constant
ExpectedNodeCount = 2
and its usage in validation correctly encode the Two-Node OpenShift with Fencing design. Per the maintainer's clarification, this is an architectural invariant, not a limitation to be made dynamic.Based on learnings
627-654
: LGTM! Well-categorized event recording.Event recording properly categorizes warnings by type:
- Failed resource actions →
EventReasonFailedAction
- Fencing events →
EventReasonFencingEvent
- General warnings →
EventReasonWarning
- Errors →
EventReasonError
- Healthy status →
EventReasonHealthy
This categorization enables effective event filtering and alerting.
f87437c
to
c07e5b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)
267-274
: Collection failures return Unknown; confirm policy.Exec/parse failures set StatusUnknown, so no degraded condition is set. If outages should surface, consider marking as StatusError to degrade. If Unknown is intentional, ignore.
Also applies to: 341-347
🧹 Nitpick comments (6)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
79-82
: Remove duplicated operation_history entry.Line 80 duplicates line 79 for the kubelet probe. Drop one to avoid confusion in tests and future parsing.
- <operation_history call="330" task="probe" rc="0" rc_text="ok" last-rc-change="Thu Sep 18 12:40:35 2025" exec-time="4ms" queue-time="0ms"/>
pkg/tnf/pacemaker/healthcheck_test.go (1)
115-149
: Optional: rename test for consistency with impl name.The function under test is updateOperatorStatus; consider renaming the test to match for easier discovery.
-func TestHealthCheck_updateOperatorAvailability(t *testing.T) { +func TestHealthCheck_updateOperatorStatus(t *testing.T) {pkg/tnf/pacemaker/healthcheck.go (4)
338-341
: Extract exec timeout into a constant.Avoid magic number; define a reusable ExecTimeout alongside other constants and use it here.
@@ - // Execute the pcs status xml command with a timeout and non-interactive sudo - ctxExec, cancel := context.WithTimeout(ctx, 10*time.Second) + // Execute the pcs status xml command with a timeout and non-interactive sudo + ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout)Add near other constants:
+ // Upper bound for "pcs status" execution + ExecTimeout = 10 * time.Second
430-434
: If standby is considered “offline”, handle it explicitly.Currently only Online != "true" is flagged. To align with the stated intent to treat standby as offline, also error on Standby == "true".
for _, node := range result.Nodes.Node { - if node.Online != NodeStatusOnline { - status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) - } + if node.Online != NodeStatusOnline { + status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) + } else if node.Standby == BooleanValueTrue { + status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (treated as offline)", node.Name)) + } }
449-453
: Match resources by agent, not ID substring.ID-based substring matching is brittle. Use ResourceAgent to detect kubelet/etcd reliably.
- // Check if this is a kubelet or etcd resource - if strings.Contains(resource.ID, ResourceKubelet) { - resourcesStarted[ResourceKubelet][nodeName] = true - } else if strings.Contains(resource.ID, ResourceEtcd) { - resourcesStarted[ResourceEtcd][nodeName] = true - } + // Identify kubelet/etcd resources by their agents + switch { + case strings.HasPrefix(resource.ResourceAgent, "systemd:kubelet"): + resourcesStarted[ResourceKubelet][nodeName] = true + case strings.HasPrefix(resource.ResourceAgent, "ocf:heartbeat:podman-etcd"): + resourcesStarted[ResourceEtcd][nodeName] = true + }
629-643
: Optional: gate repeated warning events to reduce noise.At 30s resync, warnings within the 5m/24h windows will re-emit every cycle. Consider deduplication (e.g., last-emitted cache, once-per-window) to avoid event spam.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)
49-53
: Use require.NoErrorf to format the failure message.Current call won’t format %s; switch to the fmt-aware variant.
- require.NoError(t, err, "Failed to read test XML file: %s", filename) + require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)Likely an incorrect or invalid review comment.
c07e5b1
to
e2add0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
pkg/tnf/pacemaker/healthcheck.go (1)
388-401
: Optional: size guard exists only in caller, not in parser.parsePacemakerStatusXML trusts input size; only getPacemakerStatus enforces maxXMLSize. If parse is used elsewhere, consider early-checking length or using an xml.Decoder with a bounded reader to keep invariants local to the parser. Low priority.
pkg/tnf/pacemaker/healthcheck_test.go (3)
50-55
: Use require.NoErrorf for formatted failure message.Minor test ergonomics: this call passes a format string and arg; use the “f” variant.
- data, err := os.ReadFile(path) - require.NoError(t, err, "Failed to read test XML file: %s", filename) + data, err := os.ReadFile(path) + require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)Based on learnings
642-659
: Test name/comment mismatch with behavior.The test name/comment suggest exercising “large XML” behavior, but the parser test simply validates invalid XML. Either update the comment/name to reflect invalid XML, or drive the size check via getPacemakerStatus (which enforces maxXMLSize). Optional.
484-498
: Add a test to assert standby is treated as offline.To lock in the intended behavior, add a case where node.Online="true" and Standby="true" and assert an error is recorded.
Example:
func TestHealthCheck_collectNodeStatus_StandbyTreatedAsOffline(t *testing.T) { controller := &HealthCheck{} result := &PacemakerResult{ Nodes: Nodes{ Node: []Node{ {Name: "master-0", Online: booleanValueTrue, Standby: booleanValueTrue}, {Name: "master-1", Online: booleanValueTrue, Standby: "false"}, }, }, } status := &HealthStatus{Warnings: []string{}, Errors: []string{}} controller.collectNodeStatus(result, status) require.NotEmpty(t, status.Errors) require.Contains(t, strings.Join(status.Errors, " "), "standby") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
- pkg/tnf/pacemaker/testdata/healthy_cluster.xml
e2add0e
to
f3d22d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it seems good to me :)
I just left a few notes
// Validate XML size to prevent XML bombs | ||
if len(stdout) > maxXMLSize { | ||
return newUnknownHealthStatus(fmt.Sprintf("XML output too large: %d bytes (max: %d bytes)", len(stdout), maxXMLSize)), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future. If the XML Bomb is a genuine risk, we might want to consider modifying the exec
package to use io.LimitReader
somehow. The current check prevents us to parse it, and it is good, but we still have the whole XML in memory at that point.
// Record pacemaker health check events | ||
c.recordHealthCheckEvents(healthStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how long does recordHealthCheckEvents
take, but maybe we should pass the context here.
if err != nil { | ||
klog.V(4).Infof("Failed to parse timestamp '%s' for operation %s on node %s: %v", | ||
operation.LastRCChange, operation.Task, node.Name, err) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we worry about letting these errors slip through?
// Clean up old entries that are outside the longest deduplication window | ||
maxWindow := eventDeduplicationWindowFencing | ||
for key, timestamp := range c.recordedEvents { | ||
if now.Sub(timestamp) > maxWindow { | ||
delete(c.recordedEvents, key) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup logic is currently nested inside the block that decides on item addition. Should we move it out? It feels like that cleanup step can be performed just once in recordHealthCheckEvents
execution, not for every new item to add, did I misunderstand it?
UPDATE: now that I read it a little better, it seems to me that at each loop called in recordHealthCheckEvents
we might delete events recorded in the previous loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgot the tests :)
Some tests have similar boilerplate code (like the ones that parse XML). You might want to consider using test tables instead. Right now, though, I have mixed feelings, as I do like how you organized the tests. The only real benefit of merging them would be that if we need to fix an issue or add new test cases, it'd be much easier to do it in one place instead of several. So, I'll leave the decision up to you!
func getRecentFailuresXML(t *testing.T) string { | ||
// Load the template from testdata | ||
templateXML := loadTestXML(t, "recent_failures_template.xml") | ||
|
||
// Generate recent timestamps that are definitely within the 5-minute window | ||
now := time.Now() | ||
recentTime := now.Add(-1 * time.Minute).UTC().Format("Mon Jan 2 15:04:05 2006") | ||
recentTimeISO := now.Add(-1 * time.Minute).UTC().Format("2006-01-02 15:04:05.000000Z") | ||
|
||
// Replace the placeholders in the template | ||
xmlContent := strings.ReplaceAll(templateXML, "{{RECENT_TIMESTAMP}}", recentTime) | ||
xmlContent = strings.ReplaceAll(xmlContent, "{{RECENT_TIMESTAMP_ISO}}", recentTimeISO) | ||
|
||
return xmlContent | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty nice PR IMHO 👍
return c.updateOperatorCondition(ctx, condition) | ||
} | ||
|
||
func (c *HealthCheck) clearDegradedCondition(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to check the current degraded condition for its reason, to make sure we don't clear a degraded condition which wasn't set by this healthcheck? Not sure what best pratice is in such situations...
pkg/tnf/pacemaker/healthcheck.go
Outdated
pacemakerFenceTimeFormat = "2006-01-02 15:04:05.000000Z" // Format for fence event timestamps | ||
|
||
// Pacemaker commands | ||
pcsStatusXMLCommand = "sudo -n pcs status xml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh wait... did you already try running this in cluster?
IIRC one of the reasons for using jobs for pcs commands was about permissions... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting. I'll do some testing today and get back to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it doesn't work in an interesting way.
[ec2-user@aws-jpoulin-dev ~]$ oc logs pod/etcd-operator-554d7f6899-5wrp2 | grep health.go
E1015 15:16:40.477459 1 health.go:120] health check for member (master-0) failed: err(context deadline exceeded)
E1015 15:16:45.703453 1 health.go:120] health check for member (master-0) failed: err(context deadline exceeded)
E1015 15:19:04.918498 1 health.go:120] health check for member (master-1) failed: err(context deadline exceeded)
Why does it stop after 3 tries I wonder :P
I'll try to update via chron with privileges and see if I can get something working
f3d22d3
to
6176029
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)
524-528
: Timestamp parsing errors are logged but not surfaced.When
time.Parse
fails, the error is logged at V(4) level and the operation is skipped withcontinue
. If all timestamps consistently fail to parse (e.g., due to Pacemaker version change in timestamp format), recent failures would be silently ignored.Consider accumulating a count of parse errors and emitting a warning if a threshold is exceeded, to detect systematic parsing issues.
640-665
: Move cleanup logic outside the event-checking loop for efficiency.The cleanup of old deduplication entries (lines 647-653) currently runs on every call to
shouldRecordEvent
, which means it executes once per warning/error. This is inefficient when processing multiple events.Based on learnings
Apply this refactor to clean up once per sync:
// recordHealthCheckEvents records events for pacemaker warnings and fencing history func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) { + // Clean up old entries once at the start + c.cleanupOldRecordedEvents() + // Record events for warnings with appropriate deduplication window for _, warning := range status.Warnings { eventKey := fmt.Sprintf("warning:%s", warning) @@ -696,6 +700,19 @@ } } +// cleanupOldRecordedEvents removes entries older than the longest deduplication window +func (c *HealthCheck) cleanupOldRecordedEvents() { + c.recordedEventsMu.Lock() + defer c.recordedEventsMu.Unlock() + + now := time.Now() + maxWindow := eventDeduplicationWindowFencing + for key, timestamp := range c.recordedEvents { + if now.Sub(timestamp) > maxWindow { + delete(c.recordedEvents, key) + } + } +} + // shouldRecordEvent checks if an event should be recorded based on deduplication logic func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool { c.recordedEventsMu.Lock() defer c.recordedEventsMu.Unlock() now := time.Now() - - // Clean up old entries that are outside the longest deduplication window - maxWindow := eventDeduplicationWindowFencing - for key, timestamp := range c.recordedEvents { - if now.Sub(timestamp) > maxWindow { - delete(c.recordedEvents, key) - } - } // Check if this event was recently recorded if lastRecorded, exists := c.recordedEvents[eventKey]; exists {
619-626
: Verify degraded condition ownership before clearing.The
clearDegradedCondition
function clears any degraded condition of typePacemakerHealthCheckDegraded
without verifying it was set by this controller. If multiple components manage similar conditions, this could inadvertently clear a condition set by another component.Consider checking the condition's reason or message to confirm it was set by this health check before clearing it.
Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (6)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pacemaker/testdata/standby_node.xml
(1 hunks)
🔇 Additional comments (13)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)
1-67
: LGTM! Test fixture correctly represents offline node scenario.This XML fixture appropriately models a Pacemaker cluster with an offline node (master-1 with
online="false"
), which is used to test the health-check parsing logic for degraded cluster states.pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-87
: LGTM! Test fixture correctly represents healthy cluster scenario.This XML fixture appropriately models a healthy Pacemaker cluster with both nodes online, all critical resources started, and no recent fencing events. It serves as a good baseline for testing the happy path.
pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: LGTM! Template approach is appropriate for time-sensitive testing.This template with placeholder timestamps (
{{RECENT_TIMESTAMP}}
and{{RECENT_TIMESTAMP_ISO}}
) is correctly used by thegetRecentFailuresXML
test helper to dynamically generate XML with recent timestamps, enabling proper testing of the time-window-based failure detection logic.pkg/tnf/pacemaker/testdata/standby_node.xml (2)
57-58
: Verify duplicate operation_history entries are intentional.Lines 57-58 contain identical
operation_history
entries for the kubelet probe operation (same call, task, timestamps). If this duplication is intentional to test duplicate handling in the parser, consider adding a comment explaining this. Otherwise, one entry should be removed.
1-66
: Test fixture correctly represents standby node scenario.This XML fixture appropriately models a Pacemaker cluster with a standby node (master-1 with
standby="true"
), which is used to test the health-check behavior when nodes are in standby state.pkg/tnf/pacemaker/healthcheck_test.go (3)
23-71
: LGTM! Well-structured test helpers with good practices.The helper functions appropriately:
- Create test fixtures with fake clients and in-memory recorders
- Load static XML test data from files
- Generate dynamic timestamps for time-sensitive tests using template replacement
The
getRecentFailuresXML
helper's approach of using 1-minute-old timestamps ensures tests are consistently within the 5-minute failure window.
179-245
: Excellent event deduplication test coverage.This test comprehensively validates the deduplication logic:
- Verifies duplicate events are not re-recorded within windows
- Tests different deduplication windows (5 min for actions, 1 hour for fencing)
- Validates healthy event is only recorded on status transitions
- Confirms new errors are recorded even when duplicates exist
The test correctly verifies the controller only records healthy events when transitioning from unhealthy states, not when already healthy.
428-496
: Comprehensive XML parsing test coverage.The parsing tests appropriately cover:
- Healthy cluster baseline
- Offline nodes
- Standby nodes
- Recent failures with dynamic timestamps
- Invalid/large/missing-state XML edge cases
The tests correctly validate both the parsed status and the specific warnings/errors generated for each scenario.
pkg/tnf/pacemaker/healthcheck.go (5)
349-350
: Consider passing context to recordHealthCheckEvents for consistency.While
recordHealthCheckEvents
doesn't perform I/O or cancellable operations, passing the context parameter would maintain consistency with other methods and enable future additions that might need it (e.g., timeout-sensitive event recording).Based on learnings
452-460
: Node status validation correctly treats standby as offline.The implementation correctly identifies both offline nodes (
online != "true"
) and standby nodes (standby == "true"
) as errors, treating them as equivalent failure conditions. The error message for standby nodes explicitly documents this design decision with "(unexpected behavior; treated as offline)".
360-387
: LGTM! Robust command execution with timeout and error handling.The
getPacemakerStatus
function appropriately:
- Uses a timeout context (10 seconds) to prevent indefinite hangs
- Uses non-interactive sudo (
sudo -n
) to fail fast on missing credentials- Validates XML size to prevent XML bomb attacks
- Returns
Unknown
status with error messages rather than propagating errors, ensuring graceful degradationThe timeout handling explicitly checks for
context.DeadlineExceeded
to provide clear error messages.
408-413
: LGTM! Correctly sets error status when Pacemaker is down.The code properly detects when Pacemaker is not running and sets
OverallStatus
tostatusError
before returning, ensuring the degraded condition will be set appropriately in the controller flow.
505-511
: Hard-coded expectedNodeCount is intentional by design.The use of
expectedNodeCount = 2
is correct for Two-Node OpenShift with Fencing architecture. This is a fundamental constraint, not a limitation to be refactored.Based on learnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (1)
167-175
: Prefer structured mutation over text templating (reduces brittleness).Parse the manifest once, then set fields on the CronJob object (name, namespace, schedule, image, command) instead of line‑based replacements. This avoids YAML formatting pitfalls and future template drift.
Example:
cronJob := ReadCronJobV1OrDie(c.manifest) cronJob.Name = c.name cronJob.Namespace = c.namespace cronJob.Spec.Schedule = c.schedule if len(cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 { ct := &cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0] ct.Image = c.image ct.Command = append([]string(nil), c.command...) } return cronJobpkg/tnf/pacemaker/healthcheck.go (2)
306-309
: Remove unused lastProcessedStatus or use it meaningfully.State is written but never read; drop it or leverage it to suppress duplicate event processing based on CR changes.
Also applies to: 433-436
515-523
: Optional: clarify standby-as-offline in a code comment.Behavior is intentional; a brief inline comment above the loop will aid future readers.
pkg/cmd/pacemakerstatuscollector/collector.go (1)
165-177
: Remove unused Kubernetes client wiring to simplify.You construct a client but don’t use it; updatePacemakerStatusCR rebuilds REST config anyway. Simplify by dropping the client parameter and the extra creation.
Apply:
@@ - // Create Kubernetes client - client, err := createKubernetesClient() - if err != nil { - return fmt.Errorf("failed to create kubernetes client: %w", err) - } - // Update PacemakerStatus CR - if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, collectionError); err != nil { + if err := updatePacemakerStatusCR(ctx, rawXML, summary, collectionError); err != nil { return fmt.Errorf("failed to update PacemakerStatus CR: %w", err) }And change the function signature:
-func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error { +func updatePacemakerStatusCR(ctx context.Context, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {Optionally remove createKubernetesClient if unused elsewhere.
Also applies to: 321-321
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (15)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (2)
pkg/tnf/pacemaker/v1alpha1/register.go (1)
24-31
: LGTM: scheme registration is correct.Types registered, GroupVersion wired, and metav1 added. No issues.
pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
12-121
: Autogenerated deepcopy looks consistent.Methods cover all types; list/items deep-copied properly.
err = restClient.Get(). | ||
Resource("pacemakerstatuses"). | ||
Name(pacemakerStatusName). | ||
Do(ctx). | ||
Into(&existing) | ||
|
||
now := metav1.Now() | ||
|
||
if err != nil { | ||
// Create new PacemakerStatus if it doesn't exist | ||
if strings.Contains(err.Error(), "not found") { | ||
newStatus := &v1alpha1.PacemakerStatus{ | ||
TypeMeta: metav1.TypeMeta{ | ||
APIVersion: v1alpha1.SchemeGroupVersion.String(), | ||
Kind: "PacemakerStatus", | ||
}, | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: pacemakerStatusName, | ||
}, | ||
Status: v1alpha1.PacemakerStatusStatus{ | ||
LastUpdated: now, | ||
RawXML: rawXML, | ||
CollectionError: collectionError, | ||
Summary: summary, | ||
}, | ||
} | ||
|
||
result := restClient.Post(). | ||
Resource("pacemakerstatuses"). | ||
Body(newStatus). | ||
Do(ctx) | ||
|
||
if result.Error() != nil { | ||
return fmt.Errorf("failed to create PacemakerStatus: %w", result.Error()) | ||
} | ||
|
||
klog.Info("Created new PacemakerStatus CR") | ||
return nil | ||
} | ||
return fmt.Errorf("failed to get PacemakerStatus: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use apierrors.IsNotFound instead of string matching.
String checks are brittle. Prefer proper error typing.
Apply:
@@
-import (
+import (
"context"
"encoding/xml"
"fmt"
"os"
"strings"
"time"
"github.com/spf13/cobra"
+ apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@
- if err != nil {
+ if err != nil {
// Create new PacemakerStatus if it doesn't exist
- if strings.Contains(err.Error(), "not found") {
+ if apierrors.IsNotFound(err) {
newStatus := &v1alpha1.PacemakerStatus{
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In pkg/cmd/pacemakerstatuscollector/collector.go around lines 352 to 392,
replace the brittle string check strings.Contains(err.Error(), "not found") with
a proper API error type check using apierrors.IsNotFound(err); import
"k8s.io/apimachinery/pkg/api/errors" as apierrors if not already imported, and
keep the existing error handling logic path the same (create the new
PacemakerStatus on NotFound, otherwise return the wrapped error).
import ( | ||
"context" | ||
"fmt" | ||
"strings" | ||
"time" | ||
|
||
batchv1 "k8s.io/api/batch/v1" | ||
apierrors "k8s.io/apimachinery/pkg/api/errors" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/client-go/kubernetes" | ||
"k8s.io/klog/v2" | ||
|
||
operatorv1 "github.com/openshift/api/operator/v1" | ||
"github.com/openshift/library-go/pkg/controller/factory" | ||
"github.com/openshift/library-go/pkg/operator/events" | ||
"github.com/openshift/library-go/pkg/operator/v1helpers" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add missing import for strconv used in escaping.
@@
-import (
+import (
"context"
"fmt"
+ "strconv"
"strings"
"time"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import ( | |
"context" | |
"fmt" | |
"strings" | |
"time" | |
batchv1 "k8s.io/api/batch/v1" | |
apierrors "k8s.io/apimachinery/pkg/api/errors" | |
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |
"k8s.io/client-go/kubernetes" | |
"k8s.io/klog/v2" | |
operatorv1 "github.com/openshift/api/operator/v1" | |
"github.com/openshift/library-go/pkg/controller/factory" | |
"github.com/openshift/library-go/pkg/operator/events" | |
"github.com/openshift/library-go/pkg/operator/v1helpers" | |
) | |
import ( | |
"context" | |
"fmt" | |
"strconv" | |
"strings" | |
"time" | |
batchv1 "k8s.io/api/batch/v1" | |
apierrors "k8s.io/apimachinery/pkg/api/errors" | |
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |
"k8s.io/client-go/kubernetes" | |
"k8s.io/klog/v2" | |
operatorv1 "github.com/openshift/api/operator/v1" | |
"github.com/openshift/library-go/pkg/controller/factory" | |
"github.com/openshift/library-go/pkg/operator/events" | |
"github.com/openshift/library-go/pkg/operator/v1helpers" | |
) |
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 3 to 19, the code uses
strconv for escaping but the strconv package is not imported; add "strconv" to
the import block (alongside context, fmt, strings, time, etc.) so the file
compiles and the escaping call resolves; reorder/group imports if needed to
satisfy linters.
Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds) | ||
Command []string // Command to run in the container | ||
Image string // Container image to use | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cron schedule comment is misleading; Kubernetes CronJob uses 5 fields (no seconds).
Update the example to a valid 5‑field cron (e.g., "*/30 * * * *") to avoid confusion.
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 48 to 51, the comment for
Schedule shows a 6-field cron expression including seconds which is incorrect
for Kubernetes CronJob (it expects a standard 5-field expression). Update the
example to a valid 5-field expression such as "*/30 * * * *" and adjust the
parenthetical text to note it's a 5-field cron (e.g., "*/30 * * * *" for every
30 minutes/seconds depending on scheduler) — specifically replace the existing
example and wording so the comment accurately reflects Kubernetes CronJob
syntax.
// injectValues replaces <injected> placeholders in the manifest with actual values | ||
func (c *CronJobController) injectValues(manifest []byte) []byte { | ||
result := string(manifest) | ||
|
||
// Build command string for YAML array format | ||
var commandBuilder strings.Builder | ||
commandBuilder.WriteString("[") | ||
for i, cmd := range c.command { | ||
if i > 0 { | ||
commandBuilder.WriteString(", ") | ||
} | ||
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) | ||
} | ||
commandBuilder.WriteString("]") | ||
commandStr := commandBuilder.String() | ||
|
||
// Track which line we're on to do context-aware replacement | ||
lines := strings.Split(result, "\n") | ||
for i, line := range lines { | ||
// Replace name (appears in metadata.name) | ||
if strings.Contains(line, "name: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | ||
} | ||
// Replace app label | ||
if strings.Contains(line, "app.kubernetes.io/name: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | ||
} | ||
// Replace schedule | ||
if strings.Contains(line, "schedule: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1) | ||
} | ||
// Replace image | ||
if strings.Contains(line, "image: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.image, 1) | ||
} | ||
// Replace command (special handling for array) | ||
if strings.Contains(line, "command: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", commandStr, 1) | ||
} | ||
} | ||
|
||
return []byte(strings.Join(lines, "\n")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Fix YAML injection/escaping in command and support namespace injection.
- Current command quoting uses "%s" without escaping; quotes/backslashes in args will break YAML.
- Namespace in manifest isn’t injected; if config.Namespace != manifest’s namespace, create/update can fail (path vs object namespace mismatch).
Apply:
@@
-func (c *CronJobController) injectValues(manifest []byte) []byte {
+func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)
// Build command string for YAML array format
- var commandBuilder strings.Builder
- commandBuilder.WriteString("[")
- for i, cmd := range c.command {
- if i > 0 {
- commandBuilder.WriteString(", ")
- }
- commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
- }
- commandBuilder.WriteString("]")
- commandStr := commandBuilder.String()
+ var commandBuilder strings.Builder
+ commandBuilder.WriteString("[")
+ for i, cmd := range c.command {
+ if i > 0 {
+ commandBuilder.WriteString(", ")
+ }
+ // Use proper escaping for YAML/JSON-compatible strings
+ commandBuilder.WriteString(strconv.Quote(cmd))
+ }
+ commandBuilder.WriteString("]")
+ commandStr := commandBuilder.String()
@@
for i, line := range lines {
+ // Replace namespace if present in template
+ if strings.Contains(line, "namespace: <injected>") {
+ lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+ }
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
Note: add strconv import (see separate comment).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// injectValues replaces <injected> placeholders in the manifest with actual values | |
func (c *CronJobController) injectValues(manifest []byte) []byte { | |
result := string(manifest) | |
// Build command string for YAML array format | |
var commandBuilder strings.Builder | |
commandBuilder.WriteString("[") | |
for i, cmd := range c.command { | |
if i > 0 { | |
commandBuilder.WriteString(", ") | |
} | |
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) | |
} | |
commandBuilder.WriteString("]") | |
commandStr := commandBuilder.String() | |
// Track which line we're on to do context-aware replacement | |
lines := strings.Split(result, "\n") | |
for i, line := range lines { | |
// Replace name (appears in metadata.name) | |
if strings.Contains(line, "name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace app label | |
if strings.Contains(line, "app.kubernetes.io/name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace schedule | |
if strings.Contains(line, "schedule: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1) | |
} | |
// Replace image | |
if strings.Contains(line, "image: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.image, 1) | |
} | |
// Replace command (special handling for array) | |
if strings.Contains(line, "command: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", commandStr, 1) | |
} | |
} | |
return []byte(strings.Join(lines, "\n")) | |
} | |
// injectValues replaces <injected> placeholders in the manifest with actual values | |
func (c *CronJobController) injectValues(manifest []byte) []byte { | |
result := string(manifest) | |
// Build command string for YAML array format | |
var commandBuilder strings.Builder | |
commandBuilder.WriteString("[") | |
for i, cmd := range c.command { | |
if i > 0 { | |
commandBuilder.WriteString(", ") | |
} | |
// Use proper escaping for YAML/JSON-compatible strings | |
commandBuilder.WriteString(strconv.Quote(cmd)) | |
} | |
commandBuilder.WriteString("]") | |
commandStr := commandBuilder.String() | |
// Track which line we're on to do context-aware replacement | |
lines := strings.Split(result, "\n") | |
for i, line := range lines { | |
// Replace namespace if present in template | |
if strings.Contains(line, "namespace: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) | |
} | |
// Replace name (appears in metadata.name) | |
if strings.Contains(line, "name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace app label | |
if strings.Contains(line, "app.kubernetes.io/name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace schedule | |
if strings.Contains(line, "schedule: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1) | |
} | |
// Replace image | |
if strings.Contains(line, "image: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.image, 1) | |
} | |
// Replace command (special handling for array) | |
if strings.Contains(line, "command: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", commandStr, 1) | |
} | |
} | |
return []byte(strings.Join(lines, "\n")) | |
} |
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 177 to 219, the command
array is built with fmt.Sprintf("%s") which doesn't escape quotes/backslashes
and will break YAML; also the manifest never injects a namespace. Change the
command builder to call strconv.Quote on each c.command element when
constructing commandStr (so arguments are properly escaped and wrapped), add
strconv to the import list, and add a replacement branch to replace "namespace:
<injected>" with c.namespace (or the controller's namespace field) so the
manifest namespace is injected consistently.
4aefd7d
to
b553bf0
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jaypoulz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
3-19
: Missing strconv import needed for proper YAML escaping.The injectValues method at line 191 constructs YAML array strings using
fmt.Sprintf("%s")
without proper escaping. Special characters in command arguments (quotes, backslashes) will break YAML parsing. Importstrconv
and usestrconv.Quote
for escaping.
48-48
: Misleading cron schedule example includes seconds field.Kubernetes CronJob uses standard 5-field cron expressions (minute hour day month weekday), not 6-field (with seconds). The example
"*/30 * * * * *"
is invalid. Use"*/30 * * * *"
for every 30 minutes.
180-222
: Fix YAML injection vulnerabilities and add namespace injection.Three critical issues remain from previous review:
Command escaping (line 191): Using
fmt.Sprintf("%s")
without escaping will break YAML when command args contain quotes or backslashes. Usestrconv.Quote(cmd)
instead.Namespace not injected: The manifest likely has a
namespace: <injected>
placeholder that needs replacement, but the loop has no case for it. Add namespace replacement or the CronJob will be created in wrong namespace.Schedule escaping (line 209): Wrapping in quotes helps but doesn't escape special chars in the schedule value itself.
Apply this diff:
import ( "context" "fmt" + "strconv" "strings" "time"
for i, line := range lines { + // Replace namespace if present in template + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.name, 1)for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + commandBuilder.WriteString(strconv.Quote(cmd)) }pkg/cmd/pacemakerstatuscollector/collector.go (1)
321-414
: Replace string matching with proper error type check.At line 362, the code uses brittle string matching
strings.Contains(err.Error(), "not found")
to detect NotFound errors. Useapierrors.IsNotFound(err)
instead.Apply this diff:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (1)
pkg/cmd/pacemakerstatuscollector/collector.go (1)
52-138
: Acceptable duplication to avoid circular dependencies.XML types are duplicated from healthcheck.go as noted in the comment. Consider extracting to a shared internal package in future refactoring to reduce duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (15)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/tnf/pacemaker/v1alpha1/doc.go
- pkg/operator/starter.go
- pkg/tnf/pkg/jobs/batch.go
- cmd/cluster-etcd-operator/main.go
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
- pkg/tnf/operator/starter_test.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (42)
pkg/tnf/pkg/jobs/cronjob_controller.go (7)
53-88
: LGTM!The constructor properly sets defaults, wires dependencies, and follows library-go factory patterns.
90-111
: LGTM!The sync logic correctly checks operator management state and handles errors appropriately.
113-132
: LGTM!The reconciliation logic follows standard Kubernetes controller patterns.
134-150
: LGTM!Creation logic properly handles AlreadyExists and records appropriate events.
152-165
: LGTM!Update logic correctly preserves ResourceVersion and records events.
167-178
: LGTM!Build logic is clear and includes helpful debug logging.
224-249
: LGTM!The update detection logic correctly compares schedule, command, and image with proper bounds checking.
pkg/tnf/pacemaker/healthcheck.go (22)
28-99
: LGTM!Constants are well-defined. The
StatusError
constant was appropriately added. TheexpectedNodeCount = 2
is intentional per the Two-Node OpenShift architecture design.Based on learnings.
101-273
: LGTM!XML parsing structures are comprehensive and follow standard Go XML unmarshaling conventions.
274-288
: LGTM!HealthStatus model and helper function are appropriate. Unknown status for collection failures is intentional per maintainer decision.
290-309
: LGTM!The HealthCheck structure is well-designed with proper mutex-protected state for concurrent access.
311-383
: LGTM!The constructor properly configures the REST client, informer, and controller factory with appropriate wiring.
385-405
: LGTM!The sync function follows a clear flow: retrieve status, update operator conditions, record events.
407-450
: LGTM!Status retrieval logic properly validates staleness and enforces XML size limits. The maxXMLSize check at line 439 prevents XML bomb parsing exploits; while the data is already in memory, this is acceptable for internal trusted sources.
452-499
: LGTM!XML parsing correctly validates structure and sets
StatusError
when pacemaker is not running, addressing prior review feedback.
501-524
: LGTM!Node status collection correctly flags offline and standby nodes as errors, consistent with Two-Node OpenShift design where standby is treated as offline.
Based on learnings.
526-566
: LGTM!Resource status collection correctly tracks started resources across nodes with clear helper function structure.
568-574
: LGTM!Resource count validation correctly uses the intentional
expectedNodeCount
constant.Based on learnings.
576-600
: LGTM!Recent failure detection correctly uses OR logic (line 584) per prior review. Timestamp parse errors are logged and skipped, which is appropriate defensive behavior.
602-619
: LGTM!Fencing event collection follows consistent patterns with appropriate time-based filtering.
621-634
: LGTM!Overall status determination follows clear precedence: errors > warnings > healthy.
636-653
: LGTM!Operator status update correctly sets degraded condition for
StatusError
, addressing prior review feedback.
655-664
: LGTM!Status message logging is clear and appropriate.
666-680
: LGTM!Degraded condition setting uses a scoped condition type (
PacemakerHealthCheckDegraded
at line 91), which prevents conflicts with other controllers' degraded conditions.
682-689
: LGTM!Condition clearing logic is correct.
691-701
: LGTM!Operator condition update uses standard library-go helpers correctly.
703-728
: LGTM!Event deduplication logic correctly cleans up stale entries and enforces time-based windows. The cleanup at lines 710-716 executes once per call, not per iteration of a caller's loop.
730-769
: LGTM!Event recording implements appropriate deduplication with context-aware windows (longer for fencing) and only records healthy events on transitions.
771-781
: LGTM!Warning event categorization correctly routes to appropriate event reasons.
pkg/tnf/pacemaker/v1alpha1/types.go (5)
1-5
: LGTM!Package structure and imports are standard for Kubernetes API types.
7-21
: LGTM!PacemakerStatus type follows Kubernetes CRD best practices with appropriate kubebuilder markers and subresource support.
23-50
: LGTM!Spec and Status types are well-designed with appropriate fields for status collection and error reporting. The RawXML field enables efficient parsing without re-executing commands.
52-77
: LGTM!PacemakerSummary provides a comprehensive high-level cluster state overview with well-documented fields.
79-86
: LGTM!PacemakerStatusList follows standard Kubernetes list type conventions.
pkg/cmd/pacemakerstatuscollector/collector.go (8)
1-50
: LGTM!Package structure, imports, and constants are well-organized. The pcs command uses
sudo -n
for non-interactive execution, which is good practice.
140-153
: LGTM!Command constructor follows standard cobra patterns with appropriate error handling.
155-177
: LGTM!The collector orchestration is clean with appropriate timeout and error handling.
179-217
: LGTM!Status collection properly handles timeouts, validates XML size, and gracefully handles parse errors while preserving raw XML.
219-265
: LGTM!Summary building logic correctly processes all node and resource states with clear counting logic.
267-286
: LGTM!Recent failure detection uses correct OR logic for status checks and handles timestamp parsing gracefully.
288-301
: LGTM!Recent fencing detection follows consistent patterns with appropriate time-based filtering.
303-319
: LGTM!Kubernetes client creation follows standard patterns with in-cluster and kubeconfig fallback.
b553bf0
to
d99fc51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
pkg/tnf/operator/starter.go (1)
346-351
: Cron expression is invalid for Kubernetes CronJobs
"*/30 * * * * *"
has six fields (seconds). Kubernetesbatch/v1
CronJobs only accept the standard five-field spec, so this will be rejected and the collector never runs. Use a valid five-field schedule (e.g.,"* * * * *"
for once per minute and loop inside the job for 30‑second sampling) or another mechanism that supports sub-minute intervals.- Schedule: "*/30 * * * * *", // Every 30 seconds (Kubernetes CronJob format with seconds) + Schedule: "* * * * *", // Every minute; handle 30s cadence inside the job if requiredpkg/cmd/pacemakerstatuscollector/collector.go (1)
358-391
: Use API error helpers instead of string matching.
strings.Contains(err.Error(), "not found")
is fragile (locale/log variations) and can misclassify errors. Switch toapierrors.IsNotFound(err)
and importk8s.io/apimachinery/pkg/api/errors
.- if err != nil { - // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if err != nil { + // Create new PacemakerStatus if it doesn't exist + if apierrors.IsNotFound(err) {pkg/tnf/pkg/jobs/cronjob_controller.go (3)
3-19
: Missing strconv import.The previous review comment about the missing
strconv
import still applies. This import is required for the proper escaping fix suggested in theinjectValues
method.
48-51
: Cron schedule comment is still misleading.The previous review comment about the incorrect 6-field cron expression example still applies. Kubernetes CronJob expects a 5-field expression.
180-222
: YAML injection and namespace issues remain.The previous review comments about improper escaping in command building and missing namespace injection still apply.
🧹 Nitpick comments (2)
pkg/tnf/operator/starter_test.go (1)
292-301
: Make the retry expectation observable
expectRetries
is populated but never asserted, so the test still passes even ifhandleNodesWithRetry
never performs additional attempts. Either surface the captured attempt counter (e.g., return it from the setup helper) and assert it againstexpectRetries
, or drop the field to avoid implying coverage that isn’t there.pkg/tnf/pkg/jobs/cronjob_controller.go (1)
90-111
: Sync method correctly checks operator state.The implementation appropriately verifies the operator is in Managed state before ensuring the CronJob. The early return for non-Managed states is a sound approach.
Optional: Consider whether existing CronJobs should be cleaned up when the operator transitions to Unmanaged or Removed states. The current behavior preserves existing resources, which may be intentional.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (19)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(9 hunks)pkg/tnf/operator/starter_test.go
(6 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
- pkg/operator/starter.go
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/v1alpha1/register.go
- pkg/tnf/pacemaker/v1alpha1/doc.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (8)
pkg/tnf/pkg/jobs/cronjob_controller.go (8)
21-27
: Constants are well-defined.The default resync interval and namespace values are appropriate for CronJob lifecycle management.
29-42
: Controller structure is well-organized.The
CronJobController
struct appropriately encapsulates all necessary clients and configuration for managing CronJob resources.
53-88
: Constructor is well-implemented.The constructor properly handles defaults, follows the factory pattern, and correctly sets up the controller with informers and sync intervals.
113-132
: CronJob reconciliation logic is sound.The method correctly handles the create-or-update pattern with appropriate error handling for NotFound cases.
134-150
: CronJob creation is well-handled.The method correctly creates the CronJob, handles AlreadyExists errors gracefully, and records events appropriately.
152-165
: CronJob update correctly preserves ResourceVersion.The method appropriately preserves the existing ResourceVersion before updating, which is essential for conflict-free updates.
167-178
: Manifest processing approach is sound.The method correctly injects values and parses the manifest. The debug logging at V(4) is a helpful troubleshooting aid.
224-249
: Update detection logic is correct and defensive.The method appropriately compares schedule, command, and image fields, with proper defensive checks for container existence before accessing array elements.
// Get pacemaker status | ||
healthStatus, err := c.getPacemakerStatus(ctx) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
// Update operator status conditions based on pacemaker status | ||
if err := c.updateOperatorStatus(ctx, healthStatus); err != nil { | ||
return err | ||
} | ||
|
||
// Record pacemaker health check events | ||
c.recordHealthCheckEvents(healthStatus) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale/failed collections never set degraded.
When getPacemakerStatus
returns StatusUnknown
(collection error, stale XML, oversized payload, etc.), updateOperatorStatus
hits the default branch and leaves the degraded condition untouched. That masks genuine outages (e.g., pcs unreachable) and contradicts the controller’s goal of surfacing health issues. Please map collection failures to StatusError
(either in getPacemakerStatus
or when propagating errors) so we call setDegradedCondition
.
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 390 to 403, when
getPacemakerStatus returns StatusUnknown collection failures are not mapped to
degraded and thus updateOperatorStatus leaves degraded untouched; change the
propagation so that any StatusUnknown (collection error, stale/oversized
payload, unreachable pcs) is converted to StatusError before calling
updateOperatorStatus (or update updateOperatorStatus to treat StatusUnknown the
same as StatusError) so setDegradedCondition is invoked; implement the mapping
immediately after getPacemakerStatus returns (if healthStatus == StatusUnknown {
healthStatus = StatusError }) or add a case in updateOperatorStatus handling
StatusUnknown -> setDegradedCondition.
Change List - Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health - Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes - Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons - Included happy path testing
d99fc51
to
59eaec8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter_test.go (1)
18-20
: Remove duplicate import of the same packageThe same import path is aliased twice (config/v1). Use a single alias.
Apply this diff:
- configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1" - v1 "github.com/openshift/client-go/config/informers/externalversions/config/v1" + configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1" @@ - networkInformer v1.NetworkInformer + networkInformer configv1informers.NetworkInformerAlso applies to: 55-56
♻️ Duplicate comments (5)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
3-19
: Add strconv import for proper command escaping.The
injectValues
method (line 191) builds command strings without proper escaping. To safely handle command arguments containing quotes, backslashes, or other special characters, you'll need thestrconv
package.Add the import:
import ( "context" "fmt" + "strconv" "strings" "time"
Then use
strconv.Quote(cmd)
in the command builder (see separate comment on lines 180-222).
48-48
: Fix cron schedule comment: Kubernetes uses 5 fields, not 6.The example
"*/30 * * * * *"
shows 6 fields (including seconds), but Kubernetes CronJob uses the standard 5-field cron format (minute, hour, day, month, weekday). This will confuse users.Update the comment:
- Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds) + Schedule string // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)
180-222
: Critical: Fix command escaping and add namespace injection.Two issues in
injectValues
:
Command escaping: Line 191 uses
fmt.Sprintf('"%s"', cmd)
which doesn't escape quotes or backslashes. If any command argument contains"
or\
, the generated YAML will be malformed.Missing namespace injection: The manifest has
namespace: openshift-etcd
(line 5 in cronjob.yaml), but this is never replaced withc.namespace
, causing a mismatch whenCronJobConfig.Namespace
differs from the hardcoded value.Apply this fix:
func (c *CronJobController) injectValues(manifest []byte) []byte { result := string(manifest) // Build command string for YAML array format var commandBuilder strings.Builder commandBuilder.WriteString("[") for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + // Use strconv.Quote for proper YAML/JSON escaping + commandBuilder.WriteString(strconv.Quote(cmd)) } commandBuilder.WriteString("]") commandStr := commandBuilder.String() // Track which line we're on to do context-aware replacement lines := strings.Split(result, "\n") for i, line := range lines { + // Replace namespace + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") {Note: This requires the
strconv
import (see separate comment) and changing line 5 in cronjob.yaml tonamespace: <injected>
.pkg/tnf/operator/starter.go (1)
334-364
: Cron schedule fix LGTMSchedule now uses valid 5-field spec ("* * * * *"). This resolves the prior invalid 6-field seconds issue.
pkg/cmd/pacemakerstatuscollector/collector.go (1)
360-392
: Use apierrors.IsNotFound instead of string matching.String checks are brittle; prefer typed API errors.
-import ( +import ( ... - "strings" ... + apierrors "k8s.io/apimachinery/pkg/api/errors" ) ... - if err != nil { + if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{ ... } ... } return fmt.Errorf("failed to get PacemakerStatus: %w", err) }
🧹 Nitpick comments (20)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (2)
32-39
: Make spec.nodeName requiredControllers likely rely on spec.nodeName; encode that contract in the CRD for early validation.
Apply this diff:
spec: description: PacemakerStatusSpec defines the desired state of PacemakerStatus type: object + required: + - nodeName properties: nodeName: description: NodeName identifies which node this status is for type: string
15-18
: Add printer columns for quick status at a glanceImprove kubectl get output for operators and SREs.
Apply this diff:
versions: - name: v1alpha1 served: true storage: true + additionalPrinterColumns: + - name: Node + type: string + jsonPath: .spec.nodeName + - name: Pacemakerd + type: string + jsonPath: .status.summary.pacemakerdState + - name: Quorum + type: string + jsonPath: .status.summary.hasQuorum + - name: Nodes + type: string + jsonPath: .status.summary.nodesOnline + description: Online/Total + priority: 1 + - name: Updated + type: date + jsonPath: .status.lastUpdatedAlso applies to: 58-85
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-1
: Optional: align command attributionThe request attribute shows crm_mon XML while docs mention pcs status; consider noting pcs→crm_mon indirection or aligning text for consistency.
pkg/tnf/operator/starter_test.go (1)
260-271
: Avoid live API dependencies in unit testsrunPacemakerControllers is started and the health checker may construct clients from KubeConfig. A bare rest.Config{Host: "..."} can cause dial/SSL errors and flakiness.
Options:
- Inject a dynamic.Interface into the health checker and pass fakedynamic in tests.
- Or gate runPacemakerControllers behind a test knob (function var) so unit tests stub it out.
If you keep KubeConfig in tests, set it to a dummy and ensure health code never dials when running under fake clients.
Also applies to: 96-110
pkg/tnf/operator/starter.go (2)
338-352
: 30-second sampling claim vs 1-minute CronPR text says “every 30 seconds,” but Cron runs every minute. Ensure the collector loops internally (e.g., two samples with 30s sleep) or update docs.
If needed, document the intra-job loop or add an explicit flag to the collector for 30s cadence.
355-364
: Defensive start for health controllerIf controllerContext.KubeConfig is nil or misconfigured, NewHealthCheck/Run should no-op or degrade gracefully to avoid tight error loops.
Add a nil/validation check and early return with a Warning event.
pkg/tnf/pacemaker/healthcheck_test.go (5)
31-61
: Mark helpers as test helpers.Call t.Helper() in helper functions to improve failure locations and readability.
func createTestHealthCheck() *HealthCheck { + testingT, _ := any(nil).(interface{ Helper() }) + if testingT != nil { + testingT.Helper() + }Apply similarly in createTestHealthCheckWithMockStatus, loadTestXML, and getRecentFailuresXML.
119-140
: Stabilize time-dependent tests with an injected clock.Using time.Now() risks flakes around cutoff windows. Prefer a controllable clock or pass times into helpers.
-func getRecentFailuresXML(t *testing.T) string { +func getRecentFailuresXMLAt(t *testing.T, now time.Time) string { // Load the template... - now := time.Now() recentTime := now.Add(-1 * time.Minute)... ... return xmlContent } + +func getRecentFailuresXML(t *testing.T) string { + return getRecentFailuresXMLAt(t, time.Now()) +}
277-310
: Add coverage for Unknown → no condition change (locks intent).Given Unknown is intentional for collection/staleness, add a subtest asserting degraded condition is not updated.
// Test warning status (should not update operator condition) status = &HealthStatus{ OverallStatus: statusWarning, Warnings: []string{"Test warning"}, Errors: []string{}, } err = controller.updateOperatorStatus(ctx, status) require.NoError(t, err, "updateOperatorAvailability should not return an error") + + // Test unknown status (should not update operator condition) + status = &HealthStatus{ + OverallStatus: statusUnknown, + Warnings: []string{}, + Errors: []string{"collection failed"}, + } + err = controller.updateOperatorStatus(ctx, status) + require.NoError(t, err)
338-405
: Avoid brittle concrete type assertion for recorder.Casting to events.InMemoryRecorder couples tests to implementation. Prefer asserting via the public Recorder and inspecting emitted events through its API or provide an accessor on HealthCheck for tests.
862-873
: Test the max-size guard via getPacemakerStatus path too.parsePacemakerStatusXML doesn’t enforce size; the guard lives in getPacemakerStatus. Add a test invoking getPacemakerStatus with oversized RawXML to validate handling.
pkg/tnf/pacemaker/healthcheck.go (5)
428-447
: Explicit size check in parser for direct calls.parsePacemakerStatusXML is callable directly (tests do) but doesn’t enforce maxXMLSize; add a guard to fail fast on oversized input as in getPacemakerStatus.
func (c *HealthCheck) parsePacemakerStatusXML(xmlOutput string) (*HealthStatus, error) { status := &HealthStatus{ OverallStatus: statusUnknown, Warnings: []string{}, Errors: []string{} } + if len(xmlOutput) > maxXMLSize { + return nil, fmt.Errorf("XML output too large: %d bytes (max: %d bytes)", len(xmlOutput), maxXMLSize) + } // Parse XML var result PacemakerResult
509-523
: Document standby-as-offline behavior inline.Add a brief comment so future maintainers don’t “fix” this intentionally strict check.
for _, node := range result.Nodes.Node { - if node.Online != booleanValueTrue { + // Treat standby as offline by design (standby nodes must not run critical resources) + if node.Online != booleanValueTrue { status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) } if node.Standby == booleanValueTrue { status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (unexpected behavior; treated as offline)", node.Name)) } }
428-431
: Make staleness threshold a named constant tied to resync.2 minutes is embedded; derive from the controller resync or expose a constant for coherence and easier tuning.
const ( ... healthCheckResyncInterval = 30 * time.Second + statusStaleAfter = 2 * time.Minute // consider 4 * healthCheckResyncInterval ) ... - if time.Since(result.Status.LastUpdated.Time) > 2*time.Minute { + if time.Since(result.Status.LastUpdated.Time) > statusStaleAfter {
703-729
: Run dedupe cleanup once per call, not per event.Cleanup executes for every event, adding overhead. Do it once at the start of recordHealthCheckEvents.
-func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool { - c.recordedEventsMu.Lock() - defer c.recordedEventsMu.Unlock() - now := time.Now() - // Clean up old entries... - maxWindow := eventDeduplicationWindowFencing - for key, timestamp := range c.recordedEvents { - if now.Sub(timestamp) > maxWindow { - delete(c.recordedEvents, key) - } - } +func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool { + c.recordedEventsMu.Lock() + defer c.recordedEventsMu.Unlock() + now := time.Now() // Check if this event was recently recorded if lastRecorded, exists := c.recordedEvents[eventKey]; exists { if now.Sub(lastRecorded) < deduplicationWindow { return false } } c.recordedEvents[eventKey] = now return true } func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) { + // One-time cleanup per invocation + c.recordedEventsMu.Lock() + now := time.Now() + for key, ts := range c.recordedEvents { + if now.Sub(ts) > eventDeduplicationWindowFencing { + delete(c.recordedEvents, key) + } + } + c.recordedEventsMu.Unlock() // Record events for warnings...Also applies to: 730-781
101-221
: Avoid struct duplication across packages.Pacemaker XML structs are duplicated here and in the collector; they will drift. Extract them into a shared internal/package and reuse.
Also applies to: 52-60
pkg/cmd/pacemakerstatuscollector/collector.go (4)
24-50
: De-duplicate shared constants/types to prevent drift.maxXMLSize, time formats, and XML structs exist here and in healthcheck.go. Move them into a common package (e.g., pkg/tnf/pacemaker/internal/common) and import from both callers.
Also applies to: 31-33, 75-82
179-217
: Stream-limit pcs output to reduce peak memory.If exec package supports it, prefer a bounded reader or line-capped read to avoid large buffers; otherwise consider enhancing exec.Execute to enforce an output cap.
267-286
: Be tolerant to rc_text casing.Some environments emit RCText “OK”. Compare case-insensitively to avoid false positives.
- if operation.RC != operationRCSuccess || operation.RCText != operationRCTextSuccess { + if operation.RC != operationRCSuccess || !strings.EqualFold(operation.RCText, operationRCTextSuccess) {
303-319
: HOME may be empty; add fallback.When building kubeconfig path, guard against empty HOME for non-login shells.
- kubeconfig := os.Getenv("KUBECONFIG") - if kubeconfig == "" { - kubeconfig = os.Getenv("HOME") + "/.kube/config" - } + kubeconfig := os.Getenv("KUBECONFIG") + if kubeconfig == "" { + if home := os.Getenv("HOME"); home != "" { + kubeconfig = home + "/.kube/config" + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (19)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🚧 Files skipped from review as they are similar to previous changes (6)
- cmd/cluster-etcd-operator/main.go
- pkg/tnf/pacemaker/v1alpha1/doc.go
- pkg/operator/starter.go
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/v1alpha1/types.go
- pkg/tnf/pacemaker/testdata/standby_node.xml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (5)
pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM: CronJob decoder follows established pattern.The function correctly mirrors
ReadJobV1OrDie
and reuses the existing codec factory. The panic-on-error behavior is appropriate for the "OrDie" convention.bindata/tnfdeployment/cronjob.yaml (1)
35-40
: Verify privileged settings are necessary for pacemaker operations.The static analysis tools flag the privileged security context, privilege escalation, and
hostPID: true
as security concerns. For pacemaker health checks that executepcs status
, these elevated privileges are likely necessary. However, please confirm:
- Does
pcs status --full
require privileged access and hostPID?- Can the command run with a more restrictive security context?
- Are there audit/compliance requirements that need documentation for these privileges?
If these privileges are confirmed necessary, consider adding a comment in the manifest explaining why elevated privileges are required.
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-87
: Fixture looks goodWell-formed healthy-state XML; suitable for happy-path tests.
pkg/tnf/pacemaker/v1alpha1/register.go (1)
14-31
: Scheme registration looks correctGroupVersion, AddKnownTypes, and AddToScheme are wired properly.
pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-121
: Generated deepcopy looks correct.Covers nested structs, slices, and runtime.Object; no issues spotted.
kind: CronJob | ||
metadata: | ||
name: <injected> | ||
namespace: openshift-etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded namespace creates inconsistency with controller.
The namespace is hardcoded to openshift-etcd
, but the CronJobController
accepts a Namespace
field in CronJobConfig
(line 47 in cronjob_controller.go). The injectValues
method (lines 180-222 in cronjob_controller.go) doesn't replace the namespace placeholder, so if a caller provides a different namespace, the manifest and controller will be out of sync.
Change line 5 to use a placeholder:
- namespace: openshift-etcd
+ namespace: <injected>
Then update injectValues
in cronjob_controller.go to inject the namespace (this is already flagged in past review comments for that file).
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In bindata/tnfdeployment/cronjob.yaml around line 5, the namespace is hardcoded
to "openshift-etcd" which will diverge from the CronJobController's configurable
Namespace; replace that line with a namespace placeholder (e.g. namespace: {{
.Namespace }}). Then update cronjob_controller.go (see CronJobConfig Namespace
field around line 47 and injectValues method around lines 180-222) to ensure
injectValues replaces the namespace placeholder with the configured Namespace
value when rendering the manifest; ensure nil/empty Namespace falls back to the
original default if needed and that tests and manifest rendering use the same
placeholder key.
rawXML: | ||
description: RawXML contains the raw XML output from pcs status xml command | ||
type: string | ||
collectionError: | ||
description: CollectionError contains any error encountered while collecting status | ||
type: string | ||
summary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against oversized rawXML (1 MiB object size limit)
Storing full pcs/crm_mon XML can exceed Kubernetes object size limits, causing updates to fail under stress. Either truncate in the collector or constrain via schema.
Suggested schema bound; pair with collector-side truncation:
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
+ maxLength: 262144 # ~256 KiB safety cap; adjust as needed
Optionally add a boolean status.summary.rawXMLTruncated and trim in the collector when exceeding the cap.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
rawXML: | |
description: RawXML contains the raw XML output from pcs status xml command | |
type: string | |
collectionError: | |
description: CollectionError contains any error encountered while collecting status | |
type: string | |
summary: | |
rawXML: | |
description: RawXML contains the raw XML output from pcs status xml command | |
type: string | |
maxLength: 262144 # ~256 KiB safety cap; adjust as needed | |
collectionError: | |
description: CollectionError contains any error encountered while collecting status | |
type: string | |
summary: |
🤖 Prompt for AI Agents
In manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml around lines
51-57, the rawXML status field is unbounded and can exceed Kubernetes 1 MiB
object size limits; add a schema constraint (e.g. maxLength: 1048576) on rawXML
and introduce a new boolean status.summary.rawXMLTruncated field to indicate
truncation, and update the collector to truncate the XML to that cap before
writing the CR status (ensuring truncation preserves valid encoding/closing tags
as needed).
59eaec8
to
85b3229
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
bindata/tnfdeployment/cronjob.yaml (1)
5-5
: Namespace hardcoding issue already flagged.This issue was already identified in a previous review.
pkg/tnf/pkg/jobs/cronjob_controller.go (2)
48-48
: Fix misleading cron schedule example – Kubernetes CronJob uses 5 fields, not 6.The example
"*/30 * * * * *"
suggests a 6-field cron with seconds, but Kubernetes CronJob uses standard 5-field cron. Change to"*/30 * * * *"
(every 30 minutes) and update the comment.Apply this diff:
- Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds) + Schedule string // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)
180-222
: Fix YAML injection risk and add namespace injection.Two issues:
- YAML injection: Line 191 uses
fmt.Sprintf(\
"%s"`, cmd)which doesn't escape quotes/backslashes. If
cmdcontains
"or
`, the YAML breaks. Usestrconv.Quote
for proper escaping.- Missing namespace injection: The manifest never injects namespace. If the template contains
namespace: <injected>
, it won't be replaced.Apply this diff:
package jobs import ( "context" "fmt" + "strconv" "strings" "time"
func (c *CronJobController) injectValues(manifest []byte) []byte { result := string(manifest) // Build command string for YAML array format var commandBuilder strings.Builder commandBuilder.WriteString("[") for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + commandBuilder.WriteString(strconv.Quote(cmd)) } commandBuilder.WriteString("]") commandStr := commandBuilder.String() // Track which line we're on to do context-aware replacement lines := strings.Split(result, "\n") for i, line := range lines { + // Replace namespace if present in template + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.name, 1)pkg/cmd/pacemakerstatuscollector/collector.go (1)
360-362
: Use proper error type checking instead of string matching.Line 362 uses brittle string matching to detect not-found errors. Use
apierrors.IsNotFound(err)
for robust error classification.Apply this diff:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (2)
pkg/tnf/pacemaker/healthcheck.go (2)
401-402
: Consider passing context to recordHealthCheckEvents for consistency.While
recordHealthCheckEvents
currently performs no blocking I/O, passingctx
would future-proof the function and align with standard controller patterns where context flows through the call chain.Based on learnings
438-441
: XML size validation is defensive; etcd limits already constrain CR size.The size check here occurs after the XML is in memory (from the CR). However, etcd/apiserver limits (typically ~1.5MB) already prevent excessively large CRs from being stored. This validation is still useful as a secondary defense-in-depth measure.
For stricter protection, consider future enhancement using
io.LimitReader
when the collector writes the CR.Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (17)
bindata/etcd/etcd-sa-clusterrole.yaml
(1 hunks)bindata/etcd/etcd-sa-clusterrolebinding.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(2 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- pkg/tnf/pacemaker/v1alpha1/doc.go
- bindata/etcd/etcd-sa-clusterrolebinding.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
- pkg/tnf/pkg/jobs/batch.go
- pkg/tnf/pacemaker/v1alpha1/types.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (34)
bindata/tnfdeployment/cronjob.yaml (1)
35-40
: Verify privileged access is necessary for pacemaker status collection.The CronJob runs with
privileged: true
,allowPrivilegeEscalation: true
, andhostPID: true
. While static analysis flags these as security concerns, they may be required for thepcs status
command to access the pacemaker cluster state on the host.Please confirm:
- Is host PID namespace access (
hostPID: true
) required for the pacemaker status collector to runpcs status
?- Can the privilege requirements be reduced (e.g., specific capabilities instead of fully privileged)?
- Is there documentation about the minimum privileges needed for
pcs status
execution?If these privileges are necessary, consider adding a comment in the manifest explaining why privileged access is required for operational clarity.
cmd/cluster-etcd-operator/main.go (1)
17-17
: LGTM!The pacemaker status collector command is properly imported and registered with the root command.
Also applies to: 82-82
pkg/operator/starter.go (2)
247-248
: LGTM!The RBAC manifests for the etcd service account are correctly added to the static resource controller, enabling PacemakerStatus resource access.
632-633
: LGTM!The function call is properly updated to pass the additional
dynamicClient
andAlivenessChecker
parameters required for Pacemaker controller initialization.bindata/etcd/etcd-sa-clusterrole.yaml (1)
1-24
: LGTM!The ClusterRole defines appropriate permissions for the pacemaker status collector to manage PacemakerStatus resources and their status subresource.
pkg/tnf/operator/starter_test.go (1)
35-35
: LGTM!The test is properly updated to:
- Import the necessary health package
- Create an
alivenessChecker
instance- Pass the additional parameters to
HandleDualReplicaClusters
- Initialize a minimal
rest.Config
for testingAlso applies to: 41-41, 96-109, 260-270
pkg/tnf/operator/starter.go (2)
38-38
: LGTM!The function signature is properly extended to accept
dynamicClient
andalivenessChecker
, and the newrunPacemakerControllers
function is correctly invoked to start Pacemaker-related controllers.Also applies to: 40-40, 76-77, 98-98
334-364
: Verify CronJob schedule aligns with requirements.The CronJob schedule is set to
"* * * * *"
(every minute), but the PR description states the controller "monitors pacemaker cluster status every 30 seconds."Please confirm:
- Is running every minute (instead of every 30 seconds) acceptable for your monitoring requirements?
- If 30-second intervals are required, consider either:
- Running the job more frequently (though Kubernetes CronJobs don't support sub-minute schedules natively)
- Implementing a loop inside the collector to sample twice per minute
- Using a different deployment pattern (e.g., Deployment with continuous monitoring)
If every minute is acceptable, update the PR description to match the actual schedule.
pkg/tnf/pacemaker/v1alpha1/register.go (1)
1-31
: LGTM!The API registration follows standard Kubernetes patterns:
- Properly defines
GroupName
andVersion
constants- Initializes
SchemeGroupVersion
andSchemeBuilder
- Registers
PacemakerStatus
andPacemakerStatusList
types- Provides helper functions for resource registration
pkg/tnf/pacemaker/healthcheck_test.go (1)
1-903
: LGTM!This is a comprehensive test suite with excellent coverage:
- HealthCheck construction and initialization
- PacemakerStatus retrieval (healthy, errors, stale)
- Operator status updates
- Event recording with proper deduplication across different time windows
- XML parsing for various cluster states (healthy, offline, standby, failures)
- Node and resource status collection
- Edge cases (large XML, invalid XML, missing state)
The use of testdata fixtures and dynamic timestamp generation for time-sensitive tests (like
getRecentFailuresXML
) is a good practice.pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-121
: LGTM: Autogenerated deepcopy implementations are correct.The generated deepcopy methods follow standard Kubernetes patterns: proper nil checks, correct slice allocation for
PacemakerStatusList.Items
, nestedDeepCopyInto
calls for embedded types, and appropriateDeepCopyObject
implementations.pkg/tnf/pacemaker/healthcheck.go (12)
1-99
: LGTM: Imports and constants are well-organized.The constants are clearly defined and documented. The
expectedNodeCount = 2
reflects the Two-Node OpenShift with Fencing design constraint, as clarified in past discussions.
311-383
: LGTM: Controller constructor follows standard operator patterns.The REST client and informer setup are correct. Using
panic
for scheme registration and client creation errors is appropriate for initialization failures that should halt startup.
576-600
: LGTM: Continue on parse errors is the correct resilient behavior.Continuing when timestamp parsing fails (line 590) ensures the health check processes other valid operations. The error is logged at V(4) for debugging. Failing hard on malformed timestamps would be too strict and could mask legitimate failures in other operations.
Based on learnings
682-689
: Condition ownership is established by the unique type name.The condition type
PacemakerHealthCheckDegraded
(line 91) is unique to this controller, establishing clear ownership. Per Kubernetes conventions, condition types should be managed by a single controller, so clearing without checking the reason is safe.Based on learnings
703-728
: LGTM: Event cleanup logic is correct.The cleanup (lines 712-716) runs unconditionally on every call and only removes entries older than
maxWindow
(1 hour). This is intentional and correct – it prevents unbounded map growth while preserving recent events within their deduplication windows.Based on learnings
101-273
: LGTM: XML structures are well-defined for pacemaker status parsing.The types correctly mirror the pacemaker XML schema and use appropriate xml tags for unmarshaling.
452-499
: LGTM: Parsing orchestration is well-structured.The function properly validates critical structures, checks pacemaker state, and delegates to specialized collectors. Early return on non-running state with explicit error status is appropriate.
501-524
: LGTM: Node status collection correctly handles the Two-Node design.The checks for expected node count (with warning for transient 1-node state during replacement) and online/standby validation align with the Two-Node OpenShift architecture requirements.
526-574
: LGTM: Resource validation correctly identifies kubelet and etcd placement.The helper function pattern and resource agent prefix matching properly track resource distribution across nodes.
602-634
: LGTM: Fencing and status determination logic is sound.Fencing event detection with appropriate time window (24 hours) and simple status determination based on error/warning presence are appropriate for this health check.
636-701
: LGTM: Operator condition management follows library-go patterns.Condition updates use the standard
v1helpers.UpdateStaticPodStatus
API correctly. The separation of set/clear/update functions is clean.
730-781
: LGTM: Event recording with deduplication is well-implemented.The event categorization (fencing vs. failed actions) with distinct reasons and deduplication windows, plus tracking previous status to record healthy transitions, is thorough and production-ready.
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
21-88
: LGTM: Controller setup follows standard factory patterns.The constants, types, and constructor are well-organized. Default namespace fallback and resync interval are appropriate.
90-178
: LGTM: CronJob lifecycle management is correct.The sync/ensure/create/update/build flow properly handles the CronJob lifecycle, including AlreadyExists handling and ResourceVersion preservation for updates.
224-249
: LGTM: Update detection correctly compares schedule, command, and image.The comparison logic properly checks all three configurable fields and handles array length mismatches.
pkg/cmd/pacemakerstatuscollector/collector.go (8)
1-50
: LGTM: Imports and constants are well-organized.The constants clearly define timeouts, size limits, and time formats. The command uses
sudo -n
for non-interactive execution, which is correct.
52-138
: LGTM: XML structure duplication is acceptable to avoid circular dependencies.The comment (line 53) explains why these types are duplicated from healthcheck.go. This is a reasonable trade-off for maintaining clean package dependencies.
140-177
: LGTM: Command structure and runCollector orchestration are sound.The 30-second context timeout (line 156) provides a reasonable upper bound for collection and CR update operations. Error handling propagates failures appropriately.
179-217
: LGTM: Collection logic is robust and defensive.The nested timeout (10s for exec, line 181), size validation (line 196), and returning raw XML even on parse failure (line 210) make this resilient to various failure modes.
219-265
: LGTM: Summary building comprehensively captures cluster state.The function correctly counts online nodes, started resources, and checks for recent failures/fencing events. The logic aligns with the health check expectations.
267-301
: LGTM: Time-window checks for failures and fencing are correct.The time window constants (5 minutes for failures, 24 hours for fencing) match the health check controller. Error handling continues on parse failure, which is appropriate.
303-319
: LGTM: Kubernetes client creation with in-cluster fallback is standard.The in-cluster config attempt followed by kubeconfig fallback correctly supports both in-cluster (Job/CronJob) and local execution contexts.
363-414
: LGTM: CR update logic correctly handles create vs. update paths.The create path (lines 363-389) properly sets TypeMeta and ObjectMeta. The update path (lines 394-413) correctly uses the status subresource and sets ObservedGeneration.
The goal of this change was to ensure that the CEO permissions for the pcs status cronjob were elevated enough to be able to gather pcs status output. One future improvement would be to move this out of the CEO entirely and have the status gathering be done by an alert agent that posts regular updates to the CEO. To alleviate the complexity of a future transition to this model, I've structured the code so that the healthcheck controller doesn't care where the posts to the PacemakerStatus CRD come from - it just monitors it for updates. In the future, we can write an alert agent that just posts new PacemakerStatus updates as they happen, and the CEO can retire the cronjob entire.
85b3229
to
9f5f422
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
bindata/tnfdeployment/cronjob.yaml (1)
5-5
: Hardcoded namespace creates inconsistency with controller.The namespace is hardcoded to
openshift-etcd
, but theCronJobController
accepts aNamespace
field inCronJobConfig
. The controller'sinjectValues
method doesn't replace the namespace placeholder, causing potential mismatches.As flagged in previous reviews: replace with a placeholder and update the controller to inject it.
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
48-51
: Cron schedule comment shows invalid 6-field expression.Kubernetes CronJob expects a standard 5-field cron expression (minute, hour, day, month, weekday), not 6 fields. The example showing
"*/30 * * * * *"
with seconds is misleading.Update to a valid 5-field example like
"*/30 * * * *"
(every 30 minutes).
184-194
: Command array construction lacks proper YAML escaping.The command building uses
fmt.Sprintf("%s")
which doesn't escape quotes, backslashes, or other special characters. Arguments containing these characters will break the YAML syntax when injected.As flagged in previous reviews: use
strconv.Quote()
for each command element to ensure proper escaping, and add thestrconv
import.
196-222
: Namespace is not injected into the manifest.The
injectValues
method replaces placeholders for name, schedule, image, and command, but never injects the namespace. The manifest template has a hardcoded namespace, creating an inconsistency with the controller's configurableNamespace
field.As flagged in previous reviews: add a replacement branch to inject
c.namespace
when the line contains"namespace: <injected>"
.pkg/cmd/pacemakerstatuscollector/collector.go (1)
360-392
: Use apierrors.IsNotFound instead of string matching.The brittle string check
strings.Contains(err.Error(), "not found")
on line 362 should be replaced with proper API error type checking usingapierrors.IsNotFound(err)
.Apply this fix:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (6)
pkg/tnf/operator/starter.go (1)
313-346
: Defensive guard for KubeConfig to avoid panics.If controllerContext.KubeConfig is ever nil (tests or misconfig), NewHealthCheck will panic when creating the REST client. Add a fast check and skip starting the healthcheck in that case (log at Warning).
Example:
func runPacemakerControllers(..., etcdInformer operatorv1informers.EtcdInformer) { // Start the healthcheck controller to watch PacemakerStatus CRs // This can start early since it just subscribes to CR updates klog.Infof("starting Pacemaker healthcheck controller") - healthCheckController := pacemaker.NewHealthCheck( + if controllerContext.KubeConfig == nil { + klog.Warning("skipping Pacemaker healthcheck controller: KubeConfig is nil") + } else { + healthCheckController := pacemaker.NewHealthCheck( alivenessChecker, operatorClient, kubeClient, controllerContext.EventRecorder, controllerContext.KubeConfig, ) - go healthCheckController.Run(ctx, 1) + go healthCheckController.Run(ctx, 1) + }pkg/tnf/pacemaker/healthcheck.go (2)
320-335
: Avoid panics on REST client creation; prefer graceful failure.Panic will crash the operator if RESTClientFor fails (bad rest.Config, serializer, env). Return an error from constructor (or log and skip starting) and let the caller decide, to keep the operator up.
Minimal approach (preserve signature): log error and return a no-op controller:
- restClient, err := rest.RESTClientFor(pacemakerConfig) - if err != nil { - panic(fmt.Errorf("failed to create REST client: %w", err)) - } + restClient, err := rest.RESTClientFor(pacemakerConfig) + if err != nil { + klog.Errorf("PacemakerHealthCheck disabled: failed to create REST client: %v", err) + return factory.New().ToController("PacemakerHealthCheckDisabled", events.NewInMemoryRecorder("noop")) + }Alternatively, change NewHealthCheck to return (factory.Controller, error) and bubble up.
As per coding guidelines.
730-781
: Event deduplication is sensible; minor nit: avoid cleanup per-item.Current cleanup runs for each warning/error. Consider a periodic cleanup (once per sync) to reduce lock contention.
pkg/tnf/pacemaker/healthcheck_test.go (1)
32-61
: Consider removing or clarifying the TODO comment.The TODO comment on line 56 states "TODO: Mock this for actual tests", but the function is already used successfully in
TestNewHealthCheck
(line 168-174). If the nil REST client and informer are acceptable for basic constructor testing, consider either:
- Removing the TODO comment, or
- Clarifying what additional testing would require proper mocking
pkg/cmd/pacemakerstatuscollector/collector.go (2)
303-333
: Refactor duplicate REST config retrieval logic.The code for obtaining the Kubernetes REST config is duplicated between
createKubernetesClient
(lines 304-316) andupdatePacemakerStatusCR
(lines 323-333). Consider extracting this into a shared helper function to improve maintainability.Example refactor:
func getRESTConfig() (*rest.Config, error) { // Try in-cluster config first config, err := rest.InClusterConfig() if err != nil { // Fall back to kubeconfig kubeconfig := os.Getenv("KUBECONFIG") if kubeconfig == "" { kubeconfig = os.Getenv("HOME") + "/.kube/config" } config, err = clientcmd.BuildConfigFromFlags("", kubeconfig) if err != nil { return nil, err } } return config, nil } func createKubernetesClient() (*kubernetes.Clientset, error) { config, err := getRESTConfig() if err != nil { return nil, err } return kubernetes.NewForConfig(config) } func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error { config, err := getRESTConfig() if err != nil { return err } // ... rest of function }
52-138
: XML structure duplication is documented.The comment on line 53 notes that these XML structures are copied from healthcheck.go to avoid circular dependencies. While code duplication is generally undesirable, this appears to be an intentional trade-off documented in the code.
Consider moving these shared structures to a separate package (e.g.,
pkg/tnf/pacemaker/types
orpkg/tnf/pacemaker/internal
) if the circular dependency can be resolved.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (18)
bindata/etcd/etcd-sa-clusterrole.yaml
(1 hunks)bindata/etcd/etcd-sa-clusterrolebinding.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(2 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- bindata/etcd/etcd-sa-clusterrolebinding.yaml
- pkg/operator/starter.go
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
- cmd/cluster-etcd-operator/main.go
- bindata/etcd/etcd-sa-clusterrole.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (18)
pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM!The function correctly mirrors the existing
ReadJobV1OrDie
pattern, using the same decoder infrastructure to parse CronJob manifests. The implementation is consistent and appropriate for this use case.bindata/tnfdeployment/cronjob.yaml (1)
35-40
: Privileged access is appropriate for pacemaker operations.The privileged security context and
hostPID: true
are necessary for the collector to executepcs status
commands and interact with the pacemaker cluster on the host. Static analysis warnings are expected but can be accepted for this system-level operational use case.pkg/tnf/pkg/jobs/cronjob_controller.go (3)
90-111
: LGTM!The sync method correctly checks operator state and ensures the CronJob only when in Managed state. Error handling and logging are appropriate.
113-132
: LGTM!The ensureCronJob method properly handles both creation and update paths, with correct error handling for NotFound vs other errors.
224-249
: LGTM!The needsUpdate method correctly compares schedule, command array, and image to determine if an update is required. Bounds checking is appropriate.
pkg/tnf/pacemaker/v1alpha1/doc.go (1)
1-3
: LGTM.Package doc and group annotation look good.
pkg/tnf/operator/starter_test.go (1)
96-110
: Aliveness checker wiring looks correct.Passing a concrete MultiAlivenessChecker into HandleDualReplicaClusters matches the new signature.
pkg/operator/ceohelpers/external_etcd_status.go (1)
167-201
: Nice generic waiter; sync + context-aware polling is appropriate.Gates on informer sync, retries with PollUntilContextTimeout, and treats transient errors as retryable. LGTM.
pkg/tnf/pacemaker/v1alpha1/register.go (1)
1-31
: LGTM.SchemeGroupVersion, AddToScheme, and known types are correctly wired.
pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-121
: Autogen deepcopy looks correct.Methods cover all types; safe copies for meta/timestamps.
pkg/tnf/operator/starter.go (2)
229-237
: Good: bootstrap wait refactored to shared helper.Using ceohelpers.WaitForEtcdCondition simplifies readiness gating and centralizes polling behavior.
348-367
: CronJob schedule fix looks good.Five-field spec "* * * * *" is valid for Kubernetes CronJobs; controller wiring is straightforward.
pkg/tnf/pacemaker/healthcheck.go (1)
509-523
: Node-state handling matches PR intent.You warn on unexpected node count and treat standby as offline. Clear messages; aligns with TNF design.
pkg/tnf/pacemaker/healthcheck_test.go (2)
126-140
: Excellent use of dynamic timestamps!Generating dynamic timestamps within the test window (line 131-133) is a best practice that prevents time-based test flakiness. This approach ensures tests remain reliable regardless of when they execute.
143-903
: Comprehensive test coverage!The test suite provides excellent coverage across:
- Happy path scenarios (healthy clusters, online nodes, started resources)
- Error conditions (offline nodes, standby nodes, collection errors, stale status)
- Edge cases (no nodes, mismatched counts, large XML, invalid XML, missing state)
- Event deduplication across different time windows and state transitions
- All major parsing and status determination functions
The tests are well-structured, clearly named, and follow consistent patterns.
pkg/tnf/pacemaker/v1alpha1/types.go (1)
1-86
: Well-structured API types following Kubernetes conventions.The type definitions are clean and follow standard Kubernetes API patterns with appropriate kubebuilder markers for CRD generation, deepcopy generation, and subresource configuration.
pkg/cmd/pacemakerstatuscollector/collector.go (2)
140-177
: Clean command structure and error handling.The cobra command setup and main orchestration logic follow good patterns with proper context timeout, error propagation, and logging.
179-301
: Well-structured collection and parsing logic.The collection, parsing, and summary-building functions are cleanly separated with good error handling, size validation (maxXMLSize check on line 196), and time-based filtering for recent events.
// Create a minimal rest.Config for testing | ||
restConfig := &rest.Config{ | ||
Host: "https://localhost:6443", | ||
} | ||
|
||
return args{ | ||
initErr: nil, // Default to no error | ||
ctx: context.Background(), | ||
controllerContext: &controllercmd.ControllerContext{ | ||
EventRecorder: eventRecorder, | ||
KubeConfig: restConfig, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Prevent test goroutine leaks: return the scoped ctx instead of context.Background.
Controllers started by HandleDualReplicaClusters use args.ctx; returning context.Background bypasses cancellation and can leak goroutines across tests. Return the ctx created in getArgs.
Apply:
@@
- return args{
- initErr: nil, // Default to no error
- ctx: context.Background(),
+ return args{
+ initErr: nil, // Default to no error
+ ctx: ctx,
controllerContext: &controllercmd.ControllerContext{
EventRecorder: eventRecorder,
KubeConfig: restConfig,
},
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// Create a minimal rest.Config for testing | |
restConfig := &rest.Config{ | |
Host: "https://localhost:6443", | |
} | |
return args{ | |
initErr: nil, // Default to no error | |
ctx: context.Background(), | |
controllerContext: &controllercmd.ControllerContext{ | |
EventRecorder: eventRecorder, | |
KubeConfig: restConfig, | |
}, | |
// Create a minimal rest.Config for testing | |
restConfig := &rest.Config{ | |
Host: "https://localhost:6443", | |
} | |
return args{ | |
initErr: nil, // Default to no error | |
ctx: ctx, | |
controllerContext: &controllercmd.ControllerContext{ | |
EventRecorder: eventRecorder, | |
KubeConfig: restConfig, | |
}, |
🤖 Prompt for AI Agents
In pkg/tnf/operator/starter_test.go around lines 260 to 271, the test currently
returns context.Background() for args.ctx which bypasses the scoped cancellable
ctx created by getArgs and can leak controller goroutines across tests; change
the returned args to set ctx: ctx (the local scoped context created earlier)
instead of context.Background(), so the test harness can cancel it and prevent
goroutine leaks.
// PacemakerStatusSpec defines the desired state of PacemakerStatus | ||
type PacemakerStatusSpec struct { | ||
// NodeName identifies which node this status is for | ||
// +optional | ||
NodeName string `json:"nodeName,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify if NodeName field is needed.
The NodeName
field in PacemakerStatusSpec
is marked optional but doesn't appear to be set anywhere in the collector implementation (pkg/cmd/pacemakerstatuscollector/collector.go lines 363-377, 394-399). The collector uses a singleton CR named "cluster" without node-specific distinction.
If node-specific status tracking isn't part of the current design, consider removing this field to avoid confusion. If it's planned for future use, consider adding a comment explaining its intended purpose.
Run the following to verify NodeName usage across the codebase:
🏁 Script executed:
#!/bin/bash
# Search for any references to NodeName in PacemakerStatus context
rg -n -C3 'NodeName' --type=go
Length of output: 264911
---
🏁 Script executed:
```shell
#!/bin/bash
# Search specifically for PacemakerStatus usage and NodeName references
echo "=== Search 1: PacemakerStatus creation/instantiation ==="
rg -n 'PacemakerStatus' --type=go | grep -v vendor | head -20
echo ""
echo "=== Search 2: Collector code that creates PacemakerStatus ==="
rg -n 'pacemakerstatuscollector' --type=go | grep -v vendor | head -20
echo ""
echo "=== Search 3: Direct reference to PacemakerStatusSpec with NodeName ==="
rg -n 'PacemakerStatusSpec.*NodeName\|NodeName.*PacemakerStatusSpec' --type=go | grep -v vendor
echo ""
echo "=== Search 4: Look for any .NodeName assignments in collector context ==="
rg -n '\.NodeName\s*=' --type=go pkg/cmd/pacemakerstatuscollector/ -A2 -B2 2>/dev/null || echo "File not found, searching broadly:"
find pkg -name "*collector*" -type f
Length of output: 2866
🏁 Script executed:
#!/bin/bash
# View the collector.go to see how PacemakerStatus objects are created
echo "=== Collector.go content (showing PacemakerStatus creation) ==="
rg -n 'PacemakerStatus' pkg/cmd/pacemakerstatuscollector/collector.go -A5 -B5
echo ""
echo "=== Search for Spec assignment in collector ==="
rg -n 'Spec' pkg/cmd/pacemakerstatuscollector/collector.go -C3 | head -40
Length of output: 4469
Remove unused NodeName
field from PacemakerStatusSpec
or document its intended purpose.
The NodeName
field in PacemakerStatusSpec
is never set by the collector implementation. Lines 363–376 and 395–398 of collector.go
show that only the Status
fields are populated (LastUpdated, RawXML, CollectionError, Summary); the Spec
is never touched. Combined with the singleton CR design ("cluster"), this field appears unused. Either remove it or add a comment explaining future intent.
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/v1alpha1/types.go around lines 23–28, the
PacemakerStatusSpec.NodeName field is unused by the collector and the singleton
"cluster" CR design, so remove the NodeName field from the PacemakerStatusSpec
struct, delete its json tag, and update any comments; regenerate CRDs/deepcopy
(run controller-gen or go generate as appropriate), fix any compile-time
references, and run go fmt/go vet to ensure cleanliness. If you prefer to keep
it instead, replace the field with a clear comment explaining its future intent
and why it is currently unused, and still regenerate CRDs/deepcopy and run
formatting checks.
@jaypoulz: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
hostNetwork: false | ||
hostPID: true | ||
priorityClassName: system-node-critical | ||
serviceAccountName: etcd-sa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use the tnf-setup-manager SA. like we do in the other TNF jobs? That one also has the privileged securitycontextconstraints rule assigned, which I expect the pacemaker cronjob needs as well
|
||
// Track which line we're on to do context-aware replacement | ||
lines := strings.Split(result, "\n") | ||
for i, line := range lines { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I like using a JobHookFunc
like in runJobController()
much more than this string replacements... WDYT?
cmd.AddCommand(requestbackup.NewRequestBackupCommand(ctx)) | ||
cmd.AddCommand(rev.NewRevCommand(ctx)) | ||
cmd.AddCommand(backuprestore.NewBackupServer(ctx)) | ||
cmd.AddCommand(pacemakerstatuscollector.NewPacemakerStatusCollectorCommand()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding this in cmd/tnf-setup-runner instead, where all the other TNF related commands live? (yes, it's not about the setup part, but we can rename it?)
@@ -0,0 +1,414 @@ | |||
package pacemakerstatuscollector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to have this in the tnf package?
Change List