Skip to content

Conversation

jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Sep 17, 2025

Change List

  • Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
  • Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes
  • Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
  • Included happy path testing

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 17, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 17, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Change List

  • Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
  • Updated operator condition to degraded when nodes are offline or critical resources (kubelet/etcd) are not started on all nodes
  • Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
  • Included happy path testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

coderabbitai bot commented Sep 17, 2025

Walkthrough

Adds Pacemaker support: a new PacemakerStatus CRD and Go API types, a status-collector CLI plus CronJob to populate CRs from pcs XML, a Pacemaker healthcheck controller that reads CRs and updates operator conditions, operator wiring and helpers, RBAC and manifest updates, and a runtime/binary rename to tnf-runtime.

Changes

Cohort / File(s) Summary
Pacemaker CRD & API
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml, pkg/tnf/pkg/pacemaker/v1alpha1/... (doc.go, register.go, types.go, zz_generated.deepcopy.go)
Adds cluster-scoped PacemakerStatus CRD (etcd.openshift.io/v1alpha1) and corresponding Go API registration, types, and autogenerated deepcopy implementations.
Pacemaker status collector
pkg/tnf/pkg/pacemaker/statuscollector.go, pkg/tnf/pkg/pacemaker/statuscollector_test.go, pkg/tnf/pkg/pacemaker/testdata/*
Adds CLI command to run pcs status xml, parse XML into structured results, build summary/nodes/resources/histories, and create/update PacemakerStatus CRs; includes parsing, time-window filtering, and extensive tests/fixtures.
Pacemaker healthcheck controller
pkg/tnf/pkg/pacemaker/healthcheck.go, pkg/tnf/pkg/pacemaker/healthcheck_test.go, pkg/tnf/pkg/pacemaker/client_helpers.go
Introduces HealthCheck controller and HealthStatus type: reads PacemakerStatus CRs, computes warnings/errors/overall status, updates operator PacemakerDegraded condition, records deduplicated events, and provides REST/client helpers and unit tests.
Controller wiring & operator helpers
pkg/tnf/operator/starter.go, pkg/tnf/operator/starter_test.go, pkg/operator/ceohelpers/external_etcd_status.go, pkg/operator/starter.go
Extends HandleDualReplicaClusters signature to accept an aliveness checker, wires pacemaker controllers and CronJob into operator startup, replaces explicit etcd bootstrap wait with WaitForEtcdCondition helper, and updates tests to supply rest.Config and aliveness checker.
CronJob controller & decoders
pkg/tnf/pkg/jobs/cronjob_controller.go, pkg/tnf/pkg/jobs/batch.go
Adds a generic CronJob controller (manifest templating, create/update/sync, optional hooks) and a ReadCronJobV1OrDie helper to decode batch/v1 CronJob manifests.
RBAC & manifests
bindata/tnfdeployment/cronjob.yaml, bindata/tnfdeployment/clusterrole.yaml, bindata/tnfdeployment/job.yaml
Adds CronJob manifest for status collection, grants RBAC for pacemakerstatuses and pacemakerstatuses/status, and updates Job manifest to use the tnf-runtime command.
Runtime command & image/binary rename
cmd/tnf-runtime/main.go, Dockerfile.ocp, .gitignore, bindata/tnfdeployment/job.yaml
Renames entry command to NewTnfRuntimeCommand, adds NewMonitorCommand, switches runtime binary to tnf-runtime in Dockerfile/job manifests, and updates .gitignore patterns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters" directly and clearly describes the primary change in the pull request. The changeset introduces a PacemakerHealthCheck controller that monitors cluster status for external etcd deployments, which aligns precisely with the title's description. The title is specific, avoids vague terminology, and accurately conveys the main objective to a reviewer scanning the commit history.
Description Check ✅ Passed The pull request description is clearly related to the changeset. It accurately describes the main additions: a PacemakerHealthCheck controller that monitors cluster status, operator condition updates for degraded state, event recording with appropriate time windows, and test coverage. Each point in the description corresponds to concrete changes in the raw summary, making this a substantive and on-topic description rather than vague or generic.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2025
@openshift-ci openshift-ci bot requested review from mshitrit and slintes September 17, 2025 21:39
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
pkg/tnf/pacemaker/healthcheck.go (3)

71-75: Avoid event spam: emit “Healthy” only on transitions.

At 30s resync, always emitting a Healthy event will generate noisy duplicates. Track last status and only emit on transition to Healthy.

 type HealthCheck struct {
   operatorClient v1helpers.StaticPodOperatorClient
   kubeClient     kubernetes.Interface
   eventRecorder  events.Recorder
+  lastOverallStatus string
 }
@@
-  // Record a normal event for healthy status
-  if status.OverallStatus == StatusHealthy {
-    c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started")
-  }
+  // Record a normal event for healthy status only on transition
+  if status.OverallStatus == StatusHealthy && c.lastOverallStatus != StatusHealthy {
+    c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started")
+  }
+  c.lastOverallStatus = status.OverallStatus

Also applies to: 446-450


347-356: Micro: precompile timestamp regex in isWithinTimeWindow.

Avoid recompiling on each iteration.

 func (c *HealthCheck) isWithinTimeWindow(line string, cutoffTime time.Time) bool {
   timeFormats := []string{
     "Mon Jan 2 15:04:05 2006",
     "2006-01-02 15:04:05.000000Z",
     "2006-01-02 15:04:05Z",
   }
 
-  for _, format := range timeFormats {
-    re := regexp.MustCompile(`'([^']+)'`)
-    matches := re.FindStringSubmatch(line)
+  tsRe := regexp.MustCompile(`'([^']+)'`)
+  for _, format := range timeFormats {
+    matches := tsRe.FindStringSubmatch(line)
     if len(matches) > 1 {
       if t, err := time.Parse(format, matches[1]); err == nil {
         return t.After(cutoffTime)
       }
     }
   }

71-75: Unused dependency: kubeClient is stored but never used.

Drop it from the struct/constructor or put it to use (e.g., namespacing events, future lookups). Keeping it increases surface without value.

 type HealthCheck struct {
   operatorClient v1helpers.StaticPodOperatorClient
-  kubeClient     kubernetes.Interface
   eventRecorder  events.Recorder
   lastOverallStatus string
 }
@@
 func NewHealthCheck(
   livenessChecker *health.MultiAlivenessChecker,
   operatorClient v1helpers.StaticPodOperatorClient,
-  kubeClient kubernetes.Interface,
   eventRecorder events.Recorder,
 ) factory.Controller {
   c := &HealthCheck{
     operatorClient: operatorClient,
-    kubeClient:     kubeClient,
     eventRecorder:  eventRecorder,
   }
@@
-    operatorClient,
-    kubeClient,
+    operatorClient,
     eventRecorder,
 )

If this is intentionally reserved for future use, ignore and mark with a TODO comment to avoid churn.

Also applies to: 80-84

pkg/tnf/pacemaker/healthcheck_test.go (4)

106-137: Assert condition transitions, not just error absence.

Validate that degraded/healthy actually update the operator condition in the fake client.

   err := controller.updateOperatorAvailability(ctx, status)
   require.NoError(t, err, "updateOperatorAvailability should not return an error")
+  // Verify degraded condition set
+  _, st, _, _ := operatorClient.GetStaticPodOperatorState()
+  require.Condition(t, func() (success bool) {
+    for _, c := range st.Conditions {
+      if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionTrue {
+        return true
+      }
+    }
+    return false
+  }, "expected degraded condition to be true")
@@
   err = controller.updateOperatorAvailability(ctx, status)
   require.NoError(t, err, "updateOperatorAvailability should not return an error")
+  // Verify degraded condition cleared
+  _, st, _, _ = operatorClient.GetStaticPodOperatorState()
+  require.Condition(t, func() (success bool) {
+    for _, c := range st.Conditions {
+      if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionFalse {
+        return true
+      }
+    }
+    return false
+  }, "expected degraded condition to be false")

163-184: Validate emitted events.

Use the in-memory recorder to assert reasons were produced (failed action, fencing, generic warning, error, healthy transition).

   controller.recordHealthCheckEvents(status)
-  
+  // Collect and assert events for warnings and error
+  evts := eventRecorder.Events()
+  require.NotEmpty(t, evts)
+  // Sanity: at least one of each warning type and an error
+  require.Contains(t, reasons(evts), EventReasonFailedAction)
+  require.Contains(t, reasons(evts), EventReasonFencingEvent)
+  require.Contains(t, reasons(evts), EventReasonWarning)
+  require.Contains(t, reasons(evts), EventReasonError)
@@
   status = &HealthStatus{
     OverallStatus: StatusHealthy,
     Warnings:      []string{},
     Errors:        []string{},
   }
 
   controller.recordHealthCheckEvents(status)
+  // After transition to healthy, expect a healthy event
+  evts = eventRecorder.Events()
+  require.Contains(t, reasons(evts), EventReasonHealthy)
+
+}
+
+// helper
+func reasons(evts []events.Event) []string {
+  out := make([]string, 0, len(evts))
+  for _, e := range evts {
+    out = append(out, e.Reason)
+  }
+  return out
 }

261-291: Add a test for “standby” node to ensure it’s a Warning, not Degraded.

Guards the updated parsing logic and aligns with the stated degradation policy.

   // Test with offline node
@@
   require.NotEmpty(t, status.Errors, "Should have errors for offline nodes")
+
+  // Test with standby node -> warning, but still considered online
+  lines = []string{
+    SectionNodeList,
+    "  * Node master-0 (1): online, feature set 3.19.6",
+    "  * Node master-1 (2): standby, feature set 3.19.6",
+    SectionFullListResources,
+  }
+  status = &HealthStatus{Warnings: []string{}, Errors: []string{}}
+  allOnline = controller.parseNodeStatus(lines, status)
+  require.True(t, allOnline, "Standby should not be treated as offline")
+  require.NotEmpty(t, status.Warnings, "Standby should surface as a warning")

296-313: Cover dynamic node-count logic.

Add a case where only one node has resources started while two nodes are listed; expect an error message reflecting 1/2.

   // Test with all resources started
@@
   require.Empty(t, status.Errors, "Should have no errors for started resources")
+
+  // Test with partial placement (only one node started)
+  lines = []string{
+    SectionNodeList,
+    "  * Node master-0 (1): online, feature set 3.19.6",
+    "  * Node master-1 (2): online, feature set 3.19.6",
+    SectionFullListResources,
+    "  * Clone Set: kubelet-clone [kubelet]:",
+    "    * kubelet   (systemd:kubelet):   Started master-0",
+    "  * Clone Set: etcd-clone [etcd]:",
+    "    * etcd (ocf:heartbeat:podman-etcd):     Started master-0",
+    SectionNodeAttributes,
+  }
+  status = &HealthStatus{Warnings: []string{}, Errors: []string{}}
+  allStarted = controller.parseResourceStatus(lines, status)
+  require.False(t, allStarted, "Resources not started on all nodes should be false")
+  require.NotEmpty(t, status.Errors)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61b7fb5 and c077a2b.

📒 Files selected for processing (2)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)

Comment on lines 214 to 218
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
allNodesOnline = false
nodeName := c.extractNodeName(line)
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t treat “not online” as “offline”; handle standby distinctly.

Current check flags any non-“online” node as offline (e.g., “standby”). That contradicts the PR objective (degrade only when nodes are offline).

-      if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
-        allNodesOnline = false
-        nodeName := c.extractNodeName(line)
-        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
-      }
+      if c.isNodeStatusLine(line) {
+        lower := strings.ToLower(line)
+        if strings.Contains(lower, "offline") {
+          allNodesOnline = false
+          nodeName := c.extractNodeName(line)
+          status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName))
+        } else if strings.Contains(lower, "standby") {
+          nodeName := c.extractNodeName(line)
+          status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName))
+        }
+      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
allNodesOnline = false
nodeName := c.extractNodeName(line)
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
}
if c.isNodeStatusLine(line) {
lower := strings.ToLower(line)
if strings.Contains(lower, "offline") {
allNodesOnline = false
nodeName := c.extractNodeName(line)
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName))
} else if strings.Contains(lower, "standby") {
nodeName := c.extractNodeName(line)
status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName))
}
}
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 214-218, the code currently
treats any node state that is not NodeStatusOnline as an offline error; change
the logic to only mark nodes as offline when their status equals the explicit
offline constant (e.g., NodeStatusOffline) and ignore legitimate non-online
states like NodeStatusStandby. Concretely, replace the current negation check
with an explicit equality check for offline (or a compound condition that skips
Standby), set allNodesOnline = false and append an error only when the status is
offline, and avoid adding errors for standby nodes.

Comment on lines 226 to 264
func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
resourcesStarted := make(map[string]map[string]bool)
resourcesStarted[ResourceKubelet] = make(map[string]bool)
resourcesStarted[ResourceEtcd] = make(map[string]bool)

inResources := false

for _, line := range lines {
line = strings.TrimSpace(line)

if strings.Contains(line, SectionFullListResources) {
inResources = true
continue
}

if inResources {
if strings.Contains(line, SectionNodeAttributes) {
break
}

c.checkResourceLine(line, resourcesStarted)
}
}

// Check if we have both resources started on all nodes
kubeletCount := len(resourcesStarted[ResourceKubelet])
etcdCount := len(resourcesStarted[ResourceEtcd])

if kubeletCount < ExpectedNodeCount {
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount))
}

if etcdCount < ExpectedNodeCount {
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount))
}

return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid hard-coded node count (2); derive from the Node List.

Clusters may not always be 2 nodes. Parse the Node List to compute expected count and use it when validating resource placement.

-func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
+func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
   resourcesStarted := make(map[string]map[string]bool)
   resourcesStarted[ResourceKubelet] = make(map[string]bool)
   resourcesStarted[ResourceEtcd] = make(map[string]bool)
@@
-  // Check if we have both resources started on all nodes
-  kubeletCount := len(resourcesStarted[ResourceKubelet])
-  etcdCount := len(resourcesStarted[ResourceEtcd])
+  // Determine expected node count from the Node List; fall back to constant if not found
+  expectedCount := c.detectNodeCount(lines)
+  if expectedCount == 0 {
+    expectedCount = ExpectedNodeCount
+  }
+  kubeletCount := len(resourcesStarted[ResourceKubelet])
+  etcdCount := len(resourcesStarted[ResourceEtcd])
 
-  if kubeletCount < ExpectedNodeCount {
-    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount))
+  if kubeletCount < expectedCount {
+    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, expectedCount))
   }
 
-  if etcdCount < ExpectedNodeCount {
-    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount))
+  if etcdCount < expectedCount {
+    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, expectedCount))
   }
 
-  return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount
+  return kubeletCount >= expectedCount && etcdCount >= expectedCount
 }
+
+// detectNodeCount returns the number of nodes listed in the "Node List" section.
+func (c *HealthCheck) detectNodeCount(lines []string) int {
+  inNodeList := false
+  count := 0
+  for _, line := range lines {
+    line = strings.TrimSpace(line)
+    if strings.Contains(line, SectionNodeList) {
+      inNodeList = true
+      continue
+    }
+    if inNodeList {
+      if strings.Contains(line, SectionFullListResources) {
+        break
+      }
+      if c.isNodeStatusLine(line) {
+        count++
+      }
+    }
+  }
+  return count
+}

Committable suggestion skipped: line range outside the PR's diff.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from c077a2b to 81984c7 Compare September 18, 2025 14:00
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 81984c7 to 26d6df8 Compare October 10, 2025 13:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)

502-510: Degrade on StatusError as well.

If we fail to collect status (OverallStatus == StatusError), the operator condition remains untouched, so outages never surface. Treat StatusError the same as StatusDegraded when updating conditions.

Suggested adjustment:

-	if status.OverallStatus == StatusDegraded {
-		return c.setDegradedCondition(ctx, status)
-	} else if status.OverallStatus == StatusHealthy {
+	switch status.OverallStatus {
+	case StatusDegraded, StatusError:
+		return c.setDegradedCondition(ctx, status)
+	case StatusHealthy:
 		return c.clearDegradedCondition(ctx)
 	}
-
-	// For Warning or Error status, we don't update the degraded condition
-	return nil
+	// Warning / Unknown: leave condition unchanged
+	return nil
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c077a2b and 26d6df8.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 26d6df8 to e42e5dc Compare October 10, 2025 15:26
@jaypoulz jaypoulz changed the title WIP: OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters Oct 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)

3-9: Prevent hangs; add exec timeout and non-interactive sudo (-n). Also introduce StatusError for collection failures.

pcs can block (sudo prompt/hung pcs). Use a short timeout context and sudo -n. While here, add StatusError and treat exec failures accordingly.

Apply:

--- a/pkg/tnf/pacemaker/healthcheck.go
+++ b/pkg/tnf/pacemaker/healthcheck.go
@@
 import (
+	"errors"
 	"context"
 	"encoding/xml"
 	"fmt"
 	"strings"
 	"time"
@@
 const (
   // Time windows for detecting recent events
   FailedActionTimeWindow = 5 * time.Minute
   FencingEventTimeWindow = 24 * time.Hour
+
+  // Upper bound for executing pcs commands
+  ExecTimeout = 10 * time.Second
@@
   // Status strings
   StatusHealthy  = "Healthy"
   StatusWarning  = "Warning"
   StatusDegraded = "Degraded"
   StatusUnknown  = "Unknown"
+  StatusError    = "Error"
@@
   // Pacemaker commands
-  PcsStatusXMLCommand = "sudo pcs status xml"
+  PcsStatusXMLCommand = "sudo -n pcs status xml"
@@
-  // Execute the pcs status xml command
-  stdout, stderr, err := exec.Execute(ctx, PcsStatusXMLCommand)
+  // Execute the pcs status xml command with timeout and non-interactive sudo
+  ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout)
+  defer cancel()
+  stdout, stderr, err := exec.Execute(ctxExec, PcsStatusXMLCommand)
   if err != nil {
-    return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil
+    if errors.Is(err, context.DeadlineExceeded) {
+      err = fmt.Errorf("pcs status xml command timed out: %w", err)
+    }
+    return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil
   }

Also applies to: 21-84, 336-345


379-383: Set OverallStatus when pacemaker isn’t running (avoid returning Unknown).

Without setting OverallStatus here, outages can be masked and the degraded condition won’t be set.

   if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
     status.Errors = append(status.Errors, "Pacemaker is not running")
-    return status, nil
+    status.OverallStatus = StatusDegraded
+    return status, nil
   }

568-576: Degrade on StatusError as well (not only Degraded).

Collection failures should surface as degraded to operators/users.

 switch status.OverallStatus {
- case StatusDegraded:
+ case StatusDegraded, StatusError:
   return c.setDegradedCondition(ctx, status)
 case StatusHealthy:
   return c.clearDegradedCondition(ctx)
 default:
   // For Warning or Unknown status, we don't update the degraded condition
   return nil
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 26d6df8 and e42e5dc.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/healthcheck_test.go
🔇 Additional comments (2)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-88: Fixture looks consistent with parser expectations.

IDs/roles/attributes align with parseResources/parseNodes and time formats for histories. LGTM.

pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)

1-66: Recent-failures fixture is well-formed and matches parser formats.

Timestamps match PacemakerTimeFormat/PacemakerFenceTimeFormat; placeholders are suitable for templating. LGTM.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from e42e5dc to 3cab3d9 Compare October 10, 2025 16:06
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)

386-390: Set OverallStatus to StatusDegraded before returning.

When Pacemaker isn't running, the code appends an error but returns with OverallStatus still set to StatusUnknown (from initialization at line 370). This prevents the degraded condition from being set downstream, masking the outage.

Apply this diff:

 	// Check if pacemaker is running
 	if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
 		status.Errors = append(status.Errors, "Pacemaker is not running")
+		status.OverallStatus = StatusDegraded
 		return status, nil
 	}

567-575: Handle StatusError in the switch statement.

If StatusError is added (per earlier comment), the updateOperatorStatus switch should treat it like StatusDegraded and set the degraded condition.

Apply this diff:

 	// Update operator conditions based on pacemaker status
 	switch status.OverallStatus {
-	case StatusDegraded:
+	case StatusDegraded, StatusError:
 		return c.setDegradedCondition(ctx, status)
 	case StatusHealthy:
 		return c.clearDegradedCondition(ctx)
 	default:
-		// For Warning or Unknown status, we don't update the degraded condition
+		// For Warning or Unknown status, we don't update the degraded condition
 		return nil
 	}

267-274: Rename to newErrorHealthStatus and use StatusError instead of StatusUnknown.

Collection failures (exec timeout, XML parse errors, etc.) should be marked as StatusError rather than StatusUnknown so they're surfaced as actionable failures in the operator condition. This aligns with previous feedback.

Apply this diff:

-// newUnknownHealthStatus creates a HealthStatus with Unknown status and an error message
-func newUnknownHealthStatus(errMsg string) *HealthStatus {
+// newErrorHealthStatus creates a HealthStatus with Error status and an error message
+func newErrorHealthStatus(errMsg string) *HealthStatus {
 	return &HealthStatus{
-		OverallStatus: StatusUnknown,
+		OverallStatus: StatusError,
 		Warnings:      []string{},
 		Errors:        []string{errMsg},
 	}
 }

Then update all call sites (lines 343, 346, 355, 361) to use newErrorHealthStatus instead of newUnknownHealthStatus.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e42e5dc and 3cab3d9.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (7)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-88: LGTM! Test fixture is well-formed.

The XML fixture represents a valid healthy Pacemaker cluster state with:

  • Both nodes online (online="true" on lines 12-13)
  • All resources started and active
  • No failed operations in history
  • Empty fence history

This provides good test coverage for the happy path.

pkg/tnf/pacemaker/healthcheck_test.go (6)

102-113: Update test expectations when StatusError replaces StatusUnknown.

This test expects StatusUnknown for exec failures (line 111), which is correct for the current implementation. However, once newUnknownHealthStatus is renamed to newErrorHealthStatus and returns StatusError (per the comment on healthcheck.go lines 267-274), update this assertion accordingly.

After applying the main implementation changes, update line 111:

 	require.NotNil(t, status, "HealthStatus should not be nil")
-	require.Equal(t, StatusUnknown, status.OverallStatus, "Status should be Unknown when exec fails")
+	require.Equal(t, StatusError, status.OverallStatus, "Status should be Error when exec fails")
 	require.NotEmpty(t, status.Errors, "Should have error messages when exec fails")

301-329: Update TestHealthCheck_determineOverallStatus for StatusError.

Line 306 tests newUnknownHealthStatus and expects StatusUnknown (line 308). Once StatusError is introduced, add a dedicated test case and update the existing one.

After implementing StatusError, update the test:

 func TestHealthCheck_determineOverallStatus(t *testing.T) {
 	controller := &HealthCheck{}
 
-	// Test degraded status status
-	errMsg := "test error message"
-	status := newUnknownHealthStatus(errMsg)
-	result := controller.determineOverallStatus(status)
-	require.Equal(t, StatusUnknown, result)
+	// Test error status (collection failures)
+	errMsg := "test error message"
+	status := newErrorHealthStatus(errMsg)
+	result := controller.determineOverallStatus(status)
+	require.Equal(t, StatusError, result)

Also add a test case for StatusUnknown if it's still used for other scenarios, or remove it if no longer needed.


377-392: Update TestHealthCheck_validateResourceCount signature after refactor.

Once validateResourceCount is updated to accept expectedCount as a parameter (per the comment on healthcheck.go lines 480-487), update the test calls on lines 382 and 388.

Apply this diff after the main implementation is updated:

 	// Test with correct count
 	status := &HealthStatus{Warnings: []string{}, Errors: []string{}}
-	result := controller.validateResourceCount(ResourceEtcd, 2, status)
+	result := controller.validateResourceCount(ResourceEtcd, 2, 2, status)
 	require.True(t, result, "Should return true for correct count")
 	require.Empty(t, status.Errors, "Should have no errors for correct count")
 
 	// Test with insufficient count
 	status = &HealthStatus{Warnings: []string{}, Errors: []string{}}
-	result = controller.validateResourceCount(ResourceKubelet, 1, status)
+	result = controller.validateResourceCount(ResourceKubelet, 1, 2, status)
 	require.False(t, result, "Should return false for insufficient count")

248-266: Consider removing TestHealthCheck_collectNodeStatus_MismatchedCount after refactor.

This test validates the warning for node count mismatch against ExpectedNodeCount. Once the hard-coded constant is removed (per the comment on healthcheck.go lines 423-427), this specific warning logic will be deleted, making this test obsolete.

After the refactor, either remove this test or update it to test a different node-count-related scenario if applicable.


23-46: LGTM! Test helpers are well-structured.

The createTestHealthCheck and NewHealthCheck test setup properly use fake clients and in-memory event recorders for isolated testing. Good practice for controller tests.

Also applies to: 72-100


176-230: LGTM! XML parsing tests provide good coverage.

The tests for parsePacemakerStatusXML cover:

  • Healthy cluster (lines 176-188)
  • Offline node scenario (lines 190-202)
  • Recent failures with dynamic timestamps (lines 204-230)

The use of getRecentFailuresXML with dynamic timestamps (lines 55-69) is particularly good for avoiding time-sensitive test flakes.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 3cab3d9 to c3fb43e Compare October 10, 2025 16:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (8)
pkg/tnf/pacemaker/healthcheck.go (8)

267-274: Duplicate: Mark collection failures as StatusError.

Per past review, this function should return StatusError instead of StatusUnknown to properly distinguish collection failures from unknown states. Collection failures are actionable and should be surfaced as errors.

Apply this diff:

 func newUnknownHealthStatus(errMsg string) *HealthStatus {
   return &HealthStatus{
-    OverallStatus: StatusUnknown,
+    OverallStatus: StatusError,
     Warnings:      []string{},
     Errors:        []string{errMsg},
   }
 }

Note: This requires adding the StatusError constant to the status constants section (lines 34-38).


34-38: Add StatusError constant for collection failures.

To properly handle collection/communication failures distinct from operational issues, add a StatusError constant alongside the existing status constants.

Apply this diff:

   // Status strings
   StatusHealthy  = "Healthy"
   StatusWarning  = "Warning"
   StatusDegraded = "Degraded"
+  StatusError    = "Error"
   StatusUnknown  = "Unknown"

429-433: Distinguish offline from standby nodes.

The current check treats any non-online node as an error, including legitimate standby states. Per the PR objectives, only offline nodes should trigger degradation.

Apply this diff:

   for _, node := range result.Nodes.Node {
-    if node.Online != NodeStatusOnline {
-      status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
+    if node.Online == "false" {
+      status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", node.Name))
+    } else if node.Standby == BooleanValueTrue {
+      status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", node.Name))
     }
   }

387-390: Set OverallStatus when pacemaker isn't running.

When pacemaker is down, the function returns with OverallStatus still set to Unknown. This prevents the degraded condition from being set, masking the outage.

Apply this diff:

   if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
     status.Errors = append(status.Errors, "Pacemaker is not running")
+    status.OverallStatus = StatusDegraded
     return status, nil
   }

497-509: Fix failure detection logic: use OR instead of AND.

The condition requires both RC and RCText to indicate failure, which misses cases where only one field is non-success. Use OR logic to catch all failure cases.

Apply this diff:

         for _, operation := range resourceHistory.OperationHistory {
           // Check for failed operations (rc != "0")
-          if operation.RC != OperationRCSuccess && operation.RCText != OperationRCTextSuccess {
+          if operation.RC != OperationRCSuccess || operation.RCText != OperationRCTextSuccess {
             // Parse the timestamp
             t, err := time.Parse(PacemakerTimeFormat, operation.LastRCChange)

567-575: Handle StatusError in operator status update.

When status collection fails (StatusError), the operator should surface a degraded condition to avoid masking outages. Currently only StatusDegraded triggers the degraded condition.

Apply this diff:

   // Update operator conditions based on pacemaker status
   switch status.OverallStatus {
-  case StatusDegraded:
+  case StatusDegraded, StatusError:
     return c.setDegradedCondition(ctx, status)
   case StatusHealthy:
     return c.clearDegradedCondition(ctx)
   default:
-    // For Warning or Unknown status, we don't update the degraded condition
+    // For Warning or Unknown status, we don't update the degraded condition
     return nil
   }

469-487: Derive expected node count dynamically from parsed XML.

Replace the hard-coded ExpectedNodeCount constant with a dynamic count derived from the parsed node list. This makes the validation flexible for clusters of varying sizes.

Apply this diff:

   // Check if we have both resources started on all nodes
+  expectedCount := len(result.Nodes.Node)
   kubeletCount := len(resourcesStarted[ResourceKubelet])
   etcdCount := len(resourcesStarted[ResourceEtcd])
 
-  kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, status)
-  etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, status)
+  kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, expectedCount, status)
+  etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, expectedCount, status)
 
   return kubeletOK && etcdOK
 }
 
 // validateResourceCount validates that a resource is started on the expected number of nodes
-func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) bool {
-  if actualCount < ExpectedNodeCount {
+func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) bool {
+  if actualCount < expectedCount {
     status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)",
-      resourceName, actualCount, ExpectedNodeCount))
+      resourceName, actualCount, expectedCount))
     return false
   }
   return true

423-427: Remove hard-coded node count check.

The hard-coded ExpectedNodeCount = 2 check is inflexible for clusters that may scale or undergo node replacement. Since the subsequent validation already uses the actual node count from the XML, this strict equality check can be removed.

Apply this diff:

-  // Check if we have the expected number of nodes.
-  // This should almost always be 2, but 1 node is possible during a control-plane node replacement event.
-  if len(result.Nodes.Node) != ExpectedNodeCount {
-    status.Warnings = append(status.Warnings, fmt.Sprintf("Expected %d nodes, found %d", ExpectedNodeCount, len(result.Nodes.Node)))
-  }
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3cab3d9 and c3fb43e.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🔇 Additional comments (4)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)

1-67: LGTM!

This XML test fixture correctly represents an offline node scenario with master-1 offline (line 13) and a corresponding fencing event (line 64). The structure is valid and appropriate for testing the health check parser's handling of degraded cluster states.

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-88: LGTM!

This XML test fixture correctly represents a healthy cluster with both nodes online and all resources started. The structure is valid and provides appropriate test coverage for the healthy path scenario.

pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)

1-66: LGTM!

This template correctly uses placeholders for dynamic timestamp generation (lines 56, 63), allowing tests to verify time-based filtering of recent failures and fencing events. The template approach is appropriate for testing scenarios that depend on recency.

pkg/tnf/pacemaker/healthcheck_test.go (1)

1-440: LGTM!

This comprehensive test suite provides excellent coverage for the Pacemaker health check component:

  • Thorough testing of XML parsing with multiple scenarios (healthy, offline, recent failures)
  • Good edge case coverage (no nodes, mismatched counts, large/invalid XML)
  • Effective use of test helpers and fixtures
  • Appropriate use of fake clients for isolation

The test structure is clean and maintainable.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from c3fb43e to f87437c Compare October 10, 2025 17:01
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 10, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Change List

  • Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
  • Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes
  • Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
  • Included happy path testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)

386-390: Set OverallStatus before returning when pacemaker is not running.

When pacemaker is not running, the function appends an error and returns immediately (line 389), but OverallStatus remains Unknown (initialized at line 370). This prevents the degraded condition from being set properly since updateOperatorStatus (line 567-575) only sets degraded for StatusError, not StatusUnknown.

Apply this diff:

 	if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
 		status.Errors = append(status.Errors, "Pacemaker is not running")
+		status.OverallStatus = c.determineOverallStatus(status)
 		return status, nil
 	}
🧹 Nitpick comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)

102-113: Clarify test expectation: exec failure should not return an error.

The comment states "exec.Execute will fail due to permissions" but the test expects NoError. This appears correct (the function returns Unknown status without propagating the error), but the comment could be clearer about this design choice.

Consider updating the comment to:

-  // In test environment, exec.Execute will fail due to permissions
-  require.NoError(t, err, "getPacemakerStatus should not return an error")
+  // In test environment, exec.Execute will fail due to permissions.
+  // getPacemakerStatus handles this gracefully by returning Unknown status without propagating the error.
+  require.NoError(t, err, "getPacemakerStatus should not return an error even when exec fails")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c3fb43e and f87437c.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (10)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)

1-67: LGTM! Well-structured test fixture for offline node scenario.

The XML fixture accurately represents an offline node scenario with appropriate fencing history. The structure is valid and will effectively support testing of offline node detection and fencing event handling.

pkg/tnf/pacemaker/healthcheck_test.go (2)

23-69: LGTM! Well-designed test helpers.

The test helpers are well-structured:

  • createTestHealthCheck: Properly initializes all dependencies with fakes
  • loadTestXML: Simple and effective for loading fixtures
  • getRecentFailuresXML: Clever approach using dynamic timestamps to ensure failures are within the 5-minute window

176-440: LGTM! Comprehensive test coverage.

The test suite provides excellent coverage:

  • XML parsing for healthy, offline, and recent-failures scenarios
  • Node status collection with various edge cases (empty nodes, mismatched counts)
  • Resource status validation
  • Overall status determination logic
  • Helper functions (newUnknownHealthStatus, recent failures, fencing events)
  • Edge cases (large XML, invalid XML, missing state)

The tests are well-organized, use clear assertions, and validate both happy paths and error conditions.

pkg/tnf/pacemaker/healthcheck.go (7)

334-365: LGTM! Robust command execution with proper safeguards.

The implementation correctly addresses potential hanging issues:

  • Uses context with timeout (10 seconds) to prevent indefinite blocking
  • Uses sudo -n for non-interactive execution
  • Explicitly checks for timeout errors
  • Validates XML size to prevent XML bombs
  • Returns graceful Unknown status on errors

23-85: LGTM! Well-organized constants with clear documentation.

The constants are well-structured and appropriately documented:

  • Time windows for event detection are reasonable (5 min for failures, 24 hours for fencing)
  • 30-second resync interval balances responsiveness with resource usage
  • Status constants, event reasons, and command strings are clearly defined
  • ExpectedNodeCount = 2 correctly reflects the Two-Node OpenShift design

416-434: Node online check aligns with Two-Node OpenShift design.

The check at line 430 treats any node where Online != "true" as an error, including standby nodes. Per the maintainer's clarification, this is the intended behavior for Two-Node OpenShift with Fencing, where standby nodes are considered offline for health purposes.

Based on learnings


497-509: LGTM! Correct failure detection using OR logic.

The failure detection properly uses OR logic (||) to catch failures when either RC or RCText indicates non-success. This correctly identifies cases where only one field shows failure (e.g., RC="1" but RCText="ok").


560-576: Consider whether StatusUnknown should also set degraded condition.

Currently, only StatusError sets the degraded condition. When the controller cannot determine cluster health (e.g., exec timeout, parse failure), it returns StatusUnknown, but this doesn't update the operator condition. This could mask operational issues.

Should StatusUnknown also set a degraded condition to surface collection failures? If the current behavior is intentional (avoiding false alarms during transient issues), consider adding a comment explaining this design choice.

Would you like the controller to set degraded on Unknown status, or is the current "wait and see" approach preferred?


480-487: Hard-coded ExpectedNodeCount correctly reflects architectural constraint.

The constant ExpectedNodeCount = 2 and its usage in validation correctly encode the Two-Node OpenShift with Fencing design. Per the maintainer's clarification, this is an architectural invariant, not a limitation to be made dynamic.

Based on learnings


627-654: LGTM! Well-categorized event recording.

Event recording properly categorizes warnings by type:

  • Failed resource actions → EventReasonFailedAction
  • Fencing events → EventReasonFencingEvent
  • General warnings → EventReasonWarning
  • Errors → EventReasonError
  • Healthy status → EventReasonHealthy

This categorization enables effective event filtering and alerting.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from f87437c to c07e5b1 Compare October 10, 2025 17:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)

267-274: Collection failures return Unknown; confirm policy.

Exec/parse failures set StatusUnknown, so no degraded condition is set. If outages should surface, consider marking as StatusError to degrade. If Unknown is intentional, ignore.

Also applies to: 341-347

🧹 Nitpick comments (6)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

79-82: Remove duplicated operation_history entry.

Line 80 duplicates line 79 for the kubelet probe. Drop one to avoid confusion in tests and future parsing.

-        <operation_history call="330" task="probe" rc="0" rc_text="ok" last-rc-change="Thu Sep 18 12:40:35 2025" exec-time="4ms" queue-time="0ms"/>
pkg/tnf/pacemaker/healthcheck_test.go (1)

115-149: Optional: rename test for consistency with impl name.

The function under test is updateOperatorStatus; consider renaming the test to match for easier discovery.

-func TestHealthCheck_updateOperatorAvailability(t *testing.T) {
+func TestHealthCheck_updateOperatorStatus(t *testing.T) {
pkg/tnf/pacemaker/healthcheck.go (4)

338-341: Extract exec timeout into a constant.

Avoid magic number; define a reusable ExecTimeout alongside other constants and use it here.

@@
- // Execute the pcs status xml command with a timeout and non-interactive sudo
- ctxExec, cancel := context.WithTimeout(ctx, 10*time.Second)
+ // Execute the pcs status xml command with a timeout and non-interactive sudo
+ ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout)

Add near other constants:

+ // Upper bound for "pcs status" execution
+ ExecTimeout = 10 * time.Second

430-434: If standby is considered “offline”, handle it explicitly.

Currently only Online != "true" is flagged. To align with the stated intent to treat standby as offline, also error on Standby == "true".

 for _, node := range result.Nodes.Node {
-    if node.Online != NodeStatusOnline {
-        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
-    }
+    if node.Online != NodeStatusOnline {
+        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
+    } else if node.Standby == BooleanValueTrue {
+        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (treated as offline)", node.Name))
+    }
 }

449-453: Match resources by agent, not ID substring.

ID-based substring matching is brittle. Use ResourceAgent to detect kubelet/etcd reliably.

-            // Check if this is a kubelet or etcd resource
-            if strings.Contains(resource.ID, ResourceKubelet) {
-                resourcesStarted[ResourceKubelet][nodeName] = true
-            } else if strings.Contains(resource.ID, ResourceEtcd) {
-                resourcesStarted[ResourceEtcd][nodeName] = true
-            }
+            // Identify kubelet/etcd resources by their agents
+            switch {
+            case strings.HasPrefix(resource.ResourceAgent, "systemd:kubelet"):
+                resourcesStarted[ResourceKubelet][nodeName] = true
+            case strings.HasPrefix(resource.ResourceAgent, "ocf:heartbeat:podman-etcd"):
+                resourcesStarted[ResourceEtcd][nodeName] = true
+            }

629-643: Optional: gate repeated warning events to reduce noise.

At 30s resync, warnings within the 5m/24h windows will re-emit every cycle. Consider deduplication (e.g., last-emitted cache, once-per-window) to avoid event spam.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f87437c and c07e5b1.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)

49-53: Use require.NoErrorf to format the failure message.

Current call won’t format %s; switch to the fmt-aware variant.

-   require.NoError(t, err, "Failed to read test XML file: %s", filename)
+   require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)

Likely an incorrect or invalid review comment.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from c07e5b1 to e2add0e Compare October 10, 2025 18:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
pkg/tnf/pacemaker/healthcheck.go (1)

388-401: Optional: size guard exists only in caller, not in parser.

parsePacemakerStatusXML trusts input size; only getPacemakerStatus enforces maxXMLSize. If parse is used elsewhere, consider early-checking length or using an xml.Decoder with a bounded reader to keep invariants local to the parser. Low priority.

pkg/tnf/pacemaker/healthcheck_test.go (3)

50-55: Use require.NoErrorf for formatted failure message.

Minor test ergonomics: this call passes a format string and arg; use the “f” variant.

-	data, err := os.ReadFile(path)
-	require.NoError(t, err, "Failed to read test XML file: %s", filename)
+	data, err := os.ReadFile(path)
+	require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)

Based on learnings


642-659: Test name/comment mismatch with behavior.

The test name/comment suggest exercising “large XML” behavior, but the parser test simply validates invalid XML. Either update the comment/name to reflect invalid XML, or drive the size check via getPacemakerStatus (which enforces maxXMLSize). Optional.


484-498: Add a test to assert standby is treated as offline.

To lock in the intended behavior, add a case where node.Online="true" and Standby="true" and assert an error is recorded.

Example:

func TestHealthCheck_collectNodeStatus_StandbyTreatedAsOffline(t *testing.T) {
	controller := &HealthCheck{}
	result := &PacemakerResult{
		Nodes: Nodes{
			Node: []Node{
				{Name: "master-0", Online: booleanValueTrue, Standby: booleanValueTrue},
				{Name: "master-1", Online: booleanValueTrue, Standby: "false"},
			},
		},
	}
	status := &HealthStatus{Warnings: []string{}, Errors: []string{}}
	controller.collectNodeStatus(result, status)
	require.NotEmpty(t, status.Errors)
	require.Contains(t, strings.Join(status.Errors, " "), "standby")
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c07e5b1 and e2add0e.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from e2add0e to f3d22d3 Compare October 10, 2025 19:23
Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it seems good to me :)
I just left a few notes

Comment on lines 375 to 441
// Validate XML size to prevent XML bombs
if len(stdout) > maxXMLSize {
return newUnknownHealthStatus(fmt.Sprintf("XML output too large: %d bytes (max: %d bytes)", len(stdout), maxXMLSize)), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future. If the XML Bomb is a genuine risk, we might want to consider modifying the exec package to use io.LimitReader somehow. The current check prevents us to parse it, and it is good, but we still have the whole XML in memory at that point.

Comment on lines 349 to 402
// Record pacemaker health check events
c.recordHealthCheckEvents(healthStatus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how long does recordHealthCheckEvents take, but maybe we should pass the context here.

Comment on lines 524 to 591
if err != nil {
klog.V(4).Infof("Failed to parse timestamp '%s' for operation %s on node %s: %v",
operation.LastRCChange, operation.Task, node.Name, err)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we worry about letting these errors slip through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest thing to do when something fails as part of this information collection is to log it, but refuse to change the operator state. We should provide updates for definitively known state, even if it may be an incomplete truth.

Comment on lines 647 to 716
// Clean up old entries that are outside the longest deduplication window
maxWindow := eventDeduplicationWindowFencing
for key, timestamp := range c.recordedEvents {
if now.Sub(timestamp) > maxWindow {
delete(c.recordedEvents, key)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup logic is currently nested inside the block that decides on item addition. Should we move it out? It feels like that cleanup step can be performed just once in recordHealthCheckEvents execution, not for every new item to add, did I misunderstand it?
UPDATE: now that I read it a little better, it seems to me that at each loop called in recordHealthCheckEvents we might delete events recorded in the previous loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this does feel a bit wonky. I'll tweak it so that the window runs once-per-check instead of once per error recording

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgot the tests :)

Some tests have similar boilerplate code (like the ones that parse XML). You might want to consider using test tables instead. Right now, though, I have mixed feelings, as I do like how you organized the tests. The only real benefit of merging them would be that if we need to fix an issue or add new test cases, it'd be much easier to do it in one place instead of several. So, I'll leave the decision up to you!

Comment on lines 57 to 140
func getRecentFailuresXML(t *testing.T) string {
// Load the template from testdata
templateXML := loadTestXML(t, "recent_failures_template.xml")

// Generate recent timestamps that are definitely within the 5-minute window
now := time.Now()
recentTime := now.Add(-1 * time.Minute).UTC().Format("Mon Jan 2 15:04:05 2006")
recentTimeISO := now.Add(-1 * time.Minute).UTC().Format("2006-01-02 15:04:05.000000Z")

// Replace the placeholders in the template
xmlContent := strings.ReplaceAll(templateXML, "{{RECENT_TIMESTAMP}}", recentTime)
xmlContent = strings.ReplaceAll(xmlContent, "{{RECENT_TIMESTAMP_ISO}}", recentTimeISO)

return xmlContent
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! :)

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty nice PR IMHO 👍

return c.updateOperatorCondition(ctx, condition)
}

func (c *HealthCheck) clearDegradedCondition(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to check the current degraded condition for its reason, to make sure we don't clear a degraded condition which wasn't set by this healthcheck? Not sure what best pratice is in such situations...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that's what we are doing, but the function names are very misleading. Updated this to be more clear.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)

524-528: Timestamp parsing errors are logged but not surfaced.

When time.Parse fails, the error is logged at V(4) level and the operation is skipped with continue. If all timestamps consistently fail to parse (e.g., due to Pacemaker version change in timestamp format), recent failures would be silently ignored.

Consider accumulating a count of parse errors and emitting a warning if a threshold is exceeded, to detect systematic parsing issues.


640-665: Move cleanup logic outside the event-checking loop for efficiency.

The cleanup of old deduplication entries (lines 647-653) currently runs on every call to shouldRecordEvent, which means it executes once per warning/error. This is inefficient when processing multiple events.

Based on learnings

Apply this refactor to clean up once per sync:

 // recordHealthCheckEvents records events for pacemaker warnings and fencing history
 func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) {
+	// Clean up old entries once at the start
+	c.cleanupOldRecordedEvents()
+
 	// Record events for warnings with appropriate deduplication window
 	for _, warning := range status.Warnings {
 		eventKey := fmt.Sprintf("warning:%s", warning)
@@ -696,6 +700,19 @@
 	}
 }
 
+// cleanupOldRecordedEvents removes entries older than the longest deduplication window
+func (c *HealthCheck) cleanupOldRecordedEvents() {
+	c.recordedEventsMu.Lock()
+	defer c.recordedEventsMu.Unlock()
+	
+	now := time.Now()
+	maxWindow := eventDeduplicationWindowFencing
+	for key, timestamp := range c.recordedEvents {
+		if now.Sub(timestamp) > maxWindow {
+			delete(c.recordedEvents, key)
+		}
+	}
+}
+
 // shouldRecordEvent checks if an event should be recorded based on deduplication logic
 func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool {
 	c.recordedEventsMu.Lock()
 	defer c.recordedEventsMu.Unlock()
 
 	now := time.Now()
-
-	// Clean up old entries that are outside the longest deduplication window
-	maxWindow := eventDeduplicationWindowFencing
-	for key, timestamp := range c.recordedEvents {
-		if now.Sub(timestamp) > maxWindow {
-			delete(c.recordedEvents, key)
-		}
-	}
 
 	// Check if this event was recently recorded
 	if lastRecorded, exists := c.recordedEvents[eventKey]; exists {

619-626: Verify degraded condition ownership before clearing.

The clearDegradedCondition function clears any degraded condition of type PacemakerHealthCheckDegraded without verifying it was set by this controller. If multiple components manage similar conditions, this could inadvertently clear a condition set by another component.

Consider checking the condition's reason or message to confirm it was set by this health check before clearing it.

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f3d22d3 and 6176029.

📒 Files selected for processing (6)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/standby_node.xml (1 hunks)
🔇 Additional comments (13)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)

1-67: LGTM! Test fixture correctly represents offline node scenario.

This XML fixture appropriately models a Pacemaker cluster with an offline node (master-1 with online="false"), which is used to test the health-check parsing logic for degraded cluster states.

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-87: LGTM! Test fixture correctly represents healthy cluster scenario.

This XML fixture appropriately models a healthy Pacemaker cluster with both nodes online, all critical resources started, and no recent fencing events. It serves as a good baseline for testing the happy path.

pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)

1-66: LGTM! Template approach is appropriate for time-sensitive testing.

This template with placeholder timestamps ({{RECENT_TIMESTAMP}} and {{RECENT_TIMESTAMP_ISO}}) is correctly used by the getRecentFailuresXML test helper to dynamically generate XML with recent timestamps, enabling proper testing of the time-window-based failure detection logic.

pkg/tnf/pacemaker/testdata/standby_node.xml (2)

57-58: Verify duplicate operation_history entries are intentional.

Lines 57-58 contain identical operation_history entries for the kubelet probe operation (same call, task, timestamps). If this duplication is intentional to test duplicate handling in the parser, consider adding a comment explaining this. Otherwise, one entry should be removed.


1-66: Test fixture correctly represents standby node scenario.

This XML fixture appropriately models a Pacemaker cluster with a standby node (master-1 with standby="true"), which is used to test the health-check behavior when nodes are in standby state.

pkg/tnf/pacemaker/healthcheck_test.go (3)

23-71: LGTM! Well-structured test helpers with good practices.

The helper functions appropriately:

  • Create test fixtures with fake clients and in-memory recorders
  • Load static XML test data from files
  • Generate dynamic timestamps for time-sensitive tests using template replacement

The getRecentFailuresXML helper's approach of using 1-minute-old timestamps ensures tests are consistently within the 5-minute failure window.


179-245: Excellent event deduplication test coverage.

This test comprehensively validates the deduplication logic:

  • Verifies duplicate events are not re-recorded within windows
  • Tests different deduplication windows (5 min for actions, 1 hour for fencing)
  • Validates healthy event is only recorded on status transitions
  • Confirms new errors are recorded even when duplicates exist

The test correctly verifies the controller only records healthy events when transitioning from unhealthy states, not when already healthy.


428-496: Comprehensive XML parsing test coverage.

The parsing tests appropriately cover:

  • Healthy cluster baseline
  • Offline nodes
  • Standby nodes
  • Recent failures with dynamic timestamps
  • Invalid/large/missing-state XML edge cases

The tests correctly validate both the parsed status and the specific warnings/errors generated for each scenario.

pkg/tnf/pacemaker/healthcheck.go (5)

349-350: Consider passing context to recordHealthCheckEvents for consistency.

While recordHealthCheckEvents doesn't perform I/O or cancellable operations, passing the context parameter would maintain consistency with other methods and enable future additions that might need it (e.g., timeout-sensitive event recording).

Based on learnings


452-460: Node status validation correctly treats standby as offline.

The implementation correctly identifies both offline nodes (online != "true") and standby nodes (standby == "true") as errors, treating them as equivalent failure conditions. The error message for standby nodes explicitly documents this design decision with "(unexpected behavior; treated as offline)".


360-387: LGTM! Robust command execution with timeout and error handling.

The getPacemakerStatus function appropriately:

  • Uses a timeout context (10 seconds) to prevent indefinite hangs
  • Uses non-interactive sudo (sudo -n) to fail fast on missing credentials
  • Validates XML size to prevent XML bomb attacks
  • Returns Unknown status with error messages rather than propagating errors, ensuring graceful degradation

The timeout handling explicitly checks for context.DeadlineExceeded to provide clear error messages.


408-413: LGTM! Correctly sets error status when Pacemaker is down.

The code properly detects when Pacemaker is not running and sets OverallStatus to statusError before returning, ensuring the degraded condition will be set appropriately in the controller flow.


505-511: Hard-coded expectedNodeCount is intentional by design.

The use of expectedNodeCount = 2 is correct for Two-Node OpenShift with Fencing architecture. This is a fundamental constraint, not a limitation to be refactored.

Based on learnings

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (1)

167-175: Prefer structured mutation over text templating (reduces brittleness).

Parse the manifest once, then set fields on the CronJob object (name, namespace, schedule, image, command) instead of line‑based replacements. This avoids YAML formatting pitfalls and future template drift.

Example:

cronJob := ReadCronJobV1OrDie(c.manifest)
cronJob.Name = c.name
cronJob.Namespace = c.namespace
cronJob.Spec.Schedule = c.schedule
if len(cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 {
  ct := &cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0]
  ct.Image = c.image
  ct.Command = append([]string(nil), c.command...)
}
return cronJob
pkg/tnf/pacemaker/healthcheck.go (2)

306-309: Remove unused lastProcessedStatus or use it meaningfully.

State is written but never read; drop it or leverage it to suppress duplicate event processing based on CR changes.

Also applies to: 433-436


515-523: Optional: clarify standby-as-offline in a code comment.

Behavior is intentional; a brief inline comment above the loop will aid future readers.

pkg/cmd/pacemakerstatuscollector/collector.go (1)

165-177: Remove unused Kubernetes client wiring to simplify.

You construct a client but don’t use it; updatePacemakerStatusCR rebuilds REST config anyway. Simplify by dropping the client parameter and the extra creation.

Apply:

@@
-  // Create Kubernetes client
-  client, err := createKubernetesClient()
-  if err != nil {
-    return fmt.Errorf("failed to create kubernetes client: %w", err)
-  }
-
   // Update PacemakerStatus CR
-  if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, collectionError); err != nil {
+  if err := updatePacemakerStatusCR(ctx, rawXML, summary, collectionError); err != nil {
     return fmt.Errorf("failed to update PacemakerStatus CR: %w", err)
   }

And change the function signature:

-func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {
+func updatePacemakerStatusCR(ctx context.Context, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {

Optionally remove createKubernetesClient if unused elsewhere.

Also applies to: 321-321

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6176029 and 4aefd7d.

📒 Files selected for processing (15)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (2)
pkg/tnf/pacemaker/v1alpha1/register.go (1)

24-31: LGTM: scheme registration is correct.

Types registered, GroupVersion wired, and metav1 added. No issues.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

12-121: Autogenerated deepcopy looks consistent.

Methods cover all types; list/items deep-copied properly.

Comment on lines 352 to 531
err = restClient.Get().
Resource("pacemakerstatuses").
Name(pacemakerStatusName).
Do(ctx).
Into(&existing)

now := metav1.Now()

if err != nil {
// Create new PacemakerStatus if it doesn't exist
if strings.Contains(err.Error(), "not found") {
newStatus := &v1alpha1.PacemakerStatus{
TypeMeta: metav1.TypeMeta{
APIVersion: v1alpha1.SchemeGroupVersion.String(),
Kind: "PacemakerStatus",
},
ObjectMeta: metav1.ObjectMeta{
Name: pacemakerStatusName,
},
Status: v1alpha1.PacemakerStatusStatus{
LastUpdated: now,
RawXML: rawXML,
CollectionError: collectionError,
Summary: summary,
},
}

result := restClient.Post().
Resource("pacemakerstatuses").
Body(newStatus).
Do(ctx)

if result.Error() != nil {
return fmt.Errorf("failed to create PacemakerStatus: %w", result.Error())
}

klog.Info("Created new PacemakerStatus CR")
return nil
}
return fmt.Errorf("failed to get PacemakerStatus: %w", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use apierrors.IsNotFound instead of string matching.

String checks are brittle. Prefer proper error typing.

Apply:

@@
-import (
+import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
   "strings"
   "time"
 
   "github.com/spf13/cobra"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@
-  if err != nil {
+  if err != nil {
     // Create new PacemakerStatus if it doesn't exist
-    if strings.Contains(err.Error(), "not found") {
+    if apierrors.IsNotFound(err) {
       newStatus := &v1alpha1.PacemakerStatus{

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In pkg/cmd/pacemakerstatuscollector/collector.go around lines 352 to 392,
replace the brittle string check strings.Contains(err.Error(), "not found") with
a proper API error type check using apierrors.IsNotFound(err); import
"k8s.io/apimachinery/pkg/api/errors" as apierrors if not already imported, and
keep the existing error handling logic path the same (create the new
PacemakerStatus on NotFound, otherwise return the wrapped error).

Comment on lines 3 to 20
import (
"context"
"fmt"
"strings"
"time"

batchv1 "k8s.io/api/batch/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"

operatorv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add missing import for strconv used in escaping.

@@
-import (
+import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import (
"context"
"fmt"
"strings"
"time"
batchv1 "k8s.io/api/batch/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"
operatorv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)
import (
"context"
"fmt"
"strconv"
"strings"
"time"
batchv1 "k8s.io/api/batch/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"
operatorv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 3 to 19, the code uses
strconv for escaping but the strconv package is not imported; add "strconv" to
the import block (alongside context, fmt, strings, time, etc.) so the file
compiles and the escaping call resolves; reorder/group imports if needed to
satisfy linters.

Comment on lines 177 to 213
// injectValues replaces <injected> placeholders in the manifest with actual values
func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)

// Build command string for YAML array format
var commandBuilder strings.Builder
commandBuilder.WriteString("[")
for i, cmd := range c.command {
if i > 0 {
commandBuilder.WriteString(", ")
}
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
}
commandBuilder.WriteString("]")
commandStr := commandBuilder.String()

// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace app label
if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace schedule
if strings.Contains(line, "schedule: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1)
}
// Replace image
if strings.Contains(line, "image: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.image, 1)
}
// Replace command (special handling for array)
if strings.Contains(line, "command: <injected>") {
lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
}
}

return []byte(strings.Join(lines, "\n"))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Fix YAML injection/escaping in command and support namespace injection.

  • Current command quoting uses "%s" without escaping; quotes/backslashes in args will break YAML.
  • Namespace in manifest isn’t injected; if config.Namespace != manifest’s namespace, create/update can fail (path vs object namespace mismatch).

Apply:

@@
-func (c *CronJobController) injectValues(manifest []byte) []byte {
+func (c *CronJobController) injectValues(manifest []byte) []byte {
   result := string(manifest)
 
   // Build command string for YAML array format
-  var commandBuilder strings.Builder
-  commandBuilder.WriteString("[")
-  for i, cmd := range c.command {
-    if i > 0 {
-      commandBuilder.WriteString(", ")
-    }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
-  }
-  commandBuilder.WriteString("]")
-  commandStr := commandBuilder.String()
+  var commandBuilder strings.Builder
+  commandBuilder.WriteString("[")
+  for i, cmd := range c.command {
+    if i > 0 {
+      commandBuilder.WriteString(", ")
+    }
+    // Use proper escaping for YAML/JSON-compatible strings
+    commandBuilder.WriteString(strconv.Quote(cmd))
+  }
+  commandBuilder.WriteString("]")
+  commandStr := commandBuilder.String()
@@
   for i, line := range lines {
+    // Replace namespace if present in template
+    if strings.Contains(line, "namespace: <injected>") {
+      lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+    }
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.name, 1)
     }

Note: add strconv import (see separate comment).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// injectValues replaces <injected> placeholders in the manifest with actual values
func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)
// Build command string for YAML array format
var commandBuilder strings.Builder
commandBuilder.WriteString("[")
for i, cmd := range c.command {
if i > 0 {
commandBuilder.WriteString(", ")
}
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
}
commandBuilder.WriteString("]")
commandStr := commandBuilder.String()
// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace app label
if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace schedule
if strings.Contains(line, "schedule: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1)
}
// Replace image
if strings.Contains(line, "image: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.image, 1)
}
// Replace command (special handling for array)
if strings.Contains(line, "command: <injected>") {
lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
}
}
return []byte(strings.Join(lines, "\n"))
}
// injectValues replaces <injected> placeholders in the manifest with actual values
func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)
// Build command string for YAML array format
var commandBuilder strings.Builder
commandBuilder.WriteString("[")
for i, cmd := range c.command {
if i > 0 {
commandBuilder.WriteString(", ")
}
// Use proper escaping for YAML/JSON-compatible strings
commandBuilder.WriteString(strconv.Quote(cmd))
}
commandBuilder.WriteString("]")
commandStr := commandBuilder.String()
// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
// Replace namespace if present in template
if strings.Contains(line, "namespace: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
}
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace app label
if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace schedule
if strings.Contains(line, "schedule: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1)
}
// Replace image
if strings.Contains(line, "image: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.image, 1)
}
// Replace command (special handling for array)
if strings.Contains(line, "command: <injected>") {
lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
}
}
return []byte(strings.Join(lines, "\n"))
}
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 177 to 219, the command
array is built with fmt.Sprintf("%s") which doesn't escape quotes/backslashes
and will break YAML; also the manifest never injects a namespace. Change the
command builder to call strconv.Quote on each c.command element when
constructing commandStr (so arguments are properly escaped and wrapped), add
strconv to the import list, and add a replacement branch to replace "namespace:
<injected>" with c.namespace (or the controller's namespace field) so the
manifest namespace is injected consistently.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 4aefd7d to b553bf0 Compare October 16, 2025 20:16
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jaypoulz
Once this PR has been reviewed and has the lgtm label, please assign tjungblu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

3-19: Missing strconv import needed for proper YAML escaping.

The injectValues method at line 191 constructs YAML array strings using fmt.Sprintf("%s") without proper escaping. Special characters in command arguments (quotes, backslashes) will break YAML parsing. Import strconv and use strconv.Quote for escaping.


48-48: Misleading cron schedule example includes seconds field.

Kubernetes CronJob uses standard 5-field cron expressions (minute hour day month weekday), not 6-field (with seconds). The example "*/30 * * * * *" is invalid. Use "*/30 * * * *" for every 30 minutes.


180-222: Fix YAML injection vulnerabilities and add namespace injection.

Three critical issues remain from previous review:

  1. Command escaping (line 191): Using fmt.Sprintf("%s") without escaping will break YAML when command args contain quotes or backslashes. Use strconv.Quote(cmd) instead.

  2. Namespace not injected: The manifest likely has a namespace: <injected> placeholder that needs replacement, but the loop has no case for it. Add namespace replacement or the CronJob will be created in wrong namespace.

  3. Schedule escaping (line 209): Wrapping in quotes helps but doesn't escape special chars in the schedule value itself.

Apply this diff:

 import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"
   for i, line := range lines {
+    // Replace namespace if present in template
+    if strings.Contains(line, "namespace: <injected>") {
+      lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+    }
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.name, 1)
   for i, cmd := range c.command {
     if i > 0 {
       commandBuilder.WriteString(", ")
     }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+    commandBuilder.WriteString(strconv.Quote(cmd))
   }
pkg/cmd/pacemakerstatuscollector/collector.go (1)

321-414: Replace string matching with proper error type check.

At line 362, the code uses brittle string matching strings.Contains(err.Error(), "not found") to detect NotFound errors. Use apierrors.IsNotFound(err) instead.

Apply this diff:

 import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
   "strings"
   "time"
 
   "github.com/spf13/cobra"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
   if err != nil {
     // Create new PacemakerStatus if it doesn't exist
-    if strings.Contains(err.Error(), "not found") {
+    if apierrors.IsNotFound(err) {
       newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (1)
pkg/cmd/pacemakerstatuscollector/collector.go (1)

52-138: Acceptable duplication to avoid circular dependencies.

XML types are duplicated from healthcheck.go as noted in the comment. Consider extracting to a shared internal package in future refactoring to reduce duplication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4aefd7d and b553bf0.

📒 Files selected for processing (15)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • pkg/operator/starter.go
  • pkg/tnf/pkg/jobs/batch.go
  • cmd/cluster-etcd-operator/main.go
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
  • pkg/tnf/operator/starter_test.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (42)
pkg/tnf/pkg/jobs/cronjob_controller.go (7)

53-88: LGTM!

The constructor properly sets defaults, wires dependencies, and follows library-go factory patterns.


90-111: LGTM!

The sync logic correctly checks operator management state and handles errors appropriately.


113-132: LGTM!

The reconciliation logic follows standard Kubernetes controller patterns.


134-150: LGTM!

Creation logic properly handles AlreadyExists and records appropriate events.


152-165: LGTM!

Update logic correctly preserves ResourceVersion and records events.


167-178: LGTM!

Build logic is clear and includes helpful debug logging.


224-249: LGTM!

The update detection logic correctly compares schedule, command, and image with proper bounds checking.

pkg/tnf/pacemaker/healthcheck.go (22)

28-99: LGTM!

Constants are well-defined. The StatusError constant was appropriately added. The expectedNodeCount = 2 is intentional per the Two-Node OpenShift architecture design.

Based on learnings.


101-273: LGTM!

XML parsing structures are comprehensive and follow standard Go XML unmarshaling conventions.


274-288: LGTM!

HealthStatus model and helper function are appropriate. Unknown status for collection failures is intentional per maintainer decision.


290-309: LGTM!

The HealthCheck structure is well-designed with proper mutex-protected state for concurrent access.


311-383: LGTM!

The constructor properly configures the REST client, informer, and controller factory with appropriate wiring.


385-405: LGTM!

The sync function follows a clear flow: retrieve status, update operator conditions, record events.


407-450: LGTM!

Status retrieval logic properly validates staleness and enforces XML size limits. The maxXMLSize check at line 439 prevents XML bomb parsing exploits; while the data is already in memory, this is acceptable for internal trusted sources.


452-499: LGTM!

XML parsing correctly validates structure and sets StatusError when pacemaker is not running, addressing prior review feedback.


501-524: LGTM!

Node status collection correctly flags offline and standby nodes as errors, consistent with Two-Node OpenShift design where standby is treated as offline.

Based on learnings.


526-566: LGTM!

Resource status collection correctly tracks started resources across nodes with clear helper function structure.


568-574: LGTM!

Resource count validation correctly uses the intentional expectedNodeCount constant.

Based on learnings.


576-600: LGTM!

Recent failure detection correctly uses OR logic (line 584) per prior review. Timestamp parse errors are logged and skipped, which is appropriate defensive behavior.


602-619: LGTM!

Fencing event collection follows consistent patterns with appropriate time-based filtering.


621-634: LGTM!

Overall status determination follows clear precedence: errors > warnings > healthy.


636-653: LGTM!

Operator status update correctly sets degraded condition for StatusError, addressing prior review feedback.


655-664: LGTM!

Status message logging is clear and appropriate.


666-680: LGTM!

Degraded condition setting uses a scoped condition type (PacemakerHealthCheckDegraded at line 91), which prevents conflicts with other controllers' degraded conditions.


682-689: LGTM!

Condition clearing logic is correct.


691-701: LGTM!

Operator condition update uses standard library-go helpers correctly.


703-728: LGTM!

Event deduplication logic correctly cleans up stale entries and enforces time-based windows. The cleanup at lines 710-716 executes once per call, not per iteration of a caller's loop.


730-769: LGTM!

Event recording implements appropriate deduplication with context-aware windows (longer for fencing) and only records healthy events on transitions.


771-781: LGTM!

Warning event categorization correctly routes to appropriate event reasons.

pkg/tnf/pacemaker/v1alpha1/types.go (5)

1-5: LGTM!

Package structure and imports are standard for Kubernetes API types.


7-21: LGTM!

PacemakerStatus type follows Kubernetes CRD best practices with appropriate kubebuilder markers and subresource support.


23-50: LGTM!

Spec and Status types are well-designed with appropriate fields for status collection and error reporting. The RawXML field enables efficient parsing without re-executing commands.


52-77: LGTM!

PacemakerSummary provides a comprehensive high-level cluster state overview with well-documented fields.


79-86: LGTM!

PacemakerStatusList follows standard Kubernetes list type conventions.

pkg/cmd/pacemakerstatuscollector/collector.go (8)

1-50: LGTM!

Package structure, imports, and constants are well-organized. The pcs command uses sudo -n for non-interactive execution, which is good practice.


140-153: LGTM!

Command constructor follows standard cobra patterns with appropriate error handling.


155-177: LGTM!

The collector orchestration is clean with appropriate timeout and error handling.


179-217: LGTM!

Status collection properly handles timeouts, validates XML size, and gracefully handles parse errors while preserving raw XML.


219-265: LGTM!

Summary building logic correctly processes all node and resource states with clear counting logic.


267-286: LGTM!

Recent failure detection uses correct OR logic for status checks and handles timestamp parsing gracefully.


288-301: LGTM!

Recent fencing detection follows consistent patterns with appropriate time-based filtering.


303-319: LGTM!

Kubernetes client creation follows standard patterns with in-cluster and kubeconfig fallback.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from b553bf0 to d99fc51 Compare October 17, 2025 00:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
pkg/tnf/operator/starter.go (1)

346-351: Cron expression is invalid for Kubernetes CronJobs

"*/30 * * * * *" has six fields (seconds). Kubernetes batch/v1 CronJobs only accept the standard five-field spec, so this will be rejected and the collector never runs. Use a valid five-field schedule (e.g., "* * * * *" for once per minute and loop inside the job for 30‑second sampling) or another mechanism that supports sub-minute intervals.

-			Schedule:  "*/30 * * * * *", // Every 30 seconds (Kubernetes CronJob format with seconds)
+			Schedule:  "* * * * *", // Every minute; handle 30s cadence inside the job if required
pkg/cmd/pacemakerstatuscollector/collector.go (1)

358-391: Use API error helpers instead of string matching.

strings.Contains(err.Error(), "not found") is fragile (locale/log variations) and can misclassify errors. Switch to apierrors.IsNotFound(err) and import k8s.io/apimachinery/pkg/api/errors.

-	if err != nil {
-		// Create new PacemakerStatus if it doesn't exist
-		if strings.Contains(err.Error(), "not found") {
+	if err != nil {
+		// Create new PacemakerStatus if it doesn't exist
+		if apierrors.IsNotFound(err) {
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

3-19: Missing strconv import.

The previous review comment about the missing strconv import still applies. This import is required for the proper escaping fix suggested in the injectValues method.


48-51: Cron schedule comment is still misleading.

The previous review comment about the incorrect 6-field cron expression example still applies. Kubernetes CronJob expects a 5-field expression.


180-222: YAML injection and namespace issues remain.

The previous review comments about improper escaping in command building and missing namespace injection still apply.

🧹 Nitpick comments (2)
pkg/tnf/operator/starter_test.go (1)

292-301: Make the retry expectation observable

expectRetries is populated but never asserted, so the test still passes even if handleNodesWithRetry never performs additional attempts. Either surface the captured attempt counter (e.g., return it from the setup helper) and assert it against expectRetries, or drop the field to avoid implying coverage that isn’t there.

pkg/tnf/pkg/jobs/cronjob_controller.go (1)

90-111: Sync method correctly checks operator state.

The implementation appropriately verifies the operator is in Managed state before ensuring the CronJob. The early return for non-Managed states is a sound approach.

Optional: Consider whether existing CronJobs should be cleaned up when the operator transitions to Unmanaged or Removed states. The current behavior preserves existing resources, which may be intentional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b553bf0 and d99fc51.

📒 Files selected for processing (19)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (9 hunks)
  • pkg/tnf/operator/starter_test.go (6 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
  • pkg/operator/starter.go
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/v1alpha1/register.go
  • pkg/tnf/pacemaker/v1alpha1/doc.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (8)
pkg/tnf/pkg/jobs/cronjob_controller.go (8)

21-27: Constants are well-defined.

The default resync interval and namespace values are appropriate for CronJob lifecycle management.


29-42: Controller structure is well-organized.

The CronJobController struct appropriately encapsulates all necessary clients and configuration for managing CronJob resources.


53-88: Constructor is well-implemented.

The constructor properly handles defaults, follows the factory pattern, and correctly sets up the controller with informers and sync intervals.


113-132: CronJob reconciliation logic is sound.

The method correctly handles the create-or-update pattern with appropriate error handling for NotFound cases.


134-150: CronJob creation is well-handled.

The method correctly creates the CronJob, handles AlreadyExists errors gracefully, and records events appropriately.


152-165: CronJob update correctly preserves ResourceVersion.

The method appropriately preserves the existing ResourceVersion before updating, which is essential for conflict-free updates.


167-178: Manifest processing approach is sound.

The method correctly injects values and parses the manifest. The debug logging at V(4) is a helpful troubleshooting aid.


224-249: Update detection logic is correct and defensive.

The method appropriately compares schedule, command, and image fields, with proper defensive checks for container existence before accessing array elements.

Comment on lines 390 to 403
// Get pacemaker status
healthStatus, err := c.getPacemakerStatus(ctx)
if err != nil {
return err
}

// Update operator status conditions based on pacemaker status
if err := c.updateOperatorStatus(ctx, healthStatus); err != nil {
return err
}

// Record pacemaker health check events
c.recordHealthCheckEvents(healthStatus)

Copy link

@coderabbitai coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stale/failed collections never set degraded.

When getPacemakerStatus returns StatusUnknown (collection error, stale XML, oversized payload, etc.), updateOperatorStatus hits the default branch and leaves the degraded condition untouched. That masks genuine outages (e.g., pcs unreachable) and contradicts the controller’s goal of surfacing health issues. Please map collection failures to StatusError (either in getPacemakerStatus or when propagating errors) so we call setDegradedCondition.

🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 390 to 403, when
getPacemakerStatus returns StatusUnknown collection failures are not mapped to
degraded and thus updateOperatorStatus leaves degraded untouched; change the
propagation so that any StatusUnknown (collection error, stale/oversized
payload, unreachable pcs) is converted to StatusError before calling
updateOperatorStatus (or update updateOperatorStatus to treat StatusUnknown the
same as StatusError) so setDegradedCondition is invoked; implement the mapping
immediately after getPacemakerStatus returns (if healthStatus == StatusUnknown {
healthStatus = StatusError }) or add a case in updateOperatorStatus handling
StatusUnknown -> setDegradedCondition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. We only want to update the operator status if we have a definitive answer regarding the pacemaker status. Failure to collect an up-to-date status does means that we cannot definitively determine our current state, so the safest thing to do is to leave our last known status.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaypoulz Thank you for the clarification! That makes sense — holding the last known status when collection fails is a reasonable approach to avoid flapping the degraded condition during transient issues. I understand the design decision now.

Change List
- Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
- Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes
- Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
- Included happy path testing
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from d99fc51 to 59eaec8 Compare October 17, 2025 13:26
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter_test.go (1)

18-20: Remove duplicate import of the same package

The same import path is aliased twice (config/v1). Use a single alias.

Apply this diff:

-	configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1"
-	v1 "github.com/openshift/client-go/config/informers/externalversions/config/v1"
+	configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1"
@@
-	networkInformer            v1.NetworkInformer
+	networkInformer            configv1informers.NetworkInformer

Also applies to: 55-56

♻️ Duplicate comments (5)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

3-19: Add strconv import for proper command escaping.

The injectValues method (line 191) builds command strings without proper escaping. To safely handle command arguments containing quotes, backslashes, or other special characters, you'll need the strconv package.

Add the import:

 import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"

Then use strconv.Quote(cmd) in the command builder (see separate comment on lines 180-222).


48-48: Fix cron schedule comment: Kubernetes uses 5 fields, not 6.

The example "*/30 * * * * *" shows 6 fields (including seconds), but Kubernetes CronJob uses the standard 5-field cron format (minute, hour, day, month, weekday). This will confuse users.

Update the comment:

-  Schedule  string   // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds)
+  Schedule  string   // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)

180-222: Critical: Fix command escaping and add namespace injection.

Two issues in injectValues:

  1. Command escaping: Line 191 uses fmt.Sprintf('"%s"', cmd) which doesn't escape quotes or backslashes. If any command argument contains " or \, the generated YAML will be malformed.

  2. Missing namespace injection: The manifest has namespace: openshift-etcd (line 5 in cronjob.yaml), but this is never replaced with c.namespace, causing a mismatch when CronJobConfig.Namespace differs from the hardcoded value.

Apply this fix:

 func (c *CronJobController) injectValues(manifest []byte) []byte {
   result := string(manifest)
 
   // Build command string for YAML array format
   var commandBuilder strings.Builder
   commandBuilder.WriteString("[")
   for i, cmd := range c.command {
     if i > 0 {
       commandBuilder.WriteString(", ")
     }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+    // Use strconv.Quote for proper YAML/JSON escaping
+    commandBuilder.WriteString(strconv.Quote(cmd))
   }
   commandBuilder.WriteString("]")
   commandStr := commandBuilder.String()
 
   // Track which line we're on to do context-aware replacement
   lines := strings.Split(result, "\n")
   for i, line := range lines {
+    // Replace namespace
+    if strings.Contains(line, "namespace: <injected>") {
+      lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+    }
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {

Note: This requires the strconv import (see separate comment) and changing line 5 in cronjob.yaml to namespace: <injected>.

pkg/tnf/operator/starter.go (1)

334-364: Cron schedule fix LGTM

Schedule now uses valid 5-field spec ("* * * * *"). This resolves the prior invalid 6-field seconds issue.

pkg/cmd/pacemakerstatuscollector/collector.go (1)

360-392: Use apierrors.IsNotFound instead of string matching.

String checks are brittle; prefer typed API errors.

-import (
+import (
   ...
-  "strings"
   ...
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
 )
 ...
- if err != nil {
+ if err != nil {
   // Create new PacemakerStatus if it doesn't exist
-  if strings.Contains(err.Error(), "not found") {
+  if apierrors.IsNotFound(err) {
     newStatus := &v1alpha1.PacemakerStatus{ ... }
     ...
   }
   return fmt.Errorf("failed to get PacemakerStatus: %w", err)
 }
🧹 Nitpick comments (20)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (2)

32-39: Make spec.nodeName required

Controllers likely rely on spec.nodeName; encode that contract in the CRD for early validation.

Apply this diff:

         spec:
           description: PacemakerStatusSpec defines the desired state of PacemakerStatus
           type: object
+          required:
+          - nodeName
           properties:
             nodeName:
               description: NodeName identifies which node this status is for
               type: string

15-18: Add printer columns for quick status at a glance

Improve kubectl get output for operators and SREs.

Apply this diff:

   versions:
   - name: v1alpha1
     served: true
     storage: true
+    additionalPrinterColumns:
+    - name: Node
+      type: string
+      jsonPath: .spec.nodeName
+    - name: Pacemakerd
+      type: string
+      jsonPath: .status.summary.pacemakerdState
+    - name: Quorum
+      type: string
+      jsonPath: .status.summary.hasQuorum
+    - name: Nodes
+      type: string
+      jsonPath: .status.summary.nodesOnline
+      description: Online/Total
+      priority: 1
+    - name: Updated
+      type: date
+      jsonPath: .status.lastUpdated

Also applies to: 58-85

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-1: Optional: align command attribution

The request attribute shows crm_mon XML while docs mention pcs status; consider noting pcs→crm_mon indirection or aligning text for consistency.

pkg/tnf/operator/starter_test.go (1)

260-271: Avoid live API dependencies in unit tests

runPacemakerControllers is started and the health checker may construct clients from KubeConfig. A bare rest.Config{Host: "..."} can cause dial/SSL errors and flakiness.

Options:

  • Inject a dynamic.Interface into the health checker and pass fakedynamic in tests.
  • Or gate runPacemakerControllers behind a test knob (function var) so unit tests stub it out.

If you keep KubeConfig in tests, set it to a dummy and ensure health code never dials when running under fake clients.

Also applies to: 96-110

pkg/tnf/operator/starter.go (2)

338-352: 30-second sampling claim vs 1-minute Cron

PR text says “every 30 seconds,” but Cron runs every minute. Ensure the collector loops internally (e.g., two samples with 30s sleep) or update docs.

If needed, document the intra-job loop or add an explicit flag to the collector for 30s cadence.


355-364: Defensive start for health controller

If controllerContext.KubeConfig is nil or misconfigured, NewHealthCheck/Run should no-op or degrade gracefully to avoid tight error loops.

Add a nil/validation check and early return with a Warning event.

pkg/tnf/pacemaker/healthcheck_test.go (5)

31-61: Mark helpers as test helpers.

Call t.Helper() in helper functions to improve failure locations and readability.

 func createTestHealthCheck() *HealthCheck {
+	testingT, _ := any(nil).(interface{ Helper() })
+	if testingT != nil {
+		testingT.Helper()
+	}

Apply similarly in createTestHealthCheckWithMockStatus, loadTestXML, and getRecentFailuresXML.


119-140: Stabilize time-dependent tests with an injected clock.

Using time.Now() risks flakes around cutoff windows. Prefer a controllable clock or pass times into helpers.

-func getRecentFailuresXML(t *testing.T) string {
+func getRecentFailuresXMLAt(t *testing.T, now time.Time) string {
   // Load the template...
-  now := time.Now()
   recentTime := now.Add(-1 * time.Minute)...
   ...
   return xmlContent
}
+
+func getRecentFailuresXML(t *testing.T) string {
+  return getRecentFailuresXMLAt(t, time.Now())
+}

277-310: Add coverage for Unknown → no condition change (locks intent).

Given Unknown is intentional for collection/staleness, add a subtest asserting degraded condition is not updated.

   // Test warning status (should not update operator condition)
   status = &HealthStatus{
     OverallStatus: statusWarning,
     Warnings:      []string{"Test warning"},
     Errors:        []string{},
   }
   err = controller.updateOperatorStatus(ctx, status)
   require.NoError(t, err, "updateOperatorAvailability should not return an error")
+
+  // Test unknown status (should not update operator condition)
+  status = &HealthStatus{
+    OverallStatus: statusUnknown,
+    Warnings:      []string{},
+    Errors:        []string{"collection failed"},
+  }
+  err = controller.updateOperatorStatus(ctx, status)
+  require.NoError(t, err)

338-405: Avoid brittle concrete type assertion for recorder.

Casting to events.InMemoryRecorder couples tests to implementation. Prefer asserting via the public Recorder and inspecting emitted events through its API or provide an accessor on HealthCheck for tests.


862-873: Test the max-size guard via getPacemakerStatus path too.

parsePacemakerStatusXML doesn’t enforce size; the guard lives in getPacemakerStatus. Add a test invoking getPacemakerStatus with oversized RawXML to validate handling.

pkg/tnf/pacemaker/healthcheck.go (5)

428-447: Explicit size check in parser for direct calls.

parsePacemakerStatusXML is callable directly (tests do) but doesn’t enforce maxXMLSize; add a guard to fail fast on oversized input as in getPacemakerStatus.

 func (c *HealthCheck) parsePacemakerStatusXML(xmlOutput string) (*HealthStatus, error) {
   status := &HealthStatus{ OverallStatus: statusUnknown, Warnings: []string{}, Errors: []string{} }
+  if len(xmlOutput) > maxXMLSize {
+    return nil, fmt.Errorf("XML output too large: %d bytes (max: %d bytes)", len(xmlOutput), maxXMLSize)
+  }
   // Parse XML
   var result PacemakerResult

509-523: Document standby-as-offline behavior inline.

Add a brief comment so future maintainers don’t “fix” this intentionally strict check.

 for _, node := range result.Nodes.Node {
-  if node.Online != booleanValueTrue {
+  // Treat standby as offline by design (standby nodes must not run critical resources)
+  if node.Online != booleanValueTrue {
     status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
   }
   if node.Standby == booleanValueTrue {
     status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (unexpected behavior; treated as offline)", node.Name))
   }
 }

428-431: Make staleness threshold a named constant tied to resync.

2 minutes is embedded; derive from the controller resync or expose a constant for coherence and easier tuning.

 const (
   ...
   healthCheckResyncInterval = 30 * time.Second
+  statusStaleAfter = 2 * time.Minute // consider 4 * healthCheckResyncInterval
 )
 ...
- if time.Since(result.Status.LastUpdated.Time) > 2*time.Minute {
+ if time.Since(result.Status.LastUpdated.Time) > statusStaleAfter {

703-729: Run dedupe cleanup once per call, not per event.

Cleanup executes for every event, adding overhead. Do it once at the start of recordHealthCheckEvents.

-func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool {
-  c.recordedEventsMu.Lock()
-  defer c.recordedEventsMu.Unlock()
-  now := time.Now()
-  // Clean up old entries...
-  maxWindow := eventDeduplicationWindowFencing
-  for key, timestamp := range c.recordedEvents {
-    if now.Sub(timestamp) > maxWindow {
-      delete(c.recordedEvents, key)
-    }
-  }
+func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool {
+  c.recordedEventsMu.Lock()
+  defer c.recordedEventsMu.Unlock()
+  now := time.Now()
   // Check if this event was recently recorded
   if lastRecorded, exists := c.recordedEvents[eventKey]; exists {
     if now.Sub(lastRecorded) < deduplicationWindow {
       return false
     }
   }
   c.recordedEvents[eventKey] = now
   return true
 }
 
 func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) {
+  // One-time cleanup per invocation
+  c.recordedEventsMu.Lock()
+  now := time.Now()
+  for key, ts := range c.recordedEvents {
+    if now.Sub(ts) > eventDeduplicationWindowFencing {
+      delete(c.recordedEvents, key)
+    }
+  }
+  c.recordedEventsMu.Unlock()
   // Record events for warnings...

Also applies to: 730-781


101-221: Avoid struct duplication across packages.

Pacemaker XML structs are duplicated here and in the collector; they will drift. Extract them into a shared internal/package and reuse.

Also applies to: 52-60

pkg/cmd/pacemakerstatuscollector/collector.go (4)

24-50: De-duplicate shared constants/types to prevent drift.

maxXMLSize, time formats, and XML structs exist here and in healthcheck.go. Move them into a common package (e.g., pkg/tnf/pacemaker/internal/common) and import from both callers.

Also applies to: 31-33, 75-82


179-217: Stream-limit pcs output to reduce peak memory.

If exec package supports it, prefer a bounded reader or line-capped read to avoid large buffers; otherwise consider enhancing exec.Execute to enforce an output cap.


267-286: Be tolerant to rc_text casing.

Some environments emit RCText “OK”. Compare case-insensitively to avoid false positives.

- if operation.RC != operationRCSuccess || operation.RCText != operationRCTextSuccess {
+ if operation.RC != operationRCSuccess || !strings.EqualFold(operation.RCText, operationRCTextSuccess) {

303-319: HOME may be empty; add fallback.

When building kubeconfig path, guard against empty HOME for non-login shells.

- kubeconfig := os.Getenv("KUBECONFIG")
- if kubeconfig == "" {
-   kubeconfig = os.Getenv("HOME") + "/.kube/config"
- }
+ kubeconfig := os.Getenv("KUBECONFIG")
+ if kubeconfig == "" {
+   if home := os.Getenv("HOME"); home != "" {
+     kubeconfig = home + "/.kube/config"
+   }
+ }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d99fc51 and 59eaec8.

📒 Files selected for processing (19)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🚧 Files skipped from review as they are similar to previous changes (6)
  • cmd/cluster-etcd-operator/main.go
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • pkg/operator/starter.go
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/v1alpha1/types.go
  • pkg/tnf/pacemaker/testdata/standby_node.xml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (5)
pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM: CronJob decoder follows established pattern.

The function correctly mirrors ReadJobV1OrDie and reuses the existing codec factory. The panic-on-error behavior is appropriate for the "OrDie" convention.

bindata/tnfdeployment/cronjob.yaml (1)

35-40: Verify privileged settings are necessary for pacemaker operations.

The static analysis tools flag the privileged security context, privilege escalation, and hostPID: true as security concerns. For pacemaker health checks that execute pcs status, these elevated privileges are likely necessary. However, please confirm:

  1. Does pcs status --full require privileged access and hostPID?
  2. Can the command run with a more restrictive security context?
  3. Are there audit/compliance requirements that need documentation for these privileges?

If these privileges are confirmed necessary, consider adding a comment in the manifest explaining why elevated privileges are required.

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-87: Fixture looks good

Well-formed healthy-state XML; suitable for happy-path tests.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

14-31: Scheme registration looks correct

GroupVersion, AddKnownTypes, and AddToScheme are wired properly.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-121: Generated deepcopy looks correct.

Covers nested structs, slices, and runtime.Object; no issues spotted.

kind: CronJob
metadata:
name: <injected>
namespace: openshift-etcd
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded namespace creates inconsistency with controller.

The namespace is hardcoded to openshift-etcd, but the CronJobController accepts a Namespace field in CronJobConfig (line 47 in cronjob_controller.go). The injectValues method (lines 180-222 in cronjob_controller.go) doesn't replace the namespace placeholder, so if a caller provides a different namespace, the manifest and controller will be out of sync.

Change line 5 to use a placeholder:

-  namespace: openshift-etcd
+  namespace: <injected>

Then update injectValues in cronjob_controller.go to inject the namespace (this is already flagged in past review comments for that file).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In bindata/tnfdeployment/cronjob.yaml around line 5, the namespace is hardcoded
to "openshift-etcd" which will diverge from the CronJobController's configurable
Namespace; replace that line with a namespace placeholder (e.g. namespace: {{
.Namespace }}). Then update cronjob_controller.go (see CronJobConfig Namespace
field around line 47 and injectValues method around lines 180-222) to ensure
injectValues replaces the namespace placeholder with the configured Namespace
value when rendering the manifest; ensure nil/empty Namespace falls back to the
original default if needed and that tests and manifest rendering use the same
placeholder key.

Comment on lines 51 to 54
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
collectionError:
description: CollectionError contains any error encountered while collecting status
type: string
summary:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against oversized rawXML (1 MiB object size limit)

Storing full pcs/crm_mon XML can exceed Kubernetes object size limits, causing updates to fail under stress. Either truncate in the collector or constrain via schema.

Suggested schema bound; pair with collector-side truncation:

             rawXML:
               description: RawXML contains the raw XML output from pcs status xml command
               type: string
+              maxLength: 262144  # ~256 KiB safety cap; adjust as needed

Optionally add a boolean status.summary.rawXMLTruncated and trim in the collector when exceeding the cap.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
collectionError:
description: CollectionError contains any error encountered while collecting status
type: string
summary:
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
maxLength: 262144 # ~256 KiB safety cap; adjust as needed
collectionError:
description: CollectionError contains any error encountered while collecting status
type: string
summary:
🤖 Prompt for AI Agents
In manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml around lines
51-57, the rawXML status field is unbounded and can exceed Kubernetes 1 MiB
object size limits; add a schema constraint (e.g. maxLength: 1048576) on rawXML
and introduce a new boolean status.summary.rawXMLTruncated field to indicate
truncation, and update the collector to truncate the XML to that cap before
writing the CR status (ensuring truncation preserves valid encoding/closing tags
as needed).

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 59eaec8 to 85b3229 Compare October 17, 2025 14:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
bindata/tnfdeployment/cronjob.yaml (1)

5-5: Namespace hardcoding issue already flagged.

This issue was already identified in a previous review.

pkg/tnf/pkg/jobs/cronjob_controller.go (2)

48-48: Fix misleading cron schedule example – Kubernetes CronJob uses 5 fields, not 6.

The example "*/30 * * * * *" suggests a 6-field cron with seconds, but Kubernetes CronJob uses standard 5-field cron. Change to "*/30 * * * *" (every 30 minutes) and update the comment.

Apply this diff:

-	Schedule  string   // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds)
+	Schedule  string   // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)

180-222: Fix YAML injection risk and add namespace injection.

Two issues:

  1. YAML injection: Line 191 uses fmt.Sprintf(\"%s"`, cmd)which doesn't escape quotes/backslashes. Ifcmdcontains"or`, the YAML breaks. Use strconv.Quote for proper escaping.
  2. Missing namespace injection: The manifest never injects namespace. If the template contains namespace: <injected>, it won't be replaced.

Apply this diff:

 package jobs
 
 import (
 	"context"
 	"fmt"
+	"strconv"
 	"strings"
 	"time"
 func (c *CronJobController) injectValues(manifest []byte) []byte {
 	result := string(manifest)
 
 	// Build command string for YAML array format
 	var commandBuilder strings.Builder
 	commandBuilder.WriteString("[")
 	for i, cmd := range c.command {
 		if i > 0 {
 			commandBuilder.WriteString(", ")
 		}
-		commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+		commandBuilder.WriteString(strconv.Quote(cmd))
 	}
 	commandBuilder.WriteString("]")
 	commandStr := commandBuilder.String()
 
 	// Track which line we're on to do context-aware replacement
 	lines := strings.Split(result, "\n")
 	for i, line := range lines {
+		// Replace namespace if present in template
+		if strings.Contains(line, "namespace: <injected>") {
+			lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+		}
 		// Replace name (appears in metadata.name)
 		if strings.Contains(line, "name: <injected>") {
 			lines[i] = strings.Replace(line, "<injected>", c.name, 1)
pkg/cmd/pacemakerstatuscollector/collector.go (1)

360-362: Use proper error type checking instead of string matching.

Line 362 uses brittle string matching to detect not-found errors. Use apierrors.IsNotFound(err) for robust error classification.

Apply this diff:

 import (
 	"context"
 	"encoding/xml"
 	"fmt"
 	"os"
 	"strings"
 	"time"
 
 	"github.com/spf13/cobra"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	if err != nil {
 		// Create new PacemakerStatus if it doesn't exist
-		if strings.Contains(err.Error(), "not found") {
+		if apierrors.IsNotFound(err) {
 			newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (2)
pkg/tnf/pacemaker/healthcheck.go (2)

401-402: Consider passing context to recordHealthCheckEvents for consistency.

While recordHealthCheckEvents currently performs no blocking I/O, passing ctx would future-proof the function and align with standard controller patterns where context flows through the call chain.

Based on learnings


438-441: XML size validation is defensive; etcd limits already constrain CR size.

The size check here occurs after the XML is in memory (from the CR). However, etcd/apiserver limits (typically ~1.5MB) already prevent excessively large CRs from being stored. This validation is still useful as a secondary defense-in-depth measure.

For stricter protection, consider future enhancement using io.LimitReader when the collector writes the CR.

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 59eaec8 and 85b3229.

📒 Files selected for processing (17)
  • bindata/etcd/etcd-sa-clusterrole.yaml (1 hunks)
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (2 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
  • pkg/tnf/pkg/jobs/batch.go
  • pkg/tnf/pacemaker/v1alpha1/types.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (34)
bindata/tnfdeployment/cronjob.yaml (1)

35-40: Verify privileged access is necessary for pacemaker status collection.

The CronJob runs with privileged: true, allowPrivilegeEscalation: true, and hostPID: true. While static analysis flags these as security concerns, they may be required for the pcs status command to access the pacemaker cluster state on the host.

Please confirm:

  1. Is host PID namespace access (hostPID: true) required for the pacemaker status collector to run pcs status?
  2. Can the privilege requirements be reduced (e.g., specific capabilities instead of fully privileged)?
  3. Is there documentation about the minimum privileges needed for pcs status execution?

If these privileges are necessary, consider adding a comment in the manifest explaining why privileged access is required for operational clarity.

cmd/cluster-etcd-operator/main.go (1)

17-17: LGTM!

The pacemaker status collector command is properly imported and registered with the root command.

Also applies to: 82-82

pkg/operator/starter.go (2)

247-248: LGTM!

The RBAC manifests for the etcd service account are correctly added to the static resource controller, enabling PacemakerStatus resource access.


632-633: LGTM!

The function call is properly updated to pass the additional dynamicClient and AlivenessChecker parameters required for Pacemaker controller initialization.

bindata/etcd/etcd-sa-clusterrole.yaml (1)

1-24: LGTM!

The ClusterRole defines appropriate permissions for the pacemaker status collector to manage PacemakerStatus resources and their status subresource.

pkg/tnf/operator/starter_test.go (1)

35-35: LGTM!

The test is properly updated to:

  • Import the necessary health package
  • Create an alivenessChecker instance
  • Pass the additional parameters to HandleDualReplicaClusters
  • Initialize a minimal rest.Config for testing

Also applies to: 41-41, 96-109, 260-270

pkg/tnf/operator/starter.go (2)

38-38: LGTM!

The function signature is properly extended to accept dynamicClient and alivenessChecker, and the new runPacemakerControllers function is correctly invoked to start Pacemaker-related controllers.

Also applies to: 40-40, 76-77, 98-98


334-364: Verify CronJob schedule aligns with requirements.

The CronJob schedule is set to "* * * * *" (every minute), but the PR description states the controller "monitors pacemaker cluster status every 30 seconds."

Please confirm:

  1. Is running every minute (instead of every 30 seconds) acceptable for your monitoring requirements?
  2. If 30-second intervals are required, consider either:
    • Running the job more frequently (though Kubernetes CronJobs don't support sub-minute schedules natively)
    • Implementing a loop inside the collector to sample twice per minute
    • Using a different deployment pattern (e.g., Deployment with continuous monitoring)

If every minute is acceptable, update the PR description to match the actual schedule.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

1-31: LGTM!

The API registration follows standard Kubernetes patterns:

  • Properly defines GroupName and Version constants
  • Initializes SchemeGroupVersion and SchemeBuilder
  • Registers PacemakerStatus and PacemakerStatusList types
  • Provides helper functions for resource registration
pkg/tnf/pacemaker/healthcheck_test.go (1)

1-903: LGTM!

This is a comprehensive test suite with excellent coverage:

  • HealthCheck construction and initialization
  • PacemakerStatus retrieval (healthy, errors, stale)
  • Operator status updates
  • Event recording with proper deduplication across different time windows
  • XML parsing for various cluster states (healthy, offline, standby, failures)
  • Node and resource status collection
  • Edge cases (large XML, invalid XML, missing state)

The use of testdata fixtures and dynamic timestamp generation for time-sensitive tests (like getRecentFailuresXML) is a good practice.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-121: LGTM: Autogenerated deepcopy implementations are correct.

The generated deepcopy methods follow standard Kubernetes patterns: proper nil checks, correct slice allocation for PacemakerStatusList.Items, nested DeepCopyInto calls for embedded types, and appropriate DeepCopyObject implementations.

pkg/tnf/pacemaker/healthcheck.go (12)

1-99: LGTM: Imports and constants are well-organized.

The constants are clearly defined and documented. The expectedNodeCount = 2 reflects the Two-Node OpenShift with Fencing design constraint, as clarified in past discussions.


311-383: LGTM: Controller constructor follows standard operator patterns.

The REST client and informer setup are correct. Using panic for scheme registration and client creation errors is appropriate for initialization failures that should halt startup.


576-600: LGTM: Continue on parse errors is the correct resilient behavior.

Continuing when timestamp parsing fails (line 590) ensures the health check processes other valid operations. The error is logged at V(4) for debugging. Failing hard on malformed timestamps would be too strict and could mask legitimate failures in other operations.

Based on learnings


682-689: Condition ownership is established by the unique type name.

The condition type PacemakerHealthCheckDegraded (line 91) is unique to this controller, establishing clear ownership. Per Kubernetes conventions, condition types should be managed by a single controller, so clearing without checking the reason is safe.

Based on learnings


703-728: LGTM: Event cleanup logic is correct.

The cleanup (lines 712-716) runs unconditionally on every call and only removes entries older than maxWindow (1 hour). This is intentional and correct – it prevents unbounded map growth while preserving recent events within their deduplication windows.

Based on learnings


101-273: LGTM: XML structures are well-defined for pacemaker status parsing.

The types correctly mirror the pacemaker XML schema and use appropriate xml tags for unmarshaling.


452-499: LGTM: Parsing orchestration is well-structured.

The function properly validates critical structures, checks pacemaker state, and delegates to specialized collectors. Early return on non-running state with explicit error status is appropriate.


501-524: LGTM: Node status collection correctly handles the Two-Node design.

The checks for expected node count (with warning for transient 1-node state during replacement) and online/standby validation align with the Two-Node OpenShift architecture requirements.


526-574: LGTM: Resource validation correctly identifies kubelet and etcd placement.

The helper function pattern and resource agent prefix matching properly track resource distribution across nodes.


602-634: LGTM: Fencing and status determination logic is sound.

Fencing event detection with appropriate time window (24 hours) and simple status determination based on error/warning presence are appropriate for this health check.


636-701: LGTM: Operator condition management follows library-go patterns.

Condition updates use the standard v1helpers.UpdateStaticPodStatus API correctly. The separation of set/clear/update functions is clean.


730-781: LGTM: Event recording with deduplication is well-implemented.

The event categorization (fencing vs. failed actions) with distinct reasons and deduplication windows, plus tracking previous status to record healthy transitions, is thorough and production-ready.

pkg/tnf/pkg/jobs/cronjob_controller.go (3)

21-88: LGTM: Controller setup follows standard factory patterns.

The constants, types, and constructor are well-organized. Default namespace fallback and resync interval are appropriate.


90-178: LGTM: CronJob lifecycle management is correct.

The sync/ensure/create/update/build flow properly handles the CronJob lifecycle, including AlreadyExists handling and ResourceVersion preservation for updates.


224-249: LGTM: Update detection correctly compares schedule, command, and image.

The comparison logic properly checks all three configurable fields and handles array length mismatches.

pkg/cmd/pacemakerstatuscollector/collector.go (8)

1-50: LGTM: Imports and constants are well-organized.

The constants clearly define timeouts, size limits, and time formats. The command uses sudo -n for non-interactive execution, which is correct.


52-138: LGTM: XML structure duplication is acceptable to avoid circular dependencies.

The comment (line 53) explains why these types are duplicated from healthcheck.go. This is a reasonable trade-off for maintaining clean package dependencies.


140-177: LGTM: Command structure and runCollector orchestration are sound.

The 30-second context timeout (line 156) provides a reasonable upper bound for collection and CR update operations. Error handling propagates failures appropriately.


179-217: LGTM: Collection logic is robust and defensive.

The nested timeout (10s for exec, line 181), size validation (line 196), and returning raw XML even on parse failure (line 210) make this resilient to various failure modes.


219-265: LGTM: Summary building comprehensively captures cluster state.

The function correctly counts online nodes, started resources, and checks for recent failures/fencing events. The logic aligns with the health check expectations.


267-301: LGTM: Time-window checks for failures and fencing are correct.

The time window constants (5 minutes for failures, 24 hours for fencing) match the health check controller. Error handling continues on parse failure, which is appropriate.


303-319: LGTM: Kubernetes client creation with in-cluster fallback is standard.

The in-cluster config attempt followed by kubeconfig fallback correctly supports both in-cluster (Job/CronJob) and local execution contexts.


363-414: LGTM: CR update logic correctly handles create vs. update paths.

The create path (lines 363-389) properly sets TypeMeta and ObjectMeta. The update path (lines 394-413) correctly uses the status subresource and sets ObservedGeneration.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 85b3229 to 9f5f422 Compare October 17, 2025 16:34
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
bindata/tnfdeployment/cronjob.yaml (1)

5-5: Hardcoded namespace creates inconsistency with controller.

The namespace is hardcoded to openshift-etcd, but the CronJobController accepts a Namespace field in CronJobConfig. The controller's injectValues method doesn't replace the namespace placeholder, causing potential mismatches.

As flagged in previous reviews: replace with a placeholder and update the controller to inject it.

pkg/tnf/pkg/jobs/cronjob_controller.go (3)

48-51: Cron schedule comment shows invalid 6-field expression.

Kubernetes CronJob expects a standard 5-field cron expression (minute, hour, day, month, weekday), not 6 fields. The example showing "*/30 * * * * *" with seconds is misleading.

Update to a valid 5-field example like "*/30 * * * *" (every 30 minutes).


184-194: Command array construction lacks proper YAML escaping.

The command building uses fmt.Sprintf("%s") which doesn't escape quotes, backslashes, or other special characters. Arguments containing these characters will break the YAML syntax when injected.

As flagged in previous reviews: use strconv.Quote() for each command element to ensure proper escaping, and add the strconv import.


196-222: Namespace is not injected into the manifest.

The injectValues method replaces placeholders for name, schedule, image, and command, but never injects the namespace. The manifest template has a hardcoded namespace, creating an inconsistency with the controller's configurable Namespace field.

As flagged in previous reviews: add a replacement branch to inject c.namespace when the line contains "namespace: <injected>".

pkg/cmd/pacemakerstatuscollector/collector.go (1)

360-392: Use apierrors.IsNotFound instead of string matching.

The brittle string check strings.Contains(err.Error(), "not found") on line 362 should be replaced with proper API error type checking using apierrors.IsNotFound(err).

Apply this fix:

 import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
   "strings"
   "time"
 
   "github.com/spf13/cobra"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
   if err != nil {
     // Create new PacemakerStatus if it doesn't exist
-    if strings.Contains(err.Error(), "not found") {
+    if apierrors.IsNotFound(err) {
       newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (6)
pkg/tnf/operator/starter.go (1)

313-346: Defensive guard for KubeConfig to avoid panics.

If controllerContext.KubeConfig is ever nil (tests or misconfig), NewHealthCheck will panic when creating the REST client. Add a fast check and skip starting the healthcheck in that case (log at Warning).

Example:

 func runPacemakerControllers(..., etcdInformer operatorv1informers.EtcdInformer) {
   // Start the healthcheck controller to watch PacemakerStatus CRs
   // This can start early since it just subscribes to CR updates
   klog.Infof("starting Pacemaker healthcheck controller")
-  healthCheckController := pacemaker.NewHealthCheck(
+  if controllerContext.KubeConfig == nil {
+    klog.Warning("skipping Pacemaker healthcheck controller: KubeConfig is nil")
+  } else {
+    healthCheckController := pacemaker.NewHealthCheck(
       alivenessChecker,
       operatorClient,
       kubeClient,
       controllerContext.EventRecorder,
       controllerContext.KubeConfig,
     )
-  go healthCheckController.Run(ctx, 1)
+    go healthCheckController.Run(ctx, 1)
+  }
pkg/tnf/pacemaker/healthcheck.go (2)

320-335: Avoid panics on REST client creation; prefer graceful failure.

Panic will crash the operator if RESTClientFor fails (bad rest.Config, serializer, env). Return an error from constructor (or log and skip starting) and let the caller decide, to keep the operator up.

Minimal approach (preserve signature): log error and return a no-op controller:

- restClient, err := rest.RESTClientFor(pacemakerConfig)
- if err != nil {
-   panic(fmt.Errorf("failed to create REST client: %w", err))
- }
+ restClient, err := rest.RESTClientFor(pacemakerConfig)
+ if err != nil {
+   klog.Errorf("PacemakerHealthCheck disabled: failed to create REST client: %v", err)
+   return factory.New().ToController("PacemakerHealthCheckDisabled", events.NewInMemoryRecorder("noop"))
+ }

Alternatively, change NewHealthCheck to return (factory.Controller, error) and bubble up.
As per coding guidelines.


730-781: Event deduplication is sensible; minor nit: avoid cleanup per-item.

Current cleanup runs for each warning/error. Consider a periodic cleanup (once per sync) to reduce lock contention.

pkg/tnf/pacemaker/healthcheck_test.go (1)

32-61: Consider removing or clarifying the TODO comment.

The TODO comment on line 56 states "TODO: Mock this for actual tests", but the function is already used successfully in TestNewHealthCheck (line 168-174). If the nil REST client and informer are acceptable for basic constructor testing, consider either:

  • Removing the TODO comment, or
  • Clarifying what additional testing would require proper mocking
pkg/cmd/pacemakerstatuscollector/collector.go (2)

303-333: Refactor duplicate REST config retrieval logic.

The code for obtaining the Kubernetes REST config is duplicated between createKubernetesClient (lines 304-316) and updatePacemakerStatusCR (lines 323-333). Consider extracting this into a shared helper function to improve maintainability.

Example refactor:

func getRESTConfig() (*rest.Config, error) {
  // Try in-cluster config first
  config, err := rest.InClusterConfig()
  if err != nil {
    // Fall back to kubeconfig
    kubeconfig := os.Getenv("KUBECONFIG")
    if kubeconfig == "" {
      kubeconfig = os.Getenv("HOME") + "/.kube/config"
    }
    config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
    if err != nil {
      return nil, err
    }
  }
  return config, nil
}

func createKubernetesClient() (*kubernetes.Clientset, error) {
  config, err := getRESTConfig()
  if err != nil {
    return nil, err
  }
  return kubernetes.NewForConfig(config)
}

func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {
  config, err := getRESTConfig()
  if err != nil {
    return err
  }
  // ... rest of function
}

52-138: XML structure duplication is documented.

The comment on line 53 notes that these XML structures are copied from healthcheck.go to avoid circular dependencies. While code duplication is generally undesirable, this appears to be an intentional trade-off documented in the code.

Consider moving these shared structures to a separate package (e.g., pkg/tnf/pacemaker/types or pkg/tnf/pacemaker/internal) if the circular dependency can be resolved.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 85b3229 and 9f5f422.

📒 Files selected for processing (18)
  • bindata/etcd/etcd-sa-clusterrole.yaml (1 hunks)
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (2 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml
  • pkg/operator/starter.go
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
  • cmd/cluster-etcd-operator/main.go
  • bindata/etcd/etcd-sa-clusterrole.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (18)
pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM!

The function correctly mirrors the existing ReadJobV1OrDie pattern, using the same decoder infrastructure to parse CronJob manifests. The implementation is consistent and appropriate for this use case.

bindata/tnfdeployment/cronjob.yaml (1)

35-40: Privileged access is appropriate for pacemaker operations.

The privileged security context and hostPID: true are necessary for the collector to execute pcs status commands and interact with the pacemaker cluster on the host. Static analysis warnings are expected but can be accepted for this system-level operational use case.

pkg/tnf/pkg/jobs/cronjob_controller.go (3)

90-111: LGTM!

The sync method correctly checks operator state and ensures the CronJob only when in Managed state. Error handling and logging are appropriate.


113-132: LGTM!

The ensureCronJob method properly handles both creation and update paths, with correct error handling for NotFound vs other errors.


224-249: LGTM!

The needsUpdate method correctly compares schedule, command array, and image to determine if an update is required. Bounds checking is appropriate.

pkg/tnf/pacemaker/v1alpha1/doc.go (1)

1-3: LGTM.

Package doc and group annotation look good.

pkg/tnf/operator/starter_test.go (1)

96-110: Aliveness checker wiring looks correct.

Passing a concrete MultiAlivenessChecker into HandleDualReplicaClusters matches the new signature.

pkg/operator/ceohelpers/external_etcd_status.go (1)

167-201: Nice generic waiter; sync + context-aware polling is appropriate.

Gates on informer sync, retries with PollUntilContextTimeout, and treats transient errors as retryable. LGTM.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

1-31: LGTM.

SchemeGroupVersion, AddToScheme, and known types are correctly wired.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-121: Autogen deepcopy looks correct.

Methods cover all types; safe copies for meta/timestamps.

pkg/tnf/operator/starter.go (2)

229-237: Good: bootstrap wait refactored to shared helper.

Using ceohelpers.WaitForEtcdCondition simplifies readiness gating and centralizes polling behavior.


348-367: CronJob schedule fix looks good.

Five-field spec "* * * * *" is valid for Kubernetes CronJobs; controller wiring is straightforward.

pkg/tnf/pacemaker/healthcheck.go (1)

509-523: Node-state handling matches PR intent.

You warn on unexpected node count and treat standby as offline. Clear messages; aligns with TNF design.

pkg/tnf/pacemaker/healthcheck_test.go (2)

126-140: Excellent use of dynamic timestamps!

Generating dynamic timestamps within the test window (line 131-133) is a best practice that prevents time-based test flakiness. This approach ensures tests remain reliable regardless of when they execute.


143-903: Comprehensive test coverage!

The test suite provides excellent coverage across:

  • Happy path scenarios (healthy clusters, online nodes, started resources)
  • Error conditions (offline nodes, standby nodes, collection errors, stale status)
  • Edge cases (no nodes, mismatched counts, large XML, invalid XML, missing state)
  • Event deduplication across different time windows and state transitions
  • All major parsing and status determination functions

The tests are well-structured, clearly named, and follow consistent patterns.

pkg/tnf/pacemaker/v1alpha1/types.go (1)

1-86: Well-structured API types following Kubernetes conventions.

The type definitions are clean and follow standard Kubernetes API patterns with appropriate kubebuilder markers for CRD generation, deepcopy generation, and subresource configuration.

pkg/cmd/pacemakerstatuscollector/collector.go (2)

140-177: Clean command structure and error handling.

The cobra command setup and main orchestration logic follow good patterns with proper context timeout, error propagation, and logging.


179-301: Well-structured collection and parsing logic.

The collection, parsing, and summary-building functions are cleanly separated with good error handling, size validation (maxXMLSize check on line 196), and time-based filtering for recent events.

Comment on lines 23 to 28
// PacemakerStatusSpec defines the desired state of PacemakerStatus
type PacemakerStatusSpec struct {
// NodeName identifies which node this status is for
// +optional
NodeName string `json:"nodeName,omitempty"`
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify if NodeName field is needed.

The NodeName field in PacemakerStatusSpec is marked optional but doesn't appear to be set anywhere in the collector implementation (pkg/cmd/pacemakerstatuscollector/collector.go lines 363-377, 394-399). The collector uses a singleton CR named "cluster" without node-specific distinction.

If node-specific status tracking isn't part of the current design, consider removing this field to avoid confusion. If it's planned for future use, consider adding a comment explaining its intended purpose.

Run the following to verify NodeName usage across the codebase:


🏁 Script executed:

#!/bin/bash
# Search for any references to NodeName in PacemakerStatus context
rg -n -C3 'NodeName' --type=go

Length of output: 264911



---

🏁 Script executed:

```shell
#!/bin/bash
# Search specifically for PacemakerStatus usage and NodeName references
echo "=== Search 1: PacemakerStatus creation/instantiation ==="
rg -n 'PacemakerStatus' --type=go | grep -v vendor | head -20

echo ""
echo "=== Search 2: Collector code that creates PacemakerStatus ==="
rg -n 'pacemakerstatuscollector' --type=go | grep -v vendor | head -20

echo ""
echo "=== Search 3: Direct reference to PacemakerStatusSpec with NodeName ==="
rg -n 'PacemakerStatusSpec.*NodeName\|NodeName.*PacemakerStatusSpec' --type=go | grep -v vendor

echo ""
echo "=== Search 4: Look for any .NodeName assignments in collector context ==="
rg -n '\.NodeName\s*=' --type=go pkg/cmd/pacemakerstatuscollector/ -A2 -B2 2>/dev/null || echo "File not found, searching broadly:"
find pkg -name "*collector*" -type f

Length of output: 2866


🏁 Script executed:

#!/bin/bash
# View the collector.go to see how PacemakerStatus objects are created
echo "=== Collector.go content (showing PacemakerStatus creation) ==="
rg -n 'PacemakerStatus' pkg/cmd/pacemakerstatuscollector/collector.go -A5 -B5

echo ""
echo "=== Search for Spec assignment in collector ==="
rg -n 'Spec' pkg/cmd/pacemakerstatuscollector/collector.go -C3 | head -40

Length of output: 4469


Remove unused NodeName field from PacemakerStatusSpec or document its intended purpose.

The NodeName field in PacemakerStatusSpec is never set by the collector implementation. Lines 363–376 and 395–398 of collector.go show that only the Status fields are populated (LastUpdated, RawXML, CollectionError, Summary); the Spec is never touched. Combined with the singleton CR design ("cluster"), this field appears unused. Either remove it or add a comment explaining future intent.

🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/v1alpha1/types.go around lines 23–28, the
PacemakerStatusSpec.NodeName field is unused by the collector and the singleton
"cluster" CR design, so remove the NodeName field from the PacemakerStatusSpec
struct, delete its json tag, and update any comments; regenerate CRDs/deepcopy
(run controller-gen or go generate as appropriate), fix any compile-time
references, and run go fmt/go vet to ensure cleanliness. If you prefer to keep
it instead, replace the field with a clear comment explaining its future intent
and why it is currently unused, and still regenerate CRDs/deepcopy and run
formatting checks.

hostNetwork: false
hostPID: true
priorityClassName: system-node-critical
serviceAccountName: etcd-sa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use the tnf-setup-manager SA. like we do in the other TNF jobs? That one also has the privileged securitycontextconstraints rule assigned, which I expect the pacemaker cronjob needs as well


// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I like using a JobHookFunc like in runJobController() much more than this string replacements... WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused as to how the JobHookFunc stuff worked :D
I was like - what is this sorcery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I am looking at it directly, it makes way more sense. Stealing this approach.

cmd.AddCommand(requestbackup.NewRequestBackupCommand(ctx))
cmd.AddCommand(rev.NewRevCommand(ctx))
cmd.AddCommand(backuprestore.NewBackupServer(ctx))
cmd.AddCommand(pacemakerstatuscollector.NewPacemakerStatusCollectorCommand())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding this in cmd/tnf-setup-runner instead, where all the other TNF related commands live? (yes, it's not about the setup part, but we can rename it?)

@@ -0,0 +1,414 @@
package pacemakerstatuscollector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to have this in the tnf package?

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 9f5f422 to 4c76701 Compare October 20, 2025 15:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (2)

48-51: Cron schedule example is incorrect for Kubernetes CronJob.

The comment shows a 6-field expression including seconds, but Kubernetes CronJob expects a standard 5-field cron expression (minute, hour, day of month, month, day of week).

Update the example:

-	Schedule  string   // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds)
+	Schedule  string   // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)

180-222: Command array construction is vulnerable to YAML injection.

The command builder uses fmt.Sprintf("%s") without escaping quotes, backslashes, or other special characters. Arguments containing these characters will break the YAML structure. Additionally, the namespace placeholder is not being injected.

Apply this diff to fix both issues:

+import (
+	"strconv"
+	// ... other imports
+)

 func (c *CronJobController) injectValues(manifest []byte) []byte {
 	result := string(manifest)
 
 	// Build command string for YAML array format
 	var commandBuilder strings.Builder
 	commandBuilder.WriteString("[")
 	for i, cmd := range c.command {
 		if i > 0 {
 			commandBuilder.WriteString(", ")
 		}
-		commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+		commandBuilder.WriteString(strconv.Quote(cmd))
 	}
 	commandBuilder.WriteString("]")
 	commandStr := commandBuilder.String()
 
 	// Track which line we're on to do context-aware replacement
 	lines := strings.Split(result, "\n")
 	for i, line := range lines {
+		// Replace namespace if present in template
+		if strings.Contains(line, "namespace: <injected>") {
+			lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+		}
 		// Replace name (appears in metadata.name)
 		if strings.Contains(line, "name: <injected>") {
pkg/tnf/operator/starter_test.go (1)

260-271: Test goroutines may leak due to context.Background usage.

The test creates a scoped cancellable context at line 239 but returns context.Background() at line 267. Controllers started by HandleDualReplicaClusters use args.ctx, so they won't be cancelled when the test completes, potentially leaking goroutines.

Apply this diff:

 	return args{
 		initErr: nil, // Default to no error
-		ctx:     context.Background(),
+		ctx:     ctx,
 		controllerContext: &controllercmd.ControllerContext{
pkg/cmd/pacemakerstatuscollector/collector.go (1)

487-531: CRITICAL: Replace brittle string matching with apierrors.IsNotFound (duplicate of previous review).

Line 497 still uses strings.Contains(err.Error(), "not found") which was flagged in a previous review. This is fragile and incorrect for error type checking.

Apply this diff:

 import (
 	"context"
 	"encoding/xml"
 	"fmt"
 	"os"
 	"strings"
 	"time"
 
 	"github.com/spf13/cobra"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	if err != nil {
 		// Create new PacemakerStatus if it doesn't exist
-		if strings.Contains(err.Error(), "not found") {
+		if apierrors.IsNotFound(err) {
 			newStatus := &v1alpha1.PacemakerStatus{

Based on learnings.

🧹 Nitpick comments (11)
pkg/tnf/pacemaker/healthcheck_test.go (2)

31-61: Mark helpers with t.Helper() for clearer failures.

Add t.Helper() to test helpers (createTestHealthCheck, createTestHealthCheckWithMockStatus, loadTestXML, getRecentFailuresXML).

 func createTestHealthCheck() *HealthCheck {
+	t.Helper()
@@
 func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *v1alpha1.PacemakerStatus) *HealthCheck {
+	t.Helper()
@@
 func loadTestXML(t *testing.T, filename string) string {
+	t.Helper()
@@
 func getRecentFailuresXML(t *testing.T) string {
+	t.Helper()

Also applies to: 63-117, 119-124, 126-140


338-444: Optional: make dedup timing fully deterministic.

Consider injecting a clock into HealthCheck and using it in shouldRecordEvent to avoid reliance on time.Now in tests.

pkg/tnf/operator/starter.go (2)

348-367: 30s vs 1m sampling mismatch.

Schedule is "* * * * *" (every minute). If 30-second sampling is required, either run a short loop inside the collector or switch to another mechanism.


399-500: Operational note: pcs/ sudo requirements for CronJob.

Verify the CronJob’s container has needed privileges and NOPASSWD sudo for “pcs status xml” (host mounts, SELinux, SCC). Otherwise collection will fail at runtime.

pkg/tnf/pacemaker/healthcheck.go (2)

320-335: Avoid panic in constructor paths.

Prefer utilruntime.Must (or returning an error) over panic when building scheme/REST client.

+	"k8s.io/apimachinery/pkg/util/runtime"
@@
-	if err := v1alpha1.AddToScheme(scheme); err != nil {
-		panic(fmt.Errorf("failed to add scheme: %w", err))
-	}
+	utilruntime.Must(v1alpha1.AddToScheme(scheme))
@@
-	restClient, err := rest.RESTClientFor(pacemakerConfig)
-	if err != nil {
-		panic(fmt.Errorf("failed to create REST client: %w", err))
-	}
+	restClient, err := rest.RESTClientFor(pacemakerConfig)
+	utilruntime.Must(err)

621-634: Comment mismatch with implementation.

determineOverallStatus doesn’t add messages; update the comment to reflect current behavior.

-// Note: This function may add additional error messages to the status for certain conditions
+// Computes overall status from existing warnings/errors; does not mutate messages.
pkg/tnf/pacemaker/v1alpha1/types.go (1)

23-28: Spec.NodeName: keep or document future intent.

Field is currently unused by the collector (singleton “cluster” CR). Add a brief comment about its intended future use to avoid confusion.

pkg/cmd/pacemakerstatuscollector/collector.go (4)

372-375: Consider more robust RC parsing.

The fmt.Sscanf call on line 374 doesn't check for errors. If operation.RC is not a valid integer, rc will remain 0, which might incorrectly match the success code.

Consider:

-// Parse RC
-rc := 0
-if operation.RC != "" {
-	fmt.Sscanf(operation.RC, "%d", &rc)
-}
+// Parse RC
+rc := 0
+if operation.RC != "" {
+	if _, err := fmt.Sscanf(operation.RC, "%d", &rc); err != nil {
+		klog.Warningf("Failed to parse RC value '%s': %v", operation.RC, err)
+	}
+}

276-426: Consider splitting this large function into focused helpers.

This ~150-line function handles multiple distinct responsibilities: building summaries, processing nodes, processing resources, analyzing history, and tracking fencing events. While the logic is straightforward, splitting into smaller functions (e.g., buildNodeStatus, buildResourceStatus, buildNodeHistory, buildFencingHistory) would improve readability and testability.


458-468: Eliminate config creation duplication.

Lines 458-468 duplicate the config creation logic from createKubernetesClient (lines 429-441). Consider extracting a shared getKubeConfig() (*rest.Config, error) helper or reusing the existing client creation function.


446-556: Consider splitting this large function.

This ~110-line function handles multiple responsibilities: config creation, scheme setup, REST client creation, CR retrieval, and create-or-update logic. Consider extracting helpers like createRESTClient, createNewStatus, and updateExistingStatus for better readability and testability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9f5f422 and 4c76701.

📒 Files selected for processing (17)
  • bindata/tnfdeployment/clusterrole.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (20)
pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM!

The new ReadCronJobV1OrDie function correctly mirrors the existing ReadJobV1OrDie pattern, reuses the same codec infrastructure, and follows consistent error handling conventions.

bindata/tnfdeployment/cronjob.yaml (1)

35-47: Security posture is appropriate for Pacemaker monitoring.

The privileged security context, hostPID access, and system-node-critical priority are necessary for collecting Pacemaker cluster status from the host. These elevated permissions align with the operational requirements of monitoring external etcd clusters.

Note: Static analysis security warnings are expected and acceptable for this privileged monitoring workload.

pkg/operator/ceohelpers/external_etcd_status.go (1)

167-201: LGTM!

The WaitForEtcdCondition helper correctly gates on informer sync, uses context-aware polling with proper timeout handling, and provides good observability through structured logging. Error handling appropriately retries transient condition-check failures while respecting context cancellation.

cmd/cluster-etcd-operator/main.go (1)

17-17: LGTM!

The Pacemaker status collector command is correctly imported and registered, following the established pattern for other subcommands in the operator.

Also applies to: 82-82

pkg/operator/starter.go (1)

619-634: LGTM!

The signature extension to HandleDualReplicaClusters correctly propagates the dynamicClient and AlivenessChecker parameters needed for Pacemaker health monitoring and status collection. The changes integrate cleanly with the existing operator startup flow.

bindata/tnfdeployment/clusterrole.yaml (1)

32-50: LGTM!

The RBAC rules for pacemakerstatuses are correctly structured, covering both the main resource and the status subresource with appropriate verbs for the Pacemaker status collector workflow.

pkg/tnf/operator/starter_test.go (1)

96-109: LGTM!

The test correctly instantiates alivenessChecker and propagates both dynamicClient and alivenessChecker to HandleDualReplicaClusters, maintaining test coverage for the new signature.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

24-31: LGTM: scheme registration is correct.

Types and list registered; group/version helpers look good.

pkg/tnf/operator/starter.go (2)

313-346: Ensure the healthcheck’s custom informer is started.

The Pacemaker informer created in NewHealthCheck must be running for WithInformers cache sync and event-driven resyncs. Confirm library-go factory starts it; if not, start it (e.g., go informer.Run(ctx.Done())) where the informer is created.


61-76: All call sites correctly updated with new parameters.

Verification confirms that both invocations of HandleDualReplicaClusters have been properly updated:

  1. pkg/operator/starter.go:619 — passes dynamicClient (line 630) and AlivenessChecker (line 631)
  2. pkg/tnf/operator/starter_test.go:97 — passes tt.args.dynamicClient (line 108) and alivenessChecker (line 109)

No additional call sites exist in the codebase.

pkg/tnf/pacemaker/healthcheck.go (3)

336-358: Confirm informer lifecycle.

SharedIndexInformer is constructed here and passed to WithInformers. Verify it’s actually started by the factory, or start it to avoid WaitForCacheSync deadlock.


703-728: Event dedup cleanup is efficient and safe.

Cleanup before checking saves memory and avoids repeated growth. Logic looks good.


532-544: Resource agent prefixes are validated and correctly match actual pacemaker output.

The constants resourceAgentKubelet = "systemd:kubelet" and resourceAgentEtcd = "ocf:heartbeat:podman-etcd" are verified against test fixtures representing real pcs cluster output. The strings.HasPrefix() matching logic is sound, and resourcesStarted counts are computed correctly.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-211: LGTM! Autogenerated deepcopy implementations are correct.

This file is autogenerated by deepcopy-gen and follows the standard patterns for deep copying Kubernetes API types. The implementations correctly handle nil checks, slice allocations, and nested struct copying.

pkg/cmd/pacemakerstatuscollector/collector.go (6)

1-1: Address the package location question from previous review.

A previous reviewer (slintes) questioned whether this should be in the tnf package. The current location (pkg/cmd/pacemakerstatuscollector) suggests this is a standalone command, but if it's intended as part of the tnf testing framework, consider moving to pkg/tnf/cmd/pacemakerstatuscollector or similar for better organization.

Based on learnings.


24-50: LGTM! Well-defined constants with appropriate values.

The constants are well-chosen:

  • XML bomb protection with 10MB limit is good security practice
  • Time windows (5min for failures, 24h for fencing) are reasonable for event detection
  • Pacemaker time formats are correctly specified for parsing

189-202: LGTM! Standard Cobra command implementation.

The command constructor follows best practices with proper error handling and logging.


204-226: LGTM! Clean orchestration with appropriate timeout.

The 30-second timeout provides adequate margin for the 10-second pcs command execution plus Kubernetes operations.


228-274: LGTM! Robust collection with good error handling.

The function properly handles errors while still preserving raw XML for debugging (line 266-267), and includes XML bomb protection.


428-444: LGTM! Standard Kubernetes client creation.

The fallback logic from in-cluster config to kubeconfig is correctly implemented.

Comment on lines 52 to 187
// XML structures for parsing pacemaker status
// These are copied from healthcheck.go to avoid circular dependencies
type PacemakerResult struct {
XMLName xml.Name `xml:"pacemaker-result"`
Summary Summary `xml:"summary"`
Nodes Nodes `xml:"nodes"`
Resources Resources `xml:"resources"`
NodeAttributes NodeAttributes `xml:"node_attributes"`
NodeHistory NodeHistory `xml:"node_history"`
FenceHistory FenceHistory `xml:"fence_history"`
}

type Summary struct {
Stack Stack `xml:"stack"`
CurrentDC CurrentDC `xml:"current_dc"`
NodesConfigured NodesConfigured `xml:"nodes_configured"`
ResourcesConfigured ResourcesConfigured `xml:"resources_configured"`
}

type Stack struct {
PacemakerdState string `xml:"pacemakerd-state,attr"`
}

type CurrentDC struct {
WithQuorum string `xml:"with_quorum,attr"`
}

type NodesConfigured struct {
Number string `xml:"number,attr"`
}

type ResourcesConfigured struct {
Number string `xml:"number,attr"`
}

type Nodes struct {
Node []Node `xml:"node"`
}

type Node struct {
Name string `xml:"name,attr"`
ID string `xml:"id,attr"`
Online string `xml:"online,attr"`
Standby string `xml:"standby,attr"`
StandbyOnFail string `xml:"standby_onfail,attr"`
Maintenance string `xml:"maintenance,attr"`
Pending string `xml:"pending,attr"`
Unclean string `xml:"unclean,attr"`
Shutdown string `xml:"shutdown,attr"`
ExpectedUp string `xml:"expected_up,attr"`
IsDC string `xml:"is_dc,attr"`
ResourcesRunning string `xml:"resources_running,attr"`
Type string `xml:"type,attr"`
}

type NodeAttributes struct {
Node []NodeAttributeSet `xml:"node"`
}

type NodeAttributeSet struct {
Name string `xml:"name,attr"`
Attribute []NodeAttribute `xml:"attribute"`
}

type NodeAttribute struct {
Name string `xml:"name,attr"`
Value string `xml:"value,attr"`
}

type Resources struct {
Clone []Clone `xml:"clone"`
Resource []Resource `xml:"resource"`
}

type Clone struct {
Resource []Resource `xml:"resource"`
}

type Resource struct {
ID string `xml:"id,attr"`
ResourceAgent string `xml:"resource_agent,attr"`
Role string `xml:"role,attr"`
TargetRole string `xml:"target_role,attr"`
Active string `xml:"active,attr"`
Orphaned string `xml:"orphaned,attr"`
Blocked string `xml:"blocked,attr"`
Managed string `xml:"managed,attr"`
Failed string `xml:"failed,attr"`
FailureIgnored string `xml:"failure_ignored,attr"`
NodesRunningOn string `xml:"nodes_running_on,attr"`
Node NodeRef `xml:"node"`
}

type NodeRef struct {
Name string `xml:"name,attr"`
}

type NodeHistory struct {
Node []NodeHistoryNode `xml:"node"`
}

type NodeHistoryNode struct {
Name string `xml:"name,attr"`
ResourceHistory []ResourceHistory `xml:"resource_history"`
}

type ResourceHistory struct {
ID string `xml:"id,attr"`
OperationHistory []OperationHistory `xml:"operation_history"`
}

type OperationHistory struct {
Call string `xml:"call,attr"`
Task string `xml:"task,attr"`
RC string `xml:"rc,attr"`
RCText string `xml:"rc_text,attr"`
ExitReason string `xml:"exit-reason,attr"`
LastRCChange string `xml:"last-rc-change,attr"`
ExecTime string `xml:"exec-time,attr"`
QueueTime string `xml:"queue-time,attr"`
}

type FenceHistory struct {
FenceEvent []FenceEvent `xml:"fence_event"`
}

type FenceEvent struct {
Target string `xml:"target,attr"`
Action string `xml:"action,attr"`
Delegate string `xml:"delegate,attr"`
Client string `xml:"client,attr"`
Origin string `xml:"origin,attr"`
Status string `xml:"status,attr"`
ExitReason string `xml:"exit-reason,attr"`
Completed string `xml:"completed,attr"`
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Consider extracting duplicated XML types to a shared package.

The comment on line 53 indicates these types are copied from healthcheck.go to avoid circular dependencies. Code duplication increases maintenance burden and risk of inconsistencies.

Consider:

  1. Extract these types to a shared package (e.g., pkg/tnf/pacemaker/types)
  2. Have both collector.go and healthcheck.go import from the shared location
  3. If a circular dependency truly exists, break it by restructuring the package hierarchy
🤖 Prompt for AI Agents
In pkg/cmd/pacemakerstatuscollector/collector.go lines 52-187: these XML struct
definitions are duplicated from healthcheck.go; extract them into a new shared
package (e.g., pkg/tnf/pacemaker/types) and update both collector.go and
healthcheck.go to import that package; create the package with the exact type
declarations and appropriate package name, run goimports/gofmt, and if you hit a
circular import, refactor by moving only the plain type definitions into the
lowest-level package so that both callers depend on it (rather than each other)
and adjust import paths accordingly.

Comment on lines +35 to +42
// RawXML contains the raw XML output from pcs status xml command
// Kept for debugging purposes only; healthcheck should not need to parse this
// +optional
RawXML string `json:"rawXML,omitempty"`

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update RawXML comment to match usage.

HealthCheck parses RawXML today; the comment says it “should not need to parse this.”

-// Kept for debugging purposes only; healthcheck should not need to parse this
+// Kept for debugging and parsed by the healthcheck to derive status
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/v1alpha1/types.go around lines 35 to 39, the RawXML field
comment incorrectly states healthcheck should not parse this; update the comment
to reflect that RawXML contains the raw XML output from the pcs status xml
command and is parsed by HealthCheck (kept for debugging and runtime health
validation). Modify the comment to mention both uses (debugging and that
HealthCheck may parse it), keep the +optional marker and json tag unchanged.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 4c76701 to b915543 Compare October 20, 2025 19:01
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter.go (1)

384-392: Guard against slice index panic when setting container command.

Accessing Command[1] assumes at least two elements. Fallback to setting the full command.

- job.Spec.Template.Spec.Containers[0].Command[1] = jobType.GetSubCommand()
+ cmd := &job.Spec.Template.Spec.Containers[0].Command
+ if len(*cmd) >= 2 {
+   (*cmd)[1] = jobType.GetSubCommand()
+ } else {
+   *cmd = []string{"tnf-runtime", jobType.GetSubCommand()}
+ }
♻️ Duplicate comments (8)
pkg/cmd/pacemakerstatuscollector/collector.go (3)

3-22: Missing apierrors import for proper error handling.

The past review comment suggested using apierrors.IsNotFound() instead of string matching (line 497), but the required import "k8s.io/apimachinery/pkg/api/errors" is missing.

Add the import:

 import (
 	"context"
 	"encoding/xml"
 	"fmt"
 	"os"
 	"strings"
 	"time"
 
 	"github.com/spf13/cobra"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

52-187: Extract duplicated XML types to a shared package.

These types are still duplicated from healthcheck.go. Code duplication increases maintenance burden and creates risk of inconsistencies between the two copies.

Based on learnings


487-531: Use apierrors.IsNotFound instead of string matching.

String checks on error messages are brittle. This was flagged in a previous review but has not been addressed.

Apply this diff after adding the apierrors import:

 	if err != nil {
 		// Create new PacemakerStatus if it doesn't exist
-		if strings.Contains(err.Error(), "not found") {
+		if apierrors.IsNotFound(err) {
 			newStatus := &v1alpha1.PacemakerStatus{
pkg/tnf/operator/starter_test.go (1)

267-267: Return the scoped context to prevent goroutine leaks.

The test creates a cancellable context at line 239 but returns context.Background() instead. Controllers started by HandleDualReplicaClusters will use this uncancellable context, potentially leaking goroutines when the test completes.

Apply this diff:

   return args{
     initErr: nil, // Default to no error
-    ctx:     context.Background(),
+    ctx:     ctx,
     controllerContext: &controllercmd.ControllerContext{
pkg/tnf/pkg/pacemaker/v1alpha1/types.go (2)

25-28: Spec.NodeName appears unused; deprecate or remove.

Collector uses a singleton cluster-level CR; NodeName isn’t populated. Either remove it or mark as deprecated to avoid confusion.

 type PacemakerStatusSpec struct {
-	// NodeName identifies which node this status is for
-	// +optional
-	NodeName string `json:"nodeName,omitempty"`
+	// Deprecated: currently unused; a single cluster-level CR is used.
+	// +optional
+	NodeName string `json:"nodeName,omitempty"`
 }

35-38: RawXML comment contradicts usage; update to reflect parsing by healthcheck.

Tests and healthcheck parse RawXML today.

 	// RawXML contains the raw XML output from pcs status xml command
-	// Kept for debugging purposes only; healthcheck should not need to parse this
+	// Kept for debugging and parsed by the healthcheck to derive status
 	// +optional
bindata/tnfdeployment/cronjob.yaml (1)

5-5: Hardcoded namespace diverges from configurable controller Namespace.

Replace with injected value to keep manifest and controller in sync.

-  namespace: openshift-etcd
+  namespace: <injected>
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)

280-302: Fix assertion messages to reference updateOperatorStatus (not updateOperatorAvailability).

Same issue was flagged earlier; please update all three occurrences.

- require.NoError(t, err, "updateOperatorAvailability should not return an error")
+ require.NoError(t, err, "updateOperatorStatus should not return an error")

Apply at Lines 281, 291, and 301.

🧹 Nitpick comments (23)
pkg/cmd/pacemakerstatuscollector/collector.go (3)

372-375: Consider explicitly handling fmt.Sscanf error.

While the default value of 0 is a reasonable fallback, making this explicit improves code clarity.

 				// Parse RC
 				rc := 0
 				if operation.RC != "" {
-					fmt.Sscanf(operation.RC, "%d", &rc)
+					if _, err := fmt.Sscanf(operation.RC, "%d", &rc); err != nil {
+						klog.Warningf("Failed to parse RC value '%s', defaulting to 0: %v", operation.RC, err)
+					}
 				}

276-426: Consider splitting this long function.

This 150-line function handles multiple distinct concerns (summary, nodes, resources, history, fencing). Extracting helper functions would improve maintainability and testability.

Example structure:

func buildStatusComponents(result *PacemakerResult) (...) {
	summary := buildSummary(result)
	nodes := buildNodeStatuses(result)
	resources := buildResourceStatuses(result)
	nodeHistory := buildNodeHistory(result)
	fencingHistory := buildFencingHistory(result)
	return summary, nodes, resources, nodeHistory, fencingHistory
}

func buildSummary(result *PacemakerResult) v1alpha1.PacemakerSummary { ... }
func buildNodeStatuses(result *PacemakerResult) []v1alpha1.NodeStatus { ... }
// etc.

458-468: Duplicate config creation logic.

This config creation logic duplicates the implementation in createKubernetesClient (lines 429-441). Consider reusing the existing function or extracting the common logic.

 func updatePacemakerStatusCR(
 	ctx context.Context,
 	client *kubernetes.Clientset,
 	rawXML string,
 	summary v1alpha1.PacemakerSummary,
 	nodes []v1alpha1.NodeStatus,
 	resources []v1alpha1.ResourceStatus,
 	nodeHistory []v1alpha1.NodeHistoryEntry,
 	fencingHistory []v1alpha1.FencingEvent,
 	collectionError string,
 ) error {
-	// Create a REST client for the custom resource
-	config, err := rest.InClusterConfig()
-	if err != nil {
-		kubeconfig := os.Getenv("KUBECONFIG")
-		if kubeconfig == "" {
-			kubeconfig = os.Getenv("HOME") + "/.kube/config"
-		}
-		config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
-		if err != nil {
-			return err
-		}
-	}
+	config, err := getRestConfig()
+	if err != nil {
+		return fmt.Errorf("failed to get rest config: %w", err)
+	}

Then extract a shared helper:

func getRestConfig() (*rest.Config, error) {
	config, err := rest.InClusterConfig()
	if err != nil {
		kubeconfig := os.Getenv("KUBECONFIG")
		if kubeconfig == "" {
			kubeconfig = os.Getenv("HOME") + "/.kube/config"
		}
		config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
		if err != nil {
			return nil, err
		}
	}
	return config, nil
}
cmd/tnf-runtime/main.go (1)

30-30: Use io.Discard instead of deprecated ioutil.Discard.

ioutil is deprecated since Go 1.16. Switch to io.Discard.

-import (
-	goflag "flag"
-	"fmt"
-	"io/ioutil"
+import (
+	goflag "flag"
+	"fmt"
+	"io"
@@
-	grpclog.SetLoggerV2(grpclog.NewLoggerV2(ioutil.Discard, os.Stderr, os.Stderr))
+	grpclog.SetLoggerV2(grpclog.NewLoggerV2(io.Discard, os.Stderr, os.Stderr))
bindata/tnfdeployment/cronjob.yaml (2)

31-35: Add resource limits to avoid unbounded usage.

Set modest limits alongside requests.

               resources:
                 requests:
                   cpu: 10m
                   memory: 32Mi
+                limits:
+                  cpu: 200m
+                  memory: 128Mi

35-41: Privilege/hostPID: verify necessity and minimize if possible.

Privileged, allowPrivilegeEscalation, and hostPID=true trigger multiple policy findings. If not strictly required by pcs, drop or scope down. Consider adding seccompProfile: RuntimeDefault when feasible.

pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1)

7-12: Cluster-scoped CRD needs a non-namespaced generated client marker.

Add +genclient:nonNamespaced so any generated typed clients target cluster scope.

 // +genclient
+// +genclient:nonNamespaced
 // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
 // +kubebuilder:object:root=true
 // +kubebuilder:resource:path=pacemakerstatuses,scope=Cluster
 // +kubebuilder:subresource:status
pkg/tnf/pkg/pacemaker/statuscollector_test.go (1)

14-20: Mark helpers with t.Helper() for clearer failure lines.

Improves test diagnostics without behavior change.

 func loadTestXML(t *testing.T, filename string) string {
+	t.Helper()
@@
 func getRecentFailuresXML(t *testing.T) string {
+	t.Helper()

Also applies to: 22-37

pkg/tnf/operator/starter.go (2)

313-325: Defensive nil-check for alivenessChecker.

Avoid a nil deref if a caller forgets to wire it in tests/future code.

 func runPacemakerControllers(ctx context.Context, controllerContext *controllercmd.ControllerContext, operatorClient v1helpers.StaticPodOperatorClient, kubeClient kubernetes.Interface, kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, alivenessChecker *health.MultiAlivenessChecker, etcdInformer operatorv1informers.EtcdInformer) {
+  if alivenessChecker == nil {
+    klog.V(2).Info("alivenessChecker is nil; creating default MultiAlivenessChecker")
+    alivenessChecker = health.NewMultiAlivenessChecker()
+  }
   // Start the healthcheck controller to watch PacemakerStatus CRs

488-491: Log level for “job not found” should not be error.

“Not found” is a benign case here; use Info or V(2).

- klog.Errorf("fencing job not found, nothing to do")
+ klog.V(2).Info("fencing job not found; nothing to delete")
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)

47-57: Fix misleading comment; no mock REST client is wired here.

The helper returns a HealthCheck with pacemakerRESTClient=nil. Update the comment to avoid confusion.

- // Note: This creates a HealthCheck with mock REST client
- // In a real scenario, we'd need to mock the PacemakerStatus CR
+ // Note: This creates a minimal HealthCheck without a pacemaker REST client/informer.
+ // Use createTestHealthCheckWithMockStatus when you need a mocked PacemakerStatus CR.
pkg/tnf/pkg/pacemaker/statuscollector.go (3)

382-387: Prefer strconv.Atoi over fmt.Sscanf for RC parsing.

Simpler, faster, fewer failure modes.

- rc := 0
- if operation.RC != "" {
-   if _, err := fmt.Sscanf(operation.RC, "%d", &rc); err != nil {
-     klog.Warningf("Failed to parse RC value '%s': %v", operation.RC, err)
-   }
- }
+ rc := 0
+ if operation.RC != "" {
+   if v, err := strconv.Atoi(operation.RC); err == nil {
+     rc = v
+   } else {
+     klog.Warningf("Failed to parse RC value '%s': %v", operation.RC, err)
+   }
+ }

Remember to add: import "strconv".


299-309: Remove unused nodeIPMap (dead code).

You populate nodeIPMap but never read it.


223-236: Drop unused Kubernetes client creation or use it.

createKubernetesClient()’s result isn’t used. Either remove it and the client parameter in updatePacemakerStatusCR, or use it for something tangible (events, RBAC sanity).

- // Create Kubernetes client
- client, err := createKubernetesClient()
- if err != nil {
-   return fmt.Errorf("failed to create kubernetes client: %w", err)
- }
-
- // Update PacemakerStatus CR
- if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil {
+ // Update PacemakerStatus CR
+ if err := updatePacemakerStatusCR(ctx, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil {
    return fmt.Errorf("failed to update PacemakerStatus CR: %w", err)
 }

And adjust the function signature accordingly (see next comment).

pkg/tnf/pkg/jobs/cronjob_controller.go (1)

3-8: Import strconv for safe string quoting.

Used below to properly escape command/schedule strings.

 import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"
 )
pkg/tnf/pkg/pacemaker/healthcheck.go (8)

31-34: Clarify/guard the “hourly reminders for 24h” fencing behavior.

You dedupe fencing warnings for 1h and rely on the collector limiting FencingHistory to 24h (comment on Line 390). That’s fine, but brittle if the collector window ever changes. Consider enforcing a hard 24h cap here (track firstSeen timestamp per eventKey and stop reminders after 24h). Otherwise, this logic will remind indefinitely.

Also applies to: 388-396


35-37: Use dynamic expected count (eligible nodes) for resource checks to avoid cascading errors.

Hardcoding expectedNodeCount=2 makes resource checks error during planned one‑node windows (e.g., node replacement), even though you already error on the offline node. Use the count of eligible nodes (online and not standby) as the expected value.

Apply this diff:

@@
-// Expected number of nodes in ExternalEtcd cluster
-const (
-    expectedNodeCount = 2
-)
+// Nominal expected node count is 2, but resource checks should use the number of eligible nodes
+const (
+    expectedNodeCount = 2 // used for informational warnings only
+)
@@
 func (c *HealthCheck) checkResourceStatus(pacemakerStatus *v1alpha1.PacemakerStatus, status *HealthStatus) {
@@
-    // Check if we have both resources started on all nodes
-    kubeletCount := len(resourcesStarted[resourceKubelet])
-    etcdCount := len(resourcesStarted[resourceEtcd])
-
-    c.validateResourceCount(resourceKubelet, kubeletCount, status)
-    c.validateResourceCount(resourceEtcd, etcdCount, status)
+    // Compute eligible nodes (online and not standby)
+    eligible := 0
+    for _, n := range pacemakerStatus.Status.Nodes {
+        if n.Online && !n.Standby {
+            eligible++
+        }
+    }
+    // Validate both resources are started on all eligible nodes
+    kubeletCount := len(resourcesStarted[resourceKubelet])
+    etcdCount := len(resourcesStarted[resourceEtcd])
+    c.validateResourceCount(resourceKubelet, kubeletCount, eligible, status)
+    c.validateResourceCount(resourceEtcd, etcdCount, eligible, status)
@@
-func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) {
-    if actualCount < expectedNodeCount {
-        status.Errors = append(status.Errors, fmt.Sprintf(msgResourceNotStarted,
-            resourceName, actualCount, expectedNodeCount))
+func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) {
+    if actualCount < expectedCount {
+        status.Errors = append(status.Errors, fmt.Sprintf(msgResourceNotStarted,
+            resourceName, actualCount, expectedCount))
     }
 }
@@
-    // Check if we have the expected number of nodes.
-    // This should almost always be 2, but 1 node is possible during a control-plane node replacement event.
-    if len(nodes) != expectedNodeCount {
+    // Check nominal expected number of nodes (informational).
+    // This should almost always be 2; 1 is possible during control-plane node replacement.
+    if len(nodes) != expectedNodeCount {
         status.Warnings = append(status.Warnings, fmt.Sprintf("Expected %d nodes, found %d", expectedNodeCount, len(nodes)))
     }

Also applies to: 358-364, 366-372, 320-324


165-187: Avoid double resync; remove namespace indexer for a cluster‑scoped CR.

Informer resync is set to 30s and controller also ResyncEvery(30s). That’s redundant. Also, PacemakerStatus is cluster‑scoped, so NamespaceIndex is unnecessary.

Apply this diff:

@@
-    informer := cache.NewSharedIndexInformer(
+    informer := cache.NewSharedIndexInformer(
         &cache.ListWatch{
@@
-        &v1alpha1.PacemakerStatus{},
-        healthCheckResyncInterval,
-        cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc},
+        &v1alpha1.PacemakerStatus{},
+        0, // rely on controller ResyncEvery
+        cache.Indexers{}, // cluster-scoped; no namespace index
     )

Ensure factory starts this raw SharedIndexInformer (it typically does for informers passed to WithInformers).

Also applies to: 203-211


79-81: Remove unused constant.

kubernetesAPIPath is defined but never used. Drop it.

Apply this diff:

@@
-    // Kubernetes API constants
-    kubernetesAPIPath     = "/apis"
-    pacemakerResourceName = "pacemakerstatuses"
+    // Kubernetes API constants
+    pacemakerResourceName = "pacemakerstatuses"

122-141: Trim unused struct fields/state.

  • HealthCheck.kubeClient isn’t used; remove the field and constructor arg.
  • HealthCheck.pacemakerInformer is stored but never used; rely on factory lifecycle instead.
  • lastProcessedStatus is written but not read; remove until needed to avoid confusion.

Do you want me to produce a follow‑up diff removing these safely across the file and call sites?

Also applies to: 188-196, 261-265


153-157: Prefer returning an error over panicking during controller construction.

Panics on REST client/scheme setup will crash the operator. Consider changing NewHealthCheck to return (factory.Controller, error) and bubble errors to the caller to fail fast without stack traces from panic.

Also applies to: 158-163


113-121: Unknown status carries Errors but does not set degraded — confirm behavior.

newUnknownHealthStatus populates Errors, but updateOperatorStatus ignores Unknown for degradation. Is this intentional? If the CR fetch/staleness indicates we may be blind to pacemaker health, product may want Degraded=True with an actionable message.

Also applies to: 411-428


595-604: Use HasPrefix to classify warning types.

Contains can misclassify if messages embed the tokens mid‑string. Prefer HasPrefix since you control the message format.

Apply this diff:

@@
-    switch {
-    case strings.Contains(warning, warningPrefixFailedAction):
+    switch {
+    case strings.HasPrefix(warning, warningPrefixFailedAction):
         c.eventRecorder.Warningf(eventReasonFailedAction, msgDetectedFailedAction, warning)
-    case strings.Contains(warning, warningPrefixFencingEvent):
+    case strings.HasPrefix(warning, warningPrefixFencingEvent):
         c.eventRecorder.Warningf(eventReasonFencingEvent, msgDetectedFencing, warning)
     default:
         c.eventRecorder.Warningf(eventReasonWarning, msgPacemakerWarning, warning)
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4c76701 and b915543.

📒 Files selected for processing (26)
  • Dockerfile.ocp (1 hunks)
  • bindata/tnfdeployment/clusterrole.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • bindata/tnfdeployment/job.yaml (1 hunks)
  • cmd/tnf-runtime/main.go (4 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/client_helpers.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (22)
pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1)

1-66: Well-formed test data template for Pacemaker health check scenarios.

The XML template is properly structured and comprehensively models the test scenarios described in the PR: recent kubelet operation failure (line 56), fencing events (line 63), node health status (lines 12–13), and resource deployment across nodes (lines 16–37). The placeholder convention using {{PLACEHOLDER_NAME}} is appropriate for dynamic test data substitution.

pkg/cmd/pacemakerstatuscollector/collector.go (4)

24-50: LGTM!

Constants are well-defined with appropriate values and clear documentation.


189-202: LGTM!

Standard Cobra command implementation with appropriate error handling.


204-274: LGTM!

Good design decisions:

  • Proper timeout handling with contexts
  • XML size validation to prevent XML bombs (10MB limit)
  • Returns raw XML even when parsing fails, allowing debugging
  • Clear separation of concerns between collection and update

428-444: LGTM!

Standard Kubernetes client creation with appropriate fallback from in-cluster config to kubeconfig.

bindata/tnfdeployment/job.yaml (1)

18-18: LGTM: Binary rename aligns with runtime refactoring.

The command change from tnf-setup-runner to tnf-runtime is consistent with the broader refactoring visible in Dockerfile.ocp and the introduction of the runtime CLI flow.

Dockerfile.ocp (1)

13-13: LGTM: Binary rename matches deployment manifest.

The COPY directive now references tnf-runtime, consistent with the updated container command in bindata/tnfdeployment/job.yaml.

bindata/tnfdeployment/clusterrole.yaml (1)

32-50: LGTM: RBAC rules follow standard CRD pattern.

The two rules correctly grant permissions for the PacemakerStatus CRD and its status subresource. The verb sets are appropriate for a status collector that needs to create, read, and update PacemakerStatus resources.

pkg/operator/ceohelpers/external_etcd_status.go (1)

167-201: LGTM: Well-designed generic condition waiter.

The WaitForEtcdCondition helper follows best practices:

  • Ensures informer cache is synced before checking conditions
  • Uses standard polling with configurable intervals and timeout
  • Properly handles transient errors by retrying
  • Clear logging at appropriate levels
pkg/operator/starter.go (1)

630-631: LGTM: Function signature extended to support new functionality.

The additional dynamicClient and AlivenessChecker parameters align with the Pacemaker health check integration. Tests in starter_test.go have been updated accordingly.

pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM: Consistent with existing decoder pattern.

ReadCronJobV1OrDie mirrors the implementation of ReadJobV1OrDie, using the same codec and panic-on-error approach. This consistency aids maintainability.

pkg/tnf/operator/starter_test.go (2)

96-109: LGTM: Test updated to pass new parameters.

The test correctly instantiates alivenessChecker via health.NewMultiAlivenessChecker() and passes both dynamicClient and alivenessChecker to HandleDualReplicaClusters, matching the updated function signature.


260-263: LGTM: Minimal REST config appropriate for testing.

The minimal rest.Config with a localhost host provides the necessary structure for controllers that require KubeConfig without introducing unnecessary test dependencies.

pkg/tnf/pkg/pacemaker/client_helpers.go (2)

17-32: ****

The constants envKubeconfig, envHome, and defaultKubeconfigPath are properly defined as package-level constants in pkg/tnf/pkg/pacemaker/statuscollector.go (lines 57-59). In Go, package-level constants are accessible throughout the entire package without explicit imports. Since both statuscollector.go and client_helpers.go are in the same pacemaker package, the code will compile without errors. No changes are required.

Likely an incorrect or invalid review comment.


37-56: ****

The kubernetesAPIPath constant is properly defined in the same package at pkg/tnf/pkg/pacemaker/healthcheck.go:79 as "/apis". In Go, package-level constants are accessible across all files within the same package, regardless of which file defines them. Since both healthcheck.go and client_helpers.go are in the pkg/tnf/pkg/pacemaker/ package, the reference at line 47 of client_helpers.go is valid and will compile correctly.

Likely an incorrect or invalid review comment.

cmd/tnf-runtime/main.go (1)

124-125: Align docs with actual pcs invocation.

Help text says "pcs status xml" but PR summary mentions "pcs status --full". Please confirm and update to match the collector’s actual call.

bindata/tnfdeployment/cronjob.yaml (1)

42-42: Good: using tnf-setup-manager SA matches existing TNF jobs.

pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)

24-31: API registration looks correct.

Group/version, known types, and AddToGroupVersion are wired properly.

pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-211: Generated deepcopy looks correct.

Covers all v1alpha1 types; deep-copies metav1.Time fields and slices properly.

pkg/tnf/pkg/jobs/cronjob_controller.go (1)

175-213: Rewrite quoting logic for command/schedule; discard namespace injection suggestion.

The cronjob template (bindata/tnfdeployment/cronjob.yaml) uses a hardcoded namespace: openshift-etcd, not a placeholder. Only the name, schedule, image, and command fields use <injected> markers. Implement only the quoting improvements:

  • Use strconv.Quote() for command array elements (safe YAML escaping).
  • Use strconv.Quote() for schedule value (handles special chars like *).
   var commandBuilder strings.Builder
   commandBuilder.WriteString("[")
   for i, cmd := range c.command {
     if i > 0 {
       commandBuilder.WriteString(", ")
     }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+    commandBuilder.WriteString(strconv.Quote(cmd))
   }
   commandBuilder.WriteString("]")
   commandStr := commandBuilder.String()
   
   // Track which line we're on to do context-aware replacement
   lines := strings.Split(result, "\n")
   for i, line := range lines {
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.name, 1)
     }
     // Replace app label
     if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.name, 1)
     }
     // Replace schedule (must be quoted to handle special characters like *)
     if strings.Contains(line, "schedule: <injected>") {
-      lines[i] = strings.Replace(line, "<injected>", fmt.Sprintf(`"%s"`, c.schedule), 1)
+      lines[i] = strings.Replace(line, "<injected>", strconv.Quote(c.schedule), 1)
     }
     // Replace image
     if strings.Contains(line, "image: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.image, 1)
     }
     // Replace command (special handling for array)
     if strings.Contains(line, "command: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
     }
   }

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/pacemaker/healthcheck.go (2)

441-470: Condition update churn minimized — nice.

Checking current condition status/message before updates and logging at V(2) avoids noise and unnecessary API writes. Good hygiene.

Also applies to: 492-519


548-592: Event dedup design is simple and effective.

Mutex + time‑based keys keep spam down while allowing healthy transition events only once per recovery. Looks good.

Comment on lines 289 to 298
// Build node IP map from node attributes
nodeIPMap := make(map[string]string)
for _, nodeAttrSet := range result.NodeAttributes.Node {
for _, attr := range nodeAttrSet.Attribute {
if attr.Name == "node_ip" {
nodeIPMap[nodeAttrSet.Name] = attr.Value
break
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dead code: nodeIPMap is built but never used.

The nodeIPMap is populated from node attributes but is never referenced when building the nodes array (lines 308-313) or anywhere else in the function.

If node IP information is needed, apply this diff to include it:

 		nodes = append(nodes, v1alpha1.NodeStatus{
 			Name:    node.Name,
 			Online:  online,
 			Standby: node.Standby == booleanValueTrue,
+			IP:      nodeIPMap[node.Name],
 		})

Otherwise, remove the nodeIPMap construction entirely:

-	// Build node IP map from node attributes
-	nodeIPMap := make(map[string]string)
-	for _, nodeAttrSet := range result.NodeAttributes.Node {
-		for _, attr := range nodeAttrSet.Attribute {
-			if attr.Name == "node_ip" {
-				nodeIPMap[nodeAttrSet.Name] = attr.Value
-				break
-			}
-		}
-	}
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Build node IP map from node attributes
nodeIPMap := make(map[string]string)
for _, nodeAttrSet := range result.NodeAttributes.Node {
for _, attr := range nodeAttrSet.Attribute {
if attr.Name == "node_ip" {
nodeIPMap[nodeAttrSet.Name] = attr.Value
break
}
}
}
// Build node status entries
var nodes []v1alpha1.NodeStatus
for _, node := range result.Nodes.Node {
🤖 Prompt for AI Agents
pkg/cmd/pacemakerstatuscollector/collector.go around lines 289-298: the
nodeIPMap is populated but never used; either remove the entire nodeIPMap
construction if node IPs are not required, or integrate it when creating the
nodes slice (around lines 308-313) by looking up nodeIPMap[nodeName] and
assigning that value to the node struct's IP field (or appropriate field) during
node creation; make sure to handle missing keys (empty string) and remove the
unused variable if you choose deletion.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from b915543 to 4513271 Compare October 20, 2025 19:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (7)
bindata/tnfdeployment/cronjob.yaml (1)

5-5: Namespace configurability (duplicate of earlier feedback).

Hardcoded openshift-etcd is fine for this operator, but if you intend to reuse the controller, keep this as a placeholder and have the hook set it to operatorclient.TargetNamespace.

-  namespace: openshift-etcd
+  namespace: <injected>
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

47-55: Size cap added — also add a truncation flag

maxLength on rawXML is great. Add status.summary.rawXMLTruncated (boolean) so collectors can signal when they trimmed content.

Example:

               summary:
                 description: Summary provides high-level counts and flags for the cluster state
                 type: object
                 properties:
+                  rawXMLTruncated:
+                    description: Indicates rawXML was truncated to fit CR size limits
+                    type: boolean
pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1)

38-41: Update RawXML comment to match actual usage

Health check parses RawXML today; comment says it shouldn’t.

-	// RawXML contains the raw XML output from pcs status xml command
-	// Kept for debugging purposes only; healthcheck should not need to parse this
+	// RawXML contains the raw XML output from pcs/crm_mon. Kept for debugging and
+	// parsed by the health check to derive status.
 	// +optional
 	RawXML string `json:"rawXML,omitempty"`
pkg/tnf/operator/starter.go (1)

329-346: Don’t give up after 30m; retry until ctx.Done() before starting the collector.

If etcd transitions after the first 30-minute wait, the Pacemaker status collector never starts. Loop with backoff until success or context cancellation.

Apply:

 go func() {
   klog.Infof("waiting for etcd to transition to external before starting Pacemaker status collector")
-  if err := ceohelpers.WaitForEtcdCondition(
-    ctx,
-    etcdInformer,
-    operatorClient,
-    ceohelpers.HasExternalEtcdCompletedTransition,
-    10*time.Second,
-    30*time.Minute,
-    "external etcd transition",
-  ); err != nil {
-    klog.Errorf("failed to wait for external etcd transition: %v", err)
-    return
-  }
-  klog.Infof("etcd has transitioned to external; starting Pacemaker status collector CronJob")
-  runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient)
+  for {
+    if err := ceohelpers.WaitForEtcdCondition(
+      ctx, etcdInformer, operatorClient,
+      ceohelpers.HasExternalEtcdCompletedTransition,
+      10*time.Second, 30*time.Minute, "external etcd transition",
+    ); err != nil {
+      if ctx.Err() != nil {
+        klog.Infof("context done while waiting for external etcd transition: %v", err)
+        return
+      }
+      klog.Warningf("external etcd transition not complete yet, retrying in 1m: %v", err)
+      select {
+      case <-time.After(time.Minute):
+        continue
+      case <-ctx.Done():
+        return
+      }
+    }
+    klog.Infof("etcd has transitioned to external; starting Pacemaker status collector CronJob")
+    runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient)
+    return
+  }
 }()
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)

280-302: Fix assertion messages to match the method name.

Replace “updateOperatorAvailability” with “updateOperatorStatus” in the three require.NoError messages.

- require.NoError(t, err, "updateOperatorAvailability should not return an error")
+ require.NoError(t, err, "updateOperatorStatus should not return an error")

Apply at Lines 281, 291, and 301.

pkg/tnf/pkg/pacemaker/statuscollector.go (2)

227-236: Remove unused client creation and pass only what updatePacemakerStatusCR needs.

The kubernetes.Clientset isn’t used; simplify runCollector.

-  // Create Kubernetes client
-  client, err := createKubernetesClient()
-  if err != nil {
-    return fmt.Errorf("failed to create kubernetes client: %w", err)
-  }
-
   // Update PacemakerStatus CR
-  if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil {
+  if err := updatePacemakerStatusCR(ctx, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil {
     return fmt.Errorf("failed to update PacemakerStatus CR: %w", err)
   }

3-18: Use apierrors for NotFound and remove unused imports.

Replace string matching with apierrors.IsNotFound, add the import, and (after removing the unused client parameter/function below) drop k8s clientset import.

 import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
-  "strings"
   "time"

   "github.com/spf13/cobra"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
-  "k8s.io/client-go/kubernetes"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   "k8s.io/klog/v2"
🧹 Nitpick comments (21)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

141-154: Add retry-on-conflict for updates.

Update can race with other reconcilers/GC; wrap in a retry loop to handle 409s.

 func (c *CronJobController) updateCronJob(ctx context.Context, existing, desired *batchv1.CronJob) error {
-  // Preserve the resource version
-  desired.ResourceVersion = existing.ResourceVersion
-
-  _, err := c.kubeClient.BatchV1().CronJobs(operatorclient.TargetNamespace).Update(ctx, desired, metav1.UpdateOptions{})
-  if err != nil {
-    return fmt.Errorf("failed to update CronJob: %w", err)
-  }
+  // Preserve the resource version and retry on conflict
+  return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
+    desired.ResourceVersion = existing.ResourceVersion
+    _, err := c.kubeClient.BatchV1().CronJobs(operatorclient.TargetNamespace).Update(ctx, desired, metav1.UpdateOptions{})
+    if apierrors.IsConflict(err) {
+      // Refresh existing and try again
+      var getErr error
+      existing, getErr = c.kubeClient.BatchV1().CronJobs(operatorclient.TargetNamespace).Get(ctx, c.name, metav1.GetOptions{})
+      if getErr != nil {
+        return getErr
+      }
+      return err
+    }
+    return err
+  })

Note: add k8s.io/client-go/util/retry to imports.


176-209: Broaden needsUpdate checks or use semantic deep-equality.

Only schedule, command length/content, and image are compared. Fields like args, env, resources, securityContext, serviceAccount, concurrencyPolicy, history limits, job labels/annotations, and pod template changes won’t trigger updates.

Option A (quick): compare existing.Spec vs desired.Spec with semantic equality after normalizing known server-defaults.
Option B: compute a hash (annotation) of .Spec.JobTemplate.Spec.Template.Spec and .Spec.Schedule and compare.
If you choose A, import k8s.io/apimachinery/pkg/api/equality and use apiequality.Semantic.DeepEqual.


3-20: Imports will need retry/equality if you adopt the above.

Add:

  • k8s.io/client-go/util/retry (for update retry)
  • k8s.io/apimachinery/pkg/api/equality (if using semantic deep-equal)
bindata/tnfdeployment/cronjob.yaml (2)

36-43: Privileged + hostPID justified? Document rationale.

Security scanners will flag these. If required to run pcs/interact with host daemons, add a comment in the manifest explaining why, and keep using the tnf-setup-manager SA with the SCC binding. Otherwise, drop them.

Also applies to: 39-41, 41-41, 37-39


45-47: Node selector label forward‑compatibility.

If newer clusters use node-role.kubernetes.io/control-plane, consider tolerating both (via affinity with matchExpressions) in a follow‑up.

pkg/operator/ceohelpers/external_etcd_status.go (2)

179-186: Syncing the wrong/insufficient informer

Waiting only on etcdInformer doesn’t ensure operatorClient caches are warm. Either (a) accept the exact informer(s) whose listers back operatorClient, or (b) take varargs of HasSynced fns and wait for all.

Proposed shape:

-func WaitForEtcdCondition(ctx context.Context, etcdInformer operatorv1informers.EtcdInformer, operatorClient v1helpers.StaticPodOperatorClient, conditionCheck func(context.Context, v1helpers.StaticPodOperatorClient) (bool, error), pollInterval time.Duration, timeout time.Duration, conditionName string) error {
+func WaitForEtcdCondition(ctx context.Context, operatorClient v1helpers.StaticPodOperatorClient, conditionCheck func(context.Context, v1helpers.StaticPodOperatorClient) (bool, error), pollInterval time.Duration, timeout time.Duration, conditionName string, syncedFns ...cache.InformerSynced) error {
-  if !cache.WaitForCacheSync(ctx.Done(), etcdInformer.Informer().HasSynced) {
+  if len(syncedFns) > 0 && !cache.WaitForCacheSync(ctx.Done(), syncedFns...) {
     return fmt.Errorf("failed to sync informers")
   }

187-201: Tighten validation and reduce log noise

  • Validate pollInterval > 0 and timeout > 0 to avoid surprising behavior.
  • Use klog.V(2/4) for “waiting/synced/retrying” messages to keep default logs clean.

Apply:

+ if pollInterval <= 0 || timeout <= 0 {
+   return fmt.Errorf("invalid timings: pollInterval=%v timeout=%v", pollInterval, timeout)
+ }
- klog.Infof("waiting for etcd informer to sync before checking %s...", conditionName)
+ klog.V(2).Infof("waiting for informers to sync before checking %s...", conditionName)
- klog.Infof("etcd informer synced, checking for %s", conditionName)
+ klog.V(2).Infof("informers synced, checking for %s", conditionName)
- klog.Warningf("error checking %s, will retry: %v", conditionName, err)
+ klog.V(2).Infof("error checking %s, will retry: %v", conditionName, err)
- klog.V(4).Infof("%s condition not yet met, waiting...", conditionName)
+ klog.V(4).Infof("%s not yet met; retrying...", conditionName)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

16-18: Add printer columns for quick triage

Expose high-signal fields in kubectl get output (age/nodes/resources/quorum). Improves operability.

Example:

   - name: v1alpha1
     served: true
     storage: true
+    additionalPrinterColumns:
+    - name: Quorum
+      type: string
+      jsonPath: .status.summary.hasQuorum
+    - name: Nodes
+      type: string
+      jsonPath: .status.summary.nodesOnline
+    - name: Resources
+      type: string
+      jsonPath: .status.summary.resourcesStarted
pkg/tnf/pkg/pacemaker/v1alpha1/types.go (3)

38-42: Mirror CRD size cap via kubebuilder validation

Keep code and CRD in sync; enforce max length at type level too.

-	// +optional
-	RawXML string `json:"rawXML,omitempty"`
+	// +kubebuilder:validation:MaxLength=262144
+	// +optional
+	RawXML string `json:"rawXML,omitempty"`

7-12: Add printer columns for operability

Expose key summary fields in kubectl get output.

 // +kubebuilder:object:root=true
 // +kubebuilder:resource:path=pacemakerstatuses,scope=Cluster
 // +kubebuilder:subresource:status
+// +kubebuilder:printcolumn:name="Quorum",type=string,JSONPath=`.status.summary.hasQuorum`
+// +kubebuilder:printcolumn:name="Nodes",type=string,JSONPath=`.status.summary.nodesOnline`
+// +kubebuilder:printcolumn:name="Resources",type=string,JSONPath=`.status.summary.resourcesStarted`
+// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

Also applies to: 68-93


23-31: Spec.NodeName likely unused — keep or drop

Spec is documented as unused; NodeName appears vestigial. Either remove it or add a TODO with intended use and ensure generators/CRD stay aligned.

pkg/tnf/pkg/pacemaker/statuscollector_test.go (1)

398-404: Add a targeted negative-path test for collectPacemakerStatus rather than xml.Unmarshal.

This test validates encoding/xml behavior, not our collector path. Consider mocking exec.Execute to return malformed XML and assert collectPacemakerStatus returns collectionError and empty structs. I can provide a table-driven test using a stubbed exec.Execute.

pkg/tnf/pkg/pacemaker/statuscollector.go (2)

451-459: Remove dead helper createKubernetesClient if no longer used.

After dropping the client parameter, this function becomes unused and should be removed with its import.

-// createKubernetesClient creates a Kubernetes clientset using in-cluster config
-// or falling back to kubeconfig file.
-func createKubernetesClient() (*kubernetes.Clientset, error) {
-  config, err := getKubeConfig()
-  if err != nil {
-    return nil, err
-  }
-  return kubernetes.NewForConfig(config)
-}

21-60: Make the pcs command configurable and avoid sudo coupling.

Hard-coding “sudo -n pcs status xml” can fail in minimal images. Consider reading the command from env/config (default to “pcs status xml”) and document required RBAC/permissions.

pkg/tnf/operator/starter.go (1)

360-372: Add a container bounds check before accessing Containers[0] for defensive programming.

The YAML template at bindata/tnfdeployment/cronjob.yaml currently defines exactly one container, so there is no immediate runtime risk. However, the hook function directly accesses cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0] without validation. For robustness against future template changes or unexpected states, add a guard similar to the pattern already used in CronJobController.needsUpdate():

if len(cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers) == 0 {
    return fmt.Errorf("CronJob template has no containers")
}
cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE")
cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Command = []string{"tnf-runtime", "monitor"}
pkg/tnf/pkg/pacemaker/healthcheck.go (6)

613-623: Avoid double-prefixed event messages.

Events read like “Pacemaker detected a recent failed resource action: Recent failed resource action: …”.

Apply:

 func (c *HealthCheck) recordWarningEvent(warning string) {
 	switch {
 	case strings.Contains(warning, warningPrefixFailedAction):
-		c.eventRecorder.Warningf(eventReasonFailedAction, msgDetectedFailedAction, warning)
+		details := strings.TrimPrefix(warning, warningPrefixFailedAction+" ")
+		c.eventRecorder.Warningf(eventReasonFailedAction, msgDetectedFailedAction, details)
 	case strings.Contains(warning, warningPrefixFencingEvent):
-		c.eventRecorder.Warningf(eventReasonFencingEvent, msgDetectedFencing, warning)
+		details := strings.TrimPrefix(warning, warningPrefixFencingEvent+" ")
+		c.eventRecorder.Warningf(eventReasonFencingEvent, msgDetectedFencing, details)
 	default:
 		c.eventRecorder.Warningf(eventReasonWarning, msgPacemakerWarning, warning)
 	}
 }

492-507: Return a copy of the condition, not a pointer to a slice element.

Prevents subtle aliasing on a local backing array.

Apply:

 func (c *HealthCheck) getCurrentPacemakerCondition() (*operatorv1.OperatorCondition, error) {
@@
 	for i := range currentStatus.Conditions {
 		if currentStatus.Conditions[i].Type == conditionTypePacemakerDegraded {
-			return &currentStatus.Conditions[i], nil
+			cond := currentStatus.Conditions[i]
+			return &cond, nil
 		}
 	}

153-163: Constructor panics: consider fail-fast via utilruntime.Must or returning an error (if possible).

Panics restart the operator; acceptable at init but harder to test. If signature can’t change, wrap with utilruntime.Must for clarity; otherwise prefer returning an error to the caller.


124-141: Remove/justify unused fields and constants.

  • HealthCheck.kubeClient, lastProcessedStatus/lock are unused here.
  • kubernetesAPIPath is unused.

Either remove to reduce noise or add TODO with intended use.

Also applies to: 78-81


521-543: Stabilize dedup keys.

Using full messages as keys risks churn if wording changes. Prefer typed keys, e.g., kind|resource|op|node|rc|timestampTrunc.

I can propose a small helper makeEventKey(kind string, parts ...string) and use it in both loops.

Also applies to: 564-581, 582-588


471-490: Condition clear policy: consider clearing degraded when errors disappear (even if warnings remain).

Currently, degraded clears only on Healthy, leaving degraded stuck on Warning-only states.

If desired, clear when len(status.Errors)==0.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b915543 and 4513271.

📒 Files selected for processing (26)
  • .gitignore (2 hunks)
  • Dockerfile.ocp (1 hunks)
  • bindata/tnfdeployment/clusterrole.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • bindata/tnfdeployment/job.yaml (1 hunks)
  • cmd/tnf-runtime/main.go (4 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/client_helpers.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
  • cmd/tnf-runtime/main.go
  • bindata/tnfdeployment/job.yaml
  • bindata/tnfdeployment/clusterrole.yaml
  • pkg/tnf/pkg/jobs/batch.go
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
  • pkg/tnf/pkg/pacemaker/client_helpers.go
  • pkg/tnf/operator/starter_test.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-48: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-48: Container should not be privileged

(CKV_K8S_16)


[medium] 1-48: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-48: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (9)
.gitignore (1)

14-14: Update ignore for new binary looks good.

tnf-runtime addition aligns with the rename across the repo.

pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1)

1-67: Test snapshot LGTM.

Covers an offline node plus a recent fence event; good for negative-path tests. Minor: duplicate kubelet probe entry (Line 58) seems intentional; confirm parser tolerates duplicates.

bindata/tnfdeployment/cronjob.yaml (1)

9-9: Confirm schedule injection occurs (and is valid 5‑field cron).

schedule: <injected> must be populated by the controller hook with a valid 5‑field cron (e.g., */30 * * * *). If the hook is optional, consider a sane default here to keep the asset valid.

pkg/operator/starter.go (1)

630-632: Signature change verified — no issues found.

The function signature in pkg/tnf/operator/starter.go:63-75 correctly includes dynamicClient and alivenessChecker as the final two parameters. Both call sites—production (pkg/operator/starter.go:619) and test (pkg/tnf/operator/starter_test.go:97)—pass all 12 arguments in the correct order. Wiring is correct and no build issues are expected.

Dockerfile.ocp (1)

13-13: Verified: tnf-runtime is correctly built by the Makefile.

The build process is properly configured:

  • cmd/tnf-runtime is included in GO_BUILD_PACKAGES (defaults to ./cmd/...)
  • make build compiles all cmd packages via golang.mk/build.mk
  • Binaries output to WORKDIR root (/go/src/github.com/openshift/cluster-etcd-operator/)
  • Output location matches Dockerfile.ocp COPY expectation

The COPY command will succeed.

pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1)

1-1: Confirm source of XML (crm_mon vs pcs status)

request attribute shows crm_mon; PR mentions pcs status --full. If pcs shells out to crm_mon this is fine—please confirm to avoid format drift surprises in tests.

pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)

9-18: Scheme registration looks good

Group/version and type registration are correct; AddToScheme wiring is standard.

Also applies to: 24-31

pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-211: Generated deepcopy implementations look correct and consistent with v1alpha1 types. No issues.

pkg/tnf/pkg/pacemaker/healthcheck.go (1)

350-354: Code is internally consistent; cross-release compatibility verification required.

The repository uses the prefixes correctly and consistently: resourceAgentKubelet = "systemd:kubelet" and resourceAgentEtcd = "ocf:heartbeat:podman-etcd" are defined as constants, validated by test data across multiple test files, and match the resource creation commands in pkg/tnf/pkg/pcs/cluster.go and pkg/tnf/pkg/pcs/etcd.go. The createPacemakerRESTClient function is implemented. No variant agent strings were found in the codebase.

However, the concern about compatibility across all supported pcs releases cannot be verified from code inspection alone. Confirm with your pcs release documentation or compatibility matrix that these agent strings remain stable across the releases your deployment supports.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 4513271 to 8e58a60 Compare October 20, 2025 19:53
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (1)

156-174: Force CronJob name/namespace to controller-owned values.

Avoid mismatches between GET path (TargetNamespace/c.name) and the object parsed from the manifest. Set both explicitly after parsing to prevent 422s and orphaned resources. This was raised earlier and still applies.

 func (c *CronJobController) buildCronJob(ctx context.Context) (*batchv1.CronJob, error) {
   // Parse the manifest
   cronJob := ReadCronJobV1OrDie(c.manifest)

+  // Enforce controller-managed identity and scope
+  cronJob.Namespace = operatorclient.TargetNamespace
+  cronJob.Name = c.name
+
   // Get operator spec for hooks
   opSpec, _, _, err := c.operatorClient.GetOperatorState()
   if err != nil {
     return nil, fmt.Errorf("failed to get operator state: %w", err)
   }
pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1)

38-41: Update RawXML comment to match usage.

Health check/collector parse it today; adjust the comment to avoid confusion.

-	// RawXML contains the raw XML output from pcs status xml command
-	// Kept for debugging purposes only; healthcheck should not need to parse this
+	// RawXML contains the raw XML output from 'pcs status --full --output-as xml'.
+	// Kept for debugging and may be parsed by health/monitoring components.
bindata/tnfdeployment/cronjob.yaml (1)

31-31: command must be []string, not string.

Type mismatch may fail decoding. Wrap in array or leave empty for hook fill.

-              command: <injected>
+              command: ["<injected>"]
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)

280-302: Prior nit fixed — good.

Assertion messages now reference updateOperatorStatus (was updateOperatorAvailability). Thanks for addressing.

🧹 Nitpick comments (24)
pkg/tnf/pkg/jobs/cronjob_controller.go (2)

176-209: Broaden update detection beyond first container; compare full PodSpec and key CronJob fields.

Current check misses args/env/resources/SA/annotations and assumes a single container. Use semantic deep-equality on the JobTemplate’s PodSpec and also compare other CronJob knobs (ConcurrencyPolicy, HistoryLimits, StartingDeadlineSeconds).

+import (
+  // ...
+  apiequality "k8s.io/apimachinery/pkg/api/equality"
+)
@@
 func (c *CronJobController) needsUpdate(existing, desired *batchv1.CronJob) bool {
   // Check if schedule has changed
   if existing.Spec.Schedule != desired.Spec.Schedule {
     klog.V(4).Infof("Schedule changed from %q to %q", existing.Spec.Schedule, desired.Spec.Schedule)
     return true
   }
 
- // Check if the job template has changed (containers, command, image, etc.)
- if len(existing.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 &&
-   len(desired.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 {
-   existingContainer := existing.Spec.JobTemplate.Spec.Template.Spec.Containers[0]
-   desiredContainer := desired.Spec.JobTemplate.Spec.Template.Spec.Containers[0]
-   // Check if command has changed
-   if len(existingContainer.Command) != len(desiredContainer.Command) { ... }
-   for i := range desiredContainer.Command { ... }
-   // Check if image has changed
-   if existingContainer.Image != desiredContainer.Image { ... }
- }
+ // Compare CronJob-level knobs
+ if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy ||
+   ptrVal(existing.Spec.StartingDeadlineSeconds) != ptrVal(desired.Spec.StartingDeadlineSeconds) ||
+   ptrVal(existing.Spec.SuccessfulJobsHistoryLimit) != ptrVal(desired.Spec.SuccessfulJobsHistoryLimit) ||
+   ptrVal(existing.Spec.FailedJobsHistoryLimit) != ptrVal(desired.Spec.FailedJobsHistoryLimit) {
+   return true
+ }
+
+ // Compare full PodSpec (containers, args, env, resources, SA, securityContext, etc.)
+ if !apiequality.Semantic.DeepEqual(
+   existing.Spec.JobTemplate.Spec.Template.Spec,
+   desired.Spec.JobTemplate.Spec.Template.Spec,
+ ) {
+   return true
+ }
 
   return false
 }
+
+// ptrVal safely dereferences optional int32 pointers for comparison
+func ptrVal(p *int32) int32 {
+  if p == nil { return -1 }
+  return *p
+}

66-73: Consider adding a CronJob informer to react to external edits promptly.

Relying solely on a 1m resync can delay reconciliation after manual changes. Watching batch/v1 CronJobs in TargetNamespace would make updates event-driven.

pkg/tnf/pkg/pacemaker/client_helpers.go (2)

15-32: Use standard kubeconfig loading rules; HOME concat can be wrong.

If InCluster fails and HOME is unset, you’ll build "/.kube/config". Prefer clientcmd’s default loading rules (KUBECONFIG, XDG, HOME) and context overrides.

-func getKubeConfig() (*rest.Config, error) {
-  // Try in-cluster config first
-  config, err := rest.InClusterConfig()
-  if err != nil {
-    // Fall back to kubeconfig
-    kubeconfig := os.Getenv(envKubeconfig)
-    if kubeconfig == "" {
-      kubeconfig = os.Getenv(envHome) + defaultKubeconfigPath
-    }
-    config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
-    if err != nil {
-      return nil, fmt.Errorf("failed to create kubernetes config: %w", err)
-    }
-  }
-  return config, nil
-}
+func getKubeConfig() (*rest.Config, error) {
+  if cfg, err := rest.InClusterConfig(); err == nil {
+    return cfg, nil
+  }
+  // Standard default chain: KUBECONFIG, XDG, HOME, etc.
+  rules := clientcmd.NewDefaultClientConfigLoadingRules()
+  overrides := &clientcmd.ConfigOverrides{}
+  cfg, err := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(rules, overrides).ClientConfig()
+  if err != nil {
+    return nil, fmt.Errorf("failed to create kubernetes config: %w", err)
+  }
+  return cfg, nil
+}

37-56: Tighten REST client config for CRDs (APIPath, serializer, ParameterCodec).

Ensure APIPath is “/apis” and use WithoutConversion codec; add a ParameterCodec if you pass versioned options.

 func createPacemakerRESTClient(baseConfig *rest.Config) (rest.Interface, error) {
   // Set up the scheme for PacemakerStatus CRD
   scheme := runtime.NewScheme()
   if err := v1alpha1.AddToScheme(scheme); err != nil {
     return nil, fmt.Errorf("failed to add PacemakerStatus scheme: %w", err)
   }
 
   // Configure the REST client for the PacemakerStatus CRD
   pacemakerConfig := rest.CopyConfig(baseConfig)
   pacemakerConfig.GroupVersion = &v1alpha1.SchemeGroupVersion
-  pacemakerConfig.APIPath = kubernetesAPIPath
-  pacemakerConfig.NegotiatedSerializer = serializer.NewCodecFactory(scheme)
+  pacemakerConfig.APIPath = "/apis"
+  pacemakerConfig.NegotiatedSerializer = serializer.WithoutConversionCodecFactory{CodecFactory: serializer.NewCodecFactory(scheme)}
+  // Optional: if using VersionedParams with list/get options:
+  // pacemakerConfig.ParameterCodec = runtime.NewParameterCodec(scheme)
 
   restClient, err := rest.RESTClientFor(pacemakerConfig)
   if err != nil {
     return nil, fmt.Errorf("failed to create REST client for PacemakerStatus: %w", err)
   }
 
   return restClient, nil
 }
cmd/tnf-runtime/main.go (2)

3-14: Prefer io.Discard; io/ioutil is deprecated.

Use io.Discard and drop io/ioutil import.

-import (
-	goflag "flag"
-	"fmt"
-	"io/ioutil"
-	"math/rand"
-	"os"
-	"time"
+import (
+	goflag "flag"
+	"fmt"
+	"io"
+	"math/rand"
+	"os"
+	"time"
 )
@@
-	grpclog.SetLoggerV2(grpclog.NewLoggerV2(ioutil.Discard, os.Stderr, os.Stderr))
+	grpclog.SetLoggerV2(grpclog.NewLoggerV2(io.Discard, os.Stderr, os.Stderr))

Also applies to: 30-30


53-56: Exit 0 on help-only root invocation.

Printing help is not an error; return nil instead of os.Exit(1).

-		Run: func(cmd *cobra.Command, args []string) {
-			cmd.Help()
-			os.Exit(1)
-		},
+		RunE: func(cmd *cobra.Command, args []string) error {
+			_ = cmd.Help()
+			return nil
+		},
bindata/tnfdeployment/cronjob.yaml (4)

5-5: Make namespace injectable to match controller config.

Hardcoding causes drift; use a placeholder and inject in controller.

-  namespace: openshift-etcd
+  namespace: <injected>

9-12: Consider startingDeadlineSeconds to avoid missing runs.

Helps catch up after controller downtime.

 spec:
   schedule: <injected>
+  startingDeadlineSeconds: 300
   concurrencyPolicy: "Forbid"

36-41: Privileged + hostPID need rationale and docs.

If required for pcs/systemd introspection, add a brief comment in the manifest or controller explaining why and link to SOP. Also add seccompProfile: runtime/default.

           securityContext:
             privileged: true
             allowPrivilegeEscalation: true
+            seccompProfile:
+              type: RuntimeDefault

44-47: Future-proof control-plane selection.

OpenShift uses master label, but many clusters also use control-plane. Consider nodeAffinity with matchExpressions for either label.

pkg/operator/ceohelpers/external_etcd_status.go (1)

167-201: Solid helper; improve sync error context and log verbosity.

Return context error when sync fails and use klog.V levels to reduce noise. Note precondition that the informer must be started.

-	klog.Infof("waiting for etcd informer to sync before checking %s...", conditionName)
+	klog.V(2).Infof("waiting for etcd informer to sync before checking %s...", conditionName)
 	if !cache.WaitForCacheSync(ctx.Done(), etcdInformer.Informer().HasSynced) {
-		return fmt.Errorf("failed to sync etcd informer")
+		if err := ctx.Err(); err != nil {
+			return fmt.Errorf("failed to sync etcd informer before %s: %w", conditionName, err)
+		}
+		return fmt.Errorf("failed to sync etcd informer before %s", conditionName)
 	}
-	klog.Infof("etcd informer synced, checking for %s", conditionName)
+	klog.V(2).Infof("etcd informer synced, checking for %s", conditionName)

Please confirm the caller starts the etcd informer factory prior to invoking this function.

manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (3)

15-19: Add printer columns for quick triage.

Improves kubectl get output.

   versions:
   - name: v1alpha1
     served: true
     storage: true
+    additionalPrinterColumns:
+    - name: Updated
+      type: date
+      jsonPath: .status.lastUpdated
+    - name: Nodes
+      type: string
+      jsonPath: .status.summary.nodesOnline
+      description: Online nodes
+    - name: Resources
+      type: string
+      jsonPath: .status.summary.resourcesStarted
+      description: Started resources
+    - name: Failures
+      type: boolean
+      jsonPath: .status.summary.recentFailures
+    - name: Fencing
+      type: boolean
+      jsonPath: .status.summary.recentFencing

54-82: Add basic schema invariants.

Keep counts sane using CEL validations.

         properties:
           summary:
             type: object
             properties:
@@
               resourcesTotal:
                 type: integer
+            x-kubernetes-validations:
+            - rule: "self.nodesOnline <= self.nodesTotal"
+              message: "nodesOnline cannot exceed nodesTotal"
+            - rule: "self.resourcesStarted <= self.resourcesTotal"
+              message: "resourcesStarted cannot exceed resourcesTotal"

Also applies to: 100-123, 146-153


33-39: Spec.nodeName semantics vs singleton design.

If the CR is a singleton “cluster”, nodeName misleads. Either drop it or mark as deprecated/ignored in both CRD and types, and update comments accordingly.

pkg/tnf/pkg/pacemaker/statuscollector_test.go (2)

15-20: Mark helper for clearer failures.

Use t.Helper() in loadTestXML.

 func loadTestXML(t *testing.T, filename string) string {
+	t.Helper()
 	path := filepath.Join("testdata", filename)
 	data, err := os.ReadFile(path)
 	require.NoError(t, err, "Failed to read test XML file: %s", filename)
 	return string(data)
 }

398-403: Avoid asserting a constant value. Test behavior instead.

Mock exec and assert large XML is rejected, or at least assert maxXMLSize >= threshold.

-	require.Equal(t, 10*1024*1024, maxXMLSize, "Max XML size should be 10MB")
+	require.GreaterOrEqual(t, maxXMLSize, 5*1024*1024, "Max XML size should be reasonably large")

Or better: inject a fake reader > maxXMLSize into the collector and expect an error.

pkg/tnf/pkg/pacemaker/v1alpha1/types.go (3)

27-31: Spec.NodeName is unused and conflicts with singleton design.

Either remove it or clearly mark as reserved/ignored; align CRD text too.

 type PacemakerStatusSpec struct {
-	// Reserved for future use
-	// +optional
-	NodeName string `json:"nodeName,omitempty"`
+	// Reserved for future use. This CR is status-only; spec fields are ignored.
 }

13-21: Add printer columns for UX.

Expose key fields in kubectl get.

 // +kubebuilder:object:root=true
 // +kubebuilder:resource:path=pacemakerstatuses,scope=Cluster
 // +kubebuilder:subresource:status
+// +kubebuilder:printcolumn:name="Updated",type="date",JSONPath=".status.lastUpdated"
+// +kubebuilder:printcolumn:name="Nodes",type="string",JSONPath=".status.summary.nodesOnline"
+// +kubebuilder:printcolumn:name="Res",type="string",JSONPath=".status.summary.resourcesStarted"
+// +kubebuilder:printcolumn:name="Fail",type="boolean",JSONPath=".status.summary.recentFailures"
+// +kubebuilder:printcolumn:name="Fence",type="boolean",JSONPath=".status.summary.recentFencing"

Also applies to: 68-93


68-93: Optional: indicate XML truncation.

Add a flag so UIs know if RawXML was trimmed.

 type PacemakerSummary struct {
@@
 	// RecentFencing indicates if there are recent fencing events
 	RecentFencing bool `json:"recentFencing,omitempty"`
+
+	// RawXMLTruncated indicates RawXML was truncated to fit CR size limits
+	RawXMLTruncated bool `json:"rawXMLTruncated,omitempty"`
 }

Pair with CRD schema and collector updates.

pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)

29-58: Mark test helpers with t.Helper().

Add t.Helper() at the top of createTestHealthCheck and createTestHealthCheckWithMockStatus to improve failure reporting.

 func createTestHealthCheck() *HealthCheck {
+  t.Helper()
   kubeClient := fake.NewSimpleClientset()
 func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *v1alpha1.PacemakerStatus) *HealthCheck {
+  t.Helper()
   kubeClient := fake.NewSimpleClientset()

Also applies to: 61-114

pkg/tnf/operator/starter.go (2)

313-353: Etcd transition wait: prefer context-aware ticker.

The 1m retry loop is fine. For cleaner cancellation and metrics, consider wait.JitterUntilWithContext (k8s util/wait) instead of time.After/select.

- go func() {
+ go func() {
     klog.Infof("waiting for etcd to transition to external before starting Pacemaker status collector")
-    for {
-      if err := ceohelpers.WaitForEtcdCondition( /* ... */ ); err != nil {
-        if ctx.Err() != nil { klog.Infof("context done..."); return }
-        klog.Warningf("... will retry in 1m: %v", err)
-        select {
-        case <-time.After(time.Minute):
-          continue
-        case <-ctx.Done():
-          return
-        }
-      }
-      klog.Infof("etcd has transitioned to external; starting Pacemaker status collector CronJob")
-      runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient)
-      return
-    }
+    wait.JitterUntilWithContext(ctx, func(ctx context.Context) {
+      if err := ceohelpers.WaitForEtcdCondition(ctx, etcdInformer, operatorClient, ceohelpers.HasExternalEtcdCompletedTransition, 10*time.Second, 30*time.Minute, "external etcd transition"); err != nil {
+        klog.Warningf("external etcd not ready yet: %v", err)
+        return
+      }
+      klog.Infof("etcd transitioned; starting Pacemaker status collector CronJob")
+      runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient)
+      // stop further retries
+      panic("done")
+    }, time.Minute, 0.0, true)
   }()

355-382: CronJob safety: nil maps and container index.

Defensively initialize labels and verify a container exists before indexing to avoid panics if the asset changes.

   func(_ *operatorv1.OperatorSpec, cronJob *batchv1.CronJob) error {
     // Set the schedule - run every minute
     cronJob.Spec.Schedule = "* * * * *"

     // Set the name and labels
     cronJob.SetName("pacemaker-status-collector")
-    cronJob.Labels["app.kubernetes.io/name"] = "pacemaker-status-collector"
+    if cronJob.Labels == nil {
+      cronJob.Labels = map[string]string{}
+    }
+    cronJob.Labels["app.kubernetes.io/name"] = "pacemaker-status-collector"

     // Configure the container
-    cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE")
-    cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Command = []string{"tnf-runtime", "monitor"}
+    c := &cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers
+    if len(*c) == 0 {
+      return fmt.Errorf("cronjob template has no containers")
+    }
+    (*c)[0].Image = os.Getenv("OPERATOR_IMAGE")
+    (*c)[0].Command = []string{"tnf-runtime", "monitor"}
     return nil
   },
pkg/tnf/pkg/pacemaker/statuscollector.go (2)

30-32: Cap stored RawXML to stay under K8s object size limits.

10MiB XML is far above typical 1–1.5MiB object limits. Reduce the accepted input and truncate what’s stored in Status to avoid update failures.

- // maxXMLSize prevents XML bombs (10MB)
- maxXMLSize = 10 * 1024 * 1024
+ // maxXMLSize bounds input parsing; keep comfortably under object size limits (1MiB).
+ maxXMLSize = 1 * 1024 * 1024
+ // maxStoredXMLSize bounds RawXML persisted into the Status subresource.
+ maxStoredXMLSize = 256 * 1024
- rawXML = stdout
+ rawXML = stdout
- existing.Status.RawXML = rawXML
+ if len(rawXML) > maxStoredXMLSize {
+   rawXML = rawXML[:maxStoredXMLSize]
+ }
+ existing.Status.RawXML = rawXML
- Status: v1alpha1.PacemakerStatusStatus{
+ Status: v1alpha1.PacemakerStatusStatus{
     LastUpdated:     now,
-    RawXML:          rawXML,
+    RawXML:          func(s string) string {
+      if len(s) > maxStoredXMLSize { return s[:maxStoredXMLSize] }
+      return s
+    }(rawXML),

Also applies to: 271-286, 500-547


53-55: Remove unused notFoundError constant.

Now that apierrors.IsNotFound is used, this constant is dead.

- // Error check strings
- notFoundError = "not found"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4513271 and 8e58a60.

📒 Files selected for processing (26)
  • .gitignore (2 hunks)
  • Dockerfile.ocp (1 hunks)
  • bindata/tnfdeployment/clusterrole.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • bindata/tnfdeployment/job.yaml (1 hunks)
  • cmd/tnf-runtime/main.go (4 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (5 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/client_helpers.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
  • pkg/tnf/pkg/jobs/batch.go
  • Dockerfile.ocp
  • bindata/tnfdeployment/clusterrole.yaml
  • pkg/tnf/operator/starter_test.go
  • pkg/tnf/pkg/pacemaker/v1alpha1/register.go
  • pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
  • pkg/tnf/pkg/pacemaker/healthcheck.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-48: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-48: Container should not be privileged

(CKV_K8S_16)


[medium] 1-48: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-48: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (11)
pkg/operator/starter.go (1)

630-631: LGTM! Parameter addition follows existing patterns.

The addition of AlivenessChecker as a parameter to HandleDualReplicaClusters is consistent with how health checking is wired throughout this file. The package-level AlivenessChecker is properly initialized (line 89) and used consistently across multiple controller initializations.

pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1)

1-3: LGTM.

Package doc and group annotation look correct for etcd.openshift.io/v1alpha1.

.gitignore (1)

14-14: LGTM.

Ignore pattern updated to tnf-runtime; payload pattern retained.

Also applies to: 30-30

pkg/tnf/pkg/pacemaker/client_helpers.go (1)

22-25: All constants are properly defined with expected values.

Verification confirms:

  • envKubeconfig = "KUBECONFIG"
  • envHome = "HOME"
  • defaultKubeconfigPath = "/.kube/config"
  • kubernetesAPIPath = "/apis"

All four constants are declared in the pacemaker package and have correct, standard values.

bindata/tnfdeployment/job.yaml (1)

18-18: Binary rename verified; code is ready.

The release image installs tnf-runtime to /usr/bin/ as confirmed by Dockerfile.ocp line 13. No references to the old name remain in the codebase, and the manifest correctly references the renamed binary. The change is consistent across all components (Dockerfile, Go code, and YAML manifests).

cmd/tnf-runtime/main.go (1)

120-129: Good fix: reuse collector command for monitor.

Avoids double flag parsing and accidental exits; matches prior guidance.

manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

47-50: Nice: cap rawXML length.

This guards against status object bloat.

pkg/tnf/pkg/pacemaker/statuscollector_test.go (1)

405-412: Good: invalid XML now actually errors.

pkg/tnf/operator/starter.go (2)

64-76: No issues found—signature change fully propagated.

All call sites for HandleDualReplicaClusters have been correctly updated. The production call at pkg/operator/starter.go:619 and the test call at pkg/tnf/operator/starter_test.go:96 both pass the new dynamicClient and alivenessChecker parameters.


355-382: Reconcile 30s cadence vs 1m Cron

PR claims "every 30 seconds" but the CronJob in pkg/tnf/operator/starter.go sets schedule "* * * * *" (every minute). The monitor subcommand is pacemaker.NewPacemakerStatusCollectorCommand (pkg/tnf/pkg/pacemaker/statuscollector.go, wired via cmd/tnf-runtime/main.go) and runs runCollector() once — there is no interval/loop flag (collectorTimeout = 30s is a per-run timeout, not a cadence).

  • Confirm intended cadence: 30s or 1m.
  • If 30s required: either (A) keep CronJob at 1m and run monitor twice inside the container with a 30s sleep (quick), or (B) add an interval/duration flag and a loop inside monitor so it can run continuously (preferred).
  • If 1m acceptable: update PR text/docs to "every minute".
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)

84-103: Content-Type and codec are already in sync—no action needed.

LegacyCodec always encodes to JSON, and the header is correctly set to runtime.ContentTypeJSON. Since this is the only LegacyCodec usage in the test file and the behavior is deterministic, no drift risk exists. The optional refactor to derive the header dynamically would be a nice-to-have for consistency, but it's not required.

Comment on lines 445 to 453
// createKubernetesClient creates a Kubernetes clientset using in-cluster config
// or falling back to kubeconfig file.
func createKubernetesClient() (*kubernetes.Clientset, error) {
config, err := getKubeConfig()
if err != nil {
return nil, err
}
return kubernetes.NewForConfig(config)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Is createKubernetesClient used?

It’s not referenced in this file. If unused across the package, please drop it along with the kubernetes import.


🏁 Script executed:

#!/usr/bin/env bash
# Find all references to createKubernetesClient
rg -n '\bcreateKubernetesClient\b' -g 'pkg/**.go'

Length of output: 148



---

🏁 Script executed:

```shell
#!/usr/bin/env bash
# Search for createKubernetesClient across all Go files without restrictive glob
rg -n '\bcreateKubernetesClient\b' --type=go

Length of output: 233


🏁 Script executed:

#!/usr/bin/env bash
# Also check the directory structure to understand repo layout
fd -t f '\.go$' | head -20

Length of output: 947


🏁 Script executed:

#!/usr/bin/env bash
# Check if kubernetes import is used elsewhere in the statuscollector.go file
rg -n 'kubernetes' pkg/tnf/pkg/pacemaker/statuscollector.go

Length of output: 249


🏁 Script executed:

#!/usr/bin/env bash
# Search for any other potential usages within the pacemaker package
rg -n 'kubernetes\.' pkg/tnf/pkg/pacemaker/ --type=go

Length of output: 350


Remove unused createKubernetesClient function and the kubernetes import

The function is not referenced anywhere in the codebase and is dead code. The kubernetes import on line 13 is used only within this function, so it can also be safely removed.

🤖 Prompt for AI Agents
In pkg/tnf/pkg/pacemaker/statuscollector.go around lines 445 to 453, the
createKubernetesClient function is unused dead code; remove the entire function
and also delete the now-unused "kubernetes" import on line 13. After removing
both, run go build / go vet to ensure no remaining references and tidy imports
(goimports/gofmt or go mod tidy) to clean up any leftover unused imports.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 8e58a60 to 29b5b15 Compare October 20, 2025 20:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter.go (1)

413-418: Handle tombstones on Secret delete events

Delete callbacks may provide cache.DeletedFinalStateUnknown; current cast to *corev1.Secret can fail and skip updates.

 func handleFencingSecretChange(ctx context.Context, client kubernetes.Interface, oldObj, obj interface{}) {
-  secret, ok := obj.(*corev1.Secret)
-  if !ok {
-    klog.Warningf("failed to convert added / modified / deleted object to Secret %+v", obj)
-    return
-  }
+  var secret *corev1.Secret
+  switch t := obj.(type) {
+  case *corev1.Secret:
+    secret = t
+  case cache.DeletedFinalStateUnknown:
+    if s, ok := t.Obj.(*corev1.Secret); ok {
+      secret = s
+    } else {
+      klog.Warningf("tombstone contained object that is not a Secret: %+v", t.Obj)
+      return
+    }
+  default:
+    klog.Warningf("unexpected object type for Secret event: %T", obj)
+    return
+  }
♻️ Duplicate comments (5)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

54-82: Add a truncation indicator for rawXML

If the collector trims rawXML to fit the cap, surface it in status.summary to avoid confusing consumers.

               summary:
                 description: Summary provides high-level counts and flags for the cluster state
                 type: object
                 properties:
+                  rawXMLTruncated:
+                    description: RawXMLTruncated is true when rawXML was truncated to fit size limits
+                    type: boolean
                   pacemakerdState:
pkg/tnf/pkg/pacemaker/statuscollector.go (2)

54-54: Remove unused notFoundError constant.

This constant was replaced by apierrors.IsNotFound(err) at line 491 and is no longer referenced.

Apply this diff:

-	// Error check strings
-	notFoundError = "not found"
-

445-453: Remove unused createKubernetesClient function and the kubernetes import.

The function is not referenced anywhere in the codebase and is dead code. The kubernetes import on line 13 is used only within this function, so it can also be safely removed.

Apply this diff:

-import (
-	"context"
-	"encoding/xml"
-	"fmt"
-	"os"
-	"time"
-
-	"github.com/spf13/cobra"
-	apierrors "k8s.io/apimachinery/pkg/api/errors"
-	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
-	"k8s.io/client-go/kubernetes"
-	"k8s.io/klog/v2"
-
-	"github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/exec"
-	"github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/pacemaker/v1alpha1"
-)
+import (
+	"context"
+	"encoding/xml"
+	"fmt"
+	"os"
+	"time"
+
+	"github.com/spf13/cobra"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/klog/v2"
+
+	"github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/exec"
+	"github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/pacemaker/v1alpha1"
+)

And remove the function:

-// createKubernetesClient creates a Kubernetes clientset using in-cluster config
-// or falling back to kubeconfig file.
-func createKubernetesClient() (*kubernetes.Clientset, error) {
-	config, err := getKubeConfig()
-	if err != nil {
-		return nil, err
-	}
-	return kubernetes.NewForConfig(config)
-}
-
pkg/tnf/pkg/pacemaker/healthcheck.go (2)

247-259: Unknown health status with errors should trigger degradation.

Lines 248, 253, and 258 return Unknown status with errors (CR fetch failure, collection error, stale status), but updateOperatorStatus (lines 419-427) does not mark degraded for Unknown status. This hides real outages when the CR is unavailable or stale.

The issue persists from the previous review. Apply this fix in updateOperatorStatus:

 func (c *HealthCheck) updateOperatorStatus(ctx context.Context, status *HealthStatus) error {
 	klog.V(4).Infof("Updating operator availability based on pacemaker status: %s", status.OverallStatus)
 
 	// Log warnings and errors
 	c.logStatusMessages(status)
 
 	// Update operator conditions based on pacemaker status
 	switch status.OverallStatus {
 	case statusError:
 		return c.setPacemakerDegradedCondition(ctx, status)
 	case statusHealthy:
 		return c.clearPacemakerDegradedCondition(ctx)
+	case statusUnknown:
+		// Unknown with errors (stale/unavailable CR) should degrade
+		if len(status.Errors) > 0 {
+			return c.setPacemakerDegradedCondition(ctx, status)
+		}
+		return nil
 	default:
-		// For Warning or Unknown status, we don't update the degraded condition
+		// For Warning status, we don't update the degraded condition
 		return nil
 	}
 }

337-372: Fix false-positive resource errors by comparing against online node count.

When a node is offline (already flagged as an error), resource validation still demands resources on expectedNodeCount (2) nodes and reports an additional error. This amplifies the same issue.

Compare started-resource counts to the number of online nodes instead:

 func (c *HealthCheck) checkResourceStatus(pacemakerStatus *v1alpha1.PacemakerStatus, status *HealthStatus) {
 	resources := pacemakerStatus.Status.Resources
+
+	// Count online nodes to avoid duplicate errors when nodes are offline
+	onlineCount := 0
+	for _, n := range pacemakerStatus.Status.Nodes {
+		if n.Online {
+			onlineCount++
+		}
+	}
 
 	resourcesStarted := make(map[string]map[string]bool)
 	resourcesStarted[resourceKubelet] = make(map[string]bool)
 	resourcesStarted[resourceEtcd] = make(map[string]bool)
 
 	for _, resource := range resources {
 		// A resource is considered "started" if it has Role="Started" and Active=true
 		if resource.Role == roleStarted && resource.Active && resource.Node != "" {
 			// Check if this is a kubelet or etcd resource
 			switch {
 			case strings.HasPrefix(resource.ResourceAgent, resourceAgentKubelet):
 				resourcesStarted[resourceKubelet][resource.Node] = true
 			case strings.HasPrefix(resource.ResourceAgent, resourceAgentEtcd):
 				resourcesStarted[resourceEtcd][resource.Node] = true
 			}
 		}
 	}
 
 	// Check if we have both resources started on all nodes
 	kubeletCount := len(resourcesStarted[resourceKubelet])
 	etcdCount := len(resourcesStarted[resourceEtcd])
 
-	c.validateResourceCount(resourceKubelet, kubeletCount, status)
-	c.validateResourceCount(resourceEtcd, etcdCount, status)
+	c.validateResourceCount(resourceKubelet, kubeletCount, onlineCount, status)
+	c.validateResourceCount(resourceEtcd, etcdCount, onlineCount, status)
 }
 
 // validateResourceCount validates that a resource is started on the expected number of nodes
-func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) {
-	if actualCount < expectedNodeCount {
+func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) {
+	if expectedCount == 0 {
+		return
+	}
+	if actualCount < expectedCount {
 		status.Errors = append(status.Errors, fmt.Sprintf(msgResourceNotStarted,
-			resourceName, actualCount, expectedNodeCount))
+			resourceName, actualCount, expectedCount))
 	}
 }
🧹 Nitpick comments (5)
pkg/operator/ceohelpers/external_etcd_status.go (1)

166-201: Also sync the operator client’s informer; improve error and verbosity

Only syncing the Etcd informer doesn’t ensure operatorClient.GetStaticPodOperatorState() has data. Sync both caches (and return ctx error) to avoid flapping early reads; reduce log chattiness.

 func WaitForEtcdCondition(
   ctx context.Context,
   etcdInformer operatorv1informers.EtcdInformer,
   operatorClient v1helpers.StaticPodOperatorClient,
@@
 ) error {
-  // Wait for the etcd informer to sync before checking condition
-  // This ensures operatorClient.GetStaticPodOperatorState() has data to work with
-  klog.Infof("waiting for etcd informer to sync before checking %s...", conditionName)
-  if !cache.WaitForCacheSync(ctx.Done(), etcdInformer.Informer().HasSynced) {
-    return fmt.Errorf("failed to sync etcd informer")
-  }
-  klog.Infof("etcd informer synced, checking for %s", conditionName)
+  // Wait for informers to sync before checking condition
+  klog.V(2).Infof("waiting for informers to sync before checking %s...", conditionName)
+  if !cache.WaitForCacheSync(ctx.Done(),
+    etcdInformer.Informer().HasSynced,
+    operatorClient.Informer().HasSynced,
+  ) {
+    return fmt.Errorf("failed to sync informers: %w", ctx.Err())
+  }
+  klog.V(2).Infof("informers synced, checking for %s", conditionName)
 
   // Poll until the condition is met
   return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) {
@@
-    if conditionMet {
-      klog.V(2).Infof("%s condition met", conditionName)
+    if conditionMet {
+      klog.V(2).Infof("%s condition met", conditionName)
       return true, nil
     }
     klog.V(4).Infof("%s condition not yet met, waiting...", conditionName)
     return false, nil
   })
 }
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

124-176: Constrain history list lengths to stay under the 1 MiB object budget

Unbounded nodeHistory/fencingHistory can overflow object size. Cap items; keep newest N entries in the collector.

               nodeHistory:
                 description: NodeHistory provides recent operation history for troubleshooting
                 type: array
+                maxItems: 512
               fencingHistory:
                 description: FencingHistory provides recent fencing events
                 type: array
+                maxItems: 128
pkg/tnf/operator/starter.go (2)

366-377: Prevent overlapping runs and limit history for the CronJob

Default CronJob concurrency is Allow; if a run exceeds 1m, jobs overlap. For a lightweight monitor, forbid overlap and keep small histories.

       func(_ *operatorv1.OperatorSpec, cronJob *batchv1.CronJob) error {
         // Set the schedule - run every minute
         cronJob.Spec.Schedule = "* * * * *"
+        // Avoid overlapping runs and control history size
+        policy := batchv1.ForbidConcurrent
+        cronJob.Spec.ConcurrencyPolicy = policy
+        one := int32(1)
+        zero := int32(0)
+        cronJob.Spec.SuccessfulJobsHistoryLimit = &one
+        cronJob.Spec.FailedJobsHistoryLimit = &one
+        // Optional: ensure the job doesn't linger if the node is slow
+        // deadline := int64(55)
+        // cronJob.Spec.StartingDeadlineSeconds = &deadline

229-237: Bootstrap wait: consider indefinite retry (like external transition path)

You cap bootstrap wait at 30m, but external transition waits indefinitely with backoff. If bootstrap finishes after 30m, TNF job controllers never start. Consider the same retry-until-ctx-done loop here.

Can you confirm bootstrap always completes within 30 minutes in supported environments?

pkg/tnf/pkg/jobs/cronjob_controller.go (1)

180-213: Reconcile more fields to avoid drift (policy/history)

Extend needsUpdate to cover ConcurrencyPolicy and history limits so hook changes take effect.

 func (c *CronJobController) needsUpdate(existing, desired *batchv1.CronJob) bool {
   // Check if schedule has changed
   if existing.Spec.Schedule != desired.Spec.Schedule {
@@
   }
 
+  // Concurrency policy and history limits
+  if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy {
+    klog.V(4).Infof("ConcurrencyPolicy changed from %q to %q", existing.Spec.ConcurrencyPolicy, desired.Spec.ConcurrencyPolicy)
+    return true
+  }
+  if (existing.Spec.SuccessfulJobsHistoryLimit == nil) != (desired.Spec.SuccessfulJobsHistoryLimit == nil) ||
+     (existing.Spec.SuccessfulJobsHistoryLimit != nil && desired.Spec.SuccessfulJobsHistoryLimit != nil &&
+       *existing.Spec.SuccessfulJobsHistoryLimit != *desired.Spec.SuccessfulJobsHistoryLimit) {
+    klog.V(4).Info("SuccessfulJobsHistoryLimit changed")
+    return true
+  }
+  if (existing.Spec.FailedJobsHistoryLimit == nil) != (desired.Spec.FailedJobsHistoryLimit == nil) ||
+     (existing.Spec.FailedJobsHistoryLimit != nil && desired.Spec.FailedJobsHistoryLimit != nil &&
+       *existing.Spec.FailedJobsHistoryLimit != *desired.Spec.FailedJobsHistoryLimit) {
+    klog.V(4).Info("FailedJobsHistoryLimit changed")
+    return true
+  }
+
   // Check if the job template has changed (containers, command, image, etc.)
   if len(existing.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 &&
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8e58a60 and 29b5b15.

📒 Files selected for processing (26)
  • .gitignore (2 hunks)
  • Dockerfile.ocp (1 hunks)
  • bindata/tnfdeployment/clusterrole.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • bindata/tnfdeployment/job.yaml (1 hunks)
  • cmd/tnf-runtime/main.go (4 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (5 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/client_helpers.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • pkg/tnf/pkg/jobs/batch.go
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
  • .gitignore
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
  • bindata/tnfdeployment/clusterrole.yaml
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go
  • Dockerfile.ocp
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-48: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-48: Container should not be privileged

(CKV_K8S_16)


[medium] 1-48: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-48: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (20)
bindata/tnfdeployment/job.yaml (1)

18-18: LGTM! Binary rename is consistent.

The command array correctly references the renamed tnf-runtime binary, aligning with changes in Dockerfile.ocp and cmd/tnf-runtime/main.go.

pkg/operator/starter.go (1)

630-631: LGTM! Parameters correctly passed.

The dynamicClient and AlivenessChecker parameters are correctly added to the HandleDualReplicaClusters call, matching the updated function signature for Pacemaker health check integration.

pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1)

1-66: LGTM! Well-structured test data template.

The template correctly uses placeholders ({{RECENT_TIMESTAMP}} and {{RECENT_TIMESTAMP_ISO}}) for dynamic timestamp injection, enabling effective testing of time-window filtering for recent failures and fencing events.

cmd/tnf-runtime/main.go (3)

21-21: LGTM! Runtime command properly renamed.

The function rename from NewTnfSetupRunnerCommand to NewTnfRuntimeCommand and updated command metadata are consistent with the broader binary rename in the PR.

Also applies to: 41-41, 48-52


63-63: LGTM! Monitor subcommand properly registered.

The monitor subcommand is correctly added to the root command structure.


120-129: LGTM! Monitor command correctly implemented.

The implementation properly addresses past review feedback by reusing pacemaker.NewPacemakerStatusCollectorCommand() directly instead of calling Execute() inside Run. The metadata overrides and silence flags are appropriate.

bindata/tnfdeployment/cronjob.yaml (2)

1-12: LGTM! CronJob configuration is appropriate.

The CronJob spec with concurrencyPolicy: Forbid is correct for status collection, preventing overlapping runs and ensuring consistent state snapshots.


26-48: Security context is appropriate for Pacemaker operations.

The privileged security context and hostPID: true are necessary for the collector to execute pcs status commands and access Pacemaker cluster state. Static analysis warnings are expected false positives for this infrastructure component. Ensure the tnf-setup-manager service account has appropriate RBAC permissions for PacemakerStatus CRs (as indicated in bindata/tnfdeployment/clusterrole.yaml).

pkg/tnf/operator/starter_test.go (2)

34-34: LGTM! Test dependencies properly added.

The new imports and struct field additions correctly support the updated HandleDualReplicaClusters signature with aliveness checker and dynamic client parameters.

Also applies to: 40-40, 54-54


95-95: LGTM! Test properly wired and past feedback addressed.

The test correctly instantiates the aliveness checker and passes all required parameters to HandleDualReplicaClusters. The use of the scoped ctx (Line 266) instead of context.Background() properly addresses past review feedback, preventing goroutine leaks across tests.

Also applies to: 107-108, 259-269

pkg/tnf/pkg/pacemaker/client_helpers.go (2)

37-56: LGTM! REST client creation follows best practices.

The createPacemakerRESTClient function correctly configures a REST client for PacemakerStatus CRDs, including scheme registration, GroupVersion setup, and codec factory initialization. Error handling is comprehensive.


17-32: Constants are properly defined and accessible.

Verification confirms all constants used in the getKubeConfig function are defined in the same package:

  • envKubeconfig, envHome, defaultKubeconfigPath are defined in statuscollector.go (lines 57-59)
  • kubernetesAPIPath is defined in healthcheck.go (line 79)

Since all files are in the same package (pkg/tnf/pkg/pacemaker/), the constants are properly accessible. The code is correct.

pkg/tnf/pkg/pacemaker/statuscollector_test.go (5)

15-37: LGTM! Excellent test helper design.

The loadTestXML and getRecentFailuresXML helpers are well-designed. Using dynamic timestamps with template substitution enables robust testing of time-window filtering without test flakiness.


39-82: LGTM! Comprehensive healthy cluster test.

The test thoroughly validates all components of a healthy cluster including pacemaker state, quorum, node counts, resource presence, and absence of failures/fencing events.


84-165: LGTM! Thorough coverage of degraded states.

The tests effectively validate offline node detection, standby node handling, and recent failures/fencing detection with appropriate assertions on all relevant status fields.


167-355: LGTM! Excellent edge case and variation coverage.

The time-window filtering test precisely validates the 5-minute and 24-hour windows with explicit time calculations. Table-driven tests for quorum and node status variations provide comprehensive coverage of state combinations.


398-487: LGTM! Validation and resource agent tests are thorough.

The invalid XML test correctly uses malformed XML to trigger xml.Unmarshal errors (addressing past feedback). The resource agent type test ensures proper handling of different agent formats (systemd, ocf:heartbeat).

manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

47-51: Good: bound rawXML size

Adding maxLength on status.rawXML guards object bloat and update failures. LGTM.

pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)

1-31: Scheme registration looks correct

Group/version constants, SchemeGroupVersion, AddToScheme, and known types are wired as expected.

pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-211: LGTM! Autogenerated deepcopy implementations are correct.

The deepcopy methods properly handle nil checks, nested struct copies, slice allocations, and metav1.Time deep copies. This is standard code-generator output.

Copy link
Contributor

openshift-ci bot commented Oct 20, 2025

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-disruptive-ovn 81984c7 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-aws-etcd-recovery 81984c7 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-vsphere-ovn-etcd-scaling 81984c7 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-etcd-scaling 81984c7 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive 81984c7 link false /test e2e-gcp-disruptive
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 81984c7 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-disruptive-ovn 81984c7 link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-aws-ovn-serial 81984c7 link true /test e2e-aws-ovn-serial
ci/prow/e2e-azure-ovn-etcd-scaling 81984c7 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-etcd-scaling 81984c7 link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-metal-assisted 81984c7 link true /test e2e-metal-assisted
ci/prow/e2e-aws-cpms 81984c7 link true /test e2e-aws-cpms
ci/prow/e2e-metal-ovn-two-node-fencing 81984c7 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 81984c7 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-disruptive 81984c7 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp 81984c7 link false /test e2e-gcp
ci/prow/e2e-metal-ipi-ovn-ipv6 29b5b15 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/configmap-scale 29b5b15 link false /test configmap-scale
ci/prow/e2e-agnostic-ovn-upgrade 29b5b15 link true /test e2e-agnostic-ovn-upgrade
ci/prow/images 29b5b15 link true /test images
ci/prow/e2e-aws-ovn-serial-1of2 29b5b15 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-operator 29b5b15 link true /test e2e-operator
ci/prow/e2e-aws-ovn-single-node 29b5b15 link true /test e2e-aws-ovn-single-node
ci/prow/okd-scos-e2e-aws-ovn 29b5b15 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-agnostic-ovn 29b5b15 link true /test e2e-agnostic-ovn
ci/prow/e2e-aws-ovn-serial-2of2 29b5b15 link true /test e2e-aws-ovn-serial-2of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

The goal of this change was to ensure that the CEO permissions for the pcs status cronjob
were elevated enough to be able to gather pcs status output. One future improvement would
be to move this out of the CEO entirely and have the status gathering be done by an
alert agent that posts regular updates to the CEO.

To alleviate the complexity of a future transition to this model, I've structured the code
so that the healthcheck controller doesn't care where the posts to the PacemakerStatus CRD
come from - it just monitors it for updates. In the future, we can write an alert agent
that just posts new PacemakerStatus updates as they happen, and the CEO can retire the
cronjob entire.
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 29b5b15 to 6114770 Compare October 21, 2025 15:00
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/tnf/pkg/pacemaker/statuscollector.go (1)

54-54: Remove unused constant.

The notFoundError constant is unused after refactoring to use apierrors.IsNotFound(err) at line 481. This constant should have been removed when addressing the past review comment about error handling.

Apply this diff:

-	// Error check strings
-	notFoundError = "not found"
-
🧹 Nitpick comments (8)
pkg/tnf/pkg/pacemaker/client_helpers.go (1)

22-25: Consider safer path construction.

The kubeconfig path construction on line 24 uses direct string concatenation. If HOME is empty, this results in "/.kube/config". While BuildConfigFromFlags will handle the error, using filepath.Join would be more idiomatic and safer.

Apply this diff for improved clarity:

+	"path/filepath"
+
 	kubeconfig := os.Getenv(envKubeconfig)
 	if kubeconfig == "" {
-		kubeconfig = os.Getenv(envHome) + defaultKubeconfigPath
+		home := os.Getenv(envHome)
+		if home != "" {
+			kubeconfig = filepath.Join(home, defaultKubeconfigPath)
+		} else {
+			kubeconfig = defaultKubeconfigPath
+		}
 	}
bindata/tnfdeployment/cronjob.yaml (1)

32-35: Consider adding resource limits.

While resource requests are defined (10m CPU, 32Mi memory), no limits are set. For a CronJob that runs periodically on master nodes, adding limits helps prevent resource exhaustion.

Apply this diff to add resource limits:

               resources:
                 requests:
                   cpu: 10m
                   memory: 32Mi
+                limits:
+                  cpu: 50m
+                  memory: 64Mi
pkg/tnf/pkg/pacemaker/statuscollector_test.go (2)

15-20: Mark helpers with t.Helper() for clearer failure locations.

This improves test diagnostics by attributing assertion failures to the caller.

 func loadTestXML(t *testing.T, filename string) string {
+	t.Helper()
 	path := filepath.Join("testdata", filename)
 	data, err := os.ReadFile(path)
 	require.NoError(t, err, "Failed to read test XML file: %s", filename)
 	return string(data)
 }

 // Helper function to generate recent failures XML with dynamic timestamps
 func getRecentFailuresXML(t *testing.T) string {
+	t.Helper()
 	// Load the template from testdata
 	templateXML := loadTestXML(t, "recent_failures_template.xml")

Also applies to: 23-37


167-174: Stabilize time-based tests to avoid flake near window edges.

Consider injecting a clock (now func) into buildStatusComponents or using fixed, test-controlled “now” to make window filtering tests deterministic.

pkg/tnf/operator/starter.go (1)

371-388: Harden CronJob: forbid overlap, bound histories, add starting deadline.

Prevents piling jobs on slow clusters and curbs history growth.

 			// Set the schedule - run every minute
 			cronJob.Spec.Schedule = "* * * * *"

+			// Avoid overlapping runs and bound job histories
+			forbid := batchv1.ForbidConcurrent
+			cronJob.Spec.ConcurrencyPolicy = forbid
+			successful := int32(1)
+			failed := int32(3)
+			cronJob.Spec.SuccessfulJobsHistoryLimit = &successful
+			cronJob.Spec.FailedJobsHistoryLimit = &failed
+			// If a start is missed (e.g., controller restart), try within 2m
+			sds := int64(120)
+			cronJob.Spec.StartingDeadlineSeconds = &sds
+
 			// Set labels at all levels
 			cronJob.Labels["app.kubernetes.io/name"] = pacemakerStatusCollectorName
 			cronJob.Spec.JobTemplate.Labels["app.kubernetes.io/name"] = pacemakerStatusCollectorName
 			cronJob.Spec.JobTemplate.Spec.Template.Labels["app.kubernetes.io/name"] = pacemakerStatusCollectorName
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)

54-82: Expose a truncation flag for rawXML to pair with the 256 KiB cap.

Helps operators understand when raw XML was clipped to satisfy CRD limits.

               summary:
                 description: Summary provides high-level counts and flags for the cluster state
                 type: object
                 properties:
+                  rawXMLTruncated:
+                    description: RawXMLTruncated indicates RawXML was truncated to fit CRD limits
+                    type: boolean
                   pacemakerdState:
                     description: PacemakerdState indicates if pacemaker is running
                     type: string
pkg/tnf/pkg/jobs/cronjob_controller.go (1)

180-213: Broaden needsUpdate to cover concurrency policy and history limits.

Prevents configuration drift when hooks adjust these fields.

 func (c *CronJobController) needsUpdate(existing, desired *batchv1.CronJob) bool {
   // Check if schedule has changed
   if existing.Spec.Schedule != desired.Spec.Schedule {
     klog.V(4).Infof("Schedule changed from %q to %q", existing.Spec.Schedule, desired.Spec.Schedule)
     return true
   }
+
+  // Check concurrency policy
+  if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy {
+    klog.V(4).Infof("ConcurrencyPolicy changed from %q to %q", existing.Spec.ConcurrencyPolicy, desired.Spec.ConcurrencyPolicy)
+    return true
+  }
+  // Check history limits (nil-safe)
+  get := func(p *int32) int32 { if p == nil { return -1 }; return *p }
+  if get(existing.Spec.SuccessfulJobsHistoryLimit) != get(desired.Spec.SuccessfulJobsHistoryLimit) {
+    klog.V(4).Infof("SuccessfulJobsHistoryLimit changed")
+    return true
+  }
+  if get(existing.Spec.FailedJobsHistoryLimit) != get(desired.Spec.FailedJobsHistoryLimit) {
+    klog.V(4).Infof("FailedJobsHistoryLimit changed")
+    return true
+  }

   // Check if the job template has changed (containers, command, image, etc.)
pkg/tnf/pkg/pacemaker/healthcheck.go (1)

154-163: Consider returning errors instead of panicking.

While panicking on initialization failures (lines 156, 162) is acceptable for unrecoverable errors during operator startup, returning errors would provide better error propagation and testing flexibility.

This is a minor improvement and the current approach is acceptable for operator initialization code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 29b5b15 and 6114770.

📒 Files selected for processing (26)
  • .gitignore (2 hunks)
  • Dockerfile.ocp (1 hunks)
  • bindata/tnfdeployment/clusterrole.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • bindata/tnfdeployment/job.yaml (1 hunks)
  • cmd/tnf-runtime/main.go (4 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (5 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/client_helpers.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/statuscollector_test.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
  • pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
  • pkg/operator/starter.go
  • bindata/tnfdeployment/clusterrole.yaml
  • pkg/tnf/pkg/pacemaker/healthcheck_test.go
  • cmd/tnf-runtime/main.go
  • pkg/tnf/pkg/pacemaker/v1alpha1/types.go
  • pkg/tnf/operator/starter_test.go
  • Dockerfile.ocp
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-48: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-48: Container should not be privileged

(CKV_K8S_16)


[medium] 1-48: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-48: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (23)
pkg/tnf/pkg/pacemaker/client_helpers.go (1)

37-56: LGTM!

The REST client creation follows standard Kubernetes client-go patterns with proper scheme registration, config copying, and error handling.

pkg/operator/ceohelpers/external_etcd_status.go (1)

167-201: LGTM!

The WaitForEtcdCondition helper follows best practices for waiting on Kubernetes resources: it syncs the informer cache first, then polls with proper context cancellation, timeout handling, and clear logging. The error retry logic correctly returns (false, nil) to continue polling on transient errors.

bindata/tnfdeployment/job.yaml (1)

18-18: LGTM!

Binary name update from tnf-setup-runner to tnf-runtime aligns with the broader runtime refactoring in this PR.

.gitignore (1)

14-14: LGTM!

Gitignore update correctly reflects the binary rename from tnf-setup-runner to tnf-runtime.

pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1)

1-3: LGTM!

Standard Kubernetes API group documentation with correct +groupName annotation for code generation.

pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM!

The ReadCronJobV1OrDie function correctly mirrors the existing ReadJobV1OrDie pattern, maintaining consistency with the codebase's "OrDie" convention for manifest decoding.

pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1)

1-66: LGTM!

Test data correctly models Pacemaker status XML with template placeholders for timestamp injection, supporting recent-failures test scenarios.

bindata/tnfdeployment/cronjob.yaml (4)

5-5: Past issue resolved.

The namespace placeholder issue raised in past review comments has been addressed. The manifest now uses <injected> for the namespace.


31-31: Past issue resolved.

The command placeholder now correctly uses array syntax ["<injected>"], addressing the type safety concern from past reviews.


36-41: Security context is appropriate for Pacemaker operations.

Static analysis tools flag the privileged security context and hostPID: true as security issues. However, these settings are required for this CronJob to execute pcs status commands and interact with the Pacemaker cluster on the host. The priorityClassName: system-node-critical and master node selector further indicate this is a system-level component requiring elevated privileges.


43-43: RBAC permissions for tnf-setup-manager are properly configured.

Verification confirms the ServiceAccount has all necessary permissions:

  • PacemakerStatus CRs: ClusterRole grants create, update, patch on both pacemakerstatuses and pacemakerstatuses/status resources
  • Privileged operations on master nodes: Pod runs with allowPrivilegeEscalation: true, hostPID: true, and explicitly targets master nodes via nodeSelector; ClusterRole includes use permission on the privileged SecurityContextConstraint
  • Etcd resources: ClusterRole permits get, list, watch, patch, update on both etcds and etcds/status resources

No additional RBAC configuration is required.

pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)

1-31: LGTM — standard scheme wiring.

Group/Version, SchemeBuilder, AddToScheme, and known types are registered correctly.

manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (2)

123-152: No issues found — RFC3339 serialization is correctly implemented.

The code properly handles the conversion from pacemaker's XML timestamp format ("Mon Jan 2 15:04:05 2006") to RFC3339:

  1. Pacemaker XML timestamps are parsed using the correct format constant (line 46 of statuscollector.go)
  2. Parsed times are wrapped with metav1.NewTime() to create metav1.Time instances (line 405)
  3. Go structs use metav1.Time for all timestamp fields (types.go lines 36, 150, 166)
  4. The CRD YAML correctly specifies type: string with format: date-time (RFC3339-compatible)
  5. metav1.Time automatically serializes to RFC3339 format in JSON output

100-123: Ignore this review comment—it conflates different architectural layers.

The XML parser (statuscollector.go) and Kubernetes API type (v1alpha1/types.go) are separate concerns:

  • XML parsing layer uses ID field from Pacemaker's XML output (xml:"id,attr")
  • Kubernetes API layer exposes Name field in the CRD schema (json:"name")

The ResourceStatus struct correctly maps to the CRD schema (both use name as the required field). The XML ID is an internal parsing detail, not exposed to the API. No validation errors will occur.

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-211: LGTM! Autogenerated deepcopy implementations.

This file is autogenerated by deepcopy-gen and follows standard Kubernetes patterns for deep-copy implementations. No manual review is required.

pkg/tnf/pkg/pacemaker/statuscollector.go (4)

64-197: XML structures look well-designed.

The XML unmarshaling structures comprehensively cover the pacemaker status output and follow Go XML conventions.


236-286: Excellent error handling and security practices.

The XML size validation (lines 265-269) protects against XML bombs, and preserving raw XML on parse failure (line 279) aids debugging. The execution timeout prevents hangs.


288-443: Time-windowed history filtering looks correct.

The 5-minute window for failed actions (lines 368-409) and 24-hour window for fencing events (lines 413-440) provide appropriate recent-event detection with proper timestamp parsing and error handling.


448-551: CR creation and update logic is correct.

The use of apierrors.IsNotFound (line 481) and the POST+PUT sequence to initialize the status subresource (lines 502-520) properly address the past review feedback. The update path (lines 528-550) correctly uses the status subresource.

pkg/tnf/pkg/pacemaker/healthcheck.go (4)

412-429: Operator status update logic is appropriate.

The switch correctly sets degraded only for Error status (line 422) and clears it for Healthy status (line 424). Warning and Unknown statuses don't modify the degraded condition, which allows the operator to distinguish between Pacemaker cluster issues (Error) and monitoring/collection issues (Unknown).


338-373: Resource validation logic is clear.

The resource count validation (lines 363-364, 369) checks that critical resources (kubelet, etcd) are started on the expected number of nodes. While this may report resource errors when nodes are offline (already flagged by checkNodeStatus), the explicit reporting provides clarity about which specific resources are affected.


522-624: Excellent event deduplication and recording logic.

The event recording system properly implements:

  • Time-based deduplication with appropriate windows (1 hour for fencing, 5 minutes for others)
  • Transition-based healthy event recording (lines 592-603) to avoid spam
  • Thread-safe access to the event map with mutex protection
  • Cleanup of expired events (lines 522-543)

442-520: Condition management is well-implemented.

The idempotency checks in setPacemakerDegradedCondition (lines 448-460) and clearPacemakerDegradedCondition (lines 474-483) prevent unnecessary operator status updates, reducing API churn. The use of library-go helpers (line 512) ensures consistency with operator patterns.

Comment on lines +47 to +51
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
maxLength: 262144
collectionError:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Collector must clamp rawXML to <= 262144 bytes before status update.

Ensure the status writer truncates pcs XML to this cap, otherwise status updates will be rejected.

Please confirm the collector enforces this cap and (optionally) sets a truncation indicator:


🏁 Script executed:

#!/bin/bash
# Check for clamp logic in collector/status writer
rg -nC2 -e 'rawXML' -e 'maxXMLSize|262144|truncate|Truncated' --type=go

Length of output: 223293


Status update will be rejected unless rawXML is clamped to 262144 bytes (the CRD maxLength).

The codebase enforces a 10MB XML limit (maxXMLSize = 10 * 1024 * 1024 in pkg/tnf/pkg/pacemaker/statuscollector.go:30) but does not truncate to the CRD's declared maxLength: 262144. Any XML between 256KB and 10MB will cause the API server to reject the status update as a validation error.

The collector must truncate rawXML to ≤262144 bytes before calling updatePacemakerStatusCR() at line 528 to prevent rejection. Currently, the code only validates against the 10MB threshold (line 265) and sets an error for larger values, but does nothing to clamp compliant-but-oversized payloads.

🤖 Prompt for AI Agents
In pkg/tnf/pkg/pacemaker/statuscollector.go around lines near 265 and 528, the
code validates XML only against maxXMLSize (10MB) but does not clamp rawXML to
the CRD-declared maxLength (262144 bytes) before calling
updatePacemakerStatusCR(), causing API server validation failures for payloads
between 256KB and 10MB; modify the collector to truncate rawXML to at most
262144 bytes (CRD maxLength) — perform the truncation on the byte slice (not
rune count) after reading/validating against maxXMLSize and before calling
updatePacemakerStatusCR(), and ensure the truncated content is used in the
status update (optionally set/append a flag or error field indicating
truncation) so the API server will accept the status update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants