Skip to content

Conversation

jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Sep 17, 2025

Change List

  • Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
  • Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes
  • Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
  • Included happy path testing

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 17, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 17, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Change List

  • Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
  • Updated operator condition to degraded when nodes are offline or critical resources (kubelet/etcd) are not started on all nodes
  • Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
  • Included happy path testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

coderabbitai bot commented Sep 17, 2025

Walkthrough

Adds Pacemaker status collection and health subsystems: a collector CLI and CronJob to gather pcs status xml into a new PacemakerStatus CR, a HealthCheck controller parsing those CRs into HealthStatus and operator conditions, CRD/types, a CronJob controller, RBAC, tests, and XML fixtures.

Changes

Cohort / File(s) Summary of changes
Pacemaker health-check controller
pkg/tnf/pacemaker/healthcheck.go
New HealthCheck controller, exported HealthStatus type, exported XML parsing types, informer/REST client wiring, sync loop parsing PacemakerStatus CR raw XML, deriving node/resource/failure/fencing info, computing overall health, updating operator degraded condition, and recording deduplicated events.
Pacemaker health-check tests
pkg/tnf/pacemaker/healthcheck_test.go
New comprehensive unit tests and helpers for controller construction, PacemakerStatus retrieval, XML parsing, node/resource/failure/fencing collection, health determination, event recording/deduplication, and operator condition updates using fake clients and fixtures.
Pacemaker XML fixtures
pkg/tnf/pacemaker/testdata/healthy_cluster.xml, .../offline_node.xml, .../recent_failures_template.xml, .../standby_node.xml
Added static Pacemaker XML test fixtures representing healthy, offline-with-fencing, recent-failures, and standby-node cluster snapshots.
PacemakerStatus CRD & API types
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml, pkg/tnf/pacemaker/v1alpha1/doc.go, .../register.go, .../types.go, .../zz_generated.deepcopy.go
New CRD manifest and v1alpha1 API types (PacemakerStatus, spec/status/summary), scheme registration, package doc, and autogenerated deepcopy implementations.
Pacemaker status collector CLI
pkg/cmd/pacemakerstatuscollector/collector.go, cmd/cluster-etcd-operator/main.go
New Cobra command NewPacemakerStatusCollectorCommand() and collector implementation: runs pcs status xml, validates size/timeouts, parses XML into PacemakerResult, builds PacemakerSummary, detects recent failures/fencing, and creates/updates PacemakerStatus CR; command registered in main.
CronJob controller & manifest support
pkg/tnf/pkg/jobs/cronjob_controller.go, pkg/tnf/pkg/jobs/batch.go, bindata/tnfdeployment/cronjob.yaml
New CronJobController and CronJobConfig to manage CronJob lifecycle (create/update) for the collector, helper ReadCronJobV1OrDie for decoding CronJob manifests, and bundled CronJob YAML template.
Operator startup wiring
pkg/tnf/operator/starter.go, pkg/tnf/operator/starter_test.go, pkg/operator/starter.go
Propagated alivenessChecker into HandleDualReplicaClusters, added runPacemakerControllers to start the collector CronJob and HealthCheck controller, updated call sites and tests to pass the aliveness checker, and added etcd SA resources to deployed manifests.
RBAC manifests for collector
bindata/etcd/etcd-sa-clusterrole.yaml, bindata/etcd/etcd-sa-clusterrolebinding.yaml
New ClusterRole granting CRUD/patch on pacemakerstatuses and ClusterRoleBinding binding etcd-sa ServiceAccount to that role.
Etcd helper & job support
pkg/operator/ceohelpers/external_etcd_status.go, pkg/tnf/pkg/jobs/*
New WaitForEtcdCondition helper for polling etcd-related conditions and added CronJob/Job decoding support plus CronJob controller wiring.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters" directly summarizes the primary change in the changeset. The raw summary shows that the main addition is a new PacemakerHealthCheck controller that monitors pacemaker cluster status, along with supporting infrastructure for status collection, event recording, and CRD definitions. The title is concise, clear, and specific enough that a developer reviewing git history would immediately understand that this PR introduces pacemaker health check functionality for external etcd clusters. The title avoids vague terms and file lists while accurately conveying the core objective.
Description Check ✅ Passed The pull request description is clearly related to the changeset and provides meaningful detail about each component. The description accurately references the PacemakerHealthCheck controller (pkg/tnf/pacemaker/healthcheck.go), the pacemaker status collection via 'pcs status --full' (pkg/cmd/pacemakerstatuscollector/collector.go), operator condition degraded state logic, event recording with specific time windows for warnings and fencing events, and test coverage. Each point in the description corresponds to actual implementations visible in the raw summary, demonstrating that the description is substantive and on-topic rather than generic or vague.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2025
@openshift-ci openshift-ci bot requested review from mshitrit and slintes September 17, 2025 21:39
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
pkg/tnf/pacemaker/healthcheck.go (3)

71-75: Avoid event spam: emit “Healthy” only on transitions.

At 30s resync, always emitting a Healthy event will generate noisy duplicates. Track last status and only emit on transition to Healthy.

 type HealthCheck struct {
   operatorClient v1helpers.StaticPodOperatorClient
   kubeClient     kubernetes.Interface
   eventRecorder  events.Recorder
+  lastOverallStatus string
 }
@@
-  // Record a normal event for healthy status
-  if status.OverallStatus == StatusHealthy {
-    c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started")
-  }
+  // Record a normal event for healthy status only on transition
+  if status.OverallStatus == StatusHealthy && c.lastOverallStatus != StatusHealthy {
+    c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started")
+  }
+  c.lastOverallStatus = status.OverallStatus

Also applies to: 446-450


347-356: Micro: precompile timestamp regex in isWithinTimeWindow.

Avoid recompiling on each iteration.

 func (c *HealthCheck) isWithinTimeWindow(line string, cutoffTime time.Time) bool {
   timeFormats := []string{
     "Mon Jan 2 15:04:05 2006",
     "2006-01-02 15:04:05.000000Z",
     "2006-01-02 15:04:05Z",
   }
 
-  for _, format := range timeFormats {
-    re := regexp.MustCompile(`'([^']+)'`)
-    matches := re.FindStringSubmatch(line)
+  tsRe := regexp.MustCompile(`'([^']+)'`)
+  for _, format := range timeFormats {
+    matches := tsRe.FindStringSubmatch(line)
     if len(matches) > 1 {
       if t, err := time.Parse(format, matches[1]); err == nil {
         return t.After(cutoffTime)
       }
     }
   }

71-75: Unused dependency: kubeClient is stored but never used.

Drop it from the struct/constructor or put it to use (e.g., namespacing events, future lookups). Keeping it increases surface without value.

 type HealthCheck struct {
   operatorClient v1helpers.StaticPodOperatorClient
-  kubeClient     kubernetes.Interface
   eventRecorder  events.Recorder
   lastOverallStatus string
 }
@@
 func NewHealthCheck(
   livenessChecker *health.MultiAlivenessChecker,
   operatorClient v1helpers.StaticPodOperatorClient,
-  kubeClient kubernetes.Interface,
   eventRecorder events.Recorder,
 ) factory.Controller {
   c := &HealthCheck{
     operatorClient: operatorClient,
-    kubeClient:     kubeClient,
     eventRecorder:  eventRecorder,
   }
@@
-    operatorClient,
-    kubeClient,
+    operatorClient,
     eventRecorder,
 )

If this is intentionally reserved for future use, ignore and mark with a TODO comment to avoid churn.

Also applies to: 80-84

pkg/tnf/pacemaker/healthcheck_test.go (4)

106-137: Assert condition transitions, not just error absence.

Validate that degraded/healthy actually update the operator condition in the fake client.

   err := controller.updateOperatorAvailability(ctx, status)
   require.NoError(t, err, "updateOperatorAvailability should not return an error")
+  // Verify degraded condition set
+  _, st, _, _ := operatorClient.GetStaticPodOperatorState()
+  require.Condition(t, func() (success bool) {
+    for _, c := range st.Conditions {
+      if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionTrue {
+        return true
+      }
+    }
+    return false
+  }, "expected degraded condition to be true")
@@
   err = controller.updateOperatorAvailability(ctx, status)
   require.NoError(t, err, "updateOperatorAvailability should not return an error")
+  // Verify degraded condition cleared
+  _, st, _, _ = operatorClient.GetStaticPodOperatorState()
+  require.Condition(t, func() (success bool) {
+    for _, c := range st.Conditions {
+      if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionFalse {
+        return true
+      }
+    }
+    return false
+  }, "expected degraded condition to be false")

163-184: Validate emitted events.

Use the in-memory recorder to assert reasons were produced (failed action, fencing, generic warning, error, healthy transition).

   controller.recordHealthCheckEvents(status)
-  
+  // Collect and assert events for warnings and error
+  evts := eventRecorder.Events()
+  require.NotEmpty(t, evts)
+  // Sanity: at least one of each warning type and an error
+  require.Contains(t, reasons(evts), EventReasonFailedAction)
+  require.Contains(t, reasons(evts), EventReasonFencingEvent)
+  require.Contains(t, reasons(evts), EventReasonWarning)
+  require.Contains(t, reasons(evts), EventReasonError)
@@
   status = &HealthStatus{
     OverallStatus: StatusHealthy,
     Warnings:      []string{},
     Errors:        []string{},
   }
 
   controller.recordHealthCheckEvents(status)
+  // After transition to healthy, expect a healthy event
+  evts = eventRecorder.Events()
+  require.Contains(t, reasons(evts), EventReasonHealthy)
+
+}
+
+// helper
+func reasons(evts []events.Event) []string {
+  out := make([]string, 0, len(evts))
+  for _, e := range evts {
+    out = append(out, e.Reason)
+  }
+  return out
 }

261-291: Add a test for “standby” node to ensure it’s a Warning, not Degraded.

Guards the updated parsing logic and aligns with the stated degradation policy.

   // Test with offline node
@@
   require.NotEmpty(t, status.Errors, "Should have errors for offline nodes")
+
+  // Test with standby node -> warning, but still considered online
+  lines = []string{
+    SectionNodeList,
+    "  * Node master-0 (1): online, feature set 3.19.6",
+    "  * Node master-1 (2): standby, feature set 3.19.6",
+    SectionFullListResources,
+  }
+  status = &HealthStatus{Warnings: []string{}, Errors: []string{}}
+  allOnline = controller.parseNodeStatus(lines, status)
+  require.True(t, allOnline, "Standby should not be treated as offline")
+  require.NotEmpty(t, status.Warnings, "Standby should surface as a warning")

296-313: Cover dynamic node-count logic.

Add a case where only one node has resources started while two nodes are listed; expect an error message reflecting 1/2.

   // Test with all resources started
@@
   require.Empty(t, status.Errors, "Should have no errors for started resources")
+
+  // Test with partial placement (only one node started)
+  lines = []string{
+    SectionNodeList,
+    "  * Node master-0 (1): online, feature set 3.19.6",
+    "  * Node master-1 (2): online, feature set 3.19.6",
+    SectionFullListResources,
+    "  * Clone Set: kubelet-clone [kubelet]:",
+    "    * kubelet   (systemd:kubelet):   Started master-0",
+    "  * Clone Set: etcd-clone [etcd]:",
+    "    * etcd (ocf:heartbeat:podman-etcd):     Started master-0",
+    SectionNodeAttributes,
+  }
+  status = &HealthStatus{Warnings: []string{}, Errors: []string{}}
+  allStarted = controller.parseResourceStatus(lines, status)
+  require.False(t, allStarted, "Resources not started on all nodes should be false")
+  require.NotEmpty(t, status.Errors)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61b7fb5 and c077a2b.

📒 Files selected for processing (2)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)

Comment on lines 214 to 218
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
allNodesOnline = false
nodeName := c.extractNodeName(line)
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t treat “not online” as “offline”; handle standby distinctly.

Current check flags any non-“online” node as offline (e.g., “standby”). That contradicts the PR objective (degrade only when nodes are offline).

-      if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
-        allNodesOnline = false
-        nodeName := c.extractNodeName(line)
-        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
-      }
+      if c.isNodeStatusLine(line) {
+        lower := strings.ToLower(line)
+        if strings.Contains(lower, "offline") {
+          allNodesOnline = false
+          nodeName := c.extractNodeName(line)
+          status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName))
+        } else if strings.Contains(lower, "standby") {
+          nodeName := c.extractNodeName(line)
+          status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName))
+        }
+      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
allNodesOnline = false
nodeName := c.extractNodeName(line)
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
}
if c.isNodeStatusLine(line) {
lower := strings.ToLower(line)
if strings.Contains(lower, "offline") {
allNodesOnline = false
nodeName := c.extractNodeName(line)
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName))
} else if strings.Contains(lower, "standby") {
nodeName := c.extractNodeName(line)
status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName))
}
}
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 214-218, the code currently
treats any node state that is not NodeStatusOnline as an offline error; change
the logic to only mark nodes as offline when their status equals the explicit
offline constant (e.g., NodeStatusOffline) and ignore legitimate non-online
states like NodeStatusStandby. Concretely, replace the current negation check
with an explicit equality check for offline (or a compound condition that skips
Standby), set allNodesOnline = false and append an error only when the status is
offline, and avoid adding errors for standby nodes.

Comment on lines 226 to 264
func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
resourcesStarted := make(map[string]map[string]bool)
resourcesStarted[ResourceKubelet] = make(map[string]bool)
resourcesStarted[ResourceEtcd] = make(map[string]bool)

inResources := false

for _, line := range lines {
line = strings.TrimSpace(line)

if strings.Contains(line, SectionFullListResources) {
inResources = true
continue
}

if inResources {
if strings.Contains(line, SectionNodeAttributes) {
break
}

c.checkResourceLine(line, resourcesStarted)
}
}

// Check if we have both resources started on all nodes
kubeletCount := len(resourcesStarted[ResourceKubelet])
etcdCount := len(resourcesStarted[ResourceEtcd])

if kubeletCount < ExpectedNodeCount {
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount))
}

if etcdCount < ExpectedNodeCount {
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount))
}

return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid hard-coded node count (2); derive from the Node List.

Clusters may not always be 2 nodes. Parse the Node List to compute expected count and use it when validating resource placement.

-func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
+func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
   resourcesStarted := make(map[string]map[string]bool)
   resourcesStarted[ResourceKubelet] = make(map[string]bool)
   resourcesStarted[ResourceEtcd] = make(map[string]bool)
@@
-  // Check if we have both resources started on all nodes
-  kubeletCount := len(resourcesStarted[ResourceKubelet])
-  etcdCount := len(resourcesStarted[ResourceEtcd])
+  // Determine expected node count from the Node List; fall back to constant if not found
+  expectedCount := c.detectNodeCount(lines)
+  if expectedCount == 0 {
+    expectedCount = ExpectedNodeCount
+  }
+  kubeletCount := len(resourcesStarted[ResourceKubelet])
+  etcdCount := len(resourcesStarted[ResourceEtcd])
 
-  if kubeletCount < ExpectedNodeCount {
-    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount))
+  if kubeletCount < expectedCount {
+    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, expectedCount))
   }
 
-  if etcdCount < ExpectedNodeCount {
-    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount))
+  if etcdCount < expectedCount {
+    status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, expectedCount))
   }
 
-  return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount
+  return kubeletCount >= expectedCount && etcdCount >= expectedCount
 }
+
+// detectNodeCount returns the number of nodes listed in the "Node List" section.
+func (c *HealthCheck) detectNodeCount(lines []string) int {
+  inNodeList := false
+  count := 0
+  for _, line := range lines {
+    line = strings.TrimSpace(line)
+    if strings.Contains(line, SectionNodeList) {
+      inNodeList = true
+      continue
+    }
+    if inNodeList {
+      if strings.Contains(line, SectionFullListResources) {
+        break
+      }
+      if c.isNodeStatusLine(line) {
+        count++
+      }
+    }
+  }
+  return count
+}

Committable suggestion skipped: line range outside the PR's diff.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from c077a2b to 81984c7 Compare September 18, 2025 14:00
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 81984c7 to 26d6df8 Compare October 10, 2025 13:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)

502-510: Degrade on StatusError as well.

If we fail to collect status (OverallStatus == StatusError), the operator condition remains untouched, so outages never surface. Treat StatusError the same as StatusDegraded when updating conditions.

Suggested adjustment:

-	if status.OverallStatus == StatusDegraded {
-		return c.setDegradedCondition(ctx, status)
-	} else if status.OverallStatus == StatusHealthy {
+	switch status.OverallStatus {
+	case StatusDegraded, StatusError:
+		return c.setDegradedCondition(ctx, status)
+	case StatusHealthy:
 		return c.clearDegradedCondition(ctx)
 	}
-
-	// For Warning or Error status, we don't update the degraded condition
-	return nil
+	// Warning / Unknown: leave condition unchanged
+	return nil
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c077a2b and 26d6df8.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 26d6df8 to e42e5dc Compare October 10, 2025 15:26
@jaypoulz jaypoulz changed the title WIP: OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters Oct 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)

3-9: Prevent hangs; add exec timeout and non-interactive sudo (-n). Also introduce StatusError for collection failures.

pcs can block (sudo prompt/hung pcs). Use a short timeout context and sudo -n. While here, add StatusError and treat exec failures accordingly.

Apply:

--- a/pkg/tnf/pacemaker/healthcheck.go
+++ b/pkg/tnf/pacemaker/healthcheck.go
@@
 import (
+	"errors"
 	"context"
 	"encoding/xml"
 	"fmt"
 	"strings"
 	"time"
@@
 const (
   // Time windows for detecting recent events
   FailedActionTimeWindow = 5 * time.Minute
   FencingEventTimeWindow = 24 * time.Hour
+
+  // Upper bound for executing pcs commands
+  ExecTimeout = 10 * time.Second
@@
   // Status strings
   StatusHealthy  = "Healthy"
   StatusWarning  = "Warning"
   StatusDegraded = "Degraded"
   StatusUnknown  = "Unknown"
+  StatusError    = "Error"
@@
   // Pacemaker commands
-  PcsStatusXMLCommand = "sudo pcs status xml"
+  PcsStatusXMLCommand = "sudo -n pcs status xml"
@@
-  // Execute the pcs status xml command
-  stdout, stderr, err := exec.Execute(ctx, PcsStatusXMLCommand)
+  // Execute the pcs status xml command with timeout and non-interactive sudo
+  ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout)
+  defer cancel()
+  stdout, stderr, err := exec.Execute(ctxExec, PcsStatusXMLCommand)
   if err != nil {
-    return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil
+    if errors.Is(err, context.DeadlineExceeded) {
+      err = fmt.Errorf("pcs status xml command timed out: %w", err)
+    }
+    return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil
   }

Also applies to: 21-84, 336-345


379-383: Set OverallStatus when pacemaker isn’t running (avoid returning Unknown).

Without setting OverallStatus here, outages can be masked and the degraded condition won’t be set.

   if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
     status.Errors = append(status.Errors, "Pacemaker is not running")
-    return status, nil
+    status.OverallStatus = StatusDegraded
+    return status, nil
   }

568-576: Degrade on StatusError as well (not only Degraded).

Collection failures should surface as degraded to operators/users.

 switch status.OverallStatus {
- case StatusDegraded:
+ case StatusDegraded, StatusError:
   return c.setDegradedCondition(ctx, status)
 case StatusHealthy:
   return c.clearDegradedCondition(ctx)
 default:
   // For Warning or Unknown status, we don't update the degraded condition
   return nil
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 26d6df8 and e42e5dc.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/healthcheck_test.go
🔇 Additional comments (2)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-88: Fixture looks consistent with parser expectations.

IDs/roles/attributes align with parseResources/parseNodes and time formats for histories. LGTM.

pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)

1-66: Recent-failures fixture is well-formed and matches parser formats.

Timestamps match PacemakerTimeFormat/PacemakerFenceTimeFormat; placeholders are suitable for templating. LGTM.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from e42e5dc to 3cab3d9 Compare October 10, 2025 16:06
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)

386-390: Set OverallStatus to StatusDegraded before returning.

When Pacemaker isn't running, the code appends an error but returns with OverallStatus still set to StatusUnknown (from initialization at line 370). This prevents the degraded condition from being set downstream, masking the outage.

Apply this diff:

 	// Check if pacemaker is running
 	if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
 		status.Errors = append(status.Errors, "Pacemaker is not running")
+		status.OverallStatus = StatusDegraded
 		return status, nil
 	}

567-575: Handle StatusError in the switch statement.

If StatusError is added (per earlier comment), the updateOperatorStatus switch should treat it like StatusDegraded and set the degraded condition.

Apply this diff:

 	// Update operator conditions based on pacemaker status
 	switch status.OverallStatus {
-	case StatusDegraded:
+	case StatusDegraded, StatusError:
 		return c.setDegradedCondition(ctx, status)
 	case StatusHealthy:
 		return c.clearDegradedCondition(ctx)
 	default:
-		// For Warning or Unknown status, we don't update the degraded condition
+		// For Warning or Unknown status, we don't update the degraded condition
 		return nil
 	}

267-274: Rename to newErrorHealthStatus and use StatusError instead of StatusUnknown.

Collection failures (exec timeout, XML parse errors, etc.) should be marked as StatusError rather than StatusUnknown so they're surfaced as actionable failures in the operator condition. This aligns with previous feedback.

Apply this diff:

-// newUnknownHealthStatus creates a HealthStatus with Unknown status and an error message
-func newUnknownHealthStatus(errMsg string) *HealthStatus {
+// newErrorHealthStatus creates a HealthStatus with Error status and an error message
+func newErrorHealthStatus(errMsg string) *HealthStatus {
 	return &HealthStatus{
-		OverallStatus: StatusUnknown,
+		OverallStatus: StatusError,
 		Warnings:      []string{},
 		Errors:        []string{errMsg},
 	}
 }

Then update all call sites (lines 343, 346, 355, 361) to use newErrorHealthStatus instead of newUnknownHealthStatus.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e42e5dc and 3cab3d9.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (7)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-88: LGTM! Test fixture is well-formed.

The XML fixture represents a valid healthy Pacemaker cluster state with:

  • Both nodes online (online="true" on lines 12-13)
  • All resources started and active
  • No failed operations in history
  • Empty fence history

This provides good test coverage for the happy path.

pkg/tnf/pacemaker/healthcheck_test.go (6)

102-113: Update test expectations when StatusError replaces StatusUnknown.

This test expects StatusUnknown for exec failures (line 111), which is correct for the current implementation. However, once newUnknownHealthStatus is renamed to newErrorHealthStatus and returns StatusError (per the comment on healthcheck.go lines 267-274), update this assertion accordingly.

After applying the main implementation changes, update line 111:

 	require.NotNil(t, status, "HealthStatus should not be nil")
-	require.Equal(t, StatusUnknown, status.OverallStatus, "Status should be Unknown when exec fails")
+	require.Equal(t, StatusError, status.OverallStatus, "Status should be Error when exec fails")
 	require.NotEmpty(t, status.Errors, "Should have error messages when exec fails")

301-329: Update TestHealthCheck_determineOverallStatus for StatusError.

Line 306 tests newUnknownHealthStatus and expects StatusUnknown (line 308). Once StatusError is introduced, add a dedicated test case and update the existing one.

After implementing StatusError, update the test:

 func TestHealthCheck_determineOverallStatus(t *testing.T) {
 	controller := &HealthCheck{}
 
-	// Test degraded status status
-	errMsg := "test error message"
-	status := newUnknownHealthStatus(errMsg)
-	result := controller.determineOverallStatus(status)
-	require.Equal(t, StatusUnknown, result)
+	// Test error status (collection failures)
+	errMsg := "test error message"
+	status := newErrorHealthStatus(errMsg)
+	result := controller.determineOverallStatus(status)
+	require.Equal(t, StatusError, result)

Also add a test case for StatusUnknown if it's still used for other scenarios, or remove it if no longer needed.


377-392: Update TestHealthCheck_validateResourceCount signature after refactor.

Once validateResourceCount is updated to accept expectedCount as a parameter (per the comment on healthcheck.go lines 480-487), update the test calls on lines 382 and 388.

Apply this diff after the main implementation is updated:

 	// Test with correct count
 	status := &HealthStatus{Warnings: []string{}, Errors: []string{}}
-	result := controller.validateResourceCount(ResourceEtcd, 2, status)
+	result := controller.validateResourceCount(ResourceEtcd, 2, 2, status)
 	require.True(t, result, "Should return true for correct count")
 	require.Empty(t, status.Errors, "Should have no errors for correct count")
 
 	// Test with insufficient count
 	status = &HealthStatus{Warnings: []string{}, Errors: []string{}}
-	result = controller.validateResourceCount(ResourceKubelet, 1, status)
+	result = controller.validateResourceCount(ResourceKubelet, 1, 2, status)
 	require.False(t, result, "Should return false for insufficient count")

248-266: Consider removing TestHealthCheck_collectNodeStatus_MismatchedCount after refactor.

This test validates the warning for node count mismatch against ExpectedNodeCount. Once the hard-coded constant is removed (per the comment on healthcheck.go lines 423-427), this specific warning logic will be deleted, making this test obsolete.

After the refactor, either remove this test or update it to test a different node-count-related scenario if applicable.


23-46: LGTM! Test helpers are well-structured.

The createTestHealthCheck and NewHealthCheck test setup properly use fake clients and in-memory event recorders for isolated testing. Good practice for controller tests.

Also applies to: 72-100


176-230: LGTM! XML parsing tests provide good coverage.

The tests for parsePacemakerStatusXML cover:

  • Healthy cluster (lines 176-188)
  • Offline node scenario (lines 190-202)
  • Recent failures with dynamic timestamps (lines 204-230)

The use of getRecentFailuresXML with dynamic timestamps (lines 55-69) is particularly good for avoiding time-sensitive test flakes.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 3cab3d9 to c3fb43e Compare October 10, 2025 16:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (8)
pkg/tnf/pacemaker/healthcheck.go (8)

267-274: Duplicate: Mark collection failures as StatusError.

Per past review, this function should return StatusError instead of StatusUnknown to properly distinguish collection failures from unknown states. Collection failures are actionable and should be surfaced as errors.

Apply this diff:

 func newUnknownHealthStatus(errMsg string) *HealthStatus {
   return &HealthStatus{
-    OverallStatus: StatusUnknown,
+    OverallStatus: StatusError,
     Warnings:      []string{},
     Errors:        []string{errMsg},
   }
 }

Note: This requires adding the StatusError constant to the status constants section (lines 34-38).


34-38: Add StatusError constant for collection failures.

To properly handle collection/communication failures distinct from operational issues, add a StatusError constant alongside the existing status constants.

Apply this diff:

   // Status strings
   StatusHealthy  = "Healthy"
   StatusWarning  = "Warning"
   StatusDegraded = "Degraded"
+  StatusError    = "Error"
   StatusUnknown  = "Unknown"

429-433: Distinguish offline from standby nodes.

The current check treats any non-online node as an error, including legitimate standby states. Per the PR objectives, only offline nodes should trigger degradation.

Apply this diff:

   for _, node := range result.Nodes.Node {
-    if node.Online != NodeStatusOnline {
-      status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
+    if node.Online == "false" {
+      status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", node.Name))
+    } else if node.Standby == BooleanValueTrue {
+      status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", node.Name))
     }
   }

387-390: Set OverallStatus when pacemaker isn't running.

When pacemaker is down, the function returns with OverallStatus still set to Unknown. This prevents the degraded condition from being set, masking the outage.

Apply this diff:

   if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
     status.Errors = append(status.Errors, "Pacemaker is not running")
+    status.OverallStatus = StatusDegraded
     return status, nil
   }

497-509: Fix failure detection logic: use OR instead of AND.

The condition requires both RC and RCText to indicate failure, which misses cases where only one field is non-success. Use OR logic to catch all failure cases.

Apply this diff:

         for _, operation := range resourceHistory.OperationHistory {
           // Check for failed operations (rc != "0")
-          if operation.RC != OperationRCSuccess && operation.RCText != OperationRCTextSuccess {
+          if operation.RC != OperationRCSuccess || operation.RCText != OperationRCTextSuccess {
             // Parse the timestamp
             t, err := time.Parse(PacemakerTimeFormat, operation.LastRCChange)

567-575: Handle StatusError in operator status update.

When status collection fails (StatusError), the operator should surface a degraded condition to avoid masking outages. Currently only StatusDegraded triggers the degraded condition.

Apply this diff:

   // Update operator conditions based on pacemaker status
   switch status.OverallStatus {
-  case StatusDegraded:
+  case StatusDegraded, StatusError:
     return c.setDegradedCondition(ctx, status)
   case StatusHealthy:
     return c.clearDegradedCondition(ctx)
   default:
-    // For Warning or Unknown status, we don't update the degraded condition
+    // For Warning or Unknown status, we don't update the degraded condition
     return nil
   }

469-487: Derive expected node count dynamically from parsed XML.

Replace the hard-coded ExpectedNodeCount constant with a dynamic count derived from the parsed node list. This makes the validation flexible for clusters of varying sizes.

Apply this diff:

   // Check if we have both resources started on all nodes
+  expectedCount := len(result.Nodes.Node)
   kubeletCount := len(resourcesStarted[ResourceKubelet])
   etcdCount := len(resourcesStarted[ResourceEtcd])
 
-  kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, status)
-  etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, status)
+  kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, expectedCount, status)
+  etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, expectedCount, status)
 
   return kubeletOK && etcdOK
 }
 
 // validateResourceCount validates that a resource is started on the expected number of nodes
-func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) bool {
-  if actualCount < ExpectedNodeCount {
+func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) bool {
+  if actualCount < expectedCount {
     status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)",
-      resourceName, actualCount, ExpectedNodeCount))
+      resourceName, actualCount, expectedCount))
     return false
   }
   return true

423-427: Remove hard-coded node count check.

The hard-coded ExpectedNodeCount = 2 check is inflexible for clusters that may scale or undergo node replacement. Since the subsequent validation already uses the actual node count from the XML, this strict equality check can be removed.

Apply this diff:

-  // Check if we have the expected number of nodes.
-  // This should almost always be 2, but 1 node is possible during a control-plane node replacement event.
-  if len(result.Nodes.Node) != ExpectedNodeCount {
-    status.Warnings = append(status.Warnings, fmt.Sprintf("Expected %d nodes, found %d", ExpectedNodeCount, len(result.Nodes.Node)))
-  }
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3cab3d9 and c3fb43e.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🔇 Additional comments (4)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)

1-67: LGTM!

This XML test fixture correctly represents an offline node scenario with master-1 offline (line 13) and a corresponding fencing event (line 64). The structure is valid and appropriate for testing the health check parser's handling of degraded cluster states.

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-88: LGTM!

This XML test fixture correctly represents a healthy cluster with both nodes online and all resources started. The structure is valid and provides appropriate test coverage for the healthy path scenario.

pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)

1-66: LGTM!

This template correctly uses placeholders for dynamic timestamp generation (lines 56, 63), allowing tests to verify time-based filtering of recent failures and fencing events. The template approach is appropriate for testing scenarios that depend on recency.

pkg/tnf/pacemaker/healthcheck_test.go (1)

1-440: LGTM!

This comprehensive test suite provides excellent coverage for the Pacemaker health check component:

  • Thorough testing of XML parsing with multiple scenarios (healthy, offline, recent failures)
  • Good edge case coverage (no nodes, mismatched counts, large/invalid XML)
  • Effective use of test helpers and fixtures
  • Appropriate use of fake clients for isolation

The test structure is clean and maintainable.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from c3fb43e to f87437c Compare October 10, 2025 17:01
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 10, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Change List

  • Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
  • Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes
  • Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
  • Included happy path testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)

386-390: Set OverallStatus before returning when pacemaker is not running.

When pacemaker is not running, the function appends an error and returns immediately (line 389), but OverallStatus remains Unknown (initialized at line 370). This prevents the degraded condition from being set properly since updateOperatorStatus (line 567-575) only sets degraded for StatusError, not StatusUnknown.

Apply this diff:

 	if result.Summary.Stack.PacemakerdState != PacemakerStateRunning {
 		status.Errors = append(status.Errors, "Pacemaker is not running")
+		status.OverallStatus = c.determineOverallStatus(status)
 		return status, nil
 	}
🧹 Nitpick comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)

102-113: Clarify test expectation: exec failure should not return an error.

The comment states "exec.Execute will fail due to permissions" but the test expects NoError. This appears correct (the function returns Unknown status without propagating the error), but the comment could be clearer about this design choice.

Consider updating the comment to:

-  // In test environment, exec.Execute will fail due to permissions
-  require.NoError(t, err, "getPacemakerStatus should not return an error")
+  // In test environment, exec.Execute will fail due to permissions.
+  // getPacemakerStatus handles this gracefully by returning Unknown status without propagating the error.
+  require.NoError(t, err, "getPacemakerStatus should not return an error even when exec fails")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c3fb43e and f87437c.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (10)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)

1-67: LGTM! Well-structured test fixture for offline node scenario.

The XML fixture accurately represents an offline node scenario with appropriate fencing history. The structure is valid and will effectively support testing of offline node detection and fencing event handling.

pkg/tnf/pacemaker/healthcheck_test.go (2)

23-69: LGTM! Well-designed test helpers.

The test helpers are well-structured:

  • createTestHealthCheck: Properly initializes all dependencies with fakes
  • loadTestXML: Simple and effective for loading fixtures
  • getRecentFailuresXML: Clever approach using dynamic timestamps to ensure failures are within the 5-minute window

176-440: LGTM! Comprehensive test coverage.

The test suite provides excellent coverage:

  • XML parsing for healthy, offline, and recent-failures scenarios
  • Node status collection with various edge cases (empty nodes, mismatched counts)
  • Resource status validation
  • Overall status determination logic
  • Helper functions (newUnknownHealthStatus, recent failures, fencing events)
  • Edge cases (large XML, invalid XML, missing state)

The tests are well-organized, use clear assertions, and validate both happy paths and error conditions.

pkg/tnf/pacemaker/healthcheck.go (7)

334-365: LGTM! Robust command execution with proper safeguards.

The implementation correctly addresses potential hanging issues:

  • Uses context with timeout (10 seconds) to prevent indefinite blocking
  • Uses sudo -n for non-interactive execution
  • Explicitly checks for timeout errors
  • Validates XML size to prevent XML bombs
  • Returns graceful Unknown status on errors

23-85: LGTM! Well-organized constants with clear documentation.

The constants are well-structured and appropriately documented:

  • Time windows for event detection are reasonable (5 min for failures, 24 hours for fencing)
  • 30-second resync interval balances responsiveness with resource usage
  • Status constants, event reasons, and command strings are clearly defined
  • ExpectedNodeCount = 2 correctly reflects the Two-Node OpenShift design

416-434: Node online check aligns with Two-Node OpenShift design.

The check at line 430 treats any node where Online != "true" as an error, including standby nodes. Per the maintainer's clarification, this is the intended behavior for Two-Node OpenShift with Fencing, where standby nodes are considered offline for health purposes.

Based on learnings


497-509: LGTM! Correct failure detection using OR logic.

The failure detection properly uses OR logic (||) to catch failures when either RC or RCText indicates non-success. This correctly identifies cases where only one field shows failure (e.g., RC="1" but RCText="ok").


560-576: Consider whether StatusUnknown should also set degraded condition.

Currently, only StatusError sets the degraded condition. When the controller cannot determine cluster health (e.g., exec timeout, parse failure), it returns StatusUnknown, but this doesn't update the operator condition. This could mask operational issues.

Should StatusUnknown also set a degraded condition to surface collection failures? If the current behavior is intentional (avoiding false alarms during transient issues), consider adding a comment explaining this design choice.

Would you like the controller to set degraded on Unknown status, or is the current "wait and see" approach preferred?


480-487: Hard-coded ExpectedNodeCount correctly reflects architectural constraint.

The constant ExpectedNodeCount = 2 and its usage in validation correctly encode the Two-Node OpenShift with Fencing design. Per the maintainer's clarification, this is an architectural invariant, not a limitation to be made dynamic.

Based on learnings


627-654: LGTM! Well-categorized event recording.

Event recording properly categorizes warnings by type:

  • Failed resource actions → EventReasonFailedAction
  • Fencing events → EventReasonFencingEvent
  • General warnings → EventReasonWarning
  • Errors → EventReasonError
  • Healthy status → EventReasonHealthy

This categorization enables effective event filtering and alerting.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from f87437c to c07e5b1 Compare October 10, 2025 17:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)

267-274: Collection failures return Unknown; confirm policy.

Exec/parse failures set StatusUnknown, so no degraded condition is set. If outages should surface, consider marking as StatusError to degrade. If Unknown is intentional, ignore.

Also applies to: 341-347

🧹 Nitpick comments (6)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

79-82: Remove duplicated operation_history entry.

Line 80 duplicates line 79 for the kubelet probe. Drop one to avoid confusion in tests and future parsing.

-        <operation_history call="330" task="probe" rc="0" rc_text="ok" last-rc-change="Thu Sep 18 12:40:35 2025" exec-time="4ms" queue-time="0ms"/>
pkg/tnf/pacemaker/healthcheck_test.go (1)

115-149: Optional: rename test for consistency with impl name.

The function under test is updateOperatorStatus; consider renaming the test to match for easier discovery.

-func TestHealthCheck_updateOperatorAvailability(t *testing.T) {
+func TestHealthCheck_updateOperatorStatus(t *testing.T) {
pkg/tnf/pacemaker/healthcheck.go (4)

338-341: Extract exec timeout into a constant.

Avoid magic number; define a reusable ExecTimeout alongside other constants and use it here.

@@
- // Execute the pcs status xml command with a timeout and non-interactive sudo
- ctxExec, cancel := context.WithTimeout(ctx, 10*time.Second)
+ // Execute the pcs status xml command with a timeout and non-interactive sudo
+ ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout)

Add near other constants:

+ // Upper bound for "pcs status" execution
+ ExecTimeout = 10 * time.Second

430-434: If standby is considered “offline”, handle it explicitly.

Currently only Online != "true" is flagged. To align with the stated intent to treat standby as offline, also error on Standby == "true".

 for _, node := range result.Nodes.Node {
-    if node.Online != NodeStatusOnline {
-        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
-    }
+    if node.Online != NodeStatusOnline {
+        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
+    } else if node.Standby == BooleanValueTrue {
+        status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (treated as offline)", node.Name))
+    }
 }

449-453: Match resources by agent, not ID substring.

ID-based substring matching is brittle. Use ResourceAgent to detect kubelet/etcd reliably.

-            // Check if this is a kubelet or etcd resource
-            if strings.Contains(resource.ID, ResourceKubelet) {
-                resourcesStarted[ResourceKubelet][nodeName] = true
-            } else if strings.Contains(resource.ID, ResourceEtcd) {
-                resourcesStarted[ResourceEtcd][nodeName] = true
-            }
+            // Identify kubelet/etcd resources by their agents
+            switch {
+            case strings.HasPrefix(resource.ResourceAgent, "systemd:kubelet"):
+                resourcesStarted[ResourceKubelet][nodeName] = true
+            case strings.HasPrefix(resource.ResourceAgent, "ocf:heartbeat:podman-etcd"):
+                resourcesStarted[ResourceEtcd][nodeName] = true
+            }

629-643: Optional: gate repeated warning events to reduce noise.

At 30s resync, warnings within the 5m/24h windows will re-emit every cycle. Consider deduplication (e.g., last-emitted cache, once-per-window) to avoid event spam.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f87437c and c07e5b1.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)

49-53: Use require.NoErrorf to format the failure message.

Current call won’t format %s; switch to the fmt-aware variant.

-   require.NoError(t, err, "Failed to read test XML file: %s", filename)
+   require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)

Likely an incorrect or invalid review comment.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from c07e5b1 to e2add0e Compare October 10, 2025 18:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
pkg/tnf/pacemaker/healthcheck.go (1)

388-401: Optional: size guard exists only in caller, not in parser.

parsePacemakerStatusXML trusts input size; only getPacemakerStatus enforces maxXMLSize. If parse is used elsewhere, consider early-checking length or using an xml.Decoder with a bounded reader to keep invariants local to the parser. Low priority.

pkg/tnf/pacemaker/healthcheck_test.go (3)

50-55: Use require.NoErrorf for formatted failure message.

Minor test ergonomics: this call passes a format string and arg; use the “f” variant.

-	data, err := os.ReadFile(path)
-	require.NoError(t, err, "Failed to read test XML file: %s", filename)
+	data, err := os.ReadFile(path)
+	require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)

Based on learnings


642-659: Test name/comment mismatch with behavior.

The test name/comment suggest exercising “large XML” behavior, but the parser test simply validates invalid XML. Either update the comment/name to reflect invalid XML, or drive the size check via getPacemakerStatus (which enforces maxXMLSize). Optional.


484-498: Add a test to assert standby is treated as offline.

To lock in the intended behavior, add a case where node.Online="true" and Standby="true" and assert an error is recorded.

Example:

func TestHealthCheck_collectNodeStatus_StandbyTreatedAsOffline(t *testing.T) {
	controller := &HealthCheck{}
	result := &PacemakerResult{
		Nodes: Nodes{
			Node: []Node{
				{Name: "master-0", Online: booleanValueTrue, Standby: booleanValueTrue},
				{Name: "master-1", Online: booleanValueTrue, Standby: "false"},
			},
		},
	}
	status := &HealthStatus{Warnings: []string{}, Errors: []string{}}
	controller.collectNodeStatus(result, status)
	require.NotEmpty(t, status.Errors)
	require.Contains(t, strings.Join(status.Errors, " "), "standby")
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c07e5b1 and e2add0e.

📒 Files selected for processing (5)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from e2add0e to f3d22d3 Compare October 10, 2025 19:23
Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it seems good to me :)
I just left a few notes

Comment on lines 375 to 441
// Validate XML size to prevent XML bombs
if len(stdout) > maxXMLSize {
return newUnknownHealthStatus(fmt.Sprintf("XML output too large: %d bytes (max: %d bytes)", len(stdout), maxXMLSize)), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future. If the XML Bomb is a genuine risk, we might want to consider modifying the exec package to use io.LimitReader somehow. The current check prevents us to parse it, and it is good, but we still have the whole XML in memory at that point.

Comment on lines +349 to +402
// Record pacemaker health check events
c.recordHealthCheckEvents(healthStatus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how long does recordHealthCheckEvents take, but maybe we should pass the context here.

Comment on lines +524 to +591
if err != nil {
klog.V(4).Infof("Failed to parse timestamp '%s' for operation %s on node %s: %v",
operation.LastRCChange, operation.Task, node.Name, err)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we worry about letting these errors slip through?

Comment on lines +647 to +716
// Clean up old entries that are outside the longest deduplication window
maxWindow := eventDeduplicationWindowFencing
for key, timestamp := range c.recordedEvents {
if now.Sub(timestamp) > maxWindow {
delete(c.recordedEvents, key)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup logic is currently nested inside the block that decides on item addition. Should we move it out? It feels like that cleanup step can be performed just once in recordHealthCheckEvents execution, not for every new item to add, did I misunderstand it?
UPDATE: now that I read it a little better, it seems to me that at each loop called in recordHealthCheckEvents we might delete events recorded in the previous loop.

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgot the tests :)

Some tests have similar boilerplate code (like the ones that parse XML). You might want to consider using test tables instead. Right now, though, I have mixed feelings, as I do like how you organized the tests. The only real benefit of merging them would be that if we need to fix an issue or add new test cases, it'd be much easier to do it in one place instead of several. So, I'll leave the decision up to you!

Comment on lines +57 to +140
func getRecentFailuresXML(t *testing.T) string {
// Load the template from testdata
templateXML := loadTestXML(t, "recent_failures_template.xml")

// Generate recent timestamps that are definitely within the 5-minute window
now := time.Now()
recentTime := now.Add(-1 * time.Minute).UTC().Format("Mon Jan 2 15:04:05 2006")
recentTimeISO := now.Add(-1 * time.Minute).UTC().Format("2006-01-02 15:04:05.000000Z")

// Replace the placeholders in the template
xmlContent := strings.ReplaceAll(templateXML, "{{RECENT_TIMESTAMP}}", recentTime)
xmlContent = strings.ReplaceAll(xmlContent, "{{RECENT_TIMESTAMP_ISO}}", recentTimeISO)

return xmlContent
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! :)

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty nice PR IMHO 👍

return c.updateOperatorCondition(ctx, condition)
}

func (c *HealthCheck) clearDegradedCondition(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to check the current degraded condition for its reason, to make sure we don't clear a degraded condition which wasn't set by this healthcheck? Not sure what best pratice is in such situations...

pacemakerFenceTimeFormat = "2006-01-02 15:04:05.000000Z" // Format for fence event timestamps

// Pacemaker commands
pcsStatusXMLCommand = "sudo -n pcs status xml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh wait... did you already try running this in cluster?
IIRC one of the reasons for using jobs for pcs commands was about permissions... 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. I'll do some testing today and get back to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it doesn't work in an interesting way.

[ec2-user@aws-jpoulin-dev ~]$ oc logs pod/etcd-operator-554d7f6899-5wrp2 | grep health.go
E1015 15:16:40.477459       1 health.go:120] health check for member (master-0) failed: err(context deadline exceeded)
E1015 15:16:45.703453       1 health.go:120] health check for member (master-0) failed: err(context deadline exceeded)
E1015 15:19:04.918498       1 health.go:120] health check for member (master-1) failed: err(context deadline exceeded)

Why does it stop after 3 tries I wonder :P
I'll try to update via chron with privileges and see if I can get something working

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from f3d22d3 to 6176029 Compare October 15, 2025 18:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)

524-528: Timestamp parsing errors are logged but not surfaced.

When time.Parse fails, the error is logged at V(4) level and the operation is skipped with continue. If all timestamps consistently fail to parse (e.g., due to Pacemaker version change in timestamp format), recent failures would be silently ignored.

Consider accumulating a count of parse errors and emitting a warning if a threshold is exceeded, to detect systematic parsing issues.


640-665: Move cleanup logic outside the event-checking loop for efficiency.

The cleanup of old deduplication entries (lines 647-653) currently runs on every call to shouldRecordEvent, which means it executes once per warning/error. This is inefficient when processing multiple events.

Based on learnings

Apply this refactor to clean up once per sync:

 // recordHealthCheckEvents records events for pacemaker warnings and fencing history
 func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) {
+	// Clean up old entries once at the start
+	c.cleanupOldRecordedEvents()
+
 	// Record events for warnings with appropriate deduplication window
 	for _, warning := range status.Warnings {
 		eventKey := fmt.Sprintf("warning:%s", warning)
@@ -696,6 +700,19 @@
 	}
 }
 
+// cleanupOldRecordedEvents removes entries older than the longest deduplication window
+func (c *HealthCheck) cleanupOldRecordedEvents() {
+	c.recordedEventsMu.Lock()
+	defer c.recordedEventsMu.Unlock()
+	
+	now := time.Now()
+	maxWindow := eventDeduplicationWindowFencing
+	for key, timestamp := range c.recordedEvents {
+		if now.Sub(timestamp) > maxWindow {
+			delete(c.recordedEvents, key)
+		}
+	}
+}
+
 // shouldRecordEvent checks if an event should be recorded based on deduplication logic
 func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool {
 	c.recordedEventsMu.Lock()
 	defer c.recordedEventsMu.Unlock()
 
 	now := time.Now()
-
-	// Clean up old entries that are outside the longest deduplication window
-	maxWindow := eventDeduplicationWindowFencing
-	for key, timestamp := range c.recordedEvents {
-		if now.Sub(timestamp) > maxWindow {
-			delete(c.recordedEvents, key)
-		}
-	}
 
 	// Check if this event was recently recorded
 	if lastRecorded, exists := c.recordedEvents[eventKey]; exists {

619-626: Verify degraded condition ownership before clearing.

The clearDegradedCondition function clears any degraded condition of type PacemakerHealthCheckDegraded without verifying it was set by this controller. If multiple components manage similar conditions, this could inadvertently clear a condition set by another component.

Consider checking the condition's reason or message to confirm it was set by this health check before clearing it.

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f3d22d3 and 6176029.

📒 Files selected for processing (6)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/standby_node.xml (1 hunks)
🔇 Additional comments (13)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)

1-67: LGTM! Test fixture correctly represents offline node scenario.

This XML fixture appropriately models a Pacemaker cluster with an offline node (master-1 with online="false"), which is used to test the health-check parsing logic for degraded cluster states.

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-87: LGTM! Test fixture correctly represents healthy cluster scenario.

This XML fixture appropriately models a healthy Pacemaker cluster with both nodes online, all critical resources started, and no recent fencing events. It serves as a good baseline for testing the happy path.

pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)

1-66: LGTM! Template approach is appropriate for time-sensitive testing.

This template with placeholder timestamps ({{RECENT_TIMESTAMP}} and {{RECENT_TIMESTAMP_ISO}}) is correctly used by the getRecentFailuresXML test helper to dynamically generate XML with recent timestamps, enabling proper testing of the time-window-based failure detection logic.

pkg/tnf/pacemaker/testdata/standby_node.xml (2)

57-58: Verify duplicate operation_history entries are intentional.

Lines 57-58 contain identical operation_history entries for the kubelet probe operation (same call, task, timestamps). If this duplication is intentional to test duplicate handling in the parser, consider adding a comment explaining this. Otherwise, one entry should be removed.


1-66: Test fixture correctly represents standby node scenario.

This XML fixture appropriately models a Pacemaker cluster with a standby node (master-1 with standby="true"), which is used to test the health-check behavior when nodes are in standby state.

pkg/tnf/pacemaker/healthcheck_test.go (3)

23-71: LGTM! Well-structured test helpers with good practices.

The helper functions appropriately:

  • Create test fixtures with fake clients and in-memory recorders
  • Load static XML test data from files
  • Generate dynamic timestamps for time-sensitive tests using template replacement

The getRecentFailuresXML helper's approach of using 1-minute-old timestamps ensures tests are consistently within the 5-minute failure window.


179-245: Excellent event deduplication test coverage.

This test comprehensively validates the deduplication logic:

  • Verifies duplicate events are not re-recorded within windows
  • Tests different deduplication windows (5 min for actions, 1 hour for fencing)
  • Validates healthy event is only recorded on status transitions
  • Confirms new errors are recorded even when duplicates exist

The test correctly verifies the controller only records healthy events when transitioning from unhealthy states, not when already healthy.


428-496: Comprehensive XML parsing test coverage.

The parsing tests appropriately cover:

  • Healthy cluster baseline
  • Offline nodes
  • Standby nodes
  • Recent failures with dynamic timestamps
  • Invalid/large/missing-state XML edge cases

The tests correctly validate both the parsed status and the specific warnings/errors generated for each scenario.

pkg/tnf/pacemaker/healthcheck.go (5)

349-350: Consider passing context to recordHealthCheckEvents for consistency.

While recordHealthCheckEvents doesn't perform I/O or cancellable operations, passing the context parameter would maintain consistency with other methods and enable future additions that might need it (e.g., timeout-sensitive event recording).

Based on learnings


452-460: Node status validation correctly treats standby as offline.

The implementation correctly identifies both offline nodes (online != "true") and standby nodes (standby == "true") as errors, treating them as equivalent failure conditions. The error message for standby nodes explicitly documents this design decision with "(unexpected behavior; treated as offline)".


360-387: LGTM! Robust command execution with timeout and error handling.

The getPacemakerStatus function appropriately:

  • Uses a timeout context (10 seconds) to prevent indefinite hangs
  • Uses non-interactive sudo (sudo -n) to fail fast on missing credentials
  • Validates XML size to prevent XML bomb attacks
  • Returns Unknown status with error messages rather than propagating errors, ensuring graceful degradation

The timeout handling explicitly checks for context.DeadlineExceeded to provide clear error messages.


408-413: LGTM! Correctly sets error status when Pacemaker is down.

The code properly detects when Pacemaker is not running and sets OverallStatus to statusError before returning, ensuring the degraded condition will be set appropriately in the controller flow.


505-511: Hard-coded expectedNodeCount is intentional by design.

The use of expectedNodeCount = 2 is correct for Two-Node OpenShift with Fencing architecture. This is a fundamental constraint, not a limitation to be refactored.

Based on learnings

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (1)

167-175: Prefer structured mutation over text templating (reduces brittleness).

Parse the manifest once, then set fields on the CronJob object (name, namespace, schedule, image, command) instead of line‑based replacements. This avoids YAML formatting pitfalls and future template drift.

Example:

cronJob := ReadCronJobV1OrDie(c.manifest)
cronJob.Name = c.name
cronJob.Namespace = c.namespace
cronJob.Spec.Schedule = c.schedule
if len(cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 {
  ct := &cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0]
  ct.Image = c.image
  ct.Command = append([]string(nil), c.command...)
}
return cronJob
pkg/tnf/pacemaker/healthcheck.go (2)

306-309: Remove unused lastProcessedStatus or use it meaningfully.

State is written but never read; drop it or leverage it to suppress duplicate event processing based on CR changes.

Also applies to: 433-436


515-523: Optional: clarify standby-as-offline in a code comment.

Behavior is intentional; a brief inline comment above the loop will aid future readers.

pkg/cmd/pacemakerstatuscollector/collector.go (1)

165-177: Remove unused Kubernetes client wiring to simplify.

You construct a client but don’t use it; updatePacemakerStatusCR rebuilds REST config anyway. Simplify by dropping the client parameter and the extra creation.

Apply:

@@
-  // Create Kubernetes client
-  client, err := createKubernetesClient()
-  if err != nil {
-    return fmt.Errorf("failed to create kubernetes client: %w", err)
-  }
-
   // Update PacemakerStatus CR
-  if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, collectionError); err != nil {
+  if err := updatePacemakerStatusCR(ctx, rawXML, summary, collectionError); err != nil {
     return fmt.Errorf("failed to update PacemakerStatus CR: %w", err)
   }

And change the function signature:

-func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {
+func updatePacemakerStatusCR(ctx context.Context, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {

Optionally remove createKubernetesClient if unused elsewhere.

Also applies to: 321-321

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6176029 and 4aefd7d.

📒 Files selected for processing (15)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (2)
pkg/tnf/pacemaker/v1alpha1/register.go (1)

24-31: LGTM: scheme registration is correct.

Types registered, GroupVersion wired, and metav1 added. No issues.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

12-121: Autogenerated deepcopy looks consistent.

Methods cover all types; list/items deep-copied properly.

Comment on lines +352 to +392
err = restClient.Get().
Resource("pacemakerstatuses").
Name(pacemakerStatusName).
Do(ctx).
Into(&existing)

now := metav1.Now()

if err != nil {
// Create new PacemakerStatus if it doesn't exist
if strings.Contains(err.Error(), "not found") {
newStatus := &v1alpha1.PacemakerStatus{
TypeMeta: metav1.TypeMeta{
APIVersion: v1alpha1.SchemeGroupVersion.String(),
Kind: "PacemakerStatus",
},
ObjectMeta: metav1.ObjectMeta{
Name: pacemakerStatusName,
},
Status: v1alpha1.PacemakerStatusStatus{
LastUpdated: now,
RawXML: rawXML,
CollectionError: collectionError,
Summary: summary,
},
}

result := restClient.Post().
Resource("pacemakerstatuses").
Body(newStatus).
Do(ctx)

if result.Error() != nil {
return fmt.Errorf("failed to create PacemakerStatus: %w", result.Error())
}

klog.Info("Created new PacemakerStatus CR")
return nil
}
return fmt.Errorf("failed to get PacemakerStatus: %w", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use apierrors.IsNotFound instead of string matching.

String checks are brittle. Prefer proper error typing.

Apply:

@@
-import (
+import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
   "strings"
   "time"
 
   "github.com/spf13/cobra"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@
-  if err != nil {
+  if err != nil {
     // Create new PacemakerStatus if it doesn't exist
-    if strings.Contains(err.Error(), "not found") {
+    if apierrors.IsNotFound(err) {
       newStatus := &v1alpha1.PacemakerStatus{

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In pkg/cmd/pacemakerstatuscollector/collector.go around lines 352 to 392,
replace the brittle string check strings.Contains(err.Error(), "not found") with
a proper API error type check using apierrors.IsNotFound(err); import
"k8s.io/apimachinery/pkg/api/errors" as apierrors if not already imported, and
keep the existing error handling logic path the same (create the new
PacemakerStatus on NotFound, otherwise return the wrapped error).

Comment on lines +3 to +19
import (
"context"
"fmt"
"strings"
"time"

batchv1 "k8s.io/api/batch/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"

operatorv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add missing import for strconv used in escaping.

@@
-import (
+import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import (
"context"
"fmt"
"strings"
"time"
batchv1 "k8s.io/api/batch/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"
operatorv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)
import (
"context"
"fmt"
"strconv"
"strings"
"time"
batchv1 "k8s.io/api/batch/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"
operatorv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 3 to 19, the code uses
strconv for escaping but the strconv package is not imported; add "strconv" to
the import block (alongside context, fmt, strings, time, etc.) so the file
compiles and the escaping call resolves; reorder/group imports if needed to
satisfy linters.

Comment on lines +48 to +51
Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds)
Command []string // Command to run in the container
Image string // Container image to use
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Cron schedule comment is misleading; Kubernetes CronJob uses 5 fields (no seconds).

Update the example to a valid 5‑field cron (e.g., "*/30 * * * *") to avoid confusion.

🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 48 to 51, the comment for
Schedule shows a 6-field cron expression including seconds which is incorrect
for Kubernetes CronJob (it expects a standard 5-field expression). Update the
example to a valid 5-field expression such as "*/30 * * * *" and adjust the
parenthetical text to note it's a 5-field cron (e.g., "*/30 * * * *" for every
30 minutes/seconds depending on scheduler) — specifically replace the existing
example and wording so the comment accurately reflects Kubernetes CronJob
syntax.

Comment on lines 177 to 222
// injectValues replaces <injected> placeholders in the manifest with actual values
func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)

// Build command string for YAML array format
var commandBuilder strings.Builder
commandBuilder.WriteString("[")
for i, cmd := range c.command {
if i > 0 {
commandBuilder.WriteString(", ")
}
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
}
commandBuilder.WriteString("]")
commandStr := commandBuilder.String()

// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace app label
if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace schedule
if strings.Contains(line, "schedule: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1)
}
// Replace image
if strings.Contains(line, "image: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.image, 1)
}
// Replace command (special handling for array)
if strings.Contains(line, "command: <injected>") {
lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
}
}

return []byte(strings.Join(lines, "\n"))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Fix YAML injection/escaping in command and support namespace injection.

  • Current command quoting uses "%s" without escaping; quotes/backslashes in args will break YAML.
  • Namespace in manifest isn’t injected; if config.Namespace != manifest’s namespace, create/update can fail (path vs object namespace mismatch).

Apply:

@@
-func (c *CronJobController) injectValues(manifest []byte) []byte {
+func (c *CronJobController) injectValues(manifest []byte) []byte {
   result := string(manifest)
 
   // Build command string for YAML array format
-  var commandBuilder strings.Builder
-  commandBuilder.WriteString("[")
-  for i, cmd := range c.command {
-    if i > 0 {
-      commandBuilder.WriteString(", ")
-    }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
-  }
-  commandBuilder.WriteString("]")
-  commandStr := commandBuilder.String()
+  var commandBuilder strings.Builder
+  commandBuilder.WriteString("[")
+  for i, cmd := range c.command {
+    if i > 0 {
+      commandBuilder.WriteString(", ")
+    }
+    // Use proper escaping for YAML/JSON-compatible strings
+    commandBuilder.WriteString(strconv.Quote(cmd))
+  }
+  commandBuilder.WriteString("]")
+  commandStr := commandBuilder.String()
@@
   for i, line := range lines {
+    // Replace namespace if present in template
+    if strings.Contains(line, "namespace: <injected>") {
+      lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+    }
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.name, 1)
     }

Note: add strconv import (see separate comment).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// injectValues replaces <injected> placeholders in the manifest with actual values
func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)
// Build command string for YAML array format
var commandBuilder strings.Builder
commandBuilder.WriteString("[")
for i, cmd := range c.command {
if i > 0 {
commandBuilder.WriteString(", ")
}
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
}
commandBuilder.WriteString("]")
commandStr := commandBuilder.String()
// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace app label
if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace schedule
if strings.Contains(line, "schedule: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1)
}
// Replace image
if strings.Contains(line, "image: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.image, 1)
}
// Replace command (special handling for array)
if strings.Contains(line, "command: <injected>") {
lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
}
}
return []byte(strings.Join(lines, "\n"))
}
// injectValues replaces <injected> placeholders in the manifest with actual values
func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)
// Build command string for YAML array format
var commandBuilder strings.Builder
commandBuilder.WriteString("[")
for i, cmd := range c.command {
if i > 0 {
commandBuilder.WriteString(", ")
}
// Use proper escaping for YAML/JSON-compatible strings
commandBuilder.WriteString(strconv.Quote(cmd))
}
commandBuilder.WriteString("]")
commandStr := commandBuilder.String()
// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
// Replace namespace if present in template
if strings.Contains(line, "namespace: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
}
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace app label
if strings.Contains(line, "app.kubernetes.io/name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
// Replace schedule
if strings.Contains(line, "schedule: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1)
}
// Replace image
if strings.Contains(line, "image: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.image, 1)
}
// Replace command (special handling for array)
if strings.Contains(line, "command: <injected>") {
lines[i] = strings.Replace(line, "<injected>", commandStr, 1)
}
}
return []byte(strings.Join(lines, "\n"))
}
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 177 to 219, the command
array is built with fmt.Sprintf("%s") which doesn't escape quotes/backslashes
and will break YAML; also the manifest never injects a namespace. Change the
command builder to call strconv.Quote on each c.command element when
constructing commandStr (so arguments are properly escaped and wrapped), add
strconv to the import list, and add a replacement branch to replace "namespace:
<injected>" with c.namespace (or the controller's namespace field) so the
manifest namespace is injected consistently.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 4aefd7d to b553bf0 Compare October 16, 2025 20:16
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jaypoulz
Once this PR has been reviewed and has the lgtm label, please assign tjungblu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

3-19: Missing strconv import needed for proper YAML escaping.

The injectValues method at line 191 constructs YAML array strings using fmt.Sprintf("%s") without proper escaping. Special characters in command arguments (quotes, backslashes) will break YAML parsing. Import strconv and use strconv.Quote for escaping.


48-48: Misleading cron schedule example includes seconds field.

Kubernetes CronJob uses standard 5-field cron expressions (minute hour day month weekday), not 6-field (with seconds). The example "*/30 * * * * *" is invalid. Use "*/30 * * * *" for every 30 minutes.


180-222: Fix YAML injection vulnerabilities and add namespace injection.

Three critical issues remain from previous review:

  1. Command escaping (line 191): Using fmt.Sprintf("%s") without escaping will break YAML when command args contain quotes or backslashes. Use strconv.Quote(cmd) instead.

  2. Namespace not injected: The manifest likely has a namespace: <injected> placeholder that needs replacement, but the loop has no case for it. Add namespace replacement or the CronJob will be created in wrong namespace.

  3. Schedule escaping (line 209): Wrapping in quotes helps but doesn't escape special chars in the schedule value itself.

Apply this diff:

 import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"
   for i, line := range lines {
+    // Replace namespace if present in template
+    if strings.Contains(line, "namespace: <injected>") {
+      lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+    }
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {
       lines[i] = strings.Replace(line, "<injected>", c.name, 1)
   for i, cmd := range c.command {
     if i > 0 {
       commandBuilder.WriteString(", ")
     }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+    commandBuilder.WriteString(strconv.Quote(cmd))
   }
pkg/cmd/pacemakerstatuscollector/collector.go (1)

321-414: Replace string matching with proper error type check.

At line 362, the code uses brittle string matching strings.Contains(err.Error(), "not found") to detect NotFound errors. Use apierrors.IsNotFound(err) instead.

Apply this diff:

 import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
   "strings"
   "time"
 
   "github.com/spf13/cobra"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
   if err != nil {
     // Create new PacemakerStatus if it doesn't exist
-    if strings.Contains(err.Error(), "not found") {
+    if apierrors.IsNotFound(err) {
       newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (1)
pkg/cmd/pacemakerstatuscollector/collector.go (1)

52-138: Acceptable duplication to avoid circular dependencies.

XML types are duplicated from healthcheck.go as noted in the comment. Consider extracting to a shared internal package in future refactoring to reduce duplication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4aefd7d and b553bf0.

📒 Files selected for processing (15)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • pkg/operator/starter.go
  • pkg/tnf/pkg/jobs/batch.go
  • cmd/cluster-etcd-operator/main.go
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
  • pkg/tnf/operator/starter_test.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (42)
pkg/tnf/pkg/jobs/cronjob_controller.go (7)

53-88: LGTM!

The constructor properly sets defaults, wires dependencies, and follows library-go factory patterns.


90-111: LGTM!

The sync logic correctly checks operator management state and handles errors appropriately.


113-132: LGTM!

The reconciliation logic follows standard Kubernetes controller patterns.


134-150: LGTM!

Creation logic properly handles AlreadyExists and records appropriate events.


152-165: LGTM!

Update logic correctly preserves ResourceVersion and records events.


167-178: LGTM!

Build logic is clear and includes helpful debug logging.


224-249: LGTM!

The update detection logic correctly compares schedule, command, and image with proper bounds checking.

pkg/tnf/pacemaker/healthcheck.go (22)

28-99: LGTM!

Constants are well-defined. The StatusError constant was appropriately added. The expectedNodeCount = 2 is intentional per the Two-Node OpenShift architecture design.

Based on learnings.


101-273: LGTM!

XML parsing structures are comprehensive and follow standard Go XML unmarshaling conventions.


274-288: LGTM!

HealthStatus model and helper function are appropriate. Unknown status for collection failures is intentional per maintainer decision.


290-309: LGTM!

The HealthCheck structure is well-designed with proper mutex-protected state for concurrent access.


311-383: LGTM!

The constructor properly configures the REST client, informer, and controller factory with appropriate wiring.


385-405: LGTM!

The sync function follows a clear flow: retrieve status, update operator conditions, record events.


407-450: LGTM!

Status retrieval logic properly validates staleness and enforces XML size limits. The maxXMLSize check at line 439 prevents XML bomb parsing exploits; while the data is already in memory, this is acceptable for internal trusted sources.


452-499: LGTM!

XML parsing correctly validates structure and sets StatusError when pacemaker is not running, addressing prior review feedback.


501-524: LGTM!

Node status collection correctly flags offline and standby nodes as errors, consistent with Two-Node OpenShift design where standby is treated as offline.

Based on learnings.


526-566: LGTM!

Resource status collection correctly tracks started resources across nodes with clear helper function structure.


568-574: LGTM!

Resource count validation correctly uses the intentional expectedNodeCount constant.

Based on learnings.


576-600: LGTM!

Recent failure detection correctly uses OR logic (line 584) per prior review. Timestamp parse errors are logged and skipped, which is appropriate defensive behavior.


602-619: LGTM!

Fencing event collection follows consistent patterns with appropriate time-based filtering.


621-634: LGTM!

Overall status determination follows clear precedence: errors > warnings > healthy.


636-653: LGTM!

Operator status update correctly sets degraded condition for StatusError, addressing prior review feedback.


655-664: LGTM!

Status message logging is clear and appropriate.


666-680: LGTM!

Degraded condition setting uses a scoped condition type (PacemakerHealthCheckDegraded at line 91), which prevents conflicts with other controllers' degraded conditions.


682-689: LGTM!

Condition clearing logic is correct.


691-701: LGTM!

Operator condition update uses standard library-go helpers correctly.


703-728: LGTM!

Event deduplication logic correctly cleans up stale entries and enforces time-based windows. The cleanup at lines 710-716 executes once per call, not per iteration of a caller's loop.


730-769: LGTM!

Event recording implements appropriate deduplication with context-aware windows (longer for fencing) and only records healthy events on transitions.


771-781: LGTM!

Warning event categorization correctly routes to appropriate event reasons.

pkg/tnf/pacemaker/v1alpha1/types.go (5)

1-5: LGTM!

Package structure and imports are standard for Kubernetes API types.


7-21: LGTM!

PacemakerStatus type follows Kubernetes CRD best practices with appropriate kubebuilder markers and subresource support.


23-50: LGTM!

Spec and Status types are well-designed with appropriate fields for status collection and error reporting. The RawXML field enables efficient parsing without re-executing commands.


52-77: LGTM!

PacemakerSummary provides a comprehensive high-level cluster state overview with well-documented fields.


79-86: LGTM!

PacemakerStatusList follows standard Kubernetes list type conventions.

pkg/cmd/pacemakerstatuscollector/collector.go (8)

1-50: LGTM!

Package structure, imports, and constants are well-organized. The pcs command uses sudo -n for non-interactive execution, which is good practice.


140-153: LGTM!

Command constructor follows standard cobra patterns with appropriate error handling.


155-177: LGTM!

The collector orchestration is clean with appropriate timeout and error handling.


179-217: LGTM!

Status collection properly handles timeouts, validates XML size, and gracefully handles parse errors while preserving raw XML.


219-265: LGTM!

Summary building logic correctly processes all node and resource states with clear counting logic.


267-286: LGTM!

Recent failure detection uses correct OR logic for status checks and handles timestamp parsing gracefully.


288-301: LGTM!

Recent fencing detection follows consistent patterns with appropriate time-based filtering.


303-319: LGTM!

Kubernetes client creation follows standard patterns with in-cluster and kubeconfig fallback.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from b553bf0 to d99fc51 Compare October 17, 2025 00:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
pkg/tnf/operator/starter.go (1)

346-351: Cron expression is invalid for Kubernetes CronJobs

"*/30 * * * * *" has six fields (seconds). Kubernetes batch/v1 CronJobs only accept the standard five-field spec, so this will be rejected and the collector never runs. Use a valid five-field schedule (e.g., "* * * * *" for once per minute and loop inside the job for 30‑second sampling) or another mechanism that supports sub-minute intervals.

-			Schedule:  "*/30 * * * * *", // Every 30 seconds (Kubernetes CronJob format with seconds)
+			Schedule:  "* * * * *", // Every minute; handle 30s cadence inside the job if required
pkg/cmd/pacemakerstatuscollector/collector.go (1)

358-391: Use API error helpers instead of string matching.

strings.Contains(err.Error(), "not found") is fragile (locale/log variations) and can misclassify errors. Switch to apierrors.IsNotFound(err) and import k8s.io/apimachinery/pkg/api/errors.

-	if err != nil {
-		// Create new PacemakerStatus if it doesn't exist
-		if strings.Contains(err.Error(), "not found") {
+	if err != nil {
+		// Create new PacemakerStatus if it doesn't exist
+		if apierrors.IsNotFound(err) {
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

3-19: Missing strconv import.

The previous review comment about the missing strconv import still applies. This import is required for the proper escaping fix suggested in the injectValues method.


48-51: Cron schedule comment is still misleading.

The previous review comment about the incorrect 6-field cron expression example still applies. Kubernetes CronJob expects a 5-field expression.


180-222: YAML injection and namespace issues remain.

The previous review comments about improper escaping in command building and missing namespace injection still apply.

🧹 Nitpick comments (2)
pkg/tnf/operator/starter_test.go (1)

292-301: Make the retry expectation observable

expectRetries is populated but never asserted, so the test still passes even if handleNodesWithRetry never performs additional attempts. Either surface the captured attempt counter (e.g., return it from the setup helper) and assert it against expectRetries, or drop the field to avoid implying coverage that isn’t there.

pkg/tnf/pkg/jobs/cronjob_controller.go (1)

90-111: Sync method correctly checks operator state.

The implementation appropriately verifies the operator is in Managed state before ensuring the CronJob. The early return for non-Managed states is a sound approach.

Optional: Consider whether existing CronJobs should be cleaned up when the operator transitions to Unmanaged or Removed states. The current behavior preserves existing resources, which may be intentional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b553bf0 and d99fc51.

📒 Files selected for processing (19)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (9 hunks)
  • pkg/tnf/operator/starter_test.go (6 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
  • pkg/operator/starter.go
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/v1alpha1/register.go
  • pkg/tnf/pacemaker/v1alpha1/doc.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (8)
pkg/tnf/pkg/jobs/cronjob_controller.go (8)

21-27: Constants are well-defined.

The default resync interval and namespace values are appropriate for CronJob lifecycle management.


29-42: Controller structure is well-organized.

The CronJobController struct appropriately encapsulates all necessary clients and configuration for managing CronJob resources.


53-88: Constructor is well-implemented.

The constructor properly handles defaults, follows the factory pattern, and correctly sets up the controller with informers and sync intervals.


113-132: CronJob reconciliation logic is sound.

The method correctly handles the create-or-update pattern with appropriate error handling for NotFound cases.


134-150: CronJob creation is well-handled.

The method correctly creates the CronJob, handles AlreadyExists errors gracefully, and records events appropriately.


152-165: CronJob update correctly preserves ResourceVersion.

The method appropriately preserves the existing ResourceVersion before updating, which is essential for conflict-free updates.


167-178: Manifest processing approach is sound.

The method correctly injects values and parses the manifest. The debug logging at V(4) is a helpful troubleshooting aid.


224-249: Update detection logic is correct and defensive.

The method appropriately compares schedule, command, and image fields, with proper defensive checks for container existence before accessing array elements.

Comment on lines +390 to +403
// Get pacemaker status
healthStatus, err := c.getPacemakerStatus(ctx)
if err != nil {
return err
}

// Update operator status conditions based on pacemaker status
if err := c.updateOperatorStatus(ctx, healthStatus); err != nil {
return err
}

// Record pacemaker health check events
c.recordHealthCheckEvents(healthStatus)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stale/failed collections never set degraded.

When getPacemakerStatus returns StatusUnknown (collection error, stale XML, oversized payload, etc.), updateOperatorStatus hits the default branch and leaves the degraded condition untouched. That masks genuine outages (e.g., pcs unreachable) and contradicts the controller’s goal of surfacing health issues. Please map collection failures to StatusError (either in getPacemakerStatus or when propagating errors) so we call setDegradedCondition.

🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 390 to 403, when
getPacemakerStatus returns StatusUnknown collection failures are not mapped to
degraded and thus updateOperatorStatus leaves degraded untouched; change the
propagation so that any StatusUnknown (collection error, stale/oversized
payload, unreachable pcs) is converted to StatusError before calling
updateOperatorStatus (or update updateOperatorStatus to treat StatusUnknown the
same as StatusError) so setDegradedCondition is invoked; implement the mapping
immediately after getPacemakerStatus returns (if healthStatus == StatusUnknown {
healthStatus = StatusError }) or add a case in updateOperatorStatus handling
StatusUnknown -> setDegradedCondition.

Change List
- Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health
- Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes
- Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons
- Included happy path testing
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from d99fc51 to 59eaec8 Compare October 17, 2025 13:26
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter_test.go (1)

18-20: Remove duplicate import of the same package

The same import path is aliased twice (config/v1). Use a single alias.

Apply this diff:

-	configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1"
-	v1 "github.com/openshift/client-go/config/informers/externalversions/config/v1"
+	configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1"
@@
-	networkInformer            v1.NetworkInformer
+	networkInformer            configv1informers.NetworkInformer

Also applies to: 55-56

♻️ Duplicate comments (5)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)

3-19: Add strconv import for proper command escaping.

The injectValues method (line 191) builds command strings without proper escaping. To safely handle command arguments containing quotes, backslashes, or other special characters, you'll need the strconv package.

Add the import:

 import (
   "context"
   "fmt"
+  "strconv"
   "strings"
   "time"

Then use strconv.Quote(cmd) in the command builder (see separate comment on lines 180-222).


48-48: Fix cron schedule comment: Kubernetes uses 5 fields, not 6.

The example "*/30 * * * * *" shows 6 fields (including seconds), but Kubernetes CronJob uses the standard 5-field cron format (minute, hour, day, month, weekday). This will confuse users.

Update the comment:

-  Schedule  string   // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds)
+  Schedule  string   // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)

180-222: Critical: Fix command escaping and add namespace injection.

Two issues in injectValues:

  1. Command escaping: Line 191 uses fmt.Sprintf('"%s"', cmd) which doesn't escape quotes or backslashes. If any command argument contains " or \, the generated YAML will be malformed.

  2. Missing namespace injection: The manifest has namespace: openshift-etcd (line 5 in cronjob.yaml), but this is never replaced with c.namespace, causing a mismatch when CronJobConfig.Namespace differs from the hardcoded value.

Apply this fix:

 func (c *CronJobController) injectValues(manifest []byte) []byte {
   result := string(manifest)
 
   // Build command string for YAML array format
   var commandBuilder strings.Builder
   commandBuilder.WriteString("[")
   for i, cmd := range c.command {
     if i > 0 {
       commandBuilder.WriteString(", ")
     }
-    commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+    // Use strconv.Quote for proper YAML/JSON escaping
+    commandBuilder.WriteString(strconv.Quote(cmd))
   }
   commandBuilder.WriteString("]")
   commandStr := commandBuilder.String()
 
   // Track which line we're on to do context-aware replacement
   lines := strings.Split(result, "\n")
   for i, line := range lines {
+    // Replace namespace
+    if strings.Contains(line, "namespace: <injected>") {
+      lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+    }
     // Replace name (appears in metadata.name)
     if strings.Contains(line, "name: <injected>") {

Note: This requires the strconv import (see separate comment) and changing line 5 in cronjob.yaml to namespace: <injected>.

pkg/tnf/operator/starter.go (1)

334-364: Cron schedule fix LGTM

Schedule now uses valid 5-field spec ("* * * * *"). This resolves the prior invalid 6-field seconds issue.

pkg/cmd/pacemakerstatuscollector/collector.go (1)

360-392: Use apierrors.IsNotFound instead of string matching.

String checks are brittle; prefer typed API errors.

-import (
+import (
   ...
-  "strings"
   ...
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
 )
 ...
- if err != nil {
+ if err != nil {
   // Create new PacemakerStatus if it doesn't exist
-  if strings.Contains(err.Error(), "not found") {
+  if apierrors.IsNotFound(err) {
     newStatus := &v1alpha1.PacemakerStatus{ ... }
     ...
   }
   return fmt.Errorf("failed to get PacemakerStatus: %w", err)
 }
🧹 Nitpick comments (20)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (2)

32-39: Make spec.nodeName required

Controllers likely rely on spec.nodeName; encode that contract in the CRD for early validation.

Apply this diff:

         spec:
           description: PacemakerStatusSpec defines the desired state of PacemakerStatus
           type: object
+          required:
+          - nodeName
           properties:
             nodeName:
               description: NodeName identifies which node this status is for
               type: string

15-18: Add printer columns for quick status at a glance

Improve kubectl get output for operators and SREs.

Apply this diff:

   versions:
   - name: v1alpha1
     served: true
     storage: true
+    additionalPrinterColumns:
+    - name: Node
+      type: string
+      jsonPath: .spec.nodeName
+    - name: Pacemakerd
+      type: string
+      jsonPath: .status.summary.pacemakerdState
+    - name: Quorum
+      type: string
+      jsonPath: .status.summary.hasQuorum
+    - name: Nodes
+      type: string
+      jsonPath: .status.summary.nodesOnline
+      description: Online/Total
+      priority: 1
+    - name: Updated
+      type: date
+      jsonPath: .status.lastUpdated

Also applies to: 58-85

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-1: Optional: align command attribution

The request attribute shows crm_mon XML while docs mention pcs status; consider noting pcs→crm_mon indirection or aligning text for consistency.

pkg/tnf/operator/starter_test.go (1)

260-271: Avoid live API dependencies in unit tests

runPacemakerControllers is started and the health checker may construct clients from KubeConfig. A bare rest.Config{Host: "..."} can cause dial/SSL errors and flakiness.

Options:

  • Inject a dynamic.Interface into the health checker and pass fakedynamic in tests.
  • Or gate runPacemakerControllers behind a test knob (function var) so unit tests stub it out.

If you keep KubeConfig in tests, set it to a dummy and ensure health code never dials when running under fake clients.

Also applies to: 96-110

pkg/tnf/operator/starter.go (2)

338-352: 30-second sampling claim vs 1-minute Cron

PR text says “every 30 seconds,” but Cron runs every minute. Ensure the collector loops internally (e.g., two samples with 30s sleep) or update docs.

If needed, document the intra-job loop or add an explicit flag to the collector for 30s cadence.


355-364: Defensive start for health controller

If controllerContext.KubeConfig is nil or misconfigured, NewHealthCheck/Run should no-op or degrade gracefully to avoid tight error loops.

Add a nil/validation check and early return with a Warning event.

pkg/tnf/pacemaker/healthcheck_test.go (5)

31-61: Mark helpers as test helpers.

Call t.Helper() in helper functions to improve failure locations and readability.

 func createTestHealthCheck() *HealthCheck {
+	testingT, _ := any(nil).(interface{ Helper() })
+	if testingT != nil {
+		testingT.Helper()
+	}

Apply similarly in createTestHealthCheckWithMockStatus, loadTestXML, and getRecentFailuresXML.


119-140: Stabilize time-dependent tests with an injected clock.

Using time.Now() risks flakes around cutoff windows. Prefer a controllable clock or pass times into helpers.

-func getRecentFailuresXML(t *testing.T) string {
+func getRecentFailuresXMLAt(t *testing.T, now time.Time) string {
   // Load the template...
-  now := time.Now()
   recentTime := now.Add(-1 * time.Minute)...
   ...
   return xmlContent
}
+
+func getRecentFailuresXML(t *testing.T) string {
+  return getRecentFailuresXMLAt(t, time.Now())
+}

277-310: Add coverage for Unknown → no condition change (locks intent).

Given Unknown is intentional for collection/staleness, add a subtest asserting degraded condition is not updated.

   // Test warning status (should not update operator condition)
   status = &HealthStatus{
     OverallStatus: statusWarning,
     Warnings:      []string{"Test warning"},
     Errors:        []string{},
   }
   err = controller.updateOperatorStatus(ctx, status)
   require.NoError(t, err, "updateOperatorAvailability should not return an error")
+
+  // Test unknown status (should not update operator condition)
+  status = &HealthStatus{
+    OverallStatus: statusUnknown,
+    Warnings:      []string{},
+    Errors:        []string{"collection failed"},
+  }
+  err = controller.updateOperatorStatus(ctx, status)
+  require.NoError(t, err)

338-405: Avoid brittle concrete type assertion for recorder.

Casting to events.InMemoryRecorder couples tests to implementation. Prefer asserting via the public Recorder and inspecting emitted events through its API or provide an accessor on HealthCheck for tests.


862-873: Test the max-size guard via getPacemakerStatus path too.

parsePacemakerStatusXML doesn’t enforce size; the guard lives in getPacemakerStatus. Add a test invoking getPacemakerStatus with oversized RawXML to validate handling.

pkg/tnf/pacemaker/healthcheck.go (5)

428-447: Explicit size check in parser for direct calls.

parsePacemakerStatusXML is callable directly (tests do) but doesn’t enforce maxXMLSize; add a guard to fail fast on oversized input as in getPacemakerStatus.

 func (c *HealthCheck) parsePacemakerStatusXML(xmlOutput string) (*HealthStatus, error) {
   status := &HealthStatus{ OverallStatus: statusUnknown, Warnings: []string{}, Errors: []string{} }
+  if len(xmlOutput) > maxXMLSize {
+    return nil, fmt.Errorf("XML output too large: %d bytes (max: %d bytes)", len(xmlOutput), maxXMLSize)
+  }
   // Parse XML
   var result PacemakerResult

509-523: Document standby-as-offline behavior inline.

Add a brief comment so future maintainers don’t “fix” this intentionally strict check.

 for _, node := range result.Nodes.Node {
-  if node.Online != booleanValueTrue {
+  // Treat standby as offline by design (standby nodes must not run critical resources)
+  if node.Online != booleanValueTrue {
     status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name))
   }
   if node.Standby == booleanValueTrue {
     status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (unexpected behavior; treated as offline)", node.Name))
   }
 }

428-431: Make staleness threshold a named constant tied to resync.

2 minutes is embedded; derive from the controller resync or expose a constant for coherence and easier tuning.

 const (
   ...
   healthCheckResyncInterval = 30 * time.Second
+  statusStaleAfter = 2 * time.Minute // consider 4 * healthCheckResyncInterval
 )
 ...
- if time.Since(result.Status.LastUpdated.Time) > 2*time.Minute {
+ if time.Since(result.Status.LastUpdated.Time) > statusStaleAfter {

703-729: Run dedupe cleanup once per call, not per event.

Cleanup executes for every event, adding overhead. Do it once at the start of recordHealthCheckEvents.

-func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool {
-  c.recordedEventsMu.Lock()
-  defer c.recordedEventsMu.Unlock()
-  now := time.Now()
-  // Clean up old entries...
-  maxWindow := eventDeduplicationWindowFencing
-  for key, timestamp := range c.recordedEvents {
-    if now.Sub(timestamp) > maxWindow {
-      delete(c.recordedEvents, key)
-    }
-  }
+func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool {
+  c.recordedEventsMu.Lock()
+  defer c.recordedEventsMu.Unlock()
+  now := time.Now()
   // Check if this event was recently recorded
   if lastRecorded, exists := c.recordedEvents[eventKey]; exists {
     if now.Sub(lastRecorded) < deduplicationWindow {
       return false
     }
   }
   c.recordedEvents[eventKey] = now
   return true
 }
 
 func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) {
+  // One-time cleanup per invocation
+  c.recordedEventsMu.Lock()
+  now := time.Now()
+  for key, ts := range c.recordedEvents {
+    if now.Sub(ts) > eventDeduplicationWindowFencing {
+      delete(c.recordedEvents, key)
+    }
+  }
+  c.recordedEventsMu.Unlock()
   // Record events for warnings...

Also applies to: 730-781


101-221: Avoid struct duplication across packages.

Pacemaker XML structs are duplicated here and in the collector; they will drift. Extract them into a shared internal/package and reuse.

Also applies to: 52-60

pkg/cmd/pacemakerstatuscollector/collector.go (4)

24-50: De-duplicate shared constants/types to prevent drift.

maxXMLSize, time formats, and XML structs exist here and in healthcheck.go. Move them into a common package (e.g., pkg/tnf/pacemaker/internal/common) and import from both callers.

Also applies to: 31-33, 75-82


179-217: Stream-limit pcs output to reduce peak memory.

If exec package supports it, prefer a bounded reader or line-capped read to avoid large buffers; otherwise consider enhancing exec.Execute to enforce an output cap.


267-286: Be tolerant to rc_text casing.

Some environments emit RCText “OK”. Compare case-insensitively to avoid false positives.

- if operation.RC != operationRCSuccess || operation.RCText != operationRCTextSuccess {
+ if operation.RC != operationRCSuccess || !strings.EqualFold(operation.RCText, operationRCTextSuccess) {

303-319: HOME may be empty; add fallback.

When building kubeconfig path, guard against empty HOME for non-login shells.

- kubeconfig := os.Getenv("KUBECONFIG")
- if kubeconfig == "" {
-   kubeconfig = os.Getenv("HOME") + "/.kube/config"
- }
+ kubeconfig := os.Getenv("KUBECONFIG")
+ if kubeconfig == "" {
+   if home := os.Getenv("HOME"); home != "" {
+     kubeconfig = home + "/.kube/config"
+   }
+ }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d99fc51 and 59eaec8.

📒 Files selected for processing (19)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/offline_node.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1 hunks)
  • pkg/tnf/pacemaker/testdata/standby_node.xml (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🚧 Files skipped from review as they are similar to previous changes (6)
  • cmd/cluster-etcd-operator/main.go
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • pkg/operator/starter.go
  • pkg/tnf/pacemaker/testdata/offline_node.xml
  • pkg/tnf/pacemaker/v1alpha1/types.go
  • pkg/tnf/pacemaker/testdata/standby_node.xml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (5)
pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM: CronJob decoder follows established pattern.

The function correctly mirrors ReadJobV1OrDie and reuses the existing codec factory. The panic-on-error behavior is appropriate for the "OrDie" convention.

bindata/tnfdeployment/cronjob.yaml (1)

35-40: Verify privileged settings are necessary for pacemaker operations.

The static analysis tools flag the privileged security context, privilege escalation, and hostPID: true as security concerns. For pacemaker health checks that execute pcs status, these elevated privileges are likely necessary. However, please confirm:

  1. Does pcs status --full require privileged access and hostPID?
  2. Can the command run with a more restrictive security context?
  3. Are there audit/compliance requirements that need documentation for these privileges?

If these privileges are confirmed necessary, consider adding a comment in the manifest explaining why elevated privileges are required.

pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)

1-87: Fixture looks good

Well-formed healthy-state XML; suitable for happy-path tests.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

14-31: Scheme registration looks correct

GroupVersion, AddKnownTypes, and AddToScheme are wired properly.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-121: Generated deepcopy looks correct.

Covers nested structs, slices, and runtime.Object; no issues spotted.

kind: CronJob
metadata:
name: <injected>
namespace: openshift-etcd
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded namespace creates inconsistency with controller.

The namespace is hardcoded to openshift-etcd, but the CronJobController accepts a Namespace field in CronJobConfig (line 47 in cronjob_controller.go). The injectValues method (lines 180-222 in cronjob_controller.go) doesn't replace the namespace placeholder, so if a caller provides a different namespace, the manifest and controller will be out of sync.

Change line 5 to use a placeholder:

-  namespace: openshift-etcd
+  namespace: <injected>

Then update injectValues in cronjob_controller.go to inject the namespace (this is already flagged in past review comments for that file).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In bindata/tnfdeployment/cronjob.yaml around line 5, the namespace is hardcoded
to "openshift-etcd" which will diverge from the CronJobController's configurable
Namespace; replace that line with a namespace placeholder (e.g. namespace: {{
.Namespace }}). Then update cronjob_controller.go (see CronJobConfig Namespace
field around line 47 and injectValues method around lines 180-222) to ensure
injectValues replaces the namespace placeholder with the configured Namespace
value when rendering the manifest; ensure nil/empty Namespace falls back to the
original default if needed and that tests and manifest rendering use the same
placeholder key.

Comment on lines +51 to +57
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
collectionError:
description: CollectionError contains any error encountered while collecting status
type: string
summary:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against oversized rawXML (1 MiB object size limit)

Storing full pcs/crm_mon XML can exceed Kubernetes object size limits, causing updates to fail under stress. Either truncate in the collector or constrain via schema.

Suggested schema bound; pair with collector-side truncation:

             rawXML:
               description: RawXML contains the raw XML output from pcs status xml command
               type: string
+              maxLength: 262144  # ~256 KiB safety cap; adjust as needed

Optionally add a boolean status.summary.rawXMLTruncated and trim in the collector when exceeding the cap.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
collectionError:
description: CollectionError contains any error encountered while collecting status
type: string
summary:
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
maxLength: 262144 # ~256 KiB safety cap; adjust as needed
collectionError:
description: CollectionError contains any error encountered while collecting status
type: string
summary:
🤖 Prompt for AI Agents
In manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml around lines
51-57, the rawXML status field is unbounded and can exceed Kubernetes 1 MiB
object size limits; add a schema constraint (e.g. maxLength: 1048576) on rawXML
and introduce a new boolean status.summary.rawXMLTruncated field to indicate
truncation, and update the collector to truncate the XML to that cap before
writing the CR status (ensuring truncation preserves valid encoding/closing tags
as needed).

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 59eaec8 to 85b3229 Compare October 17, 2025 14:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
bindata/tnfdeployment/cronjob.yaml (1)

5-5: Namespace hardcoding issue already flagged.

This issue was already identified in a previous review.

pkg/tnf/pkg/jobs/cronjob_controller.go (2)

48-48: Fix misleading cron schedule example – Kubernetes CronJob uses 5 fields, not 6.

The example "*/30 * * * * *" suggests a 6-field cron with seconds, but Kubernetes CronJob uses standard 5-field cron. Change to "*/30 * * * *" (every 30 minutes) and update the comment.

Apply this diff:

-	Schedule  string   // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds)
+	Schedule  string   // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)

180-222: Fix YAML injection risk and add namespace injection.

Two issues:

  1. YAML injection: Line 191 uses fmt.Sprintf(\"%s"`, cmd)which doesn't escape quotes/backslashes. Ifcmdcontains"or`, the YAML breaks. Use strconv.Quote for proper escaping.
  2. Missing namespace injection: The manifest never injects namespace. If the template contains namespace: <injected>, it won't be replaced.

Apply this diff:

 package jobs
 
 import (
 	"context"
 	"fmt"
+	"strconv"
 	"strings"
 	"time"
 func (c *CronJobController) injectValues(manifest []byte) []byte {
 	result := string(manifest)
 
 	// Build command string for YAML array format
 	var commandBuilder strings.Builder
 	commandBuilder.WriteString("[")
 	for i, cmd := range c.command {
 		if i > 0 {
 			commandBuilder.WriteString(", ")
 		}
-		commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
+		commandBuilder.WriteString(strconv.Quote(cmd))
 	}
 	commandBuilder.WriteString("]")
 	commandStr := commandBuilder.String()
 
 	// Track which line we're on to do context-aware replacement
 	lines := strings.Split(result, "\n")
 	for i, line := range lines {
+		// Replace namespace if present in template
+		if strings.Contains(line, "namespace: <injected>") {
+			lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+		}
 		// Replace name (appears in metadata.name)
 		if strings.Contains(line, "name: <injected>") {
 			lines[i] = strings.Replace(line, "<injected>", c.name, 1)
pkg/cmd/pacemakerstatuscollector/collector.go (1)

360-362: Use proper error type checking instead of string matching.

Line 362 uses brittle string matching to detect not-found errors. Use apierrors.IsNotFound(err) for robust error classification.

Apply this diff:

 import (
 	"context"
 	"encoding/xml"
 	"fmt"
 	"os"
 	"strings"
 	"time"
 
 	"github.com/spf13/cobra"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	if err != nil {
 		// Create new PacemakerStatus if it doesn't exist
-		if strings.Contains(err.Error(), "not found") {
+		if apierrors.IsNotFound(err) {
 			newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (2)
pkg/tnf/pacemaker/healthcheck.go (2)

401-402: Consider passing context to recordHealthCheckEvents for consistency.

While recordHealthCheckEvents currently performs no blocking I/O, passing ctx would future-proof the function and align with standard controller patterns where context flows through the call chain.

Based on learnings


438-441: XML size validation is defensive; etcd limits already constrain CR size.

The size check here occurs after the XML is in memory (from the CR). However, etcd/apiserver limits (typically ~1.5MB) already prevent excessively large CRs from being stored. This validation is still useful as a secondary defense-in-depth measure.

For stricter protection, consider future enhancement using io.LimitReader when the collector writes the CR.

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 59eaec8 and 85b3229.

📒 Files selected for processing (17)
  • bindata/etcd/etcd-sa-clusterrole.yaml (1 hunks)
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/starter.go (2 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • pkg/tnf/pacemaker/v1alpha1/doc.go
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
  • pkg/tnf/pkg/jobs/batch.go
  • pkg/tnf/pacemaker/v1alpha1/types.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (34)
bindata/tnfdeployment/cronjob.yaml (1)

35-40: Verify privileged access is necessary for pacemaker status collection.

The CronJob runs with privileged: true, allowPrivilegeEscalation: true, and hostPID: true. While static analysis flags these as security concerns, they may be required for the pcs status command to access the pacemaker cluster state on the host.

Please confirm:

  1. Is host PID namespace access (hostPID: true) required for the pacemaker status collector to run pcs status?
  2. Can the privilege requirements be reduced (e.g., specific capabilities instead of fully privileged)?
  3. Is there documentation about the minimum privileges needed for pcs status execution?

If these privileges are necessary, consider adding a comment in the manifest explaining why privileged access is required for operational clarity.

cmd/cluster-etcd-operator/main.go (1)

17-17: LGTM!

The pacemaker status collector command is properly imported and registered with the root command.

Also applies to: 82-82

pkg/operator/starter.go (2)

247-248: LGTM!

The RBAC manifests for the etcd service account are correctly added to the static resource controller, enabling PacemakerStatus resource access.


632-633: LGTM!

The function call is properly updated to pass the additional dynamicClient and AlivenessChecker parameters required for Pacemaker controller initialization.

bindata/etcd/etcd-sa-clusterrole.yaml (1)

1-24: LGTM!

The ClusterRole defines appropriate permissions for the pacemaker status collector to manage PacemakerStatus resources and their status subresource.

pkg/tnf/operator/starter_test.go (1)

35-35: LGTM!

The test is properly updated to:

  • Import the necessary health package
  • Create an alivenessChecker instance
  • Pass the additional parameters to HandleDualReplicaClusters
  • Initialize a minimal rest.Config for testing

Also applies to: 41-41, 96-109, 260-270

pkg/tnf/operator/starter.go (2)

38-38: LGTM!

The function signature is properly extended to accept dynamicClient and alivenessChecker, and the new runPacemakerControllers function is correctly invoked to start Pacemaker-related controllers.

Also applies to: 40-40, 76-77, 98-98


334-364: Verify CronJob schedule aligns with requirements.

The CronJob schedule is set to "* * * * *" (every minute), but the PR description states the controller "monitors pacemaker cluster status every 30 seconds."

Please confirm:

  1. Is running every minute (instead of every 30 seconds) acceptable for your monitoring requirements?
  2. If 30-second intervals are required, consider either:
    • Running the job more frequently (though Kubernetes CronJobs don't support sub-minute schedules natively)
    • Implementing a loop inside the collector to sample twice per minute
    • Using a different deployment pattern (e.g., Deployment with continuous monitoring)

If every minute is acceptable, update the PR description to match the actual schedule.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

1-31: LGTM!

The API registration follows standard Kubernetes patterns:

  • Properly defines GroupName and Version constants
  • Initializes SchemeGroupVersion and SchemeBuilder
  • Registers PacemakerStatus and PacemakerStatusList types
  • Provides helper functions for resource registration
pkg/tnf/pacemaker/healthcheck_test.go (1)

1-903: LGTM!

This is a comprehensive test suite with excellent coverage:

  • HealthCheck construction and initialization
  • PacemakerStatus retrieval (healthy, errors, stale)
  • Operator status updates
  • Event recording with proper deduplication across different time windows
  • XML parsing for various cluster states (healthy, offline, standby, failures)
  • Node and resource status collection
  • Edge cases (large XML, invalid XML, missing state)

The use of testdata fixtures and dynamic timestamp generation for time-sensitive tests (like getRecentFailuresXML) is a good practice.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-121: LGTM: Autogenerated deepcopy implementations are correct.

The generated deepcopy methods follow standard Kubernetes patterns: proper nil checks, correct slice allocation for PacemakerStatusList.Items, nested DeepCopyInto calls for embedded types, and appropriate DeepCopyObject implementations.

pkg/tnf/pacemaker/healthcheck.go (12)

1-99: LGTM: Imports and constants are well-organized.

The constants are clearly defined and documented. The expectedNodeCount = 2 reflects the Two-Node OpenShift with Fencing design constraint, as clarified in past discussions.


311-383: LGTM: Controller constructor follows standard operator patterns.

The REST client and informer setup are correct. Using panic for scheme registration and client creation errors is appropriate for initialization failures that should halt startup.


576-600: LGTM: Continue on parse errors is the correct resilient behavior.

Continuing when timestamp parsing fails (line 590) ensures the health check processes other valid operations. The error is logged at V(4) for debugging. Failing hard on malformed timestamps would be too strict and could mask legitimate failures in other operations.

Based on learnings


682-689: Condition ownership is established by the unique type name.

The condition type PacemakerHealthCheckDegraded (line 91) is unique to this controller, establishing clear ownership. Per Kubernetes conventions, condition types should be managed by a single controller, so clearing without checking the reason is safe.

Based on learnings


703-728: LGTM: Event cleanup logic is correct.

The cleanup (lines 712-716) runs unconditionally on every call and only removes entries older than maxWindow (1 hour). This is intentional and correct – it prevents unbounded map growth while preserving recent events within their deduplication windows.

Based on learnings


101-273: LGTM: XML structures are well-defined for pacemaker status parsing.

The types correctly mirror the pacemaker XML schema and use appropriate xml tags for unmarshaling.


452-499: LGTM: Parsing orchestration is well-structured.

The function properly validates critical structures, checks pacemaker state, and delegates to specialized collectors. Early return on non-running state with explicit error status is appropriate.


501-524: LGTM: Node status collection correctly handles the Two-Node design.

The checks for expected node count (with warning for transient 1-node state during replacement) and online/standby validation align with the Two-Node OpenShift architecture requirements.


526-574: LGTM: Resource validation correctly identifies kubelet and etcd placement.

The helper function pattern and resource agent prefix matching properly track resource distribution across nodes.


602-634: LGTM: Fencing and status determination logic is sound.

Fencing event detection with appropriate time window (24 hours) and simple status determination based on error/warning presence are appropriate for this health check.


636-701: LGTM: Operator condition management follows library-go patterns.

Condition updates use the standard v1helpers.UpdateStaticPodStatus API correctly. The separation of set/clear/update functions is clean.


730-781: LGTM: Event recording with deduplication is well-implemented.

The event categorization (fencing vs. failed actions) with distinct reasons and deduplication windows, plus tracking previous status to record healthy transitions, is thorough and production-ready.

pkg/tnf/pkg/jobs/cronjob_controller.go (3)

21-88: LGTM: Controller setup follows standard factory patterns.

The constants, types, and constructor are well-organized. Default namespace fallback and resync interval are appropriate.


90-178: LGTM: CronJob lifecycle management is correct.

The sync/ensure/create/update/build flow properly handles the CronJob lifecycle, including AlreadyExists handling and ResourceVersion preservation for updates.


224-249: LGTM: Update detection correctly compares schedule, command, and image.

The comparison logic properly checks all three configurable fields and handles array length mismatches.

pkg/cmd/pacemakerstatuscollector/collector.go (8)

1-50: LGTM: Imports and constants are well-organized.

The constants clearly define timeouts, size limits, and time formats. The command uses sudo -n for non-interactive execution, which is correct.


52-138: LGTM: XML structure duplication is acceptable to avoid circular dependencies.

The comment (line 53) explains why these types are duplicated from healthcheck.go. This is a reasonable trade-off for maintaining clean package dependencies.


140-177: LGTM: Command structure and runCollector orchestration are sound.

The 30-second context timeout (line 156) provides a reasonable upper bound for collection and CR update operations. Error handling propagates failures appropriately.


179-217: LGTM: Collection logic is robust and defensive.

The nested timeout (10s for exec, line 181), size validation (line 196), and returning raw XML even on parse failure (line 210) make this resilient to various failure modes.


219-265: LGTM: Summary building comprehensively captures cluster state.

The function correctly counts online nodes, started resources, and checks for recent failures/fencing events. The logic aligns with the health check expectations.


267-301: LGTM: Time-window checks for failures and fencing are correct.

The time window constants (5 minutes for failures, 24 hours for fencing) match the health check controller. Error handling continues on parse failure, which is appropriate.


303-319: LGTM: Kubernetes client creation with in-cluster fallback is standard.

The in-cluster config attempt followed by kubeconfig fallback correctly supports both in-cluster (Job/CronJob) and local execution contexts.


363-414: LGTM: CR update logic correctly handles create vs. update paths.

The create path (lines 363-389) properly sets TypeMeta and ObjectMeta. The update path (lines 394-413) correctly uses the status subresource and sets ObservedGeneration.

The goal of this change was to ensure that the CEO permissions for the pcs status cronjob
were elevated enough to be able to gather pcs status output. One future improvement would
be to move this out of the CEO entirely and have the status gathering be done by an
alert agent that posts regular updates to the CEO.

To alleviate the complexity of a future transition to this model, I've structured the code
so that the healthcheck controller doesn't care where the posts to the PacemakerStatus CRD
come from - it just monitors it for updates. In the future, we can write an alert agent
that just posts new PacemakerStatus updates as they happen, and the CEO can retire the
cronjob entire.
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084-pacemaker-health-check branch from 85b3229 to 9f5f422 Compare October 17, 2025 16:34
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
bindata/tnfdeployment/cronjob.yaml (1)

5-5: Hardcoded namespace creates inconsistency with controller.

The namespace is hardcoded to openshift-etcd, but the CronJobController accepts a Namespace field in CronJobConfig. The controller's injectValues method doesn't replace the namespace placeholder, causing potential mismatches.

As flagged in previous reviews: replace with a placeholder and update the controller to inject it.

pkg/tnf/pkg/jobs/cronjob_controller.go (3)

48-51: Cron schedule comment shows invalid 6-field expression.

Kubernetes CronJob expects a standard 5-field cron expression (minute, hour, day, month, weekday), not 6 fields. The example showing "*/30 * * * * *" with seconds is misleading.

Update to a valid 5-field example like "*/30 * * * *" (every 30 minutes).


184-194: Command array construction lacks proper YAML escaping.

The command building uses fmt.Sprintf("%s") which doesn't escape quotes, backslashes, or other special characters. Arguments containing these characters will break the YAML syntax when injected.

As flagged in previous reviews: use strconv.Quote() for each command element to ensure proper escaping, and add the strconv import.


196-222: Namespace is not injected into the manifest.

The injectValues method replaces placeholders for name, schedule, image, and command, but never injects the namespace. The manifest template has a hardcoded namespace, creating an inconsistency with the controller's configurable Namespace field.

As flagged in previous reviews: add a replacement branch to inject c.namespace when the line contains "namespace: <injected>".

pkg/cmd/pacemakerstatuscollector/collector.go (1)

360-392: Use apierrors.IsNotFound instead of string matching.

The brittle string check strings.Contains(err.Error(), "not found") on line 362 should be replaced with proper API error type checking using apierrors.IsNotFound(err).

Apply this fix:

 import (
   "context"
   "encoding/xml"
   "fmt"
   "os"
   "strings"
   "time"
 
   "github.com/spf13/cobra"
+  apierrors "k8s.io/apimachinery/pkg/api/errors"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
   if err != nil {
     // Create new PacemakerStatus if it doesn't exist
-    if strings.Contains(err.Error(), "not found") {
+    if apierrors.IsNotFound(err) {
       newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (6)
pkg/tnf/operator/starter.go (1)

313-346: Defensive guard for KubeConfig to avoid panics.

If controllerContext.KubeConfig is ever nil (tests or misconfig), NewHealthCheck will panic when creating the REST client. Add a fast check and skip starting the healthcheck in that case (log at Warning).

Example:

 func runPacemakerControllers(..., etcdInformer operatorv1informers.EtcdInformer) {
   // Start the healthcheck controller to watch PacemakerStatus CRs
   // This can start early since it just subscribes to CR updates
   klog.Infof("starting Pacemaker healthcheck controller")
-  healthCheckController := pacemaker.NewHealthCheck(
+  if controllerContext.KubeConfig == nil {
+    klog.Warning("skipping Pacemaker healthcheck controller: KubeConfig is nil")
+  } else {
+    healthCheckController := pacemaker.NewHealthCheck(
       alivenessChecker,
       operatorClient,
       kubeClient,
       controllerContext.EventRecorder,
       controllerContext.KubeConfig,
     )
-  go healthCheckController.Run(ctx, 1)
+    go healthCheckController.Run(ctx, 1)
+  }
pkg/tnf/pacemaker/healthcheck.go (2)

320-335: Avoid panics on REST client creation; prefer graceful failure.

Panic will crash the operator if RESTClientFor fails (bad rest.Config, serializer, env). Return an error from constructor (or log and skip starting) and let the caller decide, to keep the operator up.

Minimal approach (preserve signature): log error and return a no-op controller:

- restClient, err := rest.RESTClientFor(pacemakerConfig)
- if err != nil {
-   panic(fmt.Errorf("failed to create REST client: %w", err))
- }
+ restClient, err := rest.RESTClientFor(pacemakerConfig)
+ if err != nil {
+   klog.Errorf("PacemakerHealthCheck disabled: failed to create REST client: %v", err)
+   return factory.New().ToController("PacemakerHealthCheckDisabled", events.NewInMemoryRecorder("noop"))
+ }

Alternatively, change NewHealthCheck to return (factory.Controller, error) and bubble up.
As per coding guidelines.


730-781: Event deduplication is sensible; minor nit: avoid cleanup per-item.

Current cleanup runs for each warning/error. Consider a periodic cleanup (once per sync) to reduce lock contention.

pkg/tnf/pacemaker/healthcheck_test.go (1)

32-61: Consider removing or clarifying the TODO comment.

The TODO comment on line 56 states "TODO: Mock this for actual tests", but the function is already used successfully in TestNewHealthCheck (line 168-174). If the nil REST client and informer are acceptable for basic constructor testing, consider either:

  • Removing the TODO comment, or
  • Clarifying what additional testing would require proper mocking
pkg/cmd/pacemakerstatuscollector/collector.go (2)

303-333: Refactor duplicate REST config retrieval logic.

The code for obtaining the Kubernetes REST config is duplicated between createKubernetesClient (lines 304-316) and updatePacemakerStatusCR (lines 323-333). Consider extracting this into a shared helper function to improve maintainability.

Example refactor:

func getRESTConfig() (*rest.Config, error) {
  // Try in-cluster config first
  config, err := rest.InClusterConfig()
  if err != nil {
    // Fall back to kubeconfig
    kubeconfig := os.Getenv("KUBECONFIG")
    if kubeconfig == "" {
      kubeconfig = os.Getenv("HOME") + "/.kube/config"
    }
    config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
    if err != nil {
      return nil, err
    }
  }
  return config, nil
}

func createKubernetesClient() (*kubernetes.Clientset, error) {
  config, err := getRESTConfig()
  if err != nil {
    return nil, err
  }
  return kubernetes.NewForConfig(config)
}

func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {
  config, err := getRESTConfig()
  if err != nil {
    return err
  }
  // ... rest of function
}

52-138: XML structure duplication is documented.

The comment on line 53 notes that these XML structures are copied from healthcheck.go to avoid circular dependencies. While code duplication is generally undesirable, this appears to be an intentional trade-off documented in the code.

Consider moving these shared structures to a separate package (e.g., pkg/tnf/pacemaker/types or pkg/tnf/pacemaker/internal) if the circular dependency can be resolved.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 85b3229 and 9f5f422.

📒 Files selected for processing (18)
  • bindata/etcd/etcd-sa-clusterrole.yaml (1 hunks)
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml (1 hunks)
  • bindata/tnfdeployment/cronjob.yaml (1 hunks)
  • cmd/cluster-etcd-operator/main.go (2 hunks)
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1 hunks)
  • pkg/cmd/pacemakerstatuscollector/collector.go (1 hunks)
  • pkg/operator/ceohelpers/external_etcd_status.go (2 hunks)
  • pkg/operator/starter.go (2 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/operator/starter_test.go (4 hunks)
  • pkg/tnf/pacemaker/healthcheck.go (1 hunks)
  • pkg/tnf/pacemaker/healthcheck_test.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/doc.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/register.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/types.go (1 hunks)
  • pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/tnf/pkg/jobs/batch.go (1 hunks)
  • pkg/tnf/pkg/jobs/cronjob_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • bindata/etcd/etcd-sa-clusterrolebinding.yaml
  • pkg/operator/starter.go
  • manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
  • cmd/cluster-etcd-operator/main.go
  • bindata/etcd/etcd-sa-clusterrole.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml

[medium] 1-47: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-47: Container should not be privileged

(CKV_K8S_16)


[medium] 1-47: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-47: Containers should not share the host process ID namespace

(CKV_K8S_17)

🔇 Additional comments (18)
pkg/tnf/pkg/jobs/batch.go (1)

42-48: LGTM!

The function correctly mirrors the existing ReadJobV1OrDie pattern, using the same decoder infrastructure to parse CronJob manifests. The implementation is consistent and appropriate for this use case.

bindata/tnfdeployment/cronjob.yaml (1)

35-40: Privileged access is appropriate for pacemaker operations.

The privileged security context and hostPID: true are necessary for the collector to execute pcs status commands and interact with the pacemaker cluster on the host. Static analysis warnings are expected but can be accepted for this system-level operational use case.

pkg/tnf/pkg/jobs/cronjob_controller.go (3)

90-111: LGTM!

The sync method correctly checks operator state and ensures the CronJob only when in Managed state. Error handling and logging are appropriate.


113-132: LGTM!

The ensureCronJob method properly handles both creation and update paths, with correct error handling for NotFound vs other errors.


224-249: LGTM!

The needsUpdate method correctly compares schedule, command array, and image to determine if an update is required. Bounds checking is appropriate.

pkg/tnf/pacemaker/v1alpha1/doc.go (1)

1-3: LGTM.

Package doc and group annotation look good.

pkg/tnf/operator/starter_test.go (1)

96-110: Aliveness checker wiring looks correct.

Passing a concrete MultiAlivenessChecker into HandleDualReplicaClusters matches the new signature.

pkg/operator/ceohelpers/external_etcd_status.go (1)

167-201: Nice generic waiter; sync + context-aware polling is appropriate.

Gates on informer sync, retries with PollUntilContextTimeout, and treats transient errors as retryable. LGTM.

pkg/tnf/pacemaker/v1alpha1/register.go (1)

1-31: LGTM.

SchemeGroupVersion, AddToScheme, and known types are correctly wired.

pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)

1-121: Autogen deepcopy looks correct.

Methods cover all types; safe copies for meta/timestamps.

pkg/tnf/operator/starter.go (2)

229-237: Good: bootstrap wait refactored to shared helper.

Using ceohelpers.WaitForEtcdCondition simplifies readiness gating and centralizes polling behavior.


348-367: CronJob schedule fix looks good.

Five-field spec "* * * * *" is valid for Kubernetes CronJobs; controller wiring is straightforward.

pkg/tnf/pacemaker/healthcheck.go (1)

509-523: Node-state handling matches PR intent.

You warn on unexpected node count and treat standby as offline. Clear messages; aligns with TNF design.

pkg/tnf/pacemaker/healthcheck_test.go (2)

126-140: Excellent use of dynamic timestamps!

Generating dynamic timestamps within the test window (line 131-133) is a best practice that prevents time-based test flakiness. This approach ensures tests remain reliable regardless of when they execute.


143-903: Comprehensive test coverage!

The test suite provides excellent coverage across:

  • Happy path scenarios (healthy clusters, online nodes, started resources)
  • Error conditions (offline nodes, standby nodes, collection errors, stale status)
  • Edge cases (no nodes, mismatched counts, large XML, invalid XML, missing state)
  • Event deduplication across different time windows and state transitions
  • All major parsing and status determination functions

The tests are well-structured, clearly named, and follow consistent patterns.

pkg/tnf/pacemaker/v1alpha1/types.go (1)

1-86: Well-structured API types following Kubernetes conventions.

The type definitions are clean and follow standard Kubernetes API patterns with appropriate kubebuilder markers for CRD generation, deepcopy generation, and subresource configuration.

pkg/cmd/pacemakerstatuscollector/collector.go (2)

140-177: Clean command structure and error handling.

The cobra command setup and main orchestration logic follow good patterns with proper context timeout, error propagation, and logging.


179-301: Well-structured collection and parsing logic.

The collection, parsing, and summary-building functions are cleanly separated with good error handling, size validation (maxXMLSize check on line 196), and time-based filtering for recent events.

Comment on lines +260 to 271
// Create a minimal rest.Config for testing
restConfig := &rest.Config{
Host: "https://localhost:6443",
}

return args{
initErr: nil, // Default to no error
ctx: context.Background(),
controllerContext: &controllercmd.ControllerContext{
EventRecorder: eventRecorder,
KubeConfig: restConfig,
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Prevent test goroutine leaks: return the scoped ctx instead of context.Background.

Controllers started by HandleDualReplicaClusters use args.ctx; returning context.Background bypasses cancellation and can leak goroutines across tests. Return the ctx created in getArgs.

Apply:

@@
-  return args{
-    initErr: nil, // Default to no error
-    ctx:     context.Background(),
+  return args{
+    initErr: nil, // Default to no error
+    ctx:     ctx,
     controllerContext: &controllercmd.ControllerContext{
       EventRecorder: eventRecorder,
       KubeConfig:    restConfig,
     },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Create a minimal rest.Config for testing
restConfig := &rest.Config{
Host: "https://localhost:6443",
}
return args{
initErr: nil, // Default to no error
ctx: context.Background(),
controllerContext: &controllercmd.ControllerContext{
EventRecorder: eventRecorder,
KubeConfig: restConfig,
},
// Create a minimal rest.Config for testing
restConfig := &rest.Config{
Host: "https://localhost:6443",
}
return args{
initErr: nil, // Default to no error
ctx: ctx,
controllerContext: &controllercmd.ControllerContext{
EventRecorder: eventRecorder,
KubeConfig: restConfig,
},
🤖 Prompt for AI Agents
In pkg/tnf/operator/starter_test.go around lines 260 to 271, the test currently
returns context.Background() for args.ctx which bypasses the scoped cancellable
ctx created by getArgs and can leak controller goroutines across tests; change
the returned args to set ctx: ctx (the local scoped context created earlier)
instead of context.Background(), so the test harness can cancel it and prevent
goroutine leaks.

Comment on lines +23 to +28
// PacemakerStatusSpec defines the desired state of PacemakerStatus
type PacemakerStatusSpec struct {
// NodeName identifies which node this status is for
// +optional
NodeName string `json:"nodeName,omitempty"`
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify if NodeName field is needed.

The NodeName field in PacemakerStatusSpec is marked optional but doesn't appear to be set anywhere in the collector implementation (pkg/cmd/pacemakerstatuscollector/collector.go lines 363-377, 394-399). The collector uses a singleton CR named "cluster" without node-specific distinction.

If node-specific status tracking isn't part of the current design, consider removing this field to avoid confusion. If it's planned for future use, consider adding a comment explaining its intended purpose.

Run the following to verify NodeName usage across the codebase:


🏁 Script executed:

#!/bin/bash
# Search for any references to NodeName in PacemakerStatus context
rg -n -C3 'NodeName' --type=go

Length of output: 264911



---

🏁 Script executed:

```shell
#!/bin/bash
# Search specifically for PacemakerStatus usage and NodeName references
echo "=== Search 1: PacemakerStatus creation/instantiation ==="
rg -n 'PacemakerStatus' --type=go | grep -v vendor | head -20

echo ""
echo "=== Search 2: Collector code that creates PacemakerStatus ==="
rg -n 'pacemakerstatuscollector' --type=go | grep -v vendor | head -20

echo ""
echo "=== Search 3: Direct reference to PacemakerStatusSpec with NodeName ==="
rg -n 'PacemakerStatusSpec.*NodeName\|NodeName.*PacemakerStatusSpec' --type=go | grep -v vendor

echo ""
echo "=== Search 4: Look for any .NodeName assignments in collector context ==="
rg -n '\.NodeName\s*=' --type=go pkg/cmd/pacemakerstatuscollector/ -A2 -B2 2>/dev/null || echo "File not found, searching broadly:"
find pkg -name "*collector*" -type f

Length of output: 2866


🏁 Script executed:

#!/bin/bash
# View the collector.go to see how PacemakerStatus objects are created
echo "=== Collector.go content (showing PacemakerStatus creation) ==="
rg -n 'PacemakerStatus' pkg/cmd/pacemakerstatuscollector/collector.go -A5 -B5

echo ""
echo "=== Search for Spec assignment in collector ==="
rg -n 'Spec' pkg/cmd/pacemakerstatuscollector/collector.go -C3 | head -40

Length of output: 4469


Remove unused NodeName field from PacemakerStatusSpec or document its intended purpose.

The NodeName field in PacemakerStatusSpec is never set by the collector implementation. Lines 363–376 and 395–398 of collector.go show that only the Status fields are populated (LastUpdated, RawXML, CollectionError, Summary); the Spec is never touched. Combined with the singleton CR design ("cluster"), this field appears unused. Either remove it or add a comment explaining future intent.

🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/v1alpha1/types.go around lines 23–28, the
PacemakerStatusSpec.NodeName field is unused by the collector and the singleton
"cluster" CR design, so remove the NodeName field from the PacemakerStatusSpec
struct, delete its json tag, and update any comments; regenerate CRDs/deepcopy
(run controller-gen or go generate as appropriate), fix any compile-time
references, and run go fmt/go vet to ensure cleanliness. If you prefer to keep
it instead, replace the field with a clear comment explaining its future intent
and why it is currently unused, and still regenerate CRDs/deepcopy and run
formatting checks.

Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-disruptive-ovn 81984c7 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-aws-etcd-recovery 81984c7 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-vsphere-ovn-etcd-scaling 81984c7 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-etcd-scaling 81984c7 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive 81984c7 link false /test e2e-gcp-disruptive
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 81984c7 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-disruptive-ovn 81984c7 link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-aws-ovn-serial 81984c7 link true /test e2e-aws-ovn-serial
ci/prow/e2e-azure-ovn-etcd-scaling 81984c7 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-etcd-scaling 81984c7 link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-metal-assisted 81984c7 link true /test e2e-metal-assisted
ci/prow/e2e-aws-cpms 81984c7 link true /test e2e-aws-cpms
ci/prow/e2e-metal-ovn-two-node-fencing 81984c7 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 81984c7 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-disruptive 81984c7 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp 81984c7 link false /test e2e-gcp
ci/prow/e2e-aws-ovn-serial-2of2 9f5f422 link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-agnostic-ovn 9f5f422 link true /test e2e-agnostic-ovn
ci/prow/e2e-aws-ovn-single-node 9f5f422 link true /test e2e-aws-ovn-single-node
ci/prow/unit 9f5f422 link true /test unit
ci/prow/okd-scos-e2e-aws-ovn 9f5f422 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-serial-1of2 9f5f422 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-metal-ipi-ovn-ipv6 9f5f422 link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hostNetwork: false
hostPID: true
priorityClassName: system-node-critical
serviceAccountName: etcd-sa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use the tnf-setup-manager SA. like we do in the other TNF jobs? That one also has the privileged securitycontextconstraints rule assigned, which I expect the pacemaker cronjob needs as well


// Track which line we're on to do context-aware replacement
lines := strings.Split(result, "\n")
for i, line := range lines {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I like using a JobHookFunc like in runJobController() much more than this string replacements... WDYT?

cmd.AddCommand(requestbackup.NewRequestBackupCommand(ctx))
cmd.AddCommand(rev.NewRevCommand(ctx))
cmd.AddCommand(backuprestore.NewBackupServer(ctx))
cmd.AddCommand(pacemakerstatuscollector.NewPacemakerStatusCollectorCommand())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding this in cmd/tnf-setup-runner instead, where all the other TNF related commands live? (yes, it's not about the setup part, but we can rename it?)

@@ -0,0 +1,414 @@
package pacemakerstatuscollector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to have this in the tnf package?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants