-
Notifications
You must be signed in to change notification settings - Fork 147
OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters #1487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPEDGE-2084: Add pacemaker health check for ExternalEtcd clusters #1487
Conversation
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds Pacemaker support: a new PacemakerStatus CRD and Go API types, a status-collector CLI plus CronJob to populate CRs from pcs XML, a Pacemaker healthcheck controller that reads CRs and updates operator conditions, operator wiring and helpers, RBAC and manifest updates, and a runtime/binary rename to tnf-runtime. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (7)
pkg/tnf/pacemaker/healthcheck.go (3)
71-75
: Avoid event spam: emit “Healthy” only on transitions.At 30s resync, always emitting a Healthy event will generate noisy duplicates. Track last status and only emit on transition to Healthy.
type HealthCheck struct { operatorClient v1helpers.StaticPodOperatorClient kubeClient kubernetes.Interface eventRecorder events.Recorder + lastOverallStatus string } @@ - // Record a normal event for healthy status - if status.OverallStatus == StatusHealthy { - c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started") - } + // Record a normal event for healthy status only on transition + if status.OverallStatus == StatusHealthy && c.lastOverallStatus != StatusHealthy { + c.eventRecorder.Eventf(EventReasonHealthy, "Pacemaker cluster is healthy - all nodes online and critical resources started") + } + c.lastOverallStatus = status.OverallStatusAlso applies to: 446-450
347-356
: Micro: precompile timestamp regex in isWithinTimeWindow.Avoid recompiling on each iteration.
func (c *HealthCheck) isWithinTimeWindow(line string, cutoffTime time.Time) bool { timeFormats := []string{ "Mon Jan 2 15:04:05 2006", "2006-01-02 15:04:05.000000Z", "2006-01-02 15:04:05Z", } - for _, format := range timeFormats { - re := regexp.MustCompile(`'([^']+)'`) - matches := re.FindStringSubmatch(line) + tsRe := regexp.MustCompile(`'([^']+)'`) + for _, format := range timeFormats { + matches := tsRe.FindStringSubmatch(line) if len(matches) > 1 { if t, err := time.Parse(format, matches[1]); err == nil { return t.After(cutoffTime) } } }
71-75
: Unused dependency: kubeClient is stored but never used.Drop it from the struct/constructor or put it to use (e.g., namespacing events, future lookups). Keeping it increases surface without value.
type HealthCheck struct { operatorClient v1helpers.StaticPodOperatorClient - kubeClient kubernetes.Interface eventRecorder events.Recorder lastOverallStatus string } @@ func NewHealthCheck( livenessChecker *health.MultiAlivenessChecker, operatorClient v1helpers.StaticPodOperatorClient, - kubeClient kubernetes.Interface, eventRecorder events.Recorder, ) factory.Controller { c := &HealthCheck{ operatorClient: operatorClient, - kubeClient: kubeClient, eventRecorder: eventRecorder, } @@ - operatorClient, - kubeClient, + operatorClient, eventRecorder, )If this is intentionally reserved for future use, ignore and mark with a TODO comment to avoid churn.
Also applies to: 80-84
pkg/tnf/pacemaker/healthcheck_test.go (4)
106-137
: Assert condition transitions, not just error absence.Validate that degraded/healthy actually update the operator condition in the fake client.
err := controller.updateOperatorAvailability(ctx, status) require.NoError(t, err, "updateOperatorAvailability should not return an error") + // Verify degraded condition set + _, st, _, _ := operatorClient.GetStaticPodOperatorState() + require.Condition(t, func() (success bool) { + for _, c := range st.Conditions { + if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionTrue { + return true + } + } + return false + }, "expected degraded condition to be true") @@ err = controller.updateOperatorAvailability(ctx, status) require.NoError(t, err, "updateOperatorAvailability should not return an error") + // Verify degraded condition cleared + _, st, _, _ = operatorClient.GetStaticPodOperatorState() + require.Condition(t, func() (success bool) { + for _, c := range st.Conditions { + if c.Type == ConditionTypeDegraded && c.Status == operatorv1.ConditionFalse { + return true + } + } + return false + }, "expected degraded condition to be false")
163-184
: Validate emitted events.Use the in-memory recorder to assert reasons were produced (failed action, fencing, generic warning, error, healthy transition).
controller.recordHealthCheckEvents(status) - + // Collect and assert events for warnings and error + evts := eventRecorder.Events() + require.NotEmpty(t, evts) + // Sanity: at least one of each warning type and an error + require.Contains(t, reasons(evts), EventReasonFailedAction) + require.Contains(t, reasons(evts), EventReasonFencingEvent) + require.Contains(t, reasons(evts), EventReasonWarning) + require.Contains(t, reasons(evts), EventReasonError) @@ status = &HealthStatus{ OverallStatus: StatusHealthy, Warnings: []string{}, Errors: []string{}, } controller.recordHealthCheckEvents(status) + // After transition to healthy, expect a healthy event + evts = eventRecorder.Events() + require.Contains(t, reasons(evts), EventReasonHealthy) + +} + +// helper +func reasons(evts []events.Event) []string { + out := make([]string, 0, len(evts)) + for _, e := range evts { + out = append(out, e.Reason) + } + return out }
261-291
: Add a test for “standby” node to ensure it’s a Warning, not Degraded.Guards the updated parsing logic and aligns with the stated degradation policy.
// Test with offline node @@ require.NotEmpty(t, status.Errors, "Should have errors for offline nodes") + + // Test with standby node -> warning, but still considered online + lines = []string{ + SectionNodeList, + " * Node master-0 (1): online, feature set 3.19.6", + " * Node master-1 (2): standby, feature set 3.19.6", + SectionFullListResources, + } + status = &HealthStatus{Warnings: []string{}, Errors: []string{}} + allOnline = controller.parseNodeStatus(lines, status) + require.True(t, allOnline, "Standby should not be treated as offline") + require.NotEmpty(t, status.Warnings, "Standby should surface as a warning")
296-313
: Cover dynamic node-count logic.Add a case where only one node has resources started while two nodes are listed; expect an error message reflecting 1/2.
// Test with all resources started @@ require.Empty(t, status.Errors, "Should have no errors for started resources") + + // Test with partial placement (only one node started) + lines = []string{ + SectionNodeList, + " * Node master-0 (1): online, feature set 3.19.6", + " * Node master-1 (2): online, feature set 3.19.6", + SectionFullListResources, + " * Clone Set: kubelet-clone [kubelet]:", + " * kubelet (systemd:kubelet): Started master-0", + " * Clone Set: etcd-clone [etcd]:", + " * etcd (ocf:heartbeat:podman-etcd): Started master-0", + SectionNodeAttributes, + } + status = &HealthStatus{Warnings: []string{}, Errors: []string{}} + allStarted = controller.parseResourceStatus(lines, status) + require.False(t, allStarted, "Resources not started on all nodes should be false") + require.NotEmpty(t, status.Errors)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (2)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)
pkg/tnf/pacemaker/healthcheck.go
Outdated
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) { | ||
allNodesOnline = false | ||
nodeName := c.extractNodeName(line) | ||
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t treat “not online” as “offline”; handle standby distinctly.
Current check flags any non-“online” node as offline (e.g., “standby”). That contradicts the PR objective (degrade only when nodes are offline).
- if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) {
- allNodesOnline = false
- nodeName := c.extractNodeName(line)
- status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName))
- }
+ if c.isNodeStatusLine(line) {
+ lower := strings.ToLower(line)
+ if strings.Contains(lower, "offline") {
+ allNodesOnline = false
+ nodeName := c.extractNodeName(line)
+ status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName))
+ } else if strings.Contains(lower, "standby") {
+ nodeName := c.extractNodeName(line)
+ status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName))
+ }
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if c.isNodeStatusLine(line) && !strings.Contains(line, NodeStatusOnline) { | |
allNodesOnline = false | |
nodeName := c.extractNodeName(line) | |
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", nodeName)) | |
} | |
if c.isNodeStatusLine(line) { | |
lower := strings.ToLower(line) | |
if strings.Contains(lower, "offline") { | |
allNodesOnline = false | |
nodeName := c.extractNodeName(line) | |
status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", nodeName)) | |
} else if strings.Contains(lower, "standby") { | |
nodeName := c.extractNodeName(line) | |
status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", nodeName)) | |
} | |
} |
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 214-218, the code currently
treats any node state that is not NodeStatusOnline as an offline error; change
the logic to only mark nodes as offline when their status equals the explicit
offline constant (e.g., NodeStatusOffline) and ignore legitimate non-online
states like NodeStatusStandby. Concretely, replace the current negation check
with an explicit equality check for offline (or a compound condition that skips
Standby), set allNodesOnline = false and append an error only when the status is
offline, and avoid adding errors for standby nodes.
pkg/tnf/pacemaker/healthcheck.go
Outdated
func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool { | ||
resourcesStarted := make(map[string]map[string]bool) | ||
resourcesStarted[ResourceKubelet] = make(map[string]bool) | ||
resourcesStarted[ResourceEtcd] = make(map[string]bool) | ||
|
||
inResources := false | ||
|
||
for _, line := range lines { | ||
line = strings.TrimSpace(line) | ||
|
||
if strings.Contains(line, SectionFullListResources) { | ||
inResources = true | ||
continue | ||
} | ||
|
||
if inResources { | ||
if strings.Contains(line, SectionNodeAttributes) { | ||
break | ||
} | ||
|
||
c.checkResourceLine(line, resourcesStarted) | ||
} | ||
} | ||
|
||
// Check if we have both resources started on all nodes | ||
kubeletCount := len(resourcesStarted[ResourceKubelet]) | ||
etcdCount := len(resourcesStarted[ResourceEtcd]) | ||
|
||
if kubeletCount < ExpectedNodeCount { | ||
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount)) | ||
} | ||
|
||
if etcdCount < ExpectedNodeCount { | ||
status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount)) | ||
} | ||
|
||
return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hard-coded node count (2); derive from the Node List.
Clusters may not always be 2 nodes. Parse the Node List to compute expected count and use it when validating resource placement.
-func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
+func (c *HealthCheck) parseResourceStatus(lines []string, status *HealthStatus) bool {
resourcesStarted := make(map[string]map[string]bool)
resourcesStarted[ResourceKubelet] = make(map[string]bool)
resourcesStarted[ResourceEtcd] = make(map[string]bool)
@@
- // Check if we have both resources started on all nodes
- kubeletCount := len(resourcesStarted[ResourceKubelet])
- etcdCount := len(resourcesStarted[ResourceEtcd])
+ // Determine expected node count from the Node List; fall back to constant if not found
+ expectedCount := c.detectNodeCount(lines)
+ if expectedCount == 0 {
+ expectedCount = ExpectedNodeCount
+ }
+ kubeletCount := len(resourcesStarted[ResourceKubelet])
+ etcdCount := len(resourcesStarted[ResourceEtcd])
- if kubeletCount < ExpectedNodeCount {
- status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, ExpectedNodeCount))
+ if kubeletCount < expectedCount {
+ status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceKubelet, kubeletCount, expectedCount))
}
- if etcdCount < ExpectedNodeCount {
- status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, ExpectedNodeCount))
+ if etcdCount < expectedCount {
+ status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", ResourceEtcd, etcdCount, expectedCount))
}
- return kubeletCount >= ExpectedNodeCount && etcdCount >= ExpectedNodeCount
+ return kubeletCount >= expectedCount && etcdCount >= expectedCount
}
+
+// detectNodeCount returns the number of nodes listed in the "Node List" section.
+func (c *HealthCheck) detectNodeCount(lines []string) int {
+ inNodeList := false
+ count := 0
+ for _, line := range lines {
+ line = strings.TrimSpace(line)
+ if strings.Contains(line, SectionNodeList) {
+ inNodeList = true
+ continue
+ }
+ if inNodeList {
+ if strings.Contains(line, SectionFullListResources) {
+ break
+ }
+ if c.isNodeStatusLine(line) {
+ count++
+ }
+ }
+ }
+ return count
+}
Committable suggestion skipped: line range outside the PR's diff.
c077a2b
to
81984c7
Compare
81984c7
to
26d6df8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)
502-510
: Degrade on StatusError as well.If we fail to collect status (
OverallStatus == StatusError
), the operator condition remains untouched, so outages never surface. TreatStatusError
the same asStatusDegraded
when updating conditions.Suggested adjustment:
- if status.OverallStatus == StatusDegraded { - return c.setDegradedCondition(ctx, status) - } else if status.OverallStatus == StatusHealthy { + switch status.OverallStatus { + case StatusDegraded, StatusError: + return c.setDegradedCondition(ctx, status) + case StatusHealthy: return c.clearDegradedCondition(ctx) } - - // For Warning or Error status, we don't update the degraded condition - return nil + // Warning / Unknown: leave condition unchanged + return nil
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
- pkg/tnf/pacemaker/testdata/healthy_cluster.xml
26d6df8
to
e42e5dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)
3-9
: Prevent hangs; add exec timeout and non-interactive sudo (-n). Also introduce StatusError for collection failures.pcs can block (sudo prompt/hung pcs). Use a short timeout context and sudo -n. While here, add StatusError and treat exec failures accordingly.
Apply:
--- a/pkg/tnf/pacemaker/healthcheck.go +++ b/pkg/tnf/pacemaker/healthcheck.go @@ import ( + "errors" "context" "encoding/xml" "fmt" "strings" "time" @@ const ( // Time windows for detecting recent events FailedActionTimeWindow = 5 * time.Minute FencingEventTimeWindow = 24 * time.Hour + + // Upper bound for executing pcs commands + ExecTimeout = 10 * time.Second @@ // Status strings StatusHealthy = "Healthy" StatusWarning = "Warning" StatusDegraded = "Degraded" StatusUnknown = "Unknown" + StatusError = "Error" @@ // Pacemaker commands - PcsStatusXMLCommand = "sudo pcs status xml" + PcsStatusXMLCommand = "sudo -n pcs status xml" @@ - // Execute the pcs status xml command - stdout, stderr, err := exec.Execute(ctx, PcsStatusXMLCommand) + // Execute the pcs status xml command with timeout and non-interactive sudo + ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout) + defer cancel() + stdout, stderr, err := exec.Execute(ctxExec, PcsStatusXMLCommand) if err != nil { - return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil + if errors.Is(err, context.DeadlineExceeded) { + err = fmt.Errorf("pcs status xml command timed out: %w", err) + } + return newErrorHealthStatus(fmt.Sprintf("Failed to execute pcs status xml command: %v", err)), nil }Also applies to: 21-84, 336-345
379-383
: Set OverallStatus when pacemaker isn’t running (avoid returning Unknown).Without setting OverallStatus here, outages can be masked and the degraded condition won’t be set.
if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") - return status, nil + status.OverallStatus = StatusDegraded + return status, nil }
568-576
: Degrade on StatusError as well (not only Degraded).Collection failures should surface as degraded to operators/users.
switch status.OverallStatus { - case StatusDegraded: + case StatusDegraded, StatusError: return c.setDegradedCondition(ctx, status) case StatusHealthy: return c.clearDegradedCondition(ctx) default: // For Warning or Unknown status, we don't update the degraded condition return nil }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/healthcheck_test.go
🔇 Additional comments (2)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-88
: Fixture looks consistent with parser expectations.IDs/roles/attributes align with parseResources/parseNodes and time formats for histories. LGTM.
pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: Recent-failures fixture is well-formed and matches parser formats.Timestamps match PacemakerTimeFormat/PacemakerFenceTimeFormat; placeholders are suitable for templating. LGTM.
e42e5dc
to
3cab3d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)
386-390
: SetOverallStatus
toStatusDegraded
before returning.When Pacemaker isn't running, the code appends an error but returns with
OverallStatus
still set toStatusUnknown
(from initialization at line 370). This prevents the degraded condition from being set downstream, masking the outage.Apply this diff:
// Check if pacemaker is running if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") + status.OverallStatus = StatusDegraded return status, nil }
567-575
: HandleStatusError
in the switch statement.If
StatusError
is added (per earlier comment), theupdateOperatorStatus
switch should treat it likeStatusDegraded
and set the degraded condition.Apply this diff:
// Update operator conditions based on pacemaker status switch status.OverallStatus { - case StatusDegraded: + case StatusDegraded, StatusError: return c.setDegradedCondition(ctx, status) case StatusHealthy: return c.clearDegradedCondition(ctx) default: - // For Warning or Unknown status, we don't update the degraded condition + // For Warning or Unknown status, we don't update the degraded condition return nil }
267-274
: Rename tonewErrorHealthStatus
and useStatusError
instead ofStatusUnknown
.Collection failures (exec timeout, XML parse errors, etc.) should be marked as
StatusError
rather thanStatusUnknown
so they're surfaced as actionable failures in the operator condition. This aligns with previous feedback.Apply this diff:
-// newUnknownHealthStatus creates a HealthStatus with Unknown status and an error message -func newUnknownHealthStatus(errMsg string) *HealthStatus { +// newErrorHealthStatus creates a HealthStatus with Error status and an error message +func newErrorHealthStatus(errMsg string) *HealthStatus { return &HealthStatus{ - OverallStatus: StatusUnknown, + OverallStatus: StatusError, Warnings: []string{}, Errors: []string{errMsg}, } }Then update all call sites (lines 343, 346, 355, 361) to use
newErrorHealthStatus
instead ofnewUnknownHealthStatus
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (7)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-88
: LGTM! Test fixture is well-formed.The XML fixture represents a valid healthy Pacemaker cluster state with:
- Both nodes online (
online="true"
on lines 12-13)- All resources started and active
- No failed operations in history
- Empty fence history
This provides good test coverage for the happy path.
pkg/tnf/pacemaker/healthcheck_test.go (6)
102-113
: Update test expectations whenStatusError
replacesStatusUnknown
.This test expects
StatusUnknown
for exec failures (line 111), which is correct for the current implementation. However, oncenewUnknownHealthStatus
is renamed tonewErrorHealthStatus
and returnsStatusError
(per the comment on healthcheck.go lines 267-274), update this assertion accordingly.After applying the main implementation changes, update line 111:
require.NotNil(t, status, "HealthStatus should not be nil") - require.Equal(t, StatusUnknown, status.OverallStatus, "Status should be Unknown when exec fails") + require.Equal(t, StatusError, status.OverallStatus, "Status should be Error when exec fails") require.NotEmpty(t, status.Errors, "Should have error messages when exec fails")
301-329
: UpdateTestHealthCheck_determineOverallStatus
forStatusError
.Line 306 tests
newUnknownHealthStatus
and expectsStatusUnknown
(line 308). OnceStatusError
is introduced, add a dedicated test case and update the existing one.After implementing
StatusError
, update the test:func TestHealthCheck_determineOverallStatus(t *testing.T) { controller := &HealthCheck{} - // Test degraded status status - errMsg := "test error message" - status := newUnknownHealthStatus(errMsg) - result := controller.determineOverallStatus(status) - require.Equal(t, StatusUnknown, result) + // Test error status (collection failures) + errMsg := "test error message" + status := newErrorHealthStatus(errMsg) + result := controller.determineOverallStatus(status) + require.Equal(t, StatusError, result)Also add a test case for
StatusUnknown
if it's still used for other scenarios, or remove it if no longer needed.
377-392
: UpdateTestHealthCheck_validateResourceCount
signature after refactor.Once
validateResourceCount
is updated to acceptexpectedCount
as a parameter (per the comment on healthcheck.go lines 480-487), update the test calls on lines 382 and 388.Apply this diff after the main implementation is updated:
// Test with correct count status := &HealthStatus{Warnings: []string{}, Errors: []string{}} - result := controller.validateResourceCount(ResourceEtcd, 2, status) + result := controller.validateResourceCount(ResourceEtcd, 2, 2, status) require.True(t, result, "Should return true for correct count") require.Empty(t, status.Errors, "Should have no errors for correct count") // Test with insufficient count status = &HealthStatus{Warnings: []string{}, Errors: []string{}} - result = controller.validateResourceCount(ResourceKubelet, 1, status) + result = controller.validateResourceCount(ResourceKubelet, 1, 2, status) require.False(t, result, "Should return false for insufficient count")
248-266
: Consider removingTestHealthCheck_collectNodeStatus_MismatchedCount
after refactor.This test validates the warning for node count mismatch against
ExpectedNodeCount
. Once the hard-coded constant is removed (per the comment on healthcheck.go lines 423-427), this specific warning logic will be deleted, making this test obsolete.After the refactor, either remove this test or update it to test a different node-count-related scenario if applicable.
23-46
: LGTM! Test helpers are well-structured.The
createTestHealthCheck
andNewHealthCheck
test setup properly use fake clients and in-memory event recorders for isolated testing. Good practice for controller tests.Also applies to: 72-100
176-230
: LGTM! XML parsing tests provide good coverage.The tests for
parsePacemakerStatusXML
cover:
- Healthy cluster (lines 176-188)
- Offline node scenario (lines 190-202)
- Recent failures with dynamic timestamps (lines 204-230)
The use of
getRecentFailuresXML
with dynamic timestamps (lines 55-69) is particularly good for avoiding time-sensitive test flakes.
3cab3d9
to
c3fb43e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (8)
pkg/tnf/pacemaker/healthcheck.go (8)
267-274
: Duplicate: Mark collection failures as StatusError.Per past review, this function should return
StatusError
instead ofStatusUnknown
to properly distinguish collection failures from unknown states. Collection failures are actionable and should be surfaced as errors.Apply this diff:
func newUnknownHealthStatus(errMsg string) *HealthStatus { return &HealthStatus{ - OverallStatus: StatusUnknown, + OverallStatus: StatusError, Warnings: []string{}, Errors: []string{errMsg}, } }Note: This requires adding the
StatusError
constant to the status constants section (lines 34-38).
34-38
: Add StatusError constant for collection failures.To properly handle collection/communication failures distinct from operational issues, add a
StatusError
constant alongside the existing status constants.Apply this diff:
// Status strings StatusHealthy = "Healthy" StatusWarning = "Warning" StatusDegraded = "Degraded" + StatusError = "Error" StatusUnknown = "Unknown"
429-433
: Distinguish offline from standby nodes.The current check treats any non-online node as an error, including legitimate standby states. Per the PR objectives, only offline nodes should trigger degradation.
Apply this diff:
for _, node := range result.Nodes.Node { - if node.Online != NodeStatusOnline { - status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) + if node.Online == "false" { + status.Errors = append(status.Errors, fmt.Sprintf("Node %s is offline", node.Name)) + } else if node.Standby == BooleanValueTrue { + status.Warnings = append(status.Warnings, fmt.Sprintf("Node %s is in standby", node.Name)) } }
387-390
: Set OverallStatus when pacemaker isn't running.When pacemaker is down, the function returns with
OverallStatus
still set toUnknown
. This prevents the degraded condition from being set, masking the outage.Apply this diff:
if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") + status.OverallStatus = StatusDegraded return status, nil }
497-509
: Fix failure detection logic: use OR instead of AND.The condition requires both RC and RCText to indicate failure, which misses cases where only one field is non-success. Use OR logic to catch all failure cases.
Apply this diff:
for _, operation := range resourceHistory.OperationHistory { // Check for failed operations (rc != "0") - if operation.RC != OperationRCSuccess && operation.RCText != OperationRCTextSuccess { + if operation.RC != OperationRCSuccess || operation.RCText != OperationRCTextSuccess { // Parse the timestamp t, err := time.Parse(PacemakerTimeFormat, operation.LastRCChange)
567-575
: Handle StatusError in operator status update.When status collection fails (
StatusError
), the operator should surface a degraded condition to avoid masking outages. Currently onlyStatusDegraded
triggers the degraded condition.Apply this diff:
// Update operator conditions based on pacemaker status switch status.OverallStatus { - case StatusDegraded: + case StatusDegraded, StatusError: return c.setDegradedCondition(ctx, status) case StatusHealthy: return c.clearDegradedCondition(ctx) default: - // For Warning or Unknown status, we don't update the degraded condition + // For Warning or Unknown status, we don't update the degraded condition return nil }
469-487
: Derive expected node count dynamically from parsed XML.Replace the hard-coded
ExpectedNodeCount
constant with a dynamic count derived from the parsed node list. This makes the validation flexible for clusters of varying sizes.Apply this diff:
// Check if we have both resources started on all nodes + expectedCount := len(result.Nodes.Node) kubeletCount := len(resourcesStarted[ResourceKubelet]) etcdCount := len(resourcesStarted[ResourceEtcd]) - kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, status) - etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, status) + kubeletOK := c.validateResourceCount(ResourceKubelet, kubeletCount, expectedCount, status) + etcdOK := c.validateResourceCount(ResourceEtcd, etcdCount, expectedCount, status) return kubeletOK && etcdOK } // validateResourceCount validates that a resource is started on the expected number of nodes -func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) bool { - if actualCount < ExpectedNodeCount { +func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) bool { + if actualCount < expectedCount { status.Errors = append(status.Errors, fmt.Sprintf("%s resource not started on all nodes (started on %d/%d nodes)", - resourceName, actualCount, ExpectedNodeCount)) + resourceName, actualCount, expectedCount)) return false } return true
423-427
: Remove hard-coded node count check.The hard-coded
ExpectedNodeCount = 2
check is inflexible for clusters that may scale or undergo node replacement. Since the subsequent validation already uses the actual node count from the XML, this strict equality check can be removed.Apply this diff:
- // Check if we have the expected number of nodes. - // This should almost always be 2, but 1 node is possible during a control-plane node replacement event. - if len(result.Nodes.Node) != ExpectedNodeCount { - status.Warnings = append(status.Warnings, fmt.Sprintf("Expected %d nodes, found %d", ExpectedNodeCount, len(result.Nodes.Node))) - } -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🔇 Additional comments (4)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)
1-67
: LGTM!This XML test fixture correctly represents an offline node scenario with master-1 offline (line 13) and a corresponding fencing event (line 64). The structure is valid and appropriate for testing the health check parser's handling of degraded cluster states.
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-88
: LGTM!This XML test fixture correctly represents a healthy cluster with both nodes online and all resources started. The structure is valid and provides appropriate test coverage for the healthy path scenario.
pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: LGTM!This template correctly uses placeholders for dynamic timestamp generation (lines 56, 63), allowing tests to verify time-based filtering of recent failures and fencing events. The template approach is appropriate for testing scenarios that depend on recency.
pkg/tnf/pacemaker/healthcheck_test.go (1)
1-440
: LGTM!This comprehensive test suite provides excellent coverage for the Pacemaker health check component:
- Thorough testing of XML parsing with multiple scenarios (healthy, offline, recent failures)
- Good edge case coverage (no nodes, mismatched counts, large/invalid XML)
- Effective use of test helpers and fixtures
- Appropriate use of fake clients for isolation
The test structure is clean and maintainable.
c3fb43e
to
f87437c
Compare
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)
386-390
: Set OverallStatus before returning when pacemaker is not running.When pacemaker is not running, the function appends an error and returns immediately (line 389), but
OverallStatus
remainsUnknown
(initialized at line 370). This prevents the degraded condition from being set properly sinceupdateOperatorStatus
(line 567-575) only sets degraded forStatusError
, notStatusUnknown
.Apply this diff:
if result.Summary.Stack.PacemakerdState != PacemakerStateRunning { status.Errors = append(status.Errors, "Pacemaker is not running") + status.OverallStatus = c.determineOverallStatus(status) return status, nil }
🧹 Nitpick comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)
102-113
: Clarify test expectation: exec failure should not return an error.The comment states "exec.Execute will fail due to permissions" but the test expects
NoError
. This appears correct (the function returnsUnknown
status without propagating the error), but the comment could be clearer about this design choice.Consider updating the comment to:
- // In test environment, exec.Execute will fail due to permissions - require.NoError(t, err, "getPacemakerStatus should not return an error") + // In test environment, exec.Execute will fail due to permissions. + // getPacemakerStatus handles this gracefully by returning Unknown status without propagating the error. + require.NoError(t, err, "getPacemakerStatus should not return an error even when exec fails")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/healthy_cluster.xml
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (10)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)
1-67
: LGTM! Well-structured test fixture for offline node scenario.The XML fixture accurately represents an offline node scenario with appropriate fencing history. The structure is valid and will effectively support testing of offline node detection and fencing event handling.
pkg/tnf/pacemaker/healthcheck_test.go (2)
23-69
: LGTM! Well-designed test helpers.The test helpers are well-structured:
createTestHealthCheck
: Properly initializes all dependencies with fakesloadTestXML
: Simple and effective for loading fixturesgetRecentFailuresXML
: Clever approach using dynamic timestamps to ensure failures are within the 5-minute window
176-440
: LGTM! Comprehensive test coverage.The test suite provides excellent coverage:
- XML parsing for healthy, offline, and recent-failures scenarios
- Node status collection with various edge cases (empty nodes, mismatched counts)
- Resource status validation
- Overall status determination logic
- Helper functions (newUnknownHealthStatus, recent failures, fencing events)
- Edge cases (large XML, invalid XML, missing state)
The tests are well-organized, use clear assertions, and validate both happy paths and error conditions.
pkg/tnf/pacemaker/healthcheck.go (7)
334-365
: LGTM! Robust command execution with proper safeguards.The implementation correctly addresses potential hanging issues:
- Uses context with timeout (10 seconds) to prevent indefinite blocking
- Uses
sudo -n
for non-interactive execution- Explicitly checks for timeout errors
- Validates XML size to prevent XML bombs
- Returns graceful Unknown status on errors
23-85
: LGTM! Well-organized constants with clear documentation.The constants are well-structured and appropriately documented:
- Time windows for event detection are reasonable (5 min for failures, 24 hours for fencing)
- 30-second resync interval balances responsiveness with resource usage
- Status constants, event reasons, and command strings are clearly defined
ExpectedNodeCount = 2
correctly reflects the Two-Node OpenShift design
416-434
: Node online check aligns with Two-Node OpenShift design.The check at line 430 treats any node where
Online != "true"
as an error, including standby nodes. Per the maintainer's clarification, this is the intended behavior for Two-Node OpenShift with Fencing, where standby nodes are considered offline for health purposes.Based on learnings
497-509
: LGTM! Correct failure detection using OR logic.The failure detection properly uses OR logic (
||
) to catch failures when eitherRC
orRCText
indicates non-success. This correctly identifies cases where only one field shows failure (e.g.,RC="1"
butRCText="ok"
).
560-576
: Consider whether StatusUnknown should also set degraded condition.Currently, only
StatusError
sets the degraded condition. When the controller cannot determine cluster health (e.g., exec timeout, parse failure), it returnsStatusUnknown
, but this doesn't update the operator condition. This could mask operational issues.Should
StatusUnknown
also set a degraded condition to surface collection failures? If the current behavior is intentional (avoiding false alarms during transient issues), consider adding a comment explaining this design choice.Would you like the controller to set degraded on Unknown status, or is the current "wait and see" approach preferred?
480-487
: Hard-coded ExpectedNodeCount correctly reflects architectural constraint.The constant
ExpectedNodeCount = 2
and its usage in validation correctly encode the Two-Node OpenShift with Fencing design. Per the maintainer's clarification, this is an architectural invariant, not a limitation to be made dynamic.Based on learnings
627-654
: LGTM! Well-categorized event recording.Event recording properly categorizes warnings by type:
- Failed resource actions →
EventReasonFailedAction
- Fencing events →
EventReasonFencingEvent
- General warnings →
EventReasonWarning
- Errors →
EventReasonError
- Healthy status →
EventReasonHealthy
This categorization enables effective event filtering and alerting.
f87437c
to
c07e5b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/tnf/pacemaker/healthcheck.go (1)
267-274
: Collection failures return Unknown; confirm policy.Exec/parse failures set StatusUnknown, so no degraded condition is set. If outages should surface, consider marking as StatusError to degrade. If Unknown is intentional, ignore.
Also applies to: 341-347
🧹 Nitpick comments (6)
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
79-82
: Remove duplicated operation_history entry.Line 80 duplicates line 79 for the kubelet probe. Drop one to avoid confusion in tests and future parsing.
- <operation_history call="330" task="probe" rc="0" rc_text="ok" last-rc-change="Thu Sep 18 12:40:35 2025" exec-time="4ms" queue-time="0ms"/>
pkg/tnf/pacemaker/healthcheck_test.go (1)
115-149
: Optional: rename test for consistency with impl name.The function under test is updateOperatorStatus; consider renaming the test to match for easier discovery.
-func TestHealthCheck_updateOperatorAvailability(t *testing.T) { +func TestHealthCheck_updateOperatorStatus(t *testing.T) {pkg/tnf/pacemaker/healthcheck.go (4)
338-341
: Extract exec timeout into a constant.Avoid magic number; define a reusable ExecTimeout alongside other constants and use it here.
@@ - // Execute the pcs status xml command with a timeout and non-interactive sudo - ctxExec, cancel := context.WithTimeout(ctx, 10*time.Second) + // Execute the pcs status xml command with a timeout and non-interactive sudo + ctxExec, cancel := context.WithTimeout(ctx, ExecTimeout)Add near other constants:
+ // Upper bound for "pcs status" execution + ExecTimeout = 10 * time.Second
430-434
: If standby is considered “offline”, handle it explicitly.Currently only Online != "true" is flagged. To align with the stated intent to treat standby as offline, also error on Standby == "true".
for _, node := range result.Nodes.Node { - if node.Online != NodeStatusOnline { - status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) - } + if node.Online != NodeStatusOnline { + status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) + } else if node.Standby == BooleanValueTrue { + status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (treated as offline)", node.Name)) + } }
449-453
: Match resources by agent, not ID substring.ID-based substring matching is brittle. Use ResourceAgent to detect kubelet/etcd reliably.
- // Check if this is a kubelet or etcd resource - if strings.Contains(resource.ID, ResourceKubelet) { - resourcesStarted[ResourceKubelet][nodeName] = true - } else if strings.Contains(resource.ID, ResourceEtcd) { - resourcesStarted[ResourceEtcd][nodeName] = true - } + // Identify kubelet/etcd resources by their agents + switch { + case strings.HasPrefix(resource.ResourceAgent, "systemd:kubelet"): + resourcesStarted[ResourceKubelet][nodeName] = true + case strings.HasPrefix(resource.ResourceAgent, "ocf:heartbeat:podman-etcd"): + resourcesStarted[ResourceEtcd][nodeName] = true + }
629-643
: Optional: gate repeated warning events to reduce noise.At 30s resync, warnings within the 5m/24h windows will re-emit every cycle. Consider deduplication (e.g., last-emitted cache, once-per-window) to avoid event spam.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🔇 Additional comments (1)
pkg/tnf/pacemaker/healthcheck_test.go (1)
49-53
: Use require.NoErrorf to format the failure message.Current call won’t format %s; switch to the fmt-aware variant.
- require.NoError(t, err, "Failed to read test XML file: %s", filename) + require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)Likely an incorrect or invalid review comment.
c07e5b1
to
e2add0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
pkg/tnf/pacemaker/healthcheck.go (1)
388-401
: Optional: size guard exists only in caller, not in parser.parsePacemakerStatusXML trusts input size; only getPacemakerStatus enforces maxXMLSize. If parse is used elsewhere, consider early-checking length or using an xml.Decoder with a bounded reader to keep invariants local to the parser. Low priority.
pkg/tnf/pacemaker/healthcheck_test.go (3)
50-55
: Use require.NoErrorf for formatted failure message.Minor test ergonomics: this call passes a format string and arg; use the “f” variant.
- data, err := os.ReadFile(path) - require.NoError(t, err, "Failed to read test XML file: %s", filename) + data, err := os.ReadFile(path) + require.NoErrorf(t, err, "Failed to read test XML file: %s", filename)Based on learnings
642-659
: Test name/comment mismatch with behavior.The test name/comment suggest exercising “large XML” behavior, but the parser test simply validates invalid XML. Either update the comment/name to reflect invalid XML, or drive the size check via getPacemakerStatus (which enforces maxXMLSize). Optional.
484-498
: Add a test to assert standby is treated as offline.To lock in the intended behavior, add a case where node.Online="true" and Standby="true" and assert an error is recorded.
Example:
func TestHealthCheck_collectNodeStatus_StandbyTreatedAsOffline(t *testing.T) { controller := &HealthCheck{} result := &PacemakerResult{ Nodes: Nodes{ Node: []Node{ {Name: "master-0", Online: booleanValueTrue, Standby: booleanValueTrue}, {Name: "master-1", Online: booleanValueTrue, Standby: "false"}, }, }, } status := &HealthStatus{Warnings: []string{}, Errors: []string{}} controller.collectNodeStatus(result, status) require.NotEmpty(t, status.Errors) require.Contains(t, strings.Join(status.Errors, " "), "standby") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
- pkg/tnf/pacemaker/testdata/healthy_cluster.xml
e2add0e
to
f3d22d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it seems good to me :)
I just left a few notes
pkg/tnf/pacemaker/healthcheck.go
Outdated
// Validate XML size to prevent XML bombs | ||
if len(stdout) > maxXMLSize { | ||
return newUnknownHealthStatus(fmt.Sprintf("XML output too large: %d bytes (max: %d bytes)", len(stdout), maxXMLSize)), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future. If the XML Bomb is a genuine risk, we might want to consider modifying the exec
package to use io.LimitReader
somehow. The current check prevents us to parse it, and it is good, but we still have the whole XML in memory at that point.
pkg/tnf/pacemaker/healthcheck.go
Outdated
// Record pacemaker health check events | ||
c.recordHealthCheckEvents(healthStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how long does recordHealthCheckEvents
take, but maybe we should pass the context here.
pkg/tnf/pacemaker/healthcheck.go
Outdated
if err != nil { | ||
klog.V(4).Infof("Failed to parse timestamp '%s' for operation %s on node %s: %v", | ||
operation.LastRCChange, operation.Task, node.Name, err) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we worry about letting these errors slip through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest thing to do when something fails as part of this information collection is to log it, but refuse to change the operator state. We should provide updates for definitively known state, even if it may be an incomplete truth.
pkg/tnf/pacemaker/healthcheck.go
Outdated
// Clean up old entries that are outside the longest deduplication window | ||
maxWindow := eventDeduplicationWindowFencing | ||
for key, timestamp := range c.recordedEvents { | ||
if now.Sub(timestamp) > maxWindow { | ||
delete(c.recordedEvents, key) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup logic is currently nested inside the block that decides on item addition. Should we move it out? It feels like that cleanup step can be performed just once in recordHealthCheckEvents
execution, not for every new item to add, did I misunderstand it?
UPDATE: now that I read it a little better, it seems to me that at each loop called in recordHealthCheckEvents
we might delete events recorded in the previous loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this does feel a bit wonky. I'll tweak it so that the window runs once-per-check instead of once per error recording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgot the tests :)
Some tests have similar boilerplate code (like the ones that parse XML). You might want to consider using test tables instead. Right now, though, I have mixed feelings, as I do like how you organized the tests. The only real benefit of merging them would be that if we need to fix an issue or add new test cases, it'd be much easier to do it in one place instead of several. So, I'll leave the decision up to you!
func getRecentFailuresXML(t *testing.T) string { | ||
// Load the template from testdata | ||
templateXML := loadTestXML(t, "recent_failures_template.xml") | ||
|
||
// Generate recent timestamps that are definitely within the 5-minute window | ||
now := time.Now() | ||
recentTime := now.Add(-1 * time.Minute).UTC().Format("Mon Jan 2 15:04:05 2006") | ||
recentTimeISO := now.Add(-1 * time.Minute).UTC().Format("2006-01-02 15:04:05.000000Z") | ||
|
||
// Replace the placeholders in the template | ||
xmlContent := strings.ReplaceAll(templateXML, "{{RECENT_TIMESTAMP}}", recentTime) | ||
xmlContent = strings.ReplaceAll(xmlContent, "{{RECENT_TIMESTAMP_ISO}}", recentTimeISO) | ||
|
||
return xmlContent | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty nice PR IMHO 👍
pkg/tnf/pacemaker/healthcheck.go
Outdated
return c.updateOperatorCondition(ctx, condition) | ||
} | ||
|
||
func (c *HealthCheck) clearDegradedCondition(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to check the current degraded condition for its reason, to make sure we don't clear a degraded condition which wasn't set by this healthcheck? Not sure what best pratice is in such situations...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that's what we are doing, but the function names are very misleading. Updated this to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pkg/tnf/pacemaker/healthcheck.go (3)
524-528
: Timestamp parsing errors are logged but not surfaced.When
time.Parse
fails, the error is logged at V(4) level and the operation is skipped withcontinue
. If all timestamps consistently fail to parse (e.g., due to Pacemaker version change in timestamp format), recent failures would be silently ignored.Consider accumulating a count of parse errors and emitting a warning if a threshold is exceeded, to detect systematic parsing issues.
640-665
: Move cleanup logic outside the event-checking loop for efficiency.The cleanup of old deduplication entries (lines 647-653) currently runs on every call to
shouldRecordEvent
, which means it executes once per warning/error. This is inefficient when processing multiple events.Based on learnings
Apply this refactor to clean up once per sync:
// recordHealthCheckEvents records events for pacemaker warnings and fencing history func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) { + // Clean up old entries once at the start + c.cleanupOldRecordedEvents() + // Record events for warnings with appropriate deduplication window for _, warning := range status.Warnings { eventKey := fmt.Sprintf("warning:%s", warning) @@ -696,6 +700,19 @@ } } +// cleanupOldRecordedEvents removes entries older than the longest deduplication window +func (c *HealthCheck) cleanupOldRecordedEvents() { + c.recordedEventsMu.Lock() + defer c.recordedEventsMu.Unlock() + + now := time.Now() + maxWindow := eventDeduplicationWindowFencing + for key, timestamp := range c.recordedEvents { + if now.Sub(timestamp) > maxWindow { + delete(c.recordedEvents, key) + } + } +} + // shouldRecordEvent checks if an event should be recorded based on deduplication logic func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool { c.recordedEventsMu.Lock() defer c.recordedEventsMu.Unlock() now := time.Now() - - // Clean up old entries that are outside the longest deduplication window - maxWindow := eventDeduplicationWindowFencing - for key, timestamp := range c.recordedEvents { - if now.Sub(timestamp) > maxWindow { - delete(c.recordedEvents, key) - } - } // Check if this event was recently recorded if lastRecorded, exists := c.recordedEvents[eventKey]; exists {
619-626
: Verify degraded condition ownership before clearing.The
clearDegradedCondition
function clears any degraded condition of typePacemakerHealthCheckDegraded
without verifying it was set by this controller. If multiple components manage similar conditions, this could inadvertently clear a condition set by another component.Consider checking the condition's reason or message to confirm it was set by this health check before clearing it.
Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (6)
pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pacemaker/testdata/standby_node.xml
(1 hunks)
🔇 Additional comments (13)
pkg/tnf/pacemaker/testdata/offline_node.xml (1)
1-67
: LGTM! Test fixture correctly represents offline node scenario.This XML fixture appropriately models a Pacemaker cluster with an offline node (master-1 with
online="false"
), which is used to test the health-check parsing logic for degraded cluster states.pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-87
: LGTM! Test fixture correctly represents healthy cluster scenario.This XML fixture appropriately models a healthy Pacemaker cluster with both nodes online, all critical resources started, and no recent fencing events. It serves as a good baseline for testing the happy path.
pkg/tnf/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: LGTM! Template approach is appropriate for time-sensitive testing.This template with placeholder timestamps (
{{RECENT_TIMESTAMP}}
and{{RECENT_TIMESTAMP_ISO}}
) is correctly used by thegetRecentFailuresXML
test helper to dynamically generate XML with recent timestamps, enabling proper testing of the time-window-based failure detection logic.pkg/tnf/pacemaker/testdata/standby_node.xml (2)
57-58
: Verify duplicate operation_history entries are intentional.Lines 57-58 contain identical
operation_history
entries for the kubelet probe operation (same call, task, timestamps). If this duplication is intentional to test duplicate handling in the parser, consider adding a comment explaining this. Otherwise, one entry should be removed.
1-66
: Test fixture correctly represents standby node scenario.This XML fixture appropriately models a Pacemaker cluster with a standby node (master-1 with
standby="true"
), which is used to test the health-check behavior when nodes are in standby state.pkg/tnf/pacemaker/healthcheck_test.go (3)
23-71
: LGTM! Well-structured test helpers with good practices.The helper functions appropriately:
- Create test fixtures with fake clients and in-memory recorders
- Load static XML test data from files
- Generate dynamic timestamps for time-sensitive tests using template replacement
The
getRecentFailuresXML
helper's approach of using 1-minute-old timestamps ensures tests are consistently within the 5-minute failure window.
179-245
: Excellent event deduplication test coverage.This test comprehensively validates the deduplication logic:
- Verifies duplicate events are not re-recorded within windows
- Tests different deduplication windows (5 min for actions, 1 hour for fencing)
- Validates healthy event is only recorded on status transitions
- Confirms new errors are recorded even when duplicates exist
The test correctly verifies the controller only records healthy events when transitioning from unhealthy states, not when already healthy.
428-496
: Comprehensive XML parsing test coverage.The parsing tests appropriately cover:
- Healthy cluster baseline
- Offline nodes
- Standby nodes
- Recent failures with dynamic timestamps
- Invalid/large/missing-state XML edge cases
The tests correctly validate both the parsed status and the specific warnings/errors generated for each scenario.
pkg/tnf/pacemaker/healthcheck.go (5)
349-350
: Consider passing context to recordHealthCheckEvents for consistency.While
recordHealthCheckEvents
doesn't perform I/O or cancellable operations, passing the context parameter would maintain consistency with other methods and enable future additions that might need it (e.g., timeout-sensitive event recording).Based on learnings
452-460
: Node status validation correctly treats standby as offline.The implementation correctly identifies both offline nodes (
online != "true"
) and standby nodes (standby == "true"
) as errors, treating them as equivalent failure conditions. The error message for standby nodes explicitly documents this design decision with "(unexpected behavior; treated as offline)".
360-387
: LGTM! Robust command execution with timeout and error handling.The
getPacemakerStatus
function appropriately:
- Uses a timeout context (10 seconds) to prevent indefinite hangs
- Uses non-interactive sudo (
sudo -n
) to fail fast on missing credentials- Validates XML size to prevent XML bomb attacks
- Returns
Unknown
status with error messages rather than propagating errors, ensuring graceful degradationThe timeout handling explicitly checks for
context.DeadlineExceeded
to provide clear error messages.
408-413
: LGTM! Correctly sets error status when Pacemaker is down.The code properly detects when Pacemaker is not running and sets
OverallStatus
tostatusError
before returning, ensuring the degraded condition will be set appropriately in the controller flow.
505-511
: Hard-coded expectedNodeCount is intentional by design.The use of
expectedNodeCount = 2
is correct for Two-Node OpenShift with Fencing architecture. This is a fundamental constraint, not a limitation to be refactored.Based on learnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (1)
167-175
: Prefer structured mutation over text templating (reduces brittleness).Parse the manifest once, then set fields on the CronJob object (name, namespace, schedule, image, command) instead of line‑based replacements. This avoids YAML formatting pitfalls and future template drift.
Example:
cronJob := ReadCronJobV1OrDie(c.manifest) cronJob.Name = c.name cronJob.Namespace = c.namespace cronJob.Spec.Schedule = c.schedule if len(cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 { ct := &cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0] ct.Image = c.image ct.Command = append([]string(nil), c.command...) } return cronJobpkg/tnf/pacemaker/healthcheck.go (2)
306-309
: Remove unused lastProcessedStatus or use it meaningfully.State is written but never read; drop it or leverage it to suppress duplicate event processing based on CR changes.
Also applies to: 433-436
515-523
: Optional: clarify standby-as-offline in a code comment.Behavior is intentional; a brief inline comment above the loop will aid future readers.
pkg/cmd/pacemakerstatuscollector/collector.go (1)
165-177
: Remove unused Kubernetes client wiring to simplify.You construct a client but don’t use it; updatePacemakerStatusCR rebuilds REST config anyway. Simplify by dropping the client parameter and the extra creation.
Apply:
@@ - // Create Kubernetes client - client, err := createKubernetesClient() - if err != nil { - return fmt.Errorf("failed to create kubernetes client: %w", err) - } - // Update PacemakerStatus CR - if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, collectionError); err != nil { + if err := updatePacemakerStatusCR(ctx, rawXML, summary, collectionError); err != nil { return fmt.Errorf("failed to update PacemakerStatus CR: %w", err) }And change the function signature:
-func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error { +func updatePacemakerStatusCR(ctx context.Context, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error {Optionally remove createKubernetesClient if unused elsewhere.
Also applies to: 321-321
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (15)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (2)
pkg/tnf/pacemaker/v1alpha1/register.go (1)
24-31
: LGTM: scheme registration is correct.Types registered, GroupVersion wired, and metav1 added. No issues.
pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
12-121
: Autogenerated deepcopy looks consistent.Methods cover all types; list/items deep-copied properly.
err = restClient.Get(). | ||
Resource("pacemakerstatuses"). | ||
Name(pacemakerStatusName). | ||
Do(ctx). | ||
Into(&existing) | ||
|
||
now := metav1.Now() | ||
|
||
if err != nil { | ||
// Create new PacemakerStatus if it doesn't exist | ||
if strings.Contains(err.Error(), "not found") { | ||
newStatus := &v1alpha1.PacemakerStatus{ | ||
TypeMeta: metav1.TypeMeta{ | ||
APIVersion: v1alpha1.SchemeGroupVersion.String(), | ||
Kind: "PacemakerStatus", | ||
}, | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: pacemakerStatusName, | ||
}, | ||
Status: v1alpha1.PacemakerStatusStatus{ | ||
LastUpdated: now, | ||
RawXML: rawXML, | ||
CollectionError: collectionError, | ||
Summary: summary, | ||
}, | ||
} | ||
|
||
result := restClient.Post(). | ||
Resource("pacemakerstatuses"). | ||
Body(newStatus). | ||
Do(ctx) | ||
|
||
if result.Error() != nil { | ||
return fmt.Errorf("failed to create PacemakerStatus: %w", result.Error()) | ||
} | ||
|
||
klog.Info("Created new PacemakerStatus CR") | ||
return nil | ||
} | ||
return fmt.Errorf("failed to get PacemakerStatus: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use apierrors.IsNotFound instead of string matching.
String checks are brittle. Prefer proper error typing.
Apply:
@@
-import (
+import (
"context"
"encoding/xml"
"fmt"
"os"
"strings"
"time"
"github.com/spf13/cobra"
+ apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@
- if err != nil {
+ if err != nil {
// Create new PacemakerStatus if it doesn't exist
- if strings.Contains(err.Error(), "not found") {
+ if apierrors.IsNotFound(err) {
newStatus := &v1alpha1.PacemakerStatus{
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In pkg/cmd/pacemakerstatuscollector/collector.go around lines 352 to 392,
replace the brittle string check strings.Contains(err.Error(), "not found") with
a proper API error type check using apierrors.IsNotFound(err); import
"k8s.io/apimachinery/pkg/api/errors" as apierrors if not already imported, and
keep the existing error handling logic path the same (create the new
PacemakerStatus on NotFound, otherwise return the wrapped error).
import ( | ||
"context" | ||
"fmt" | ||
"strings" | ||
"time" | ||
|
||
batchv1 "k8s.io/api/batch/v1" | ||
apierrors "k8s.io/apimachinery/pkg/api/errors" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/client-go/kubernetes" | ||
"k8s.io/klog/v2" | ||
|
||
operatorv1 "github.com/openshift/api/operator/v1" | ||
"github.com/openshift/library-go/pkg/controller/factory" | ||
"github.com/openshift/library-go/pkg/operator/events" | ||
"github.com/openshift/library-go/pkg/operator/v1helpers" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add missing import for strconv used in escaping.
@@
-import (
+import (
"context"
"fmt"
+ "strconv"
"strings"
"time"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import ( | |
"context" | |
"fmt" | |
"strings" | |
"time" | |
batchv1 "k8s.io/api/batch/v1" | |
apierrors "k8s.io/apimachinery/pkg/api/errors" | |
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |
"k8s.io/client-go/kubernetes" | |
"k8s.io/klog/v2" | |
operatorv1 "github.com/openshift/api/operator/v1" | |
"github.com/openshift/library-go/pkg/controller/factory" | |
"github.com/openshift/library-go/pkg/operator/events" | |
"github.com/openshift/library-go/pkg/operator/v1helpers" | |
) | |
import ( | |
"context" | |
"fmt" | |
"strconv" | |
"strings" | |
"time" | |
batchv1 "k8s.io/api/batch/v1" | |
apierrors "k8s.io/apimachinery/pkg/api/errors" | |
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |
"k8s.io/client-go/kubernetes" | |
"k8s.io/klog/v2" | |
operatorv1 "github.com/openshift/api/operator/v1" | |
"github.com/openshift/library-go/pkg/controller/factory" | |
"github.com/openshift/library-go/pkg/operator/events" | |
"github.com/openshift/library-go/pkg/operator/v1helpers" | |
) |
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 3 to 19, the code uses
strconv for escaping but the strconv package is not imported; add "strconv" to
the import block (alongside context, fmt, strings, time, etc.) so the file
compiles and the escaping call resolves; reorder/group imports if needed to
satisfy linters.
// injectValues replaces <injected> placeholders in the manifest with actual values | ||
func (c *CronJobController) injectValues(manifest []byte) []byte { | ||
result := string(manifest) | ||
|
||
// Build command string for YAML array format | ||
var commandBuilder strings.Builder | ||
commandBuilder.WriteString("[") | ||
for i, cmd := range c.command { | ||
if i > 0 { | ||
commandBuilder.WriteString(", ") | ||
} | ||
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) | ||
} | ||
commandBuilder.WriteString("]") | ||
commandStr := commandBuilder.String() | ||
|
||
// Track which line we're on to do context-aware replacement | ||
lines := strings.Split(result, "\n") | ||
for i, line := range lines { | ||
// Replace name (appears in metadata.name) | ||
if strings.Contains(line, "name: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | ||
} | ||
// Replace app label | ||
if strings.Contains(line, "app.kubernetes.io/name: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | ||
} | ||
// Replace schedule | ||
if strings.Contains(line, "schedule: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1) | ||
} | ||
// Replace image | ||
if strings.Contains(line, "image: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", c.image, 1) | ||
} | ||
// Replace command (special handling for array) | ||
if strings.Contains(line, "command: <injected>") { | ||
lines[i] = strings.Replace(line, "<injected>", commandStr, 1) | ||
} | ||
} | ||
|
||
return []byte(strings.Join(lines, "\n")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Fix YAML injection/escaping in command and support namespace injection.
- Current command quoting uses "%s" without escaping; quotes/backslashes in args will break YAML.
- Namespace in manifest isn’t injected; if config.Namespace != manifest’s namespace, create/update can fail (path vs object namespace mismatch).
Apply:
@@
-func (c *CronJobController) injectValues(manifest []byte) []byte {
+func (c *CronJobController) injectValues(manifest []byte) []byte {
result := string(manifest)
// Build command string for YAML array format
- var commandBuilder strings.Builder
- commandBuilder.WriteString("[")
- for i, cmd := range c.command {
- if i > 0 {
- commandBuilder.WriteString(", ")
- }
- commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd))
- }
- commandBuilder.WriteString("]")
- commandStr := commandBuilder.String()
+ var commandBuilder strings.Builder
+ commandBuilder.WriteString("[")
+ for i, cmd := range c.command {
+ if i > 0 {
+ commandBuilder.WriteString(", ")
+ }
+ // Use proper escaping for YAML/JSON-compatible strings
+ commandBuilder.WriteString(strconv.Quote(cmd))
+ }
+ commandBuilder.WriteString("]")
+ commandStr := commandBuilder.String()
@@
for i, line := range lines {
+ // Replace namespace if present in template
+ if strings.Contains(line, "namespace: <injected>") {
+ lines[i] = strings.Replace(line, "<injected>", c.namespace, 1)
+ }
// Replace name (appears in metadata.name)
if strings.Contains(line, "name: <injected>") {
lines[i] = strings.Replace(line, "<injected>", c.name, 1)
}
Note: add strconv import (see separate comment).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// injectValues replaces <injected> placeholders in the manifest with actual values | |
func (c *CronJobController) injectValues(manifest []byte) []byte { | |
result := string(manifest) | |
// Build command string for YAML array format | |
var commandBuilder strings.Builder | |
commandBuilder.WriteString("[") | |
for i, cmd := range c.command { | |
if i > 0 { | |
commandBuilder.WriteString(", ") | |
} | |
commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) | |
} | |
commandBuilder.WriteString("]") | |
commandStr := commandBuilder.String() | |
// Track which line we're on to do context-aware replacement | |
lines := strings.Split(result, "\n") | |
for i, line := range lines { | |
// Replace name (appears in metadata.name) | |
if strings.Contains(line, "name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace app label | |
if strings.Contains(line, "app.kubernetes.io/name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace schedule | |
if strings.Contains(line, "schedule: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1) | |
} | |
// Replace image | |
if strings.Contains(line, "image: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.image, 1) | |
} | |
// Replace command (special handling for array) | |
if strings.Contains(line, "command: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", commandStr, 1) | |
} | |
} | |
return []byte(strings.Join(lines, "\n")) | |
} | |
// injectValues replaces <injected> placeholders in the manifest with actual values | |
func (c *CronJobController) injectValues(manifest []byte) []byte { | |
result := string(manifest) | |
// Build command string for YAML array format | |
var commandBuilder strings.Builder | |
commandBuilder.WriteString("[") | |
for i, cmd := range c.command { | |
if i > 0 { | |
commandBuilder.WriteString(", ") | |
} | |
// Use proper escaping for YAML/JSON-compatible strings | |
commandBuilder.WriteString(strconv.Quote(cmd)) | |
} | |
commandBuilder.WriteString("]") | |
commandStr := commandBuilder.String() | |
// Track which line we're on to do context-aware replacement | |
lines := strings.Split(result, "\n") | |
for i, line := range lines { | |
// Replace namespace if present in template | |
if strings.Contains(line, "namespace: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) | |
} | |
// Replace name (appears in metadata.name) | |
if strings.Contains(line, "name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace app label | |
if strings.Contains(line, "app.kubernetes.io/name: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.name, 1) | |
} | |
// Replace schedule | |
if strings.Contains(line, "schedule: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.schedule, 1) | |
} | |
// Replace image | |
if strings.Contains(line, "image: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", c.image, 1) | |
} | |
// Replace command (special handling for array) | |
if strings.Contains(line, "command: <injected>") { | |
lines[i] = strings.Replace(line, "<injected>", commandStr, 1) | |
} | |
} | |
return []byte(strings.Join(lines, "\n")) | |
} |
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/cronjob_controller.go around lines 177 to 219, the command
array is built with fmt.Sprintf("%s") which doesn't escape quotes/backslashes
and will break YAML; also the manifest never injects a namespace. Change the
command builder to call strconv.Quote on each c.command element when
constructing commandStr (so arguments are properly escaped and wrapped), add
strconv to the import list, and add a replacement branch to replace "namespace:
<injected>" with c.namespace (or the controller's namespace field) so the
manifest namespace is injected consistently.
4aefd7d
to
b553bf0
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jaypoulz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
3-19
: Missing strconv import needed for proper YAML escaping.The injectValues method at line 191 constructs YAML array strings using
fmt.Sprintf("%s")
without proper escaping. Special characters in command arguments (quotes, backslashes) will break YAML parsing. Importstrconv
and usestrconv.Quote
for escaping.
48-48
: Misleading cron schedule example includes seconds field.Kubernetes CronJob uses standard 5-field cron expressions (minute hour day month weekday), not 6-field (with seconds). The example
"*/30 * * * * *"
is invalid. Use"*/30 * * * *"
for every 30 minutes.
180-222
: Fix YAML injection vulnerabilities and add namespace injection.Three critical issues remain from previous review:
Command escaping (line 191): Using
fmt.Sprintf("%s")
without escaping will break YAML when command args contain quotes or backslashes. Usestrconv.Quote(cmd)
instead.Namespace not injected: The manifest likely has a
namespace: <injected>
placeholder that needs replacement, but the loop has no case for it. Add namespace replacement or the CronJob will be created in wrong namespace.Schedule escaping (line 209): Wrapping in quotes helps but doesn't escape special chars in the schedule value itself.
Apply this diff:
import ( "context" "fmt" + "strconv" "strings" "time"
for i, line := range lines { + // Replace namespace if present in template + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.name, 1)for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + commandBuilder.WriteString(strconv.Quote(cmd)) }pkg/cmd/pacemakerstatuscollector/collector.go (1)
321-414
: Replace string matching with proper error type check.At line 362, the code uses brittle string matching
strings.Contains(err.Error(), "not found")
to detect NotFound errors. Useapierrors.IsNotFound(err)
instead.Apply this diff:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (1)
pkg/cmd/pacemakerstatuscollector/collector.go (1)
52-138
: Acceptable duplication to avoid circular dependencies.XML types are duplicated from healthcheck.go as noted in the comment. Consider extracting to a shared internal package in future refactoring to reduce duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (15)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/tnf/pacemaker/v1alpha1/doc.go
- pkg/operator/starter.go
- pkg/tnf/pkg/jobs/batch.go
- cmd/cluster-etcd-operator/main.go
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
- pkg/tnf/operator/starter_test.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (42)
pkg/tnf/pkg/jobs/cronjob_controller.go (7)
53-88
: LGTM!The constructor properly sets defaults, wires dependencies, and follows library-go factory patterns.
90-111
: LGTM!The sync logic correctly checks operator management state and handles errors appropriately.
113-132
: LGTM!The reconciliation logic follows standard Kubernetes controller patterns.
134-150
: LGTM!Creation logic properly handles AlreadyExists and records appropriate events.
152-165
: LGTM!Update logic correctly preserves ResourceVersion and records events.
167-178
: LGTM!Build logic is clear and includes helpful debug logging.
224-249
: LGTM!The update detection logic correctly compares schedule, command, and image with proper bounds checking.
pkg/tnf/pacemaker/healthcheck.go (22)
28-99
: LGTM!Constants are well-defined. The
StatusError
constant was appropriately added. TheexpectedNodeCount = 2
is intentional per the Two-Node OpenShift architecture design.Based on learnings.
101-273
: LGTM!XML parsing structures are comprehensive and follow standard Go XML unmarshaling conventions.
274-288
: LGTM!HealthStatus model and helper function are appropriate. Unknown status for collection failures is intentional per maintainer decision.
290-309
: LGTM!The HealthCheck structure is well-designed with proper mutex-protected state for concurrent access.
311-383
: LGTM!The constructor properly configures the REST client, informer, and controller factory with appropriate wiring.
385-405
: LGTM!The sync function follows a clear flow: retrieve status, update operator conditions, record events.
407-450
: LGTM!Status retrieval logic properly validates staleness and enforces XML size limits. The maxXMLSize check at line 439 prevents XML bomb parsing exploits; while the data is already in memory, this is acceptable for internal trusted sources.
452-499
: LGTM!XML parsing correctly validates structure and sets
StatusError
when pacemaker is not running, addressing prior review feedback.
501-524
: LGTM!Node status collection correctly flags offline and standby nodes as errors, consistent with Two-Node OpenShift design where standby is treated as offline.
Based on learnings.
526-566
: LGTM!Resource status collection correctly tracks started resources across nodes with clear helper function structure.
568-574
: LGTM!Resource count validation correctly uses the intentional
expectedNodeCount
constant.Based on learnings.
576-600
: LGTM!Recent failure detection correctly uses OR logic (line 584) per prior review. Timestamp parse errors are logged and skipped, which is appropriate defensive behavior.
602-619
: LGTM!Fencing event collection follows consistent patterns with appropriate time-based filtering.
621-634
: LGTM!Overall status determination follows clear precedence: errors > warnings > healthy.
636-653
: LGTM!Operator status update correctly sets degraded condition for
StatusError
, addressing prior review feedback.
655-664
: LGTM!Status message logging is clear and appropriate.
666-680
: LGTM!Degraded condition setting uses a scoped condition type (
PacemakerHealthCheckDegraded
at line 91), which prevents conflicts with other controllers' degraded conditions.
682-689
: LGTM!Condition clearing logic is correct.
691-701
: LGTM!Operator condition update uses standard library-go helpers correctly.
703-728
: LGTM!Event deduplication logic correctly cleans up stale entries and enforces time-based windows. The cleanup at lines 710-716 executes once per call, not per iteration of a caller's loop.
730-769
: LGTM!Event recording implements appropriate deduplication with context-aware windows (longer for fencing) and only records healthy events on transitions.
771-781
: LGTM!Warning event categorization correctly routes to appropriate event reasons.
pkg/tnf/pacemaker/v1alpha1/types.go (5)
1-5
: LGTM!Package structure and imports are standard for Kubernetes API types.
7-21
: LGTM!PacemakerStatus type follows Kubernetes CRD best practices with appropriate kubebuilder markers and subresource support.
23-50
: LGTM!Spec and Status types are well-designed with appropriate fields for status collection and error reporting. The RawXML field enables efficient parsing without re-executing commands.
52-77
: LGTM!PacemakerSummary provides a comprehensive high-level cluster state overview with well-documented fields.
79-86
: LGTM!PacemakerStatusList follows standard Kubernetes list type conventions.
pkg/cmd/pacemakerstatuscollector/collector.go (8)
1-50
: LGTM!Package structure, imports, and constants are well-organized. The pcs command uses
sudo -n
for non-interactive execution, which is good practice.
140-153
: LGTM!Command constructor follows standard cobra patterns with appropriate error handling.
155-177
: LGTM!The collector orchestration is clean with appropriate timeout and error handling.
179-217
: LGTM!Status collection properly handles timeouts, validates XML size, and gracefully handles parse errors while preserving raw XML.
219-265
: LGTM!Summary building logic correctly processes all node and resource states with clear counting logic.
267-286
: LGTM!Recent failure detection uses correct OR logic for status checks and handles timestamp parsing gracefully.
288-301
: LGTM!Recent fencing detection follows consistent patterns with appropriate time-based filtering.
303-319
: LGTM!Kubernetes client creation follows standard patterns with in-cluster and kubeconfig fallback.
b553bf0
to
d99fc51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
pkg/tnf/operator/starter.go (1)
346-351
: Cron expression is invalid for Kubernetes CronJobs
"*/30 * * * * *"
has six fields (seconds). Kubernetesbatch/v1
CronJobs only accept the standard five-field spec, so this will be rejected and the collector never runs. Use a valid five-field schedule (e.g.,"* * * * *"
for once per minute and loop inside the job for 30‑second sampling) or another mechanism that supports sub-minute intervals.- Schedule: "*/30 * * * * *", // Every 30 seconds (Kubernetes CronJob format with seconds) + Schedule: "* * * * *", // Every minute; handle 30s cadence inside the job if requiredpkg/cmd/pacemakerstatuscollector/collector.go (1)
358-391
: Use API error helpers instead of string matching.
strings.Contains(err.Error(), "not found")
is fragile (locale/log variations) and can misclassify errors. Switch toapierrors.IsNotFound(err)
and importk8s.io/apimachinery/pkg/api/errors
.- if err != nil { - // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if err != nil { + // Create new PacemakerStatus if it doesn't exist + if apierrors.IsNotFound(err) {pkg/tnf/pkg/jobs/cronjob_controller.go (3)
3-19
: Missing strconv import.The previous review comment about the missing
strconv
import still applies. This import is required for the proper escaping fix suggested in theinjectValues
method.
48-51
: Cron schedule comment is still misleading.The previous review comment about the incorrect 6-field cron expression example still applies. Kubernetes CronJob expects a 5-field expression.
180-222
: YAML injection and namespace issues remain.The previous review comments about improper escaping in command building and missing namespace injection still apply.
🧹 Nitpick comments (2)
pkg/tnf/operator/starter_test.go (1)
292-301
: Make the retry expectation observable
expectRetries
is populated but never asserted, so the test still passes even ifhandleNodesWithRetry
never performs additional attempts. Either surface the captured attempt counter (e.g., return it from the setup helper) and assert it againstexpectRetries
, or drop the field to avoid implying coverage that isn’t there.pkg/tnf/pkg/jobs/cronjob_controller.go (1)
90-111
: Sync method correctly checks operator state.The implementation appropriately verifies the operator is in Managed state before ensuring the CronJob. The early return for non-Managed states is a sound approach.
Optional: Consider whether existing CronJobs should be cleaned up when the operator transitions to Unmanaged or Removed states. The current behavior preserves existing resources, which may be intentional.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (19)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(9 hunks)pkg/tnf/operator/starter_test.go
(6 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
- pkg/operator/starter.go
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/v1alpha1/register.go
- pkg/tnf/pacemaker/v1alpha1/doc.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (8)
pkg/tnf/pkg/jobs/cronjob_controller.go (8)
21-27
: Constants are well-defined.The default resync interval and namespace values are appropriate for CronJob lifecycle management.
29-42
: Controller structure is well-organized.The
CronJobController
struct appropriately encapsulates all necessary clients and configuration for managing CronJob resources.
53-88
: Constructor is well-implemented.The constructor properly handles defaults, follows the factory pattern, and correctly sets up the controller with informers and sync intervals.
113-132
: CronJob reconciliation logic is sound.The method correctly handles the create-or-update pattern with appropriate error handling for NotFound cases.
134-150
: CronJob creation is well-handled.The method correctly creates the CronJob, handles AlreadyExists errors gracefully, and records events appropriately.
152-165
: CronJob update correctly preserves ResourceVersion.The method appropriately preserves the existing ResourceVersion before updating, which is essential for conflict-free updates.
167-178
: Manifest processing approach is sound.The method correctly injects values and parses the manifest. The debug logging at V(4) is a helpful troubleshooting aid.
224-249
: Update detection logic is correct and defensive.The method appropriately compares schedule, command, and image fields, with proper defensive checks for container existence before accessing array elements.
pkg/tnf/pacemaker/healthcheck.go
Outdated
// Get pacemaker status | ||
healthStatus, err := c.getPacemakerStatus(ctx) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
// Update operator status conditions based on pacemaker status | ||
if err := c.updateOperatorStatus(ctx, healthStatus); err != nil { | ||
return err | ||
} | ||
|
||
// Record pacemaker health check events | ||
c.recordHealthCheckEvents(healthStatus) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale/failed collections never set degraded.
When getPacemakerStatus
returns StatusUnknown
(collection error, stale XML, oversized payload, etc.), updateOperatorStatus
hits the default branch and leaves the degraded condition untouched. That masks genuine outages (e.g., pcs unreachable) and contradicts the controller’s goal of surfacing health issues. Please map collection failures to StatusError
(either in getPacemakerStatus
or when propagating errors) so we call setDegradedCondition
.
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/healthcheck.go around lines 390 to 403, when
getPacemakerStatus returns StatusUnknown collection failures are not mapped to
degraded and thus updateOperatorStatus leaves degraded untouched; change the
propagation so that any StatusUnknown (collection error, stale/oversized
payload, unreachable pcs) is converted to StatusError before calling
updateOperatorStatus (or update updateOperatorStatus to treat StatusUnknown the
same as StatusError) so setDegradedCondition is invoked; implement the mapping
immediately after getPacemakerStatus returns (if healthStatus == StatusUnknown {
healthStatus = StatusError }) or add a case in updateOperatorStatus handling
StatusUnknown -> setDegradedCondition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional. We only want to update the operator status if we have a definitive answer regarding the pacemaker status. Failure to collect an up-to-date status does means that we cannot definitively determine our current state, so the safest thing to do is to leave our last known status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaypoulz Thank you for the clarification! That makes sense — holding the last known status when collection fails is a reasonable approach to avoid flapping the degraded condition during transient issues. I understand the design decision now.
Change List - Added a PacemakerHealthCheck controller which monitors pacemaker cluster status every 30 seconds by executing 'pcs status --full' and parsing node/resource health - Updated operator condition to degraded when nodes are not online or critical resources (kubelet/etcd) are not started on all nodes - Added event recording for warnings of recent failed resource actions (5min window) and fencing events (24hr window) with appropriate event reasons - Included happy path testing
d99fc51
to
59eaec8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter_test.go (1)
18-20
: Remove duplicate import of the same packageThe same import path is aliased twice (config/v1). Use a single alias.
Apply this diff:
- configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1" - v1 "github.com/openshift/client-go/config/informers/externalversions/config/v1" + configv1informers "github.com/openshift/client-go/config/informers/externalversions/config/v1" @@ - networkInformer v1.NetworkInformer + networkInformer configv1informers.NetworkInformerAlso applies to: 55-56
♻️ Duplicate comments (5)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
3-19
: Add strconv import for proper command escaping.The
injectValues
method (line 191) builds command strings without proper escaping. To safely handle command arguments containing quotes, backslashes, or other special characters, you'll need thestrconv
package.Add the import:
import ( "context" "fmt" + "strconv" "strings" "time"
Then use
strconv.Quote(cmd)
in the command builder (see separate comment on lines 180-222).
48-48
: Fix cron schedule comment: Kubernetes uses 5 fields, not 6.The example
"*/30 * * * * *"
shows 6 fields (including seconds), but Kubernetes CronJob uses the standard 5-field cron format (minute, hour, day, month, weekday). This will confuse users.Update the comment:
- Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds) + Schedule string // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)
180-222
: Critical: Fix command escaping and add namespace injection.Two issues in
injectValues
:
Command escaping: Line 191 uses
fmt.Sprintf('"%s"', cmd)
which doesn't escape quotes or backslashes. If any command argument contains"
or\
, the generated YAML will be malformed.Missing namespace injection: The manifest has
namespace: openshift-etcd
(line 5 in cronjob.yaml), but this is never replaced withc.namespace
, causing a mismatch whenCronJobConfig.Namespace
differs from the hardcoded value.Apply this fix:
func (c *CronJobController) injectValues(manifest []byte) []byte { result := string(manifest) // Build command string for YAML array format var commandBuilder strings.Builder commandBuilder.WriteString("[") for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + // Use strconv.Quote for proper YAML/JSON escaping + commandBuilder.WriteString(strconv.Quote(cmd)) } commandBuilder.WriteString("]") commandStr := commandBuilder.String() // Track which line we're on to do context-aware replacement lines := strings.Split(result, "\n") for i, line := range lines { + // Replace namespace + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") {Note: This requires the
strconv
import (see separate comment) and changing line 5 in cronjob.yaml tonamespace: <injected>
.pkg/tnf/operator/starter.go (1)
334-364
: Cron schedule fix LGTMSchedule now uses valid 5-field spec ("* * * * *"). This resolves the prior invalid 6-field seconds issue.
pkg/cmd/pacemakerstatuscollector/collector.go (1)
360-392
: Use apierrors.IsNotFound instead of string matching.String checks are brittle; prefer typed API errors.
-import ( +import ( ... - "strings" ... + apierrors "k8s.io/apimachinery/pkg/api/errors" ) ... - if err != nil { + if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{ ... } ... } return fmt.Errorf("failed to get PacemakerStatus: %w", err) }
🧹 Nitpick comments (20)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (2)
32-39
: Make spec.nodeName requiredControllers likely rely on spec.nodeName; encode that contract in the CRD for early validation.
Apply this diff:
spec: description: PacemakerStatusSpec defines the desired state of PacemakerStatus type: object + required: + - nodeName properties: nodeName: description: NodeName identifies which node this status is for type: string
15-18
: Add printer columns for quick status at a glanceImprove kubectl get output for operators and SREs.
Apply this diff:
versions: - name: v1alpha1 served: true storage: true + additionalPrinterColumns: + - name: Node + type: string + jsonPath: .spec.nodeName + - name: Pacemakerd + type: string + jsonPath: .status.summary.pacemakerdState + - name: Quorum + type: string + jsonPath: .status.summary.hasQuorum + - name: Nodes + type: string + jsonPath: .status.summary.nodesOnline + description: Online/Total + priority: 1 + - name: Updated + type: date + jsonPath: .status.lastUpdatedAlso applies to: 58-85
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-1
: Optional: align command attributionThe request attribute shows crm_mon XML while docs mention pcs status; consider noting pcs→crm_mon indirection or aligning text for consistency.
pkg/tnf/operator/starter_test.go (1)
260-271
: Avoid live API dependencies in unit testsrunPacemakerControllers is started and the health checker may construct clients from KubeConfig. A bare rest.Config{Host: "..."} can cause dial/SSL errors and flakiness.
Options:
- Inject a dynamic.Interface into the health checker and pass fakedynamic in tests.
- Or gate runPacemakerControllers behind a test knob (function var) so unit tests stub it out.
If you keep KubeConfig in tests, set it to a dummy and ensure health code never dials when running under fake clients.
Also applies to: 96-110
pkg/tnf/operator/starter.go (2)
338-352
: 30-second sampling claim vs 1-minute CronPR text says “every 30 seconds,” but Cron runs every minute. Ensure the collector loops internally (e.g., two samples with 30s sleep) or update docs.
If needed, document the intra-job loop or add an explicit flag to the collector for 30s cadence.
355-364
: Defensive start for health controllerIf controllerContext.KubeConfig is nil or misconfigured, NewHealthCheck/Run should no-op or degrade gracefully to avoid tight error loops.
Add a nil/validation check and early return with a Warning event.
pkg/tnf/pacemaker/healthcheck_test.go (5)
31-61
: Mark helpers as test helpers.Call t.Helper() in helper functions to improve failure locations and readability.
func createTestHealthCheck() *HealthCheck { + testingT, _ := any(nil).(interface{ Helper() }) + if testingT != nil { + testingT.Helper() + }Apply similarly in createTestHealthCheckWithMockStatus, loadTestXML, and getRecentFailuresXML.
119-140
: Stabilize time-dependent tests with an injected clock.Using time.Now() risks flakes around cutoff windows. Prefer a controllable clock or pass times into helpers.
-func getRecentFailuresXML(t *testing.T) string { +func getRecentFailuresXMLAt(t *testing.T, now time.Time) string { // Load the template... - now := time.Now() recentTime := now.Add(-1 * time.Minute)... ... return xmlContent } + +func getRecentFailuresXML(t *testing.T) string { + return getRecentFailuresXMLAt(t, time.Now()) +}
277-310
: Add coverage for Unknown → no condition change (locks intent).Given Unknown is intentional for collection/staleness, add a subtest asserting degraded condition is not updated.
// Test warning status (should not update operator condition) status = &HealthStatus{ OverallStatus: statusWarning, Warnings: []string{"Test warning"}, Errors: []string{}, } err = controller.updateOperatorStatus(ctx, status) require.NoError(t, err, "updateOperatorAvailability should not return an error") + + // Test unknown status (should not update operator condition) + status = &HealthStatus{ + OverallStatus: statusUnknown, + Warnings: []string{}, + Errors: []string{"collection failed"}, + } + err = controller.updateOperatorStatus(ctx, status) + require.NoError(t, err)
338-405
: Avoid brittle concrete type assertion for recorder.Casting to events.InMemoryRecorder couples tests to implementation. Prefer asserting via the public Recorder and inspecting emitted events through its API or provide an accessor on HealthCheck for tests.
862-873
: Test the max-size guard via getPacemakerStatus path too.parsePacemakerStatusXML doesn’t enforce size; the guard lives in getPacemakerStatus. Add a test invoking getPacemakerStatus with oversized RawXML to validate handling.
pkg/tnf/pacemaker/healthcheck.go (5)
428-447
: Explicit size check in parser for direct calls.parsePacemakerStatusXML is callable directly (tests do) but doesn’t enforce maxXMLSize; add a guard to fail fast on oversized input as in getPacemakerStatus.
func (c *HealthCheck) parsePacemakerStatusXML(xmlOutput string) (*HealthStatus, error) { status := &HealthStatus{ OverallStatus: statusUnknown, Warnings: []string{}, Errors: []string{} } + if len(xmlOutput) > maxXMLSize { + return nil, fmt.Errorf("XML output too large: %d bytes (max: %d bytes)", len(xmlOutput), maxXMLSize) + } // Parse XML var result PacemakerResult
509-523
: Document standby-as-offline behavior inline.Add a brief comment so future maintainers don’t “fix” this intentionally strict check.
for _, node := range result.Nodes.Node { - if node.Online != booleanValueTrue { + // Treat standby as offline by design (standby nodes must not run critical resources) + if node.Online != booleanValueTrue { status.Errors = append(status.Errors, fmt.Sprintf("Node %s is not online", node.Name)) } if node.Standby == booleanValueTrue { status.Errors = append(status.Errors, fmt.Sprintf("Node %s is in standby (unexpected behavior; treated as offline)", node.Name)) } }
428-431
: Make staleness threshold a named constant tied to resync.2 minutes is embedded; derive from the controller resync or expose a constant for coherence and easier tuning.
const ( ... healthCheckResyncInterval = 30 * time.Second + statusStaleAfter = 2 * time.Minute // consider 4 * healthCheckResyncInterval ) ... - if time.Since(result.Status.LastUpdated.Time) > 2*time.Minute { + if time.Since(result.Status.LastUpdated.Time) > statusStaleAfter {
703-729
: Run dedupe cleanup once per call, not per event.Cleanup executes for every event, adding overhead. Do it once at the start of recordHealthCheckEvents.
-func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool { - c.recordedEventsMu.Lock() - defer c.recordedEventsMu.Unlock() - now := time.Now() - // Clean up old entries... - maxWindow := eventDeduplicationWindowFencing - for key, timestamp := range c.recordedEvents { - if now.Sub(timestamp) > maxWindow { - delete(c.recordedEvents, key) - } - } +func (c *HealthCheck) shouldRecordEvent(eventKey string, deduplicationWindow time.Duration) bool { + c.recordedEventsMu.Lock() + defer c.recordedEventsMu.Unlock() + now := time.Now() // Check if this event was recently recorded if lastRecorded, exists := c.recordedEvents[eventKey]; exists { if now.Sub(lastRecorded) < deduplicationWindow { return false } } c.recordedEvents[eventKey] = now return true } func (c *HealthCheck) recordHealthCheckEvents(status *HealthStatus) { + // One-time cleanup per invocation + c.recordedEventsMu.Lock() + now := time.Now() + for key, ts := range c.recordedEvents { + if now.Sub(ts) > eventDeduplicationWindowFencing { + delete(c.recordedEvents, key) + } + } + c.recordedEventsMu.Unlock() // Record events for warnings...Also applies to: 730-781
101-221
: Avoid struct duplication across packages.Pacemaker XML structs are duplicated here and in the collector; they will drift. Extract them into a shared internal/package and reuse.
Also applies to: 52-60
pkg/cmd/pacemakerstatuscollector/collector.go (4)
24-50
: De-duplicate shared constants/types to prevent drift.maxXMLSize, time formats, and XML structs exist here and in healthcheck.go. Move them into a common package (e.g., pkg/tnf/pacemaker/internal/common) and import from both callers.
Also applies to: 31-33, 75-82
179-217
: Stream-limit pcs output to reduce peak memory.If exec package supports it, prefer a bounded reader or line-capped read to avoid large buffers; otherwise consider enhancing exec.Execute to enforce an output cap.
267-286
: Be tolerant to rc_text casing.Some environments emit RCText “OK”. Compare case-insensitively to avoid false positives.
- if operation.RC != operationRCSuccess || operation.RCText != operationRCTextSuccess { + if operation.RC != operationRCSuccess || !strings.EqualFold(operation.RCText, operationRCTextSuccess) {
303-319
: HOME may be empty; add fallback.When building kubeconfig path, guard against empty HOME for non-login shells.
- kubeconfig := os.Getenv("KUBECONFIG") - if kubeconfig == "" { - kubeconfig = os.Getenv("HOME") + "/.kube/config" - } + kubeconfig := os.Getenv("KUBECONFIG") + if kubeconfig == "" { + if home := os.Getenv("HOME"); home != "" { + kubeconfig = home + "/.kube/config" + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (19)
bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/pacemaker/testdata/recent_failures_template.xml
🚧 Files skipped from review as they are similar to previous changes (6)
- cmd/cluster-etcd-operator/main.go
- pkg/tnf/pacemaker/v1alpha1/doc.go
- pkg/operator/starter.go
- pkg/tnf/pacemaker/testdata/offline_node.xml
- pkg/tnf/pacemaker/v1alpha1/types.go
- pkg/tnf/pacemaker/testdata/standby_node.xml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (5)
pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM: CronJob decoder follows established pattern.The function correctly mirrors
ReadJobV1OrDie
and reuses the existing codec factory. The panic-on-error behavior is appropriate for the "OrDie" convention.bindata/tnfdeployment/cronjob.yaml (1)
35-40
: Verify privileged settings are necessary for pacemaker operations.The static analysis tools flag the privileged security context, privilege escalation, and
hostPID: true
as security concerns. For pacemaker health checks that executepcs status
, these elevated privileges are likely necessary. However, please confirm:
- Does
pcs status --full
require privileged access and hostPID?- Can the command run with a more restrictive security context?
- Are there audit/compliance requirements that need documentation for these privileges?
If these privileges are confirmed necessary, consider adding a comment in the manifest explaining why elevated privileges are required.
pkg/tnf/pacemaker/testdata/healthy_cluster.xml (1)
1-87
: Fixture looks goodWell-formed healthy-state XML; suitable for happy-path tests.
pkg/tnf/pacemaker/v1alpha1/register.go (1)
14-31
: Scheme registration looks correctGroupVersion, AddKnownTypes, and AddToScheme are wired properly.
pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-121
: Generated deepcopy looks correct.Covers nested structs, slices, and runtime.Object; no issues spotted.
bindata/tnfdeployment/cronjob.yaml
Outdated
kind: CronJob | ||
metadata: | ||
name: <injected> | ||
namespace: openshift-etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded namespace creates inconsistency with controller.
The namespace is hardcoded to openshift-etcd
, but the CronJobController
accepts a Namespace
field in CronJobConfig
(line 47 in cronjob_controller.go). The injectValues
method (lines 180-222 in cronjob_controller.go) doesn't replace the namespace placeholder, so if a caller provides a different namespace, the manifest and controller will be out of sync.
Change line 5 to use a placeholder:
- namespace: openshift-etcd
+ namespace: <injected>
Then update injectValues
in cronjob_controller.go to inject the namespace (this is already flagged in past review comments for that file).
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In bindata/tnfdeployment/cronjob.yaml around line 5, the namespace is hardcoded
to "openshift-etcd" which will diverge from the CronJobController's configurable
Namespace; replace that line with a namespace placeholder (e.g. namespace: {{
.Namespace }}). Then update cronjob_controller.go (see CronJobConfig Namespace
field around line 47 and injectValues method around lines 180-222) to ensure
injectValues replaces the namespace placeholder with the configured Namespace
value when rendering the manifest; ensure nil/empty Namespace falls back to the
original default if needed and that tests and manifest rendering use the same
placeholder key.
rawXML: | ||
description: RawXML contains the raw XML output from pcs status xml command | ||
type: string | ||
collectionError: | ||
description: CollectionError contains any error encountered while collecting status | ||
type: string | ||
summary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against oversized rawXML (1 MiB object size limit)
Storing full pcs/crm_mon XML can exceed Kubernetes object size limits, causing updates to fail under stress. Either truncate in the collector or constrain via schema.
Suggested schema bound; pair with collector-side truncation:
rawXML:
description: RawXML contains the raw XML output from pcs status xml command
type: string
+ maxLength: 262144 # ~256 KiB safety cap; adjust as needed
Optionally add a boolean status.summary.rawXMLTruncated and trim in the collector when exceeding the cap.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
rawXML: | |
description: RawXML contains the raw XML output from pcs status xml command | |
type: string | |
collectionError: | |
description: CollectionError contains any error encountered while collecting status | |
type: string | |
summary: | |
rawXML: | |
description: RawXML contains the raw XML output from pcs status xml command | |
type: string | |
maxLength: 262144 # ~256 KiB safety cap; adjust as needed | |
collectionError: | |
description: CollectionError contains any error encountered while collecting status | |
type: string | |
summary: |
🤖 Prompt for AI Agents
In manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml around lines
51-57, the rawXML status field is unbounded and can exceed Kubernetes 1 MiB
object size limits; add a schema constraint (e.g. maxLength: 1048576) on rawXML
and introduce a new boolean status.summary.rawXMLTruncated field to indicate
truncation, and update the collector to truncate the XML to that cap before
writing the CR status (ensuring truncation preserves valid encoding/closing tags
as needed).
59eaec8
to
85b3229
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
bindata/tnfdeployment/cronjob.yaml (1)
5-5
: Namespace hardcoding issue already flagged.This issue was already identified in a previous review.
pkg/tnf/pkg/jobs/cronjob_controller.go (2)
48-48
: Fix misleading cron schedule example – Kubernetes CronJob uses 5 fields, not 6.The example
"*/30 * * * * *"
suggests a 6-field cron with seconds, but Kubernetes CronJob uses standard 5-field cron. Change to"*/30 * * * *"
(every 30 minutes) and update the comment.Apply this diff:
- Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds) + Schedule string // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)
180-222
: Fix YAML injection risk and add namespace injection.Two issues:
- YAML injection: Line 191 uses
fmt.Sprintf(\
"%s"`, cmd)which doesn't escape quotes/backslashes. If
cmdcontains
"or
`, the YAML breaks. Usestrconv.Quote
for proper escaping.- Missing namespace injection: The manifest never injects namespace. If the template contains
namespace: <injected>
, it won't be replaced.Apply this diff:
package jobs import ( "context" "fmt" + "strconv" "strings" "time"
func (c *CronJobController) injectValues(manifest []byte) []byte { result := string(manifest) // Build command string for YAML array format var commandBuilder strings.Builder commandBuilder.WriteString("[") for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + commandBuilder.WriteString(strconv.Quote(cmd)) } commandBuilder.WriteString("]") commandStr := commandBuilder.String() // Track which line we're on to do context-aware replacement lines := strings.Split(result, "\n") for i, line := range lines { + // Replace namespace if present in template + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.name, 1)pkg/cmd/pacemakerstatuscollector/collector.go (1)
360-362
: Use proper error type checking instead of string matching.Line 362 uses brittle string matching to detect not-found errors. Use
apierrors.IsNotFound(err)
for robust error classification.Apply this diff:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (2)
pkg/tnf/pacemaker/healthcheck.go (2)
401-402
: Consider passing context to recordHealthCheckEvents for consistency.While
recordHealthCheckEvents
currently performs no blocking I/O, passingctx
would future-proof the function and align with standard controller patterns where context flows through the call chain.Based on learnings
438-441
: XML size validation is defensive; etcd limits already constrain CR size.The size check here occurs after the XML is in memory (from the CR). However, etcd/apiserver limits (typically ~1.5MB) already prevent excessively large CRs from being stored. This validation is still useful as a secondary defense-in-depth measure.
For stricter protection, consider future enhancement using
io.LimitReader
when the collector writes the CR.Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (17)
bindata/etcd/etcd-sa-clusterrole.yaml
(1 hunks)bindata/etcd/etcd-sa-clusterrolebinding.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/starter.go
(2 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- pkg/tnf/pacemaker/v1alpha1/doc.go
- bindata/etcd/etcd-sa-clusterrolebinding.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
- pkg/tnf/pkg/jobs/batch.go
- pkg/tnf/pacemaker/v1alpha1/types.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (34)
bindata/tnfdeployment/cronjob.yaml (1)
35-40
: Verify privileged access is necessary for pacemaker status collection.The CronJob runs with
privileged: true
,allowPrivilegeEscalation: true
, andhostPID: true
. While static analysis flags these as security concerns, they may be required for thepcs status
command to access the pacemaker cluster state on the host.Please confirm:
- Is host PID namespace access (
hostPID: true
) required for the pacemaker status collector to runpcs status
?- Can the privilege requirements be reduced (e.g., specific capabilities instead of fully privileged)?
- Is there documentation about the minimum privileges needed for
pcs status
execution?If these privileges are necessary, consider adding a comment in the manifest explaining why privileged access is required for operational clarity.
cmd/cluster-etcd-operator/main.go (1)
17-17
: LGTM!The pacemaker status collector command is properly imported and registered with the root command.
Also applies to: 82-82
pkg/operator/starter.go (2)
247-248
: LGTM!The RBAC manifests for the etcd service account are correctly added to the static resource controller, enabling PacemakerStatus resource access.
632-633
: LGTM!The function call is properly updated to pass the additional
dynamicClient
andAlivenessChecker
parameters required for Pacemaker controller initialization.bindata/etcd/etcd-sa-clusterrole.yaml (1)
1-24
: LGTM!The ClusterRole defines appropriate permissions for the pacemaker status collector to manage PacemakerStatus resources and their status subresource.
pkg/tnf/operator/starter_test.go (1)
35-35
: LGTM!The test is properly updated to:
- Import the necessary health package
- Create an
alivenessChecker
instance- Pass the additional parameters to
HandleDualReplicaClusters
- Initialize a minimal
rest.Config
for testingAlso applies to: 41-41, 96-109, 260-270
pkg/tnf/operator/starter.go (2)
38-38
: LGTM!The function signature is properly extended to accept
dynamicClient
andalivenessChecker
, and the newrunPacemakerControllers
function is correctly invoked to start Pacemaker-related controllers.Also applies to: 40-40, 76-77, 98-98
334-364
: Verify CronJob schedule aligns with requirements.The CronJob schedule is set to
"* * * * *"
(every minute), but the PR description states the controller "monitors pacemaker cluster status every 30 seconds."Please confirm:
- Is running every minute (instead of every 30 seconds) acceptable for your monitoring requirements?
- If 30-second intervals are required, consider either:
- Running the job more frequently (though Kubernetes CronJobs don't support sub-minute schedules natively)
- Implementing a loop inside the collector to sample twice per minute
- Using a different deployment pattern (e.g., Deployment with continuous monitoring)
If every minute is acceptable, update the PR description to match the actual schedule.
pkg/tnf/pacemaker/v1alpha1/register.go (1)
1-31
: LGTM!The API registration follows standard Kubernetes patterns:
- Properly defines
GroupName
andVersion
constants- Initializes
SchemeGroupVersion
andSchemeBuilder
- Registers
PacemakerStatus
andPacemakerStatusList
types- Provides helper functions for resource registration
pkg/tnf/pacemaker/healthcheck_test.go (1)
1-903
: LGTM!This is a comprehensive test suite with excellent coverage:
- HealthCheck construction and initialization
- PacemakerStatus retrieval (healthy, errors, stale)
- Operator status updates
- Event recording with proper deduplication across different time windows
- XML parsing for various cluster states (healthy, offline, standby, failures)
- Node and resource status collection
- Edge cases (large XML, invalid XML, missing state)
The use of testdata fixtures and dynamic timestamp generation for time-sensitive tests (like
getRecentFailuresXML
) is a good practice.pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-121
: LGTM: Autogenerated deepcopy implementations are correct.The generated deepcopy methods follow standard Kubernetes patterns: proper nil checks, correct slice allocation for
PacemakerStatusList.Items
, nestedDeepCopyInto
calls for embedded types, and appropriateDeepCopyObject
implementations.pkg/tnf/pacemaker/healthcheck.go (12)
1-99
: LGTM: Imports and constants are well-organized.The constants are clearly defined and documented. The
expectedNodeCount = 2
reflects the Two-Node OpenShift with Fencing design constraint, as clarified in past discussions.
311-383
: LGTM: Controller constructor follows standard operator patterns.The REST client and informer setup are correct. Using
panic
for scheme registration and client creation errors is appropriate for initialization failures that should halt startup.
576-600
: LGTM: Continue on parse errors is the correct resilient behavior.Continuing when timestamp parsing fails (line 590) ensures the health check processes other valid operations. The error is logged at V(4) for debugging. Failing hard on malformed timestamps would be too strict and could mask legitimate failures in other operations.
Based on learnings
682-689
: Condition ownership is established by the unique type name.The condition type
PacemakerHealthCheckDegraded
(line 91) is unique to this controller, establishing clear ownership. Per Kubernetes conventions, condition types should be managed by a single controller, so clearing without checking the reason is safe.Based on learnings
703-728
: LGTM: Event cleanup logic is correct.The cleanup (lines 712-716) runs unconditionally on every call and only removes entries older than
maxWindow
(1 hour). This is intentional and correct – it prevents unbounded map growth while preserving recent events within their deduplication windows.Based on learnings
101-273
: LGTM: XML structures are well-defined for pacemaker status parsing.The types correctly mirror the pacemaker XML schema and use appropriate xml tags for unmarshaling.
452-499
: LGTM: Parsing orchestration is well-structured.The function properly validates critical structures, checks pacemaker state, and delegates to specialized collectors. Early return on non-running state with explicit error status is appropriate.
501-524
: LGTM: Node status collection correctly handles the Two-Node design.The checks for expected node count (with warning for transient 1-node state during replacement) and online/standby validation align with the Two-Node OpenShift architecture requirements.
526-574
: LGTM: Resource validation correctly identifies kubelet and etcd placement.The helper function pattern and resource agent prefix matching properly track resource distribution across nodes.
602-634
: LGTM: Fencing and status determination logic is sound.Fencing event detection with appropriate time window (24 hours) and simple status determination based on error/warning presence are appropriate for this health check.
636-701
: LGTM: Operator condition management follows library-go patterns.Condition updates use the standard
v1helpers.UpdateStaticPodStatus
API correctly. The separation of set/clear/update functions is clean.
730-781
: LGTM: Event recording with deduplication is well-implemented.The event categorization (fencing vs. failed actions) with distinct reasons and deduplication windows, plus tracking previous status to record healthy transitions, is thorough and production-ready.
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
21-88
: LGTM: Controller setup follows standard factory patterns.The constants, types, and constructor are well-organized. Default namespace fallback and resync interval are appropriate.
90-178
: LGTM: CronJob lifecycle management is correct.The sync/ensure/create/update/build flow properly handles the CronJob lifecycle, including AlreadyExists handling and ResourceVersion preservation for updates.
224-249
: LGTM: Update detection correctly compares schedule, command, and image.The comparison logic properly checks all three configurable fields and handles array length mismatches.
pkg/cmd/pacemakerstatuscollector/collector.go (8)
1-50
: LGTM: Imports and constants are well-organized.The constants clearly define timeouts, size limits, and time formats. The command uses
sudo -n
for non-interactive execution, which is correct.
52-138
: LGTM: XML structure duplication is acceptable to avoid circular dependencies.The comment (line 53) explains why these types are duplicated from healthcheck.go. This is a reasonable trade-off for maintaining clean package dependencies.
140-177
: LGTM: Command structure and runCollector orchestration are sound.The 30-second context timeout (line 156) provides a reasonable upper bound for collection and CR update operations. Error handling propagates failures appropriately.
179-217
: LGTM: Collection logic is robust and defensive.The nested timeout (10s for exec, line 181), size validation (line 196), and returning raw XML even on parse failure (line 210) make this resilient to various failure modes.
219-265
: LGTM: Summary building comprehensively captures cluster state.The function correctly counts online nodes, started resources, and checks for recent failures/fencing events. The logic aligns with the health check expectations.
267-301
: LGTM: Time-window checks for failures and fencing are correct.The time window constants (5 minutes for failures, 24 hours for fencing) match the health check controller. Error handling continues on parse failure, which is appropriate.
303-319
: LGTM: Kubernetes client creation with in-cluster fallback is standard.The in-cluster config attempt followed by kubeconfig fallback correctly supports both in-cluster (Job/CronJob) and local execution contexts.
363-414
: LGTM: CR update logic correctly handles create vs. update paths.The create path (lines 363-389) properly sets TypeMeta and ObjectMeta. The update path (lines 394-413) correctly uses the status subresource and sets ObservedGeneration.
85b3229
to
9f5f422
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
bindata/tnfdeployment/cronjob.yaml (1)
5-5
: Hardcoded namespace creates inconsistency with controller.The namespace is hardcoded to
openshift-etcd
, but theCronJobController
accepts aNamespace
field inCronJobConfig
. The controller'sinjectValues
method doesn't replace the namespace placeholder, causing potential mismatches.As flagged in previous reviews: replace with a placeholder and update the controller to inject it.
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
48-51
: Cron schedule comment shows invalid 6-field expression.Kubernetes CronJob expects a standard 5-field cron expression (minute, hour, day, month, weekday), not 6 fields. The example showing
"*/30 * * * * *"
with seconds is misleading.Update to a valid 5-field example like
"*/30 * * * *"
(every 30 minutes).
184-194
: Command array construction lacks proper YAML escaping.The command building uses
fmt.Sprintf("%s")
which doesn't escape quotes, backslashes, or other special characters. Arguments containing these characters will break the YAML syntax when injected.As flagged in previous reviews: use
strconv.Quote()
for each command element to ensure proper escaping, and add thestrconv
import.
196-222
: Namespace is not injected into the manifest.The
injectValues
method replaces placeholders for name, schedule, image, and command, but never injects the namespace. The manifest template has a hardcoded namespace, creating an inconsistency with the controller's configurableNamespace
field.As flagged in previous reviews: add a replacement branch to inject
c.namespace
when the line contains"namespace: <injected>"
.pkg/cmd/pacemakerstatuscollector/collector.go (1)
360-392
: Use apierrors.IsNotFound instead of string matching.The brittle string check
strings.Contains(err.Error(), "not found")
on line 362 should be replaced with proper API error type checking usingapierrors.IsNotFound(err)
.Apply this fix:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{
🧹 Nitpick comments (6)
pkg/tnf/operator/starter.go (1)
313-346
: Defensive guard for KubeConfig to avoid panics.If controllerContext.KubeConfig is ever nil (tests or misconfig), NewHealthCheck will panic when creating the REST client. Add a fast check and skip starting the healthcheck in that case (log at Warning).
Example:
func runPacemakerControllers(..., etcdInformer operatorv1informers.EtcdInformer) { // Start the healthcheck controller to watch PacemakerStatus CRs // This can start early since it just subscribes to CR updates klog.Infof("starting Pacemaker healthcheck controller") - healthCheckController := pacemaker.NewHealthCheck( + if controllerContext.KubeConfig == nil { + klog.Warning("skipping Pacemaker healthcheck controller: KubeConfig is nil") + } else { + healthCheckController := pacemaker.NewHealthCheck( alivenessChecker, operatorClient, kubeClient, controllerContext.EventRecorder, controllerContext.KubeConfig, ) - go healthCheckController.Run(ctx, 1) + go healthCheckController.Run(ctx, 1) + }pkg/tnf/pacemaker/healthcheck.go (2)
320-335
: Avoid panics on REST client creation; prefer graceful failure.Panic will crash the operator if RESTClientFor fails (bad rest.Config, serializer, env). Return an error from constructor (or log and skip starting) and let the caller decide, to keep the operator up.
Minimal approach (preserve signature): log error and return a no-op controller:
- restClient, err := rest.RESTClientFor(pacemakerConfig) - if err != nil { - panic(fmt.Errorf("failed to create REST client: %w", err)) - } + restClient, err := rest.RESTClientFor(pacemakerConfig) + if err != nil { + klog.Errorf("PacemakerHealthCheck disabled: failed to create REST client: %v", err) + return factory.New().ToController("PacemakerHealthCheckDisabled", events.NewInMemoryRecorder("noop")) + }Alternatively, change NewHealthCheck to return (factory.Controller, error) and bubble up.
As per coding guidelines.
730-781
: Event deduplication is sensible; minor nit: avoid cleanup per-item.Current cleanup runs for each warning/error. Consider a periodic cleanup (once per sync) to reduce lock contention.
pkg/tnf/pacemaker/healthcheck_test.go (1)
32-61
: Consider removing or clarifying the TODO comment.The TODO comment on line 56 states "TODO: Mock this for actual tests", but the function is already used successfully in
TestNewHealthCheck
(line 168-174). If the nil REST client and informer are acceptable for basic constructor testing, consider either:
- Removing the TODO comment, or
- Clarifying what additional testing would require proper mocking
pkg/cmd/pacemakerstatuscollector/collector.go (2)
303-333
: Refactor duplicate REST config retrieval logic.The code for obtaining the Kubernetes REST config is duplicated between
createKubernetesClient
(lines 304-316) andupdatePacemakerStatusCR
(lines 323-333). Consider extracting this into a shared helper function to improve maintainability.Example refactor:
func getRESTConfig() (*rest.Config, error) { // Try in-cluster config first config, err := rest.InClusterConfig() if err != nil { // Fall back to kubeconfig kubeconfig := os.Getenv("KUBECONFIG") if kubeconfig == "" { kubeconfig = os.Getenv("HOME") + "/.kube/config" } config, err = clientcmd.BuildConfigFromFlags("", kubeconfig) if err != nil { return nil, err } } return config, nil } func createKubernetesClient() (*kubernetes.Clientset, error) { config, err := getRESTConfig() if err != nil { return nil, err } return kubernetes.NewForConfig(config) } func updatePacemakerStatusCR(ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, collectionError string) error { config, err := getRESTConfig() if err != nil { return err } // ... rest of function }
52-138
: XML structure duplication is documented.The comment on line 53 notes that these XML structures are copied from healthcheck.go to avoid circular dependencies. While code duplication is generally undesirable, this appears to be an intentional trade-off documented in the code.
Consider moving these shared structures to a separate package (e.g.,
pkg/tnf/pacemaker/types
orpkg/tnf/pacemaker/internal
) if the circular dependency can be resolved.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (18)
bindata/etcd/etcd-sa-clusterrole.yaml
(1 hunks)bindata/etcd/etcd-sa-clusterrolebinding.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(2 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- bindata/etcd/etcd-sa-clusterrolebinding.yaml
- pkg/operator/starter.go
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
- cmd/cluster-etcd-operator/main.go
- bindata/etcd/etcd-sa-clusterrole.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (18)
pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM!The function correctly mirrors the existing
ReadJobV1OrDie
pattern, using the same decoder infrastructure to parse CronJob manifests. The implementation is consistent and appropriate for this use case.bindata/tnfdeployment/cronjob.yaml (1)
35-40
: Privileged access is appropriate for pacemaker operations.The privileged security context and
hostPID: true
are necessary for the collector to executepcs status
commands and interact with the pacemaker cluster on the host. Static analysis warnings are expected but can be accepted for this system-level operational use case.pkg/tnf/pkg/jobs/cronjob_controller.go (3)
90-111
: LGTM!The sync method correctly checks operator state and ensures the CronJob only when in Managed state. Error handling and logging are appropriate.
113-132
: LGTM!The ensureCronJob method properly handles both creation and update paths, with correct error handling for NotFound vs other errors.
224-249
: LGTM!The needsUpdate method correctly compares schedule, command array, and image to determine if an update is required. Bounds checking is appropriate.
pkg/tnf/pacemaker/v1alpha1/doc.go (1)
1-3
: LGTM.Package doc and group annotation look good.
pkg/tnf/operator/starter_test.go (1)
96-110
: Aliveness checker wiring looks correct.Passing a concrete MultiAlivenessChecker into HandleDualReplicaClusters matches the new signature.
pkg/operator/ceohelpers/external_etcd_status.go (1)
167-201
: Nice generic waiter; sync + context-aware polling is appropriate.Gates on informer sync, retries with PollUntilContextTimeout, and treats transient errors as retryable. LGTM.
pkg/tnf/pacemaker/v1alpha1/register.go (1)
1-31
: LGTM.SchemeGroupVersion, AddToScheme, and known types are correctly wired.
pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-121
: Autogen deepcopy looks correct.Methods cover all types; safe copies for meta/timestamps.
pkg/tnf/operator/starter.go (2)
229-237
: Good: bootstrap wait refactored to shared helper.Using ceohelpers.WaitForEtcdCondition simplifies readiness gating and centralizes polling behavior.
348-367
: CronJob schedule fix looks good.Five-field spec "* * * * *" is valid for Kubernetes CronJobs; controller wiring is straightforward.
pkg/tnf/pacemaker/healthcheck.go (1)
509-523
: Node-state handling matches PR intent.You warn on unexpected node count and treat standby as offline. Clear messages; aligns with TNF design.
pkg/tnf/pacemaker/healthcheck_test.go (2)
126-140
: Excellent use of dynamic timestamps!Generating dynamic timestamps within the test window (line 131-133) is a best practice that prevents time-based test flakiness. This approach ensures tests remain reliable regardless of when they execute.
143-903
: Comprehensive test coverage!The test suite provides excellent coverage across:
- Happy path scenarios (healthy clusters, online nodes, started resources)
- Error conditions (offline nodes, standby nodes, collection errors, stale status)
- Edge cases (no nodes, mismatched counts, large XML, invalid XML, missing state)
- Event deduplication across different time windows and state transitions
- All major parsing and status determination functions
The tests are well-structured, clearly named, and follow consistent patterns.
pkg/tnf/pacemaker/v1alpha1/types.go (1)
1-86
: Well-structured API types following Kubernetes conventions.The type definitions are clean and follow standard Kubernetes API patterns with appropriate kubebuilder markers for CRD generation, deepcopy generation, and subresource configuration.
pkg/cmd/pacemakerstatuscollector/collector.go (2)
140-177
: Clean command structure and error handling.The cobra command setup and main orchestration logic follow good patterns with proper context timeout, error propagation, and logging.
179-301
: Well-structured collection and parsing logic.The collection, parsing, and summary-building functions are cleanly separated with good error handling, size validation (maxXMLSize check on line 196), and time-based filtering for recent events.
// PacemakerStatusSpec defines the desired state of PacemakerStatus | ||
type PacemakerStatusSpec struct { | ||
// NodeName identifies which node this status is for | ||
// +optional | ||
NodeName string `json:"nodeName,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify if NodeName field is needed.
The NodeName
field in PacemakerStatusSpec
is marked optional but doesn't appear to be set anywhere in the collector implementation (pkg/cmd/pacemakerstatuscollector/collector.go lines 363-377, 394-399). The collector uses a singleton CR named "cluster" without node-specific distinction.
If node-specific status tracking isn't part of the current design, consider removing this field to avoid confusion. If it's planned for future use, consider adding a comment explaining its intended purpose.
Run the following to verify NodeName usage across the codebase:
🏁 Script executed:
#!/bin/bash
# Search for any references to NodeName in PacemakerStatus context
rg -n -C3 'NodeName' --type=go
Length of output: 264911
---
🏁 Script executed:
```shell
#!/bin/bash
# Search specifically for PacemakerStatus usage and NodeName references
echo "=== Search 1: PacemakerStatus creation/instantiation ==="
rg -n 'PacemakerStatus' --type=go | grep -v vendor | head -20
echo ""
echo "=== Search 2: Collector code that creates PacemakerStatus ==="
rg -n 'pacemakerstatuscollector' --type=go | grep -v vendor | head -20
echo ""
echo "=== Search 3: Direct reference to PacemakerStatusSpec with NodeName ==="
rg -n 'PacemakerStatusSpec.*NodeName\|NodeName.*PacemakerStatusSpec' --type=go | grep -v vendor
echo ""
echo "=== Search 4: Look for any .NodeName assignments in collector context ==="
rg -n '\.NodeName\s*=' --type=go pkg/cmd/pacemakerstatuscollector/ -A2 -B2 2>/dev/null || echo "File not found, searching broadly:"
find pkg -name "*collector*" -type f
Length of output: 2866
🏁 Script executed:
#!/bin/bash
# View the collector.go to see how PacemakerStatus objects are created
echo "=== Collector.go content (showing PacemakerStatus creation) ==="
rg -n 'PacemakerStatus' pkg/cmd/pacemakerstatuscollector/collector.go -A5 -B5
echo ""
echo "=== Search for Spec assignment in collector ==="
rg -n 'Spec' pkg/cmd/pacemakerstatuscollector/collector.go -C3 | head -40
Length of output: 4469
Remove unused NodeName
field from PacemakerStatusSpec
or document its intended purpose.
The NodeName
field in PacemakerStatusSpec
is never set by the collector implementation. Lines 363–376 and 395–398 of collector.go
show that only the Status
fields are populated (LastUpdated, RawXML, CollectionError, Summary); the Spec
is never touched. Combined with the singleton CR design ("cluster"), this field appears unused. Either remove it or add a comment explaining future intent.
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/v1alpha1/types.go around lines 23–28, the
PacemakerStatusSpec.NodeName field is unused by the collector and the singleton
"cluster" CR design, so remove the NodeName field from the PacemakerStatusSpec
struct, delete its json tag, and update any comments; regenerate CRDs/deepcopy
(run controller-gen or go generate as appropriate), fix any compile-time
references, and run go fmt/go vet to ensure cleanliness. If you prefer to keep
it instead, replace the field with a clear comment explaining its future intent
and why it is currently unused, and still regenerate CRDs/deepcopy and run
formatting checks.
bindata/tnfdeployment/cronjob.yaml
Outdated
hostNetwork: false | ||
hostPID: true | ||
priorityClassName: system-node-critical | ||
serviceAccountName: etcd-sa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use the tnf-setup-manager SA. like we do in the other TNF jobs? That one also has the privileged securitycontextconstraints rule assigned, which I expect the pacemaker cronjob needs as well
|
||
// Track which line we're on to do context-aware replacement | ||
lines := strings.Split(result, "\n") | ||
for i, line := range lines { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I like using a JobHookFunc
like in runJobController()
much more than this string replacements... WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused as to how the JobHookFunc stuff worked :D
I was like - what is this sorcery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I am looking at it directly, it makes way more sense. Stealing this approach.
cmd/cluster-etcd-operator/main.go
Outdated
cmd.AddCommand(requestbackup.NewRequestBackupCommand(ctx)) | ||
cmd.AddCommand(rev.NewRevCommand(ctx)) | ||
cmd.AddCommand(backuprestore.NewBackupServer(ctx)) | ||
cmd.AddCommand(pacemakerstatuscollector.NewPacemakerStatusCollectorCommand()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding this in cmd/tnf-setup-runner instead, where all the other TNF related commands live? (yes, it's not about the setup part, but we can rename it?)
@@ -0,0 +1,414 @@ | |||
package pacemakerstatuscollector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to have this in the tnf package?
9f5f422
to
4c76701
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (2)
48-51
: Cron schedule example is incorrect for Kubernetes CronJob.The comment shows a 6-field expression including seconds, but Kubernetes CronJob expects a standard 5-field cron expression (minute, hour, day of month, month, day of week).
Update the example:
- Schedule string // Cron schedule expression (e.g., "*/30 * * * * *" for every 30 seconds) + Schedule string // Cron schedule expression (e.g., "*/30 * * * *" for every 30 minutes)
180-222
: Command array construction is vulnerable to YAML injection.The command builder uses
fmt.Sprintf("%s")
without escaping quotes, backslashes, or other special characters. Arguments containing these characters will break the YAML structure. Additionally, the namespace placeholder is not being injected.Apply this diff to fix both issues:
+import ( + "strconv" + // ... other imports +) func (c *CronJobController) injectValues(manifest []byte) []byte { result := string(manifest) // Build command string for YAML array format var commandBuilder strings.Builder commandBuilder.WriteString("[") for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + commandBuilder.WriteString(strconv.Quote(cmd)) } commandBuilder.WriteString("]") commandStr := commandBuilder.String() // Track which line we're on to do context-aware replacement lines := strings.Split(result, "\n") for i, line := range lines { + // Replace namespace if present in template + if strings.Contains(line, "namespace: <injected>") { + lines[i] = strings.Replace(line, "<injected>", c.namespace, 1) + } // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") {pkg/tnf/operator/starter_test.go (1)
260-271
: Test goroutines may leak due to context.Background usage.The test creates a scoped cancellable context at line 239 but returns
context.Background()
at line 267. Controllers started byHandleDualReplicaClusters
useargs.ctx
, so they won't be cancelled when the test completes, potentially leaking goroutines.Apply this diff:
return args{ initErr: nil, // Default to no error - ctx: context.Background(), + ctx: ctx, controllerContext: &controllercmd.ControllerContext{pkg/cmd/pacemakerstatuscollector/collector.go (1)
487-531
: CRITICAL: Replace brittle string matching with apierrors.IsNotFound (duplicate of previous review).Line 497 still uses
strings.Contains(err.Error(), "not found")
which was flagged in a previous review. This is fragile and incorrect for error type checking.Apply this diff:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{Based on learnings.
🧹 Nitpick comments (11)
pkg/tnf/pacemaker/healthcheck_test.go (2)
31-61
: Mark helpers with t.Helper() for clearer failures.Add t.Helper() to test helpers (createTestHealthCheck, createTestHealthCheckWithMockStatus, loadTestXML, getRecentFailuresXML).
func createTestHealthCheck() *HealthCheck { + t.Helper() @@ func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *v1alpha1.PacemakerStatus) *HealthCheck { + t.Helper() @@ func loadTestXML(t *testing.T, filename string) string { + t.Helper() @@ func getRecentFailuresXML(t *testing.T) string { + t.Helper()Also applies to: 63-117, 119-124, 126-140
338-444
: Optional: make dedup timing fully deterministic.Consider injecting a clock into HealthCheck and using it in shouldRecordEvent to avoid reliance on time.Now in tests.
pkg/tnf/operator/starter.go (2)
348-367
: 30s vs 1m sampling mismatch.Schedule is "* * * * *" (every minute). If 30-second sampling is required, either run a short loop inside the collector or switch to another mechanism.
399-500
: Operational note: pcs/ sudo requirements for CronJob.Verify the CronJob’s container has needed privileges and NOPASSWD sudo for “pcs status xml” (host mounts, SELinux, SCC). Otherwise collection will fail at runtime.
pkg/tnf/pacemaker/healthcheck.go (2)
320-335
: Avoid panic in constructor paths.Prefer utilruntime.Must (or returning an error) over panic when building scheme/REST client.
+ "k8s.io/apimachinery/pkg/util/runtime" @@ - if err := v1alpha1.AddToScheme(scheme); err != nil { - panic(fmt.Errorf("failed to add scheme: %w", err)) - } + utilruntime.Must(v1alpha1.AddToScheme(scheme)) @@ - restClient, err := rest.RESTClientFor(pacemakerConfig) - if err != nil { - panic(fmt.Errorf("failed to create REST client: %w", err)) - } + restClient, err := rest.RESTClientFor(pacemakerConfig) + utilruntime.Must(err)
621-634
: Comment mismatch with implementation.determineOverallStatus doesn’t add messages; update the comment to reflect current behavior.
-// Note: This function may add additional error messages to the status for certain conditions +// Computes overall status from existing warnings/errors; does not mutate messages.pkg/tnf/pacemaker/v1alpha1/types.go (1)
23-28
: Spec.NodeName: keep or document future intent.Field is currently unused by the collector (singleton “cluster” CR). Add a brief comment about its intended future use to avoid confusion.
pkg/cmd/pacemakerstatuscollector/collector.go (4)
372-375
: Consider more robust RC parsing.The
fmt.Sscanf
call on line 374 doesn't check for errors. Ifoperation.RC
is not a valid integer,rc
will remain 0, which might incorrectly match the success code.Consider:
-// Parse RC -rc := 0 -if operation.RC != "" { - fmt.Sscanf(operation.RC, "%d", &rc) -} +// Parse RC +rc := 0 +if operation.RC != "" { + if _, err := fmt.Sscanf(operation.RC, "%d", &rc); err != nil { + klog.Warningf("Failed to parse RC value '%s': %v", operation.RC, err) + } +}
276-426
: Consider splitting this large function into focused helpers.This ~150-line function handles multiple distinct responsibilities: building summaries, processing nodes, processing resources, analyzing history, and tracking fencing events. While the logic is straightforward, splitting into smaller functions (e.g.,
buildNodeStatus
,buildResourceStatus
,buildNodeHistory
,buildFencingHistory
) would improve readability and testability.
458-468
: Eliminate config creation duplication.Lines 458-468 duplicate the config creation logic from
createKubernetesClient
(lines 429-441). Consider extracting a sharedgetKubeConfig() (*rest.Config, error)
helper or reusing the existing client creation function.
446-556
: Consider splitting this large function.This ~110-line function handles multiple responsibilities: config creation, scheme setup, REST client creation, CR retrieval, and create-or-update logic. Consider extracting helpers like
createRESTClient
,createNewStatus
, andupdateExistingStatus
for better readability and testability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (17)
bindata/tnfdeployment/clusterrole.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)cmd/cluster-etcd-operator/main.go
(2 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pacemaker/v1alpha1/doc.go
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (20)
pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM!The new
ReadCronJobV1OrDie
function correctly mirrors the existingReadJobV1OrDie
pattern, reuses the same codec infrastructure, and follows consistent error handling conventions.bindata/tnfdeployment/cronjob.yaml (1)
35-47
: Security posture is appropriate for Pacemaker monitoring.The privileged security context, hostPID access, and system-node-critical priority are necessary for collecting Pacemaker cluster status from the host. These elevated permissions align with the operational requirements of monitoring external etcd clusters.
Note: Static analysis security warnings are expected and acceptable for this privileged monitoring workload.
pkg/operator/ceohelpers/external_etcd_status.go (1)
167-201
: LGTM!The
WaitForEtcdCondition
helper correctly gates on informer sync, uses context-aware polling with proper timeout handling, and provides good observability through structured logging. Error handling appropriately retries transient condition-check failures while respecting context cancellation.cmd/cluster-etcd-operator/main.go (1)
17-17
: LGTM!The Pacemaker status collector command is correctly imported and registered, following the established pattern for other subcommands in the operator.
Also applies to: 82-82
pkg/operator/starter.go (1)
619-634
: LGTM!The signature extension to
HandleDualReplicaClusters
correctly propagates thedynamicClient
andAlivenessChecker
parameters needed for Pacemaker health monitoring and status collection. The changes integrate cleanly with the existing operator startup flow.bindata/tnfdeployment/clusterrole.yaml (1)
32-50
: LGTM!The RBAC rules for
pacemakerstatuses
are correctly structured, covering both the main resource and the status subresource with appropriate verbs for the Pacemaker status collector workflow.pkg/tnf/operator/starter_test.go (1)
96-109
: LGTM!The test correctly instantiates
alivenessChecker
and propagates bothdynamicClient
andalivenessChecker
toHandleDualReplicaClusters
, maintaining test coverage for the new signature.pkg/tnf/pacemaker/v1alpha1/register.go (1)
24-31
: LGTM: scheme registration is correct.Types and list registered; group/version helpers look good.
pkg/tnf/operator/starter.go (2)
313-346
: Ensure the healthcheck’s custom informer is started.The Pacemaker informer created in NewHealthCheck must be running for WithInformers cache sync and event-driven resyncs. Confirm library-go factory starts it; if not, start it (e.g., go informer.Run(ctx.Done())) where the informer is created.
61-76
: All call sites correctly updated with new parameters.Verification confirms that both invocations of
HandleDualReplicaClusters
have been properly updated:
- pkg/operator/starter.go:619 — passes
dynamicClient
(line 630) andAlivenessChecker
(line 631)- pkg/tnf/operator/starter_test.go:97 — passes
tt.args.dynamicClient
(line 108) andalivenessChecker
(line 109)No additional call sites exist in the codebase.
pkg/tnf/pacemaker/healthcheck.go (3)
336-358
: Confirm informer lifecycle.SharedIndexInformer is constructed here and passed to WithInformers. Verify it’s actually started by the factory, or start it to avoid WaitForCacheSync deadlock.
703-728
: Event dedup cleanup is efficient and safe.Cleanup before checking saves memory and avoids repeated growth. Logic looks good.
532-544
: Resource agent prefixes are validated and correctly match actual pacemaker output.The constants
resourceAgentKubelet = "systemd:kubelet"
andresourceAgentEtcd = "ocf:heartbeat:podman-etcd"
are verified against test fixtures representing realpcs
cluster output. Thestrings.HasPrefix()
matching logic is sound, andresourcesStarted
counts are computed correctly.pkg/tnf/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-211
: LGTM! Autogenerated deepcopy implementations are correct.This file is autogenerated by deepcopy-gen and follows the standard patterns for deep copying Kubernetes API types. The implementations correctly handle nil checks, slice allocations, and nested struct copying.
pkg/cmd/pacemakerstatuscollector/collector.go (6)
1-1
: Address the package location question from previous review.A previous reviewer (slintes) questioned whether this should be in the tnf package. The current location (
pkg/cmd/pacemakerstatuscollector
) suggests this is a standalone command, but if it's intended as part of the tnf testing framework, consider moving topkg/tnf/cmd/pacemakerstatuscollector
or similar for better organization.Based on learnings.
24-50
: LGTM! Well-defined constants with appropriate values.The constants are well-chosen:
- XML bomb protection with 10MB limit is good security practice
- Time windows (5min for failures, 24h for fencing) are reasonable for event detection
- Pacemaker time formats are correctly specified for parsing
189-202
: LGTM! Standard Cobra command implementation.The command constructor follows best practices with proper error handling and logging.
204-226
: LGTM! Clean orchestration with appropriate timeout.The 30-second timeout provides adequate margin for the 10-second pcs command execution plus Kubernetes operations.
228-274
: LGTM! Robust collection with good error handling.The function properly handles errors while still preserving raw XML for debugging (line 266-267), and includes XML bomb protection.
428-444
: LGTM! Standard Kubernetes client creation.The fallback logic from in-cluster config to kubeconfig is correctly implemented.
// XML structures for parsing pacemaker status | ||
// These are copied from healthcheck.go to avoid circular dependencies | ||
type PacemakerResult struct { | ||
XMLName xml.Name `xml:"pacemaker-result"` | ||
Summary Summary `xml:"summary"` | ||
Nodes Nodes `xml:"nodes"` | ||
Resources Resources `xml:"resources"` | ||
NodeAttributes NodeAttributes `xml:"node_attributes"` | ||
NodeHistory NodeHistory `xml:"node_history"` | ||
FenceHistory FenceHistory `xml:"fence_history"` | ||
} | ||
|
||
type Summary struct { | ||
Stack Stack `xml:"stack"` | ||
CurrentDC CurrentDC `xml:"current_dc"` | ||
NodesConfigured NodesConfigured `xml:"nodes_configured"` | ||
ResourcesConfigured ResourcesConfigured `xml:"resources_configured"` | ||
} | ||
|
||
type Stack struct { | ||
PacemakerdState string `xml:"pacemakerd-state,attr"` | ||
} | ||
|
||
type CurrentDC struct { | ||
WithQuorum string `xml:"with_quorum,attr"` | ||
} | ||
|
||
type NodesConfigured struct { | ||
Number string `xml:"number,attr"` | ||
} | ||
|
||
type ResourcesConfigured struct { | ||
Number string `xml:"number,attr"` | ||
} | ||
|
||
type Nodes struct { | ||
Node []Node `xml:"node"` | ||
} | ||
|
||
type Node struct { | ||
Name string `xml:"name,attr"` | ||
ID string `xml:"id,attr"` | ||
Online string `xml:"online,attr"` | ||
Standby string `xml:"standby,attr"` | ||
StandbyOnFail string `xml:"standby_onfail,attr"` | ||
Maintenance string `xml:"maintenance,attr"` | ||
Pending string `xml:"pending,attr"` | ||
Unclean string `xml:"unclean,attr"` | ||
Shutdown string `xml:"shutdown,attr"` | ||
ExpectedUp string `xml:"expected_up,attr"` | ||
IsDC string `xml:"is_dc,attr"` | ||
ResourcesRunning string `xml:"resources_running,attr"` | ||
Type string `xml:"type,attr"` | ||
} | ||
|
||
type NodeAttributes struct { | ||
Node []NodeAttributeSet `xml:"node"` | ||
} | ||
|
||
type NodeAttributeSet struct { | ||
Name string `xml:"name,attr"` | ||
Attribute []NodeAttribute `xml:"attribute"` | ||
} | ||
|
||
type NodeAttribute struct { | ||
Name string `xml:"name,attr"` | ||
Value string `xml:"value,attr"` | ||
} | ||
|
||
type Resources struct { | ||
Clone []Clone `xml:"clone"` | ||
Resource []Resource `xml:"resource"` | ||
} | ||
|
||
type Clone struct { | ||
Resource []Resource `xml:"resource"` | ||
} | ||
|
||
type Resource struct { | ||
ID string `xml:"id,attr"` | ||
ResourceAgent string `xml:"resource_agent,attr"` | ||
Role string `xml:"role,attr"` | ||
TargetRole string `xml:"target_role,attr"` | ||
Active string `xml:"active,attr"` | ||
Orphaned string `xml:"orphaned,attr"` | ||
Blocked string `xml:"blocked,attr"` | ||
Managed string `xml:"managed,attr"` | ||
Failed string `xml:"failed,attr"` | ||
FailureIgnored string `xml:"failure_ignored,attr"` | ||
NodesRunningOn string `xml:"nodes_running_on,attr"` | ||
Node NodeRef `xml:"node"` | ||
} | ||
|
||
type NodeRef struct { | ||
Name string `xml:"name,attr"` | ||
} | ||
|
||
type NodeHistory struct { | ||
Node []NodeHistoryNode `xml:"node"` | ||
} | ||
|
||
type NodeHistoryNode struct { | ||
Name string `xml:"name,attr"` | ||
ResourceHistory []ResourceHistory `xml:"resource_history"` | ||
} | ||
|
||
type ResourceHistory struct { | ||
ID string `xml:"id,attr"` | ||
OperationHistory []OperationHistory `xml:"operation_history"` | ||
} | ||
|
||
type OperationHistory struct { | ||
Call string `xml:"call,attr"` | ||
Task string `xml:"task,attr"` | ||
RC string `xml:"rc,attr"` | ||
RCText string `xml:"rc_text,attr"` | ||
ExitReason string `xml:"exit-reason,attr"` | ||
LastRCChange string `xml:"last-rc-change,attr"` | ||
ExecTime string `xml:"exec-time,attr"` | ||
QueueTime string `xml:"queue-time,attr"` | ||
} | ||
|
||
type FenceHistory struct { | ||
FenceEvent []FenceEvent `xml:"fence_event"` | ||
} | ||
|
||
type FenceEvent struct { | ||
Target string `xml:"target,attr"` | ||
Action string `xml:"action,attr"` | ||
Delegate string `xml:"delegate,attr"` | ||
Client string `xml:"client,attr"` | ||
Origin string `xml:"origin,attr"` | ||
Status string `xml:"status,attr"` | ||
ExitReason string `xml:"exit-reason,attr"` | ||
Completed string `xml:"completed,attr"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Consider extracting duplicated XML types to a shared package.
The comment on line 53 indicates these types are copied from healthcheck.go
to avoid circular dependencies. Code duplication increases maintenance burden and risk of inconsistencies.
Consider:
- Extract these types to a shared package (e.g.,
pkg/tnf/pacemaker/types
) - Have both
collector.go
andhealthcheck.go
import from the shared location - If a circular dependency truly exists, break it by restructuring the package hierarchy
🤖 Prompt for AI Agents
In pkg/cmd/pacemakerstatuscollector/collector.go lines 52-187: these XML struct
definitions are duplicated from healthcheck.go; extract them into a new shared
package (e.g., pkg/tnf/pacemaker/types) and update both collector.go and
healthcheck.go to import that package; create the package with the exact type
declarations and appropriate package name, run goimports/gofmt, and if you hit a
circular import, refactor by moving only the plain type definitions into the
lowest-level package so that both callers depend on it (rather than each other)
and adjust import paths accordingly.
// RawXML contains the raw XML output from pcs status xml command | ||
// Kept for debugging purposes only; healthcheck should not need to parse this | ||
// +optional | ||
RawXML string `json:"rawXML,omitempty"` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update RawXML comment to match usage.
HealthCheck parses RawXML today; the comment says it “should not need to parse this.”
-// Kept for debugging purposes only; healthcheck should not need to parse this
+// Kept for debugging and parsed by the healthcheck to derive status
🤖 Prompt for AI Agents
In pkg/tnf/pacemaker/v1alpha1/types.go around lines 35 to 39, the RawXML field
comment incorrectly states healthcheck should not parse this; update the comment
to reflect that RawXML contains the raw XML output from the pcs status xml
command and is parsed by HealthCheck (kept for debugging and runtime health
validation). Modify the comment to mention both uses (debugging and that
HealthCheck may parse it), keep the +optional marker and json tag unchanged.
4c76701
to
b915543
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter.go (1)
384-392
: Guard against slice index panic when setting container command.Accessing Command[1] assumes at least two elements. Fallback to setting the full command.
- job.Spec.Template.Spec.Containers[0].Command[1] = jobType.GetSubCommand() + cmd := &job.Spec.Template.Spec.Containers[0].Command + if len(*cmd) >= 2 { + (*cmd)[1] = jobType.GetSubCommand() + } else { + *cmd = []string{"tnf-runtime", jobType.GetSubCommand()} + }
♻️ Duplicate comments (8)
pkg/cmd/pacemakerstatuscollector/collector.go (3)
3-22
: Missing apierrors import for proper error handling.The past review comment suggested using
apierrors.IsNotFound()
instead of string matching (line 497), but the required import"k8s.io/apimachinery/pkg/api/errors"
is missing.Add the import:
import ( "context" "encoding/xml" "fmt" "os" "strings" "time" "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
52-187
: Extract duplicated XML types to a shared package.These types are still duplicated from
healthcheck.go
. Code duplication increases maintenance burden and creates risk of inconsistencies between the two copies.Based on learnings
487-531
: Use apierrors.IsNotFound instead of string matching.String checks on error messages are brittle. This was flagged in a previous review but has not been addressed.
Apply this diff after adding the apierrors import:
if err != nil { // Create new PacemakerStatus if it doesn't exist - if strings.Contains(err.Error(), "not found") { + if apierrors.IsNotFound(err) { newStatus := &v1alpha1.PacemakerStatus{pkg/tnf/operator/starter_test.go (1)
267-267
: Return the scoped context to prevent goroutine leaks.The test creates a cancellable context at line 239 but returns
context.Background()
instead. Controllers started byHandleDualReplicaClusters
will use this uncancellable context, potentially leaking goroutines when the test completes.Apply this diff:
return args{ initErr: nil, // Default to no error - ctx: context.Background(), + ctx: ctx, controllerContext: &controllercmd.ControllerContext{pkg/tnf/pkg/pacemaker/v1alpha1/types.go (2)
25-28
: Spec.NodeName appears unused; deprecate or remove.Collector uses a singleton cluster-level CR; NodeName isn’t populated. Either remove it or mark as deprecated to avoid confusion.
type PacemakerStatusSpec struct { - // NodeName identifies which node this status is for - // +optional - NodeName string `json:"nodeName,omitempty"` + // Deprecated: currently unused; a single cluster-level CR is used. + // +optional + NodeName string `json:"nodeName,omitempty"` }
35-38
: RawXML comment contradicts usage; update to reflect parsing by healthcheck.Tests and healthcheck parse RawXML today.
// RawXML contains the raw XML output from pcs status xml command - // Kept for debugging purposes only; healthcheck should not need to parse this + // Kept for debugging and parsed by the healthcheck to derive status // +optionalbindata/tnfdeployment/cronjob.yaml (1)
5-5
: Hardcoded namespace diverges from configurable controller Namespace.Replace with injected value to keep manifest and controller in sync.
- namespace: openshift-etcd + namespace: <injected>pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)
280-302
: Fix assertion messages to reference updateOperatorStatus (not updateOperatorAvailability).Same issue was flagged earlier; please update all three occurrences.
- require.NoError(t, err, "updateOperatorAvailability should not return an error") + require.NoError(t, err, "updateOperatorStatus should not return an error")Apply at Lines 281, 291, and 301.
🧹 Nitpick comments (23)
pkg/cmd/pacemakerstatuscollector/collector.go (3)
372-375
: Consider explicitly handling fmt.Sscanf error.While the default value of 0 is a reasonable fallback, making this explicit improves code clarity.
// Parse RC rc := 0 if operation.RC != "" { - fmt.Sscanf(operation.RC, "%d", &rc) + if _, err := fmt.Sscanf(operation.RC, "%d", &rc); err != nil { + klog.Warningf("Failed to parse RC value '%s', defaulting to 0: %v", operation.RC, err) + } }
276-426
: Consider splitting this long function.This 150-line function handles multiple distinct concerns (summary, nodes, resources, history, fencing). Extracting helper functions would improve maintainability and testability.
Example structure:
func buildStatusComponents(result *PacemakerResult) (...) { summary := buildSummary(result) nodes := buildNodeStatuses(result) resources := buildResourceStatuses(result) nodeHistory := buildNodeHistory(result) fencingHistory := buildFencingHistory(result) return summary, nodes, resources, nodeHistory, fencingHistory } func buildSummary(result *PacemakerResult) v1alpha1.PacemakerSummary { ... } func buildNodeStatuses(result *PacemakerResult) []v1alpha1.NodeStatus { ... } // etc.
458-468
: Duplicate config creation logic.This config creation logic duplicates the implementation in
createKubernetesClient
(lines 429-441). Consider reusing the existing function or extracting the common logic.func updatePacemakerStatusCR( ctx context.Context, client *kubernetes.Clientset, rawXML string, summary v1alpha1.PacemakerSummary, nodes []v1alpha1.NodeStatus, resources []v1alpha1.ResourceStatus, nodeHistory []v1alpha1.NodeHistoryEntry, fencingHistory []v1alpha1.FencingEvent, collectionError string, ) error { - // Create a REST client for the custom resource - config, err := rest.InClusterConfig() - if err != nil { - kubeconfig := os.Getenv("KUBECONFIG") - if kubeconfig == "" { - kubeconfig = os.Getenv("HOME") + "/.kube/config" - } - config, err = clientcmd.BuildConfigFromFlags("", kubeconfig) - if err != nil { - return err - } - } + config, err := getRestConfig() + if err != nil { + return fmt.Errorf("failed to get rest config: %w", err) + }Then extract a shared helper:
func getRestConfig() (*rest.Config, error) { config, err := rest.InClusterConfig() if err != nil { kubeconfig := os.Getenv("KUBECONFIG") if kubeconfig == "" { kubeconfig = os.Getenv("HOME") + "/.kube/config" } config, err = clientcmd.BuildConfigFromFlags("", kubeconfig) if err != nil { return nil, err } } return config, nil }cmd/tnf-runtime/main.go (1)
30-30
: Use io.Discard instead of deprecated ioutil.Discard.ioutil is deprecated since Go 1.16. Switch to io.Discard.
-import ( - goflag "flag" - "fmt" - "io/ioutil" +import ( + goflag "flag" + "fmt" + "io" @@ - grpclog.SetLoggerV2(grpclog.NewLoggerV2(ioutil.Discard, os.Stderr, os.Stderr)) + grpclog.SetLoggerV2(grpclog.NewLoggerV2(io.Discard, os.Stderr, os.Stderr))bindata/tnfdeployment/cronjob.yaml (2)
31-35
: Add resource limits to avoid unbounded usage.Set modest limits alongside requests.
resources: requests: cpu: 10m memory: 32Mi + limits: + cpu: 200m + memory: 128Mi
35-41
: Privilege/hostPID: verify necessity and minimize if possible.Privileged, allowPrivilegeEscalation, and hostPID=true trigger multiple policy findings. If not strictly required by pcs, drop or scope down. Consider adding seccompProfile: RuntimeDefault when feasible.
pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1)
7-12
: Cluster-scoped CRD needs a non-namespaced generated client marker.Add +genclient:nonNamespaced so any generated typed clients target cluster scope.
// +genclient +// +genclient:nonNamespaced // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object // +kubebuilder:object:root=true // +kubebuilder:resource:path=pacemakerstatuses,scope=Cluster // +kubebuilder:subresource:status
pkg/tnf/pkg/pacemaker/statuscollector_test.go (1)
14-20
: Mark helpers with t.Helper() for clearer failure lines.Improves test diagnostics without behavior change.
func loadTestXML(t *testing.T, filename string) string { + t.Helper() @@ func getRecentFailuresXML(t *testing.T) string { + t.Helper()Also applies to: 22-37
pkg/tnf/operator/starter.go (2)
313-325
: Defensive nil-check for alivenessChecker.Avoid a nil deref if a caller forgets to wire it in tests/future code.
func runPacemakerControllers(ctx context.Context, controllerContext *controllercmd.ControllerContext, operatorClient v1helpers.StaticPodOperatorClient, kubeClient kubernetes.Interface, kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, alivenessChecker *health.MultiAlivenessChecker, etcdInformer operatorv1informers.EtcdInformer) { + if alivenessChecker == nil { + klog.V(2).Info("alivenessChecker is nil; creating default MultiAlivenessChecker") + alivenessChecker = health.NewMultiAlivenessChecker() + } // Start the healthcheck controller to watch PacemakerStatus CRs
488-491
: Log level for “job not found” should not be error.“Not found” is a benign case here; use Info or V(2).
- klog.Errorf("fencing job not found, nothing to do") + klog.V(2).Info("fencing job not found; nothing to delete")pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)
47-57
: Fix misleading comment; no mock REST client is wired here.The helper returns a HealthCheck with pacemakerRESTClient=nil. Update the comment to avoid confusion.
- // Note: This creates a HealthCheck with mock REST client - // In a real scenario, we'd need to mock the PacemakerStatus CR + // Note: This creates a minimal HealthCheck without a pacemaker REST client/informer. + // Use createTestHealthCheckWithMockStatus when you need a mocked PacemakerStatus CR.pkg/tnf/pkg/pacemaker/statuscollector.go (3)
382-387
: Prefer strconv.Atoi over fmt.Sscanf for RC parsing.Simpler, faster, fewer failure modes.
- rc := 0 - if operation.RC != "" { - if _, err := fmt.Sscanf(operation.RC, "%d", &rc); err != nil { - klog.Warningf("Failed to parse RC value '%s': %v", operation.RC, err) - } - } + rc := 0 + if operation.RC != "" { + if v, err := strconv.Atoi(operation.RC); err == nil { + rc = v + } else { + klog.Warningf("Failed to parse RC value '%s': %v", operation.RC, err) + } + }Remember to add:
import "strconv"
.
299-309
: Remove unused nodeIPMap (dead code).You populate nodeIPMap but never read it.
223-236
: Drop unused Kubernetes client creation or use it.
createKubernetesClient()
’s result isn’t used. Either remove it and theclient
parameter in updatePacemakerStatusCR, or use it for something tangible (events, RBAC sanity).- // Create Kubernetes client - client, err := createKubernetesClient() - if err != nil { - return fmt.Errorf("failed to create kubernetes client: %w", err) - } - - // Update PacemakerStatus CR - if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil { + // Update PacemakerStatus CR + if err := updatePacemakerStatusCR(ctx, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil { return fmt.Errorf("failed to update PacemakerStatus CR: %w", err) }And adjust the function signature accordingly (see next comment).
pkg/tnf/pkg/jobs/cronjob_controller.go (1)
3-8
: Import strconv for safe string quoting.Used below to properly escape command/schedule strings.
import ( "context" "fmt" + "strconv" "strings" "time" )
pkg/tnf/pkg/pacemaker/healthcheck.go (8)
31-34
: Clarify/guard the “hourly reminders for 24h” fencing behavior.You dedupe fencing warnings for 1h and rely on the collector limiting FencingHistory to 24h (comment on Line 390). That’s fine, but brittle if the collector window ever changes. Consider enforcing a hard 24h cap here (track firstSeen timestamp per eventKey and stop reminders after 24h). Otherwise, this logic will remind indefinitely.
Also applies to: 388-396
35-37
: Use dynamic expected count (eligible nodes) for resource checks to avoid cascading errors.Hardcoding expectedNodeCount=2 makes resource checks error during planned one‑node windows (e.g., node replacement), even though you already error on the offline node. Use the count of eligible nodes (online and not standby) as the expected value.
Apply this diff:
@@ -// Expected number of nodes in ExternalEtcd cluster -const ( - expectedNodeCount = 2 -) +// Nominal expected node count is 2, but resource checks should use the number of eligible nodes +const ( + expectedNodeCount = 2 // used for informational warnings only +) @@ func (c *HealthCheck) checkResourceStatus(pacemakerStatus *v1alpha1.PacemakerStatus, status *HealthStatus) { @@ - // Check if we have both resources started on all nodes - kubeletCount := len(resourcesStarted[resourceKubelet]) - etcdCount := len(resourcesStarted[resourceEtcd]) - - c.validateResourceCount(resourceKubelet, kubeletCount, status) - c.validateResourceCount(resourceEtcd, etcdCount, status) + // Compute eligible nodes (online and not standby) + eligible := 0 + for _, n := range pacemakerStatus.Status.Nodes { + if n.Online && !n.Standby { + eligible++ + } + } + // Validate both resources are started on all eligible nodes + kubeletCount := len(resourcesStarted[resourceKubelet]) + etcdCount := len(resourcesStarted[resourceEtcd]) + c.validateResourceCount(resourceKubelet, kubeletCount, eligible, status) + c.validateResourceCount(resourceEtcd, etcdCount, eligible, status) @@ -func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) { - if actualCount < expectedNodeCount { - status.Errors = append(status.Errors, fmt.Sprintf(msgResourceNotStarted, - resourceName, actualCount, expectedNodeCount)) +func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) { + if actualCount < expectedCount { + status.Errors = append(status.Errors, fmt.Sprintf(msgResourceNotStarted, + resourceName, actualCount, expectedCount)) } } @@ - // Check if we have the expected number of nodes. - // This should almost always be 2, but 1 node is possible during a control-plane node replacement event. - if len(nodes) != expectedNodeCount { + // Check nominal expected number of nodes (informational). + // This should almost always be 2; 1 is possible during control-plane node replacement. + if len(nodes) != expectedNodeCount { status.Warnings = append(status.Warnings, fmt.Sprintf("Expected %d nodes, found %d", expectedNodeCount, len(nodes))) }Also applies to: 358-364, 366-372, 320-324
165-187
: Avoid double resync; remove namespace indexer for a cluster‑scoped CR.Informer resync is set to 30s and controller also ResyncEvery(30s). That’s redundant. Also, PacemakerStatus is cluster‑scoped, so NamespaceIndex is unnecessary.
Apply this diff:
@@ - informer := cache.NewSharedIndexInformer( + informer := cache.NewSharedIndexInformer( &cache.ListWatch{ @@ - &v1alpha1.PacemakerStatus{}, - healthCheckResyncInterval, - cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, + &v1alpha1.PacemakerStatus{}, + 0, // rely on controller ResyncEvery + cache.Indexers{}, // cluster-scoped; no namespace index )Ensure factory starts this raw SharedIndexInformer (it typically does for informers passed to WithInformers).
Also applies to: 203-211
79-81
: Remove unused constant.kubernetesAPIPath is defined but never used. Drop it.
Apply this diff:
@@ - // Kubernetes API constants - kubernetesAPIPath = "/apis" - pacemakerResourceName = "pacemakerstatuses" + // Kubernetes API constants + pacemakerResourceName = "pacemakerstatuses"
122-141
: Trim unused struct fields/state.
- HealthCheck.kubeClient isn’t used; remove the field and constructor arg.
- HealthCheck.pacemakerInformer is stored but never used; rely on factory lifecycle instead.
- lastProcessedStatus is written but not read; remove until needed to avoid confusion.
Do you want me to produce a follow‑up diff removing these safely across the file and call sites?
Also applies to: 188-196, 261-265
153-157
: Prefer returning an error over panicking during controller construction.Panics on REST client/scheme setup will crash the operator. Consider changing NewHealthCheck to return (factory.Controller, error) and bubble errors to the caller to fail fast without stack traces from panic.
Also applies to: 158-163
113-121
: Unknown status carries Errors but does not set degraded — confirm behavior.newUnknownHealthStatus populates Errors, but updateOperatorStatus ignores Unknown for degradation. Is this intentional? If the CR fetch/staleness indicates we may be blind to pacemaker health, product may want Degraded=True with an actionable message.
Also applies to: 411-428
595-604
: Use HasPrefix to classify warning types.Contains can misclassify if messages embed the tokens mid‑string. Prefer HasPrefix since you control the message format.
Apply this diff:
@@ - switch { - case strings.Contains(warning, warningPrefixFailedAction): + switch { + case strings.HasPrefix(warning, warningPrefixFailedAction): c.eventRecorder.Warningf(eventReasonFailedAction, msgDetectedFailedAction, warning) - case strings.Contains(warning, warningPrefixFencingEvent): + case strings.HasPrefix(warning, warningPrefixFencingEvent): c.eventRecorder.Warningf(eventReasonFencingEvent, msgDetectedFencing, warning) default: c.eventRecorder.Warningf(eventReasonWarning, msgPacemakerWarning, warning) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (26)
Dockerfile.ocp
(1 hunks)bindata/tnfdeployment/clusterrole.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)bindata/tnfdeployment/job.yaml
(1 hunks)cmd/tnf-runtime/main.go
(4 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/cmd/pacemakerstatuscollector/collector.go
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)pkg/tnf/pkg/pacemaker/client_helpers.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
- pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
- pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
- pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
🚧 Files skipped from review as they are similar to previous changes (1)
- manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-47: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-47: Container should not be privileged
(CKV_K8S_16)
[medium] 1-47: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-47: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (22)
pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: Well-formed test data template for Pacemaker health check scenarios.The XML template is properly structured and comprehensively models the test scenarios described in the PR: recent kubelet operation failure (line 56), fencing events (line 63), node health status (lines 12–13), and resource deployment across nodes (lines 16–37). The placeholder convention using
{{PLACEHOLDER_NAME}}
is appropriate for dynamic test data substitution.pkg/cmd/pacemakerstatuscollector/collector.go (4)
24-50
: LGTM!Constants are well-defined with appropriate values and clear documentation.
189-202
: LGTM!Standard Cobra command implementation with appropriate error handling.
204-274
: LGTM!Good design decisions:
- Proper timeout handling with contexts
- XML size validation to prevent XML bombs (10MB limit)
- Returns raw XML even when parsing fails, allowing debugging
- Clear separation of concerns between collection and update
428-444
: LGTM!Standard Kubernetes client creation with appropriate fallback from in-cluster config to kubeconfig.
bindata/tnfdeployment/job.yaml (1)
18-18
: LGTM: Binary rename aligns with runtime refactoring.The command change from
tnf-setup-runner
totnf-runtime
is consistent with the broader refactoring visible in Dockerfile.ocp and the introduction of the runtime CLI flow.Dockerfile.ocp (1)
13-13
: LGTM: Binary rename matches deployment manifest.The COPY directive now references
tnf-runtime
, consistent with the updated container command in bindata/tnfdeployment/job.yaml.bindata/tnfdeployment/clusterrole.yaml (1)
32-50
: LGTM: RBAC rules follow standard CRD pattern.The two rules correctly grant permissions for the PacemakerStatus CRD and its status subresource. The verb sets are appropriate for a status collector that needs to create, read, and update PacemakerStatus resources.
pkg/operator/ceohelpers/external_etcd_status.go (1)
167-201
: LGTM: Well-designed generic condition waiter.The
WaitForEtcdCondition
helper follows best practices:
- Ensures informer cache is synced before checking conditions
- Uses standard polling with configurable intervals and timeout
- Properly handles transient errors by retrying
- Clear logging at appropriate levels
pkg/operator/starter.go (1)
630-631
: LGTM: Function signature extended to support new functionality.The additional
dynamicClient
andAlivenessChecker
parameters align with the Pacemaker health check integration. Tests in starter_test.go have been updated accordingly.pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM: Consistent with existing decoder pattern.
ReadCronJobV1OrDie
mirrors the implementation ofReadJobV1OrDie
, using the same codec and panic-on-error approach. This consistency aids maintainability.pkg/tnf/operator/starter_test.go (2)
96-109
: LGTM: Test updated to pass new parameters.The test correctly instantiates
alivenessChecker
viahealth.NewMultiAlivenessChecker()
and passes bothdynamicClient
andalivenessChecker
toHandleDualReplicaClusters
, matching the updated function signature.
260-263
: LGTM: Minimal REST config appropriate for testing.The minimal
rest.Config
with a localhost host provides the necessary structure for controllers that require KubeConfig without introducing unnecessary test dependencies.pkg/tnf/pkg/pacemaker/client_helpers.go (2)
17-32
: ****The constants
envKubeconfig
,envHome
, anddefaultKubeconfigPath
are properly defined as package-level constants inpkg/tnf/pkg/pacemaker/statuscollector.go
(lines 57-59). In Go, package-level constants are accessible throughout the entire package without explicit imports. Since bothstatuscollector.go
andclient_helpers.go
are in the samepacemaker
package, the code will compile without errors. No changes are required.Likely an incorrect or invalid review comment.
37-56
: ****The
kubernetesAPIPath
constant is properly defined in the same package atpkg/tnf/pkg/pacemaker/healthcheck.go:79
as"/apis"
. In Go, package-level constants are accessible across all files within the same package, regardless of which file defines them. Since bothhealthcheck.go
andclient_helpers.go
are in thepkg/tnf/pkg/pacemaker/
package, the reference at line 47 ofclient_helpers.go
is valid and will compile correctly.Likely an incorrect or invalid review comment.
cmd/tnf-runtime/main.go (1)
124-125
: Align docs with actual pcs invocation.Help text says "pcs status xml" but PR summary mentions "pcs status --full". Please confirm and update to match the collector’s actual call.
bindata/tnfdeployment/cronjob.yaml (1)
42-42
: Good: using tnf-setup-manager SA matches existing TNF jobs.pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)
24-31
: API registration looks correct.Group/version, known types, and AddToGroupVersion are wired properly.
pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-211
: Generated deepcopy looks correct.Covers all v1alpha1 types; deep-copies metav1.Time fields and slices properly.
pkg/tnf/pkg/jobs/cronjob_controller.go (1)
175-213
: Rewrite quoting logic for command/schedule; discard namespace injection suggestion.The cronjob template (
bindata/tnfdeployment/cronjob.yaml
) uses a hardcodednamespace: openshift-etcd
, not a placeholder. Only the name, schedule, image, and command fields use<injected>
markers. Implement only the quoting improvements:
- Use
strconv.Quote()
for command array elements (safe YAML escaping).- Use
strconv.Quote()
for schedule value (handles special chars like*
).var commandBuilder strings.Builder commandBuilder.WriteString("[") for i, cmd := range c.command { if i > 0 { commandBuilder.WriteString(", ") } - commandBuilder.WriteString(fmt.Sprintf(`"%s"`, cmd)) + commandBuilder.WriteString(strconv.Quote(cmd)) } commandBuilder.WriteString("]") commandStr := commandBuilder.String() // Track which line we're on to do context-aware replacement lines := strings.Split(result, "\n") for i, line := range lines { // Replace name (appears in metadata.name) if strings.Contains(line, "name: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.name, 1) } // Replace app label if strings.Contains(line, "app.kubernetes.io/name: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.name, 1) } // Replace schedule (must be quoted to handle special characters like *) if strings.Contains(line, "schedule: <injected>") { - lines[i] = strings.Replace(line, "<injected>", fmt.Sprintf(`"%s"`, c.schedule), 1) + lines[i] = strings.Replace(line, "<injected>", strconv.Quote(c.schedule), 1) } // Replace image if strings.Contains(line, "image: <injected>") { lines[i] = strings.Replace(line, "<injected>", c.image, 1) } // Replace command (special handling for array) if strings.Contains(line, "command: <injected>") { lines[i] = strings.Replace(line, "<injected>", commandStr, 1) } }Likely an incorrect or invalid review comment.
pkg/tnf/pkg/pacemaker/healthcheck.go (2)
441-470
: Condition update churn minimized — nice.Checking current condition status/message before updates and logging at V(2) avoids noise and unnecessary API writes. Good hygiene.
Also applies to: 492-519
548-592
: Event dedup design is simple and effective.Mutex + time‑based keys keep spam down while allowing healthy transition events only once per recovery. Looks good.
// Build node IP map from node attributes | ||
nodeIPMap := make(map[string]string) | ||
for _, nodeAttrSet := range result.NodeAttributes.Node { | ||
for _, attr := range nodeAttrSet.Attribute { | ||
if attr.Name == "node_ip" { | ||
nodeIPMap[nodeAttrSet.Name] = attr.Value | ||
break | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code: nodeIPMap is built but never used.
The nodeIPMap
is populated from node attributes but is never referenced when building the nodes
array (lines 308-313) or anywhere else in the function.
If node IP information is needed, apply this diff to include it:
nodes = append(nodes, v1alpha1.NodeStatus{
Name: node.Name,
Online: online,
Standby: node.Standby == booleanValueTrue,
+ IP: nodeIPMap[node.Name],
})
Otherwise, remove the nodeIPMap construction entirely:
- // Build node IP map from node attributes
- nodeIPMap := make(map[string]string)
- for _, nodeAttrSet := range result.NodeAttributes.Node {
- for _, attr := range nodeAttrSet.Attribute {
- if attr.Name == "node_ip" {
- nodeIPMap[nodeAttrSet.Name] = attr.Value
- break
- }
- }
- }
-
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// Build node IP map from node attributes | |
nodeIPMap := make(map[string]string) | |
for _, nodeAttrSet := range result.NodeAttributes.Node { | |
for _, attr := range nodeAttrSet.Attribute { | |
if attr.Name == "node_ip" { | |
nodeIPMap[nodeAttrSet.Name] = attr.Value | |
break | |
} | |
} | |
} | |
// Build node status entries | |
var nodes []v1alpha1.NodeStatus | |
for _, node := range result.Nodes.Node { |
🤖 Prompt for AI Agents
pkg/cmd/pacemakerstatuscollector/collector.go around lines 289-298: the
nodeIPMap is populated but never used; either remove the entire nodeIPMap
construction if node IPs are not required, or integrate it when creating the
nodes slice (around lines 308-313) by looking up nodeIPMap[nodeName] and
assigning that value to the node struct's IP field (or appropriate field) during
node creation; make sure to handle missing keys (empty string) and remove the
unused variable if you choose deletion.
b915543
to
4513271
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (7)
bindata/tnfdeployment/cronjob.yaml (1)
5-5
: Namespace configurability (duplicate of earlier feedback).Hardcoded
openshift-etcd
is fine for this operator, but if you intend to reuse the controller, keep this as a placeholder and have the hook set it tooperatorclient.TargetNamespace
.- namespace: openshift-etcd + namespace: <injected>manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
47-55
: Size cap added — also add a truncation flagmaxLength on rawXML is great. Add status.summary.rawXMLTruncated (boolean) so collectors can signal when they trimmed content.
Example:
summary: description: Summary provides high-level counts and flags for the cluster state type: object properties: + rawXMLTruncated: + description: Indicates rawXML was truncated to fit CR size limits + type: booleanpkg/tnf/pkg/pacemaker/v1alpha1/types.go (1)
38-41
: Update RawXML comment to match actual usageHealth check parses RawXML today; comment says it shouldn’t.
- // RawXML contains the raw XML output from pcs status xml command - // Kept for debugging purposes only; healthcheck should not need to parse this + // RawXML contains the raw XML output from pcs/crm_mon. Kept for debugging and + // parsed by the health check to derive status. // +optional RawXML string `json:"rawXML,omitempty"`pkg/tnf/operator/starter.go (1)
329-346
: Don’t give up after 30m; retry until ctx.Done() before starting the collector.If etcd transitions after the first 30-minute wait, the Pacemaker status collector never starts. Loop with backoff until success or context cancellation.
Apply:
go func() { klog.Infof("waiting for etcd to transition to external before starting Pacemaker status collector") - if err := ceohelpers.WaitForEtcdCondition( - ctx, - etcdInformer, - operatorClient, - ceohelpers.HasExternalEtcdCompletedTransition, - 10*time.Second, - 30*time.Minute, - "external etcd transition", - ); err != nil { - klog.Errorf("failed to wait for external etcd transition: %v", err) - return - } - klog.Infof("etcd has transitioned to external; starting Pacemaker status collector CronJob") - runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient) + for { + if err := ceohelpers.WaitForEtcdCondition( + ctx, etcdInformer, operatorClient, + ceohelpers.HasExternalEtcdCompletedTransition, + 10*time.Second, 30*time.Minute, "external etcd transition", + ); err != nil { + if ctx.Err() != nil { + klog.Infof("context done while waiting for external etcd transition: %v", err) + return + } + klog.Warningf("external etcd transition not complete yet, retrying in 1m: %v", err) + select { + case <-time.After(time.Minute): + continue + case <-ctx.Done(): + return + } + } + klog.Infof("etcd has transitioned to external; starting Pacemaker status collector CronJob") + runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient) + return + } }()pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)
280-302
: Fix assertion messages to match the method name.Replace “updateOperatorAvailability” with “updateOperatorStatus” in the three require.NoError messages.
- require.NoError(t, err, "updateOperatorAvailability should not return an error") + require.NoError(t, err, "updateOperatorStatus should not return an error")Apply at Lines 281, 291, and 301.
pkg/tnf/pkg/pacemaker/statuscollector.go (2)
227-236
: Remove unused client creation and pass only what updatePacemakerStatusCR needs.The kubernetes.Clientset isn’t used; simplify runCollector.
- // Create Kubernetes client - client, err := createKubernetesClient() - if err != nil { - return fmt.Errorf("failed to create kubernetes client: %w", err) - } - // Update PacemakerStatus CR - if err := updatePacemakerStatusCR(ctx, client, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil { + if err := updatePacemakerStatusCR(ctx, rawXML, summary, nodes, resources, nodeHistory, fencingHistory, collectionError); err != nil { return fmt.Errorf("failed to update PacemakerStatus CR: %w", err) }
3-18
: Use apierrors for NotFound and remove unused imports.Replace string matching with apierrors.IsNotFound, add the import, and (after removing the unused client parameter/function below) drop k8s clientset import.
import ( "context" "encoding/xml" "fmt" "os" - "strings" "time" "github.com/spf13/cobra" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" - "k8s.io/client-go/kubernetes" + apierrors "k8s.io/apimachinery/pkg/api/errors" "k8s.io/klog/v2"
🧹 Nitpick comments (21)
pkg/tnf/pkg/jobs/cronjob_controller.go (3)
141-154
: Add retry-on-conflict for updates.Update can race with other reconcilers/GC; wrap in a retry loop to handle 409s.
func (c *CronJobController) updateCronJob(ctx context.Context, existing, desired *batchv1.CronJob) error { - // Preserve the resource version - desired.ResourceVersion = existing.ResourceVersion - - _, err := c.kubeClient.BatchV1().CronJobs(operatorclient.TargetNamespace).Update(ctx, desired, metav1.UpdateOptions{}) - if err != nil { - return fmt.Errorf("failed to update CronJob: %w", err) - } + // Preserve the resource version and retry on conflict + return retry.RetryOnConflict(retry.DefaultBackoff, func() error { + desired.ResourceVersion = existing.ResourceVersion + _, err := c.kubeClient.BatchV1().CronJobs(operatorclient.TargetNamespace).Update(ctx, desired, metav1.UpdateOptions{}) + if apierrors.IsConflict(err) { + // Refresh existing and try again + var getErr error + existing, getErr = c.kubeClient.BatchV1().CronJobs(operatorclient.TargetNamespace).Get(ctx, c.name, metav1.GetOptions{}) + if getErr != nil { + return getErr + } + return err + } + return err + })Note: add
k8s.io/client-go/util/retry
to imports.
176-209
: Broaden needsUpdate checks or use semantic deep-equality.Only schedule, command length/content, and image are compared. Fields like args, env, resources, securityContext, serviceAccount, concurrencyPolicy, history limits, job labels/annotations, and pod template changes won’t trigger updates.
Option A (quick): compare
existing.Spec
vsdesired.Spec
with semantic equality after normalizing known server-defaults.
Option B: compute a hash (annotation) of.Spec.JobTemplate.Spec.Template.Spec
and.Spec.Schedule
and compare.
If you choose A, importk8s.io/apimachinery/pkg/api/equality
and useapiequality.Semantic.DeepEqual
.
3-20
: Imports will need retry/equality if you adopt the above.Add:
k8s.io/client-go/util/retry
(for update retry)k8s.io/apimachinery/pkg/api/equality
(if using semantic deep-equal)bindata/tnfdeployment/cronjob.yaml (2)
36-43
: Privileged + hostPID justified? Document rationale.Security scanners will flag these. If required to run
pcs
/interact with host daemons, add a comment in the manifest explaining why, and keep using thetnf-setup-manager
SA with the SCC binding. Otherwise, drop them.Also applies to: 39-41, 41-41, 37-39
45-47
: Node selector label forward‑compatibility.If newer clusters use
node-role.kubernetes.io/control-plane
, consider tolerating both (via affinity with matchExpressions) in a follow‑up.pkg/operator/ceohelpers/external_etcd_status.go (2)
179-186
: Syncing the wrong/insufficient informerWaiting only on etcdInformer doesn’t ensure operatorClient caches are warm. Either (a) accept the exact informer(s) whose listers back operatorClient, or (b) take varargs of HasSynced fns and wait for all.
Proposed shape:
-func WaitForEtcdCondition(ctx context.Context, etcdInformer operatorv1informers.EtcdInformer, operatorClient v1helpers.StaticPodOperatorClient, conditionCheck func(context.Context, v1helpers.StaticPodOperatorClient) (bool, error), pollInterval time.Duration, timeout time.Duration, conditionName string) error { +func WaitForEtcdCondition(ctx context.Context, operatorClient v1helpers.StaticPodOperatorClient, conditionCheck func(context.Context, v1helpers.StaticPodOperatorClient) (bool, error), pollInterval time.Duration, timeout time.Duration, conditionName string, syncedFns ...cache.InformerSynced) error { - if !cache.WaitForCacheSync(ctx.Done(), etcdInformer.Informer().HasSynced) { + if len(syncedFns) > 0 && !cache.WaitForCacheSync(ctx.Done(), syncedFns...) { return fmt.Errorf("failed to sync informers") }
187-201
: Tighten validation and reduce log noise
- Validate pollInterval > 0 and timeout > 0 to avoid surprising behavior.
- Use klog.V(2/4) for “waiting/synced/retrying” messages to keep default logs clean.
Apply:
+ if pollInterval <= 0 || timeout <= 0 { + return fmt.Errorf("invalid timings: pollInterval=%v timeout=%v", pollInterval, timeout) + } - klog.Infof("waiting for etcd informer to sync before checking %s...", conditionName) + klog.V(2).Infof("waiting for informers to sync before checking %s...", conditionName) - klog.Infof("etcd informer synced, checking for %s", conditionName) + klog.V(2).Infof("informers synced, checking for %s", conditionName) - klog.Warningf("error checking %s, will retry: %v", conditionName, err) + klog.V(2).Infof("error checking %s, will retry: %v", conditionName, err) - klog.V(4).Infof("%s condition not yet met, waiting...", conditionName) + klog.V(4).Infof("%s not yet met; retrying...", conditionName)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
16-18
: Add printer columns for quick triageExpose high-signal fields in kubectl get output (age/nodes/resources/quorum). Improves operability.
Example:
- name: v1alpha1 served: true storage: true + additionalPrinterColumns: + - name: Quorum + type: string + jsonPath: .status.summary.hasQuorum + - name: Nodes + type: string + jsonPath: .status.summary.nodesOnline + - name: Resources + type: string + jsonPath: .status.summary.resourcesStartedpkg/tnf/pkg/pacemaker/v1alpha1/types.go (3)
38-42
: Mirror CRD size cap via kubebuilder validationKeep code and CRD in sync; enforce max length at type level too.
- // +optional - RawXML string `json:"rawXML,omitempty"` + // +kubebuilder:validation:MaxLength=262144 + // +optional + RawXML string `json:"rawXML,omitempty"`
7-12
: Add printer columns for operabilityExpose key summary fields in kubectl get output.
// +kubebuilder:object:root=true // +kubebuilder:resource:path=pacemakerstatuses,scope=Cluster // +kubebuilder:subresource:status +// +kubebuilder:printcolumn:name="Quorum",type=string,JSONPath=`.status.summary.hasQuorum` +// +kubebuilder:printcolumn:name="Nodes",type=string,JSONPath=`.status.summary.nodesOnline` +// +kubebuilder:printcolumn:name="Resources",type=string,JSONPath=`.status.summary.resourcesStarted` +// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`Also applies to: 68-93
23-31
: Spec.NodeName likely unused — keep or dropSpec is documented as unused; NodeName appears vestigial. Either remove it or add a TODO with intended use and ensure generators/CRD stay aligned.
pkg/tnf/pkg/pacemaker/statuscollector_test.go (1)
398-404
: Add a targeted negative-path test for collectPacemakerStatus rather than xml.Unmarshal.This test validates encoding/xml behavior, not our collector path. Consider mocking exec.Execute to return malformed XML and assert collectPacemakerStatus returns collectionError and empty structs. I can provide a table-driven test using a stubbed exec.Execute.
pkg/tnf/pkg/pacemaker/statuscollector.go (2)
451-459
: Remove dead helper createKubernetesClient if no longer used.After dropping the client parameter, this function becomes unused and should be removed with its import.
-// createKubernetesClient creates a Kubernetes clientset using in-cluster config -// or falling back to kubeconfig file. -func createKubernetesClient() (*kubernetes.Clientset, error) { - config, err := getKubeConfig() - if err != nil { - return nil, err - } - return kubernetes.NewForConfig(config) -}
21-60
: Make the pcs command configurable and avoid sudo coupling.Hard-coding “sudo -n pcs status xml” can fail in minimal images. Consider reading the command from env/config (default to “pcs status xml”) and document required RBAC/permissions.
pkg/tnf/operator/starter.go (1)
360-372
: Add a container bounds check before accessing Containers[0] for defensive programming.The YAML template at
bindata/tnfdeployment/cronjob.yaml
currently defines exactly one container, so there is no immediate runtime risk. However, the hook function directly accessescronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0]
without validation. For robustness against future template changes or unexpected states, add a guard similar to the pattern already used inCronJobController.needsUpdate()
:if len(cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers) == 0 { return fmt.Errorf("CronJob template has no containers") } cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE") cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Command = []string{"tnf-runtime", "monitor"}pkg/tnf/pkg/pacemaker/healthcheck.go (6)
613-623
: Avoid double-prefixed event messages.Events read like “Pacemaker detected a recent failed resource action: Recent failed resource action: …”.
Apply:
func (c *HealthCheck) recordWarningEvent(warning string) { switch { case strings.Contains(warning, warningPrefixFailedAction): - c.eventRecorder.Warningf(eventReasonFailedAction, msgDetectedFailedAction, warning) + details := strings.TrimPrefix(warning, warningPrefixFailedAction+" ") + c.eventRecorder.Warningf(eventReasonFailedAction, msgDetectedFailedAction, details) case strings.Contains(warning, warningPrefixFencingEvent): - c.eventRecorder.Warningf(eventReasonFencingEvent, msgDetectedFencing, warning) + details := strings.TrimPrefix(warning, warningPrefixFencingEvent+" ") + c.eventRecorder.Warningf(eventReasonFencingEvent, msgDetectedFencing, details) default: c.eventRecorder.Warningf(eventReasonWarning, msgPacemakerWarning, warning) } }
492-507
: Return a copy of the condition, not a pointer to a slice element.Prevents subtle aliasing on a local backing array.
Apply:
func (c *HealthCheck) getCurrentPacemakerCondition() (*operatorv1.OperatorCondition, error) { @@ for i := range currentStatus.Conditions { if currentStatus.Conditions[i].Type == conditionTypePacemakerDegraded { - return ¤tStatus.Conditions[i], nil + cond := currentStatus.Conditions[i] + return &cond, nil } }
153-163
: Constructor panics: consider fail-fast via utilruntime.Must or returning an error (if possible).Panics restart the operator; acceptable at init but harder to test. If signature can’t change, wrap with utilruntime.Must for clarity; otherwise prefer returning an error to the caller.
124-141
: Remove/justify unused fields and constants.
- HealthCheck.kubeClient, lastProcessedStatus/lock are unused here.
- kubernetesAPIPath is unused.
Either remove to reduce noise or add TODO with intended use.
Also applies to: 78-81
521-543
: Stabilize dedup keys.Using full messages as keys risks churn if wording changes. Prefer typed keys, e.g., kind|resource|op|node|rc|timestampTrunc.
I can propose a small helper makeEventKey(kind string, parts ...string) and use it in both loops.
Also applies to: 564-581, 582-588
471-490
: Condition clear policy: consider clearing degraded when errors disappear (even if warnings remain).Currently, degraded clears only on Healthy, leaving degraded stuck on Warning-only states.
If desired, clear when len(status.Errors)==0.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (26)
.gitignore
(2 hunks)Dockerfile.ocp
(1 hunks)bindata/tnfdeployment/clusterrole.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)bindata/tnfdeployment/job.yaml
(1 hunks)cmd/tnf-runtime/main.go
(4 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(4 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)pkg/tnf/pkg/pacemaker/client_helpers.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
- pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
- cmd/tnf-runtime/main.go
- bindata/tnfdeployment/job.yaml
- bindata/tnfdeployment/clusterrole.yaml
- pkg/tnf/pkg/jobs/batch.go
- pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
- pkg/tnf/pkg/pacemaker/client_helpers.go
- pkg/tnf/operator/starter_test.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-48: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-48: Container should not be privileged
(CKV_K8S_16)
[medium] 1-48: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-48: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (9)
.gitignore (1)
14-14
: Update ignore for new binary looks good.tnf-runtime addition aligns with the rename across the repo.
pkg/tnf/pkg/pacemaker/testdata/offline_node.xml (1)
1-67
: Test snapshot LGTM.Covers an offline node plus a recent fence event; good for negative-path tests. Minor: duplicate kubelet probe entry (Line 58) seems intentional; confirm parser tolerates duplicates.
bindata/tnfdeployment/cronjob.yaml (1)
9-9
: Confirm schedule injection occurs (and is valid 5‑field cron).
schedule: <injected>
must be populated by the controller hook with a valid 5‑field cron (e.g.,*/30 * * * *
). If the hook is optional, consider a sane default here to keep the asset valid.pkg/operator/starter.go (1)
630-632
: Signature change verified — no issues found.The function signature in
pkg/tnf/operator/starter.go:63-75
correctly includesdynamicClient
andalivenessChecker
as the final two parameters. Both call sites—production (pkg/operator/starter.go:619
) and test (pkg/tnf/operator/starter_test.go:97
)—pass all 12 arguments in the correct order. Wiring is correct and no build issues are expected.Dockerfile.ocp (1)
13-13
: Verified: tnf-runtime is correctly built by the Makefile.The build process is properly configured:
cmd/tnf-runtime
is included inGO_BUILD_PACKAGES
(defaults to./cmd/...
)make build
compiles all cmd packages via golang.mk/build.mk- Binaries output to WORKDIR root (
/go/src/github.com/openshift/cluster-etcd-operator/
)- Output location matches Dockerfile.ocp COPY expectation
The COPY command will succeed.
pkg/tnf/pkg/pacemaker/testdata/standby_node.xml (1)
1-1
: Confirm source of XML (crm_mon vs pcs status)request attribute shows crm_mon; PR mentions pcs status --full. If pcs shells out to crm_mon this is fine—please confirm to avoid format drift surprises in tests.
pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)
9-18
: Scheme registration looks goodGroup/version and type registration are correct; AddToScheme wiring is standard.
Also applies to: 24-31
pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-211
: Generated deepcopy implementations look correct and consistent with v1alpha1 types. No issues.pkg/tnf/pkg/pacemaker/healthcheck.go (1)
350-354
: Code is internally consistent; cross-release compatibility verification required.The repository uses the prefixes correctly and consistently:
resourceAgentKubelet = "systemd:kubelet"
andresourceAgentEtcd = "ocf:heartbeat:podman-etcd"
are defined as constants, validated by test data across multiple test files, and match the resource creation commands inpkg/tnf/pkg/pcs/cluster.go
andpkg/tnf/pkg/pcs/etcd.go
. ThecreatePacemakerRESTClient
function is implemented. No variant agent strings were found in the codebase.However, the concern about compatibility across all supported pcs releases cannot be verified from code inspection alone. Confirm with your pcs release documentation or compatibility matrix that these agent strings remain stable across the releases your deployment supports.
4513271
to
8e58a60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
pkg/tnf/pkg/jobs/cronjob_controller.go (1)
156-174
: Force CronJob name/namespace to controller-owned values.Avoid mismatches between GET path (TargetNamespace/c.name) and the object parsed from the manifest. Set both explicitly after parsing to prevent 422s and orphaned resources. This was raised earlier and still applies.
func (c *CronJobController) buildCronJob(ctx context.Context) (*batchv1.CronJob, error) { // Parse the manifest cronJob := ReadCronJobV1OrDie(c.manifest) + // Enforce controller-managed identity and scope + cronJob.Namespace = operatorclient.TargetNamespace + cronJob.Name = c.name + // Get operator spec for hooks opSpec, _, _, err := c.operatorClient.GetOperatorState() if err != nil { return nil, fmt.Errorf("failed to get operator state: %w", err) }pkg/tnf/pkg/pacemaker/v1alpha1/types.go (1)
38-41
: Update RawXML comment to match usage.Health check/collector parse it today; adjust the comment to avoid confusion.
- // RawXML contains the raw XML output from pcs status xml command - // Kept for debugging purposes only; healthcheck should not need to parse this + // RawXML contains the raw XML output from 'pcs status --full --output-as xml'. + // Kept for debugging and may be parsed by health/monitoring components.bindata/tnfdeployment/cronjob.yaml (1)
31-31
: command must be []string, not string.Type mismatch may fail decoding. Wrap in array or leave empty for hook fill.
- command: <injected> + command: ["<injected>"]pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)
280-302
: Prior nit fixed — good.Assertion messages now reference updateOperatorStatus (was updateOperatorAvailability). Thanks for addressing.
🧹 Nitpick comments (24)
pkg/tnf/pkg/jobs/cronjob_controller.go (2)
176-209
: Broaden update detection beyond first container; compare full PodSpec and key CronJob fields.Current check misses args/env/resources/SA/annotations and assumes a single container. Use semantic deep-equality on the JobTemplate’s PodSpec and also compare other CronJob knobs (ConcurrencyPolicy, HistoryLimits, StartingDeadlineSeconds).
+import ( + // ... + apiequality "k8s.io/apimachinery/pkg/api/equality" +) @@ func (c *CronJobController) needsUpdate(existing, desired *batchv1.CronJob) bool { // Check if schedule has changed if existing.Spec.Schedule != desired.Spec.Schedule { klog.V(4).Infof("Schedule changed from %q to %q", existing.Spec.Schedule, desired.Spec.Schedule) return true } - // Check if the job template has changed (containers, command, image, etc.) - if len(existing.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 && - len(desired.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 { - existingContainer := existing.Spec.JobTemplate.Spec.Template.Spec.Containers[0] - desiredContainer := desired.Spec.JobTemplate.Spec.Template.Spec.Containers[0] - // Check if command has changed - if len(existingContainer.Command) != len(desiredContainer.Command) { ... } - for i := range desiredContainer.Command { ... } - // Check if image has changed - if existingContainer.Image != desiredContainer.Image { ... } - } + // Compare CronJob-level knobs + if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy || + ptrVal(existing.Spec.StartingDeadlineSeconds) != ptrVal(desired.Spec.StartingDeadlineSeconds) || + ptrVal(existing.Spec.SuccessfulJobsHistoryLimit) != ptrVal(desired.Spec.SuccessfulJobsHistoryLimit) || + ptrVal(existing.Spec.FailedJobsHistoryLimit) != ptrVal(desired.Spec.FailedJobsHistoryLimit) { + return true + } + + // Compare full PodSpec (containers, args, env, resources, SA, securityContext, etc.) + if !apiequality.Semantic.DeepEqual( + existing.Spec.JobTemplate.Spec.Template.Spec, + desired.Spec.JobTemplate.Spec.Template.Spec, + ) { + return true + } return false } + +// ptrVal safely dereferences optional int32 pointers for comparison +func ptrVal(p *int32) int32 { + if p == nil { return -1 } + return *p +}
66-73
: Consider adding a CronJob informer to react to external edits promptly.Relying solely on a 1m resync can delay reconciliation after manual changes. Watching batch/v1 CronJobs in TargetNamespace would make updates event-driven.
pkg/tnf/pkg/pacemaker/client_helpers.go (2)
15-32
: Use standard kubeconfig loading rules; HOME concat can be wrong.If InCluster fails and HOME is unset, you’ll build "/.kube/config". Prefer clientcmd’s default loading rules (KUBECONFIG, XDG, HOME) and context overrides.
-func getKubeConfig() (*rest.Config, error) { - // Try in-cluster config first - config, err := rest.InClusterConfig() - if err != nil { - // Fall back to kubeconfig - kubeconfig := os.Getenv(envKubeconfig) - if kubeconfig == "" { - kubeconfig = os.Getenv(envHome) + defaultKubeconfigPath - } - config, err = clientcmd.BuildConfigFromFlags("", kubeconfig) - if err != nil { - return nil, fmt.Errorf("failed to create kubernetes config: %w", err) - } - } - return config, nil -} +func getKubeConfig() (*rest.Config, error) { + if cfg, err := rest.InClusterConfig(); err == nil { + return cfg, nil + } + // Standard default chain: KUBECONFIG, XDG, HOME, etc. + rules := clientcmd.NewDefaultClientConfigLoadingRules() + overrides := &clientcmd.ConfigOverrides{} + cfg, err := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(rules, overrides).ClientConfig() + if err != nil { + return nil, fmt.Errorf("failed to create kubernetes config: %w", err) + } + return cfg, nil +}
37-56
: Tighten REST client config for CRDs (APIPath, serializer, ParameterCodec).Ensure APIPath is “/apis” and use WithoutConversion codec; add a ParameterCodec if you pass versioned options.
func createPacemakerRESTClient(baseConfig *rest.Config) (rest.Interface, error) { // Set up the scheme for PacemakerStatus CRD scheme := runtime.NewScheme() if err := v1alpha1.AddToScheme(scheme); err != nil { return nil, fmt.Errorf("failed to add PacemakerStatus scheme: %w", err) } // Configure the REST client for the PacemakerStatus CRD pacemakerConfig := rest.CopyConfig(baseConfig) pacemakerConfig.GroupVersion = &v1alpha1.SchemeGroupVersion - pacemakerConfig.APIPath = kubernetesAPIPath - pacemakerConfig.NegotiatedSerializer = serializer.NewCodecFactory(scheme) + pacemakerConfig.APIPath = "/apis" + pacemakerConfig.NegotiatedSerializer = serializer.WithoutConversionCodecFactory{CodecFactory: serializer.NewCodecFactory(scheme)} + // Optional: if using VersionedParams with list/get options: + // pacemakerConfig.ParameterCodec = runtime.NewParameterCodec(scheme) restClient, err := rest.RESTClientFor(pacemakerConfig) if err != nil { return nil, fmt.Errorf("failed to create REST client for PacemakerStatus: %w", err) } return restClient, nil }cmd/tnf-runtime/main.go (2)
3-14
: Prefer io.Discard; io/ioutil is deprecated.Use io.Discard and drop io/ioutil import.
-import ( - goflag "flag" - "fmt" - "io/ioutil" - "math/rand" - "os" - "time" +import ( + goflag "flag" + "fmt" + "io" + "math/rand" + "os" + "time" ) @@ - grpclog.SetLoggerV2(grpclog.NewLoggerV2(ioutil.Discard, os.Stderr, os.Stderr)) + grpclog.SetLoggerV2(grpclog.NewLoggerV2(io.Discard, os.Stderr, os.Stderr))Also applies to: 30-30
53-56
: Exit 0 on help-only root invocation.Printing help is not an error; return nil instead of os.Exit(1).
- Run: func(cmd *cobra.Command, args []string) { - cmd.Help() - os.Exit(1) - }, + RunE: func(cmd *cobra.Command, args []string) error { + _ = cmd.Help() + return nil + },bindata/tnfdeployment/cronjob.yaml (4)
5-5
: Make namespace injectable to match controller config.Hardcoding causes drift; use a placeholder and inject in controller.
- namespace: openshift-etcd + namespace: <injected>
9-12
: Consider startingDeadlineSeconds to avoid missing runs.Helps catch up after controller downtime.
spec: schedule: <injected> + startingDeadlineSeconds: 300 concurrencyPolicy: "Forbid"
36-41
: Privileged + hostPID need rationale and docs.If required for pcs/systemd introspection, add a brief comment in the manifest or controller explaining why and link to SOP. Also add seccompProfile: runtime/default.
securityContext: privileged: true allowPrivilegeEscalation: true + seccompProfile: + type: RuntimeDefault
44-47
: Future-proof control-plane selection.OpenShift uses master label, but many clusters also use control-plane. Consider nodeAffinity with matchExpressions for either label.
pkg/operator/ceohelpers/external_etcd_status.go (1)
167-201
: Solid helper; improve sync error context and log verbosity.Return context error when sync fails and use klog.V levels to reduce noise. Note precondition that the informer must be started.
- klog.Infof("waiting for etcd informer to sync before checking %s...", conditionName) + klog.V(2).Infof("waiting for etcd informer to sync before checking %s...", conditionName) if !cache.WaitForCacheSync(ctx.Done(), etcdInformer.Informer().HasSynced) { - return fmt.Errorf("failed to sync etcd informer") + if err := ctx.Err(); err != nil { + return fmt.Errorf("failed to sync etcd informer before %s: %w", conditionName, err) + } + return fmt.Errorf("failed to sync etcd informer before %s", conditionName) } - klog.Infof("etcd informer synced, checking for %s", conditionName) + klog.V(2).Infof("etcd informer synced, checking for %s", conditionName)Please confirm the caller starts the etcd informer factory prior to invoking this function.
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (3)
15-19
: Add printer columns for quick triage.Improves kubectl get output.
versions: - name: v1alpha1 served: true storage: true + additionalPrinterColumns: + - name: Updated + type: date + jsonPath: .status.lastUpdated + - name: Nodes + type: string + jsonPath: .status.summary.nodesOnline + description: Online nodes + - name: Resources + type: string + jsonPath: .status.summary.resourcesStarted + description: Started resources + - name: Failures + type: boolean + jsonPath: .status.summary.recentFailures + - name: Fencing + type: boolean + jsonPath: .status.summary.recentFencing
54-82
: Add basic schema invariants.Keep counts sane using CEL validations.
properties: summary: type: object properties: @@ resourcesTotal: type: integer + x-kubernetes-validations: + - rule: "self.nodesOnline <= self.nodesTotal" + message: "nodesOnline cannot exceed nodesTotal" + - rule: "self.resourcesStarted <= self.resourcesTotal" + message: "resourcesStarted cannot exceed resourcesTotal"Also applies to: 100-123, 146-153
33-39
: Spec.nodeName semantics vs singleton design.If the CR is a singleton “cluster”, nodeName misleads. Either drop it or mark as deprecated/ignored in both CRD and types, and update comments accordingly.
pkg/tnf/pkg/pacemaker/statuscollector_test.go (2)
15-20
: Mark helper for clearer failures.Use t.Helper() in loadTestXML.
func loadTestXML(t *testing.T, filename string) string { + t.Helper() path := filepath.Join("testdata", filename) data, err := os.ReadFile(path) require.NoError(t, err, "Failed to read test XML file: %s", filename) return string(data) }
398-403
: Avoid asserting a constant value. Test behavior instead.Mock exec and assert large XML is rejected, or at least assert maxXMLSize >= threshold.
- require.Equal(t, 10*1024*1024, maxXMLSize, "Max XML size should be 10MB") + require.GreaterOrEqual(t, maxXMLSize, 5*1024*1024, "Max XML size should be reasonably large")Or better: inject a fake reader > maxXMLSize into the collector and expect an error.
pkg/tnf/pkg/pacemaker/v1alpha1/types.go (3)
27-31
: Spec.NodeName is unused and conflicts with singleton design.Either remove it or clearly mark as reserved/ignored; align CRD text too.
type PacemakerStatusSpec struct { - // Reserved for future use - // +optional - NodeName string `json:"nodeName,omitempty"` + // Reserved for future use. This CR is status-only; spec fields are ignored. }
13-21
: Add printer columns for UX.Expose key fields in kubectl get.
// +kubebuilder:object:root=true // +kubebuilder:resource:path=pacemakerstatuses,scope=Cluster // +kubebuilder:subresource:status +// +kubebuilder:printcolumn:name="Updated",type="date",JSONPath=".status.lastUpdated" +// +kubebuilder:printcolumn:name="Nodes",type="string",JSONPath=".status.summary.nodesOnline" +// +kubebuilder:printcolumn:name="Res",type="string",JSONPath=".status.summary.resourcesStarted" +// +kubebuilder:printcolumn:name="Fail",type="boolean",JSONPath=".status.summary.recentFailures" +// +kubebuilder:printcolumn:name="Fence",type="boolean",JSONPath=".status.summary.recentFencing"Also applies to: 68-93
68-93
: Optional: indicate XML truncation.Add a flag so UIs know if RawXML was trimmed.
type PacemakerSummary struct { @@ // RecentFencing indicates if there are recent fencing events RecentFencing bool `json:"recentFencing,omitempty"` + + // RawXMLTruncated indicates RawXML was truncated to fit CR size limits + RawXMLTruncated bool `json:"rawXMLTruncated,omitempty"` }Pair with CRD schema and collector updates.
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)
29-58
: Mark test helpers with t.Helper().Add t.Helper() at the top of createTestHealthCheck and createTestHealthCheckWithMockStatus to improve failure reporting.
func createTestHealthCheck() *HealthCheck { + t.Helper() kubeClient := fake.NewSimpleClientset()
func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *v1alpha1.PacemakerStatus) *HealthCheck { + t.Helper() kubeClient := fake.NewSimpleClientset()
Also applies to: 61-114
pkg/tnf/operator/starter.go (2)
313-353
: Etcd transition wait: prefer context-aware ticker.The 1m retry loop is fine. For cleaner cancellation and metrics, consider wait.JitterUntilWithContext (k8s util/wait) instead of time.After/select.
- go func() { + go func() { klog.Infof("waiting for etcd to transition to external before starting Pacemaker status collector") - for { - if err := ceohelpers.WaitForEtcdCondition( /* ... */ ); err != nil { - if ctx.Err() != nil { klog.Infof("context done..."); return } - klog.Warningf("... will retry in 1m: %v", err) - select { - case <-time.After(time.Minute): - continue - case <-ctx.Done(): - return - } - } - klog.Infof("etcd has transitioned to external; starting Pacemaker status collector CronJob") - runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient) - return - } + wait.JitterUntilWithContext(ctx, func(ctx context.Context) { + if err := ceohelpers.WaitForEtcdCondition(ctx, etcdInformer, operatorClient, ceohelpers.HasExternalEtcdCompletedTransition, 10*time.Second, 30*time.Minute, "external etcd transition"); err != nil { + klog.Warningf("external etcd not ready yet: %v", err) + return + } + klog.Infof("etcd transitioned; starting Pacemaker status collector CronJob") + runPacemakerStatusCollectorCronJob(ctx, controllerContext, operatorClient, kubeClient) + // stop further retries + panic("done") + }, time.Minute, 0.0, true) }()
355-382
: CronJob safety: nil maps and container index.Defensively initialize labels and verify a container exists before indexing to avoid panics if the asset changes.
func(_ *operatorv1.OperatorSpec, cronJob *batchv1.CronJob) error { // Set the schedule - run every minute cronJob.Spec.Schedule = "* * * * *" // Set the name and labels cronJob.SetName("pacemaker-status-collector") - cronJob.Labels["app.kubernetes.io/name"] = "pacemaker-status-collector" + if cronJob.Labels == nil { + cronJob.Labels = map[string]string{} + } + cronJob.Labels["app.kubernetes.io/name"] = "pacemaker-status-collector" // Configure the container - cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE") - cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers[0].Command = []string{"tnf-runtime", "monitor"} + c := &cronJob.Spec.JobTemplate.Spec.Template.Spec.Containers + if len(*c) == 0 { + return fmt.Errorf("cronjob template has no containers") + } + (*c)[0].Image = os.Getenv("OPERATOR_IMAGE") + (*c)[0].Command = []string{"tnf-runtime", "monitor"} return nil },pkg/tnf/pkg/pacemaker/statuscollector.go (2)
30-32
: Cap stored RawXML to stay under K8s object size limits.10MiB XML is far above typical 1–1.5MiB object limits. Reduce the accepted input and truncate what’s stored in Status to avoid update failures.
- // maxXMLSize prevents XML bombs (10MB) - maxXMLSize = 10 * 1024 * 1024 + // maxXMLSize bounds input parsing; keep comfortably under object size limits (1MiB). + maxXMLSize = 1 * 1024 * 1024 + // maxStoredXMLSize bounds RawXML persisted into the Status subresource. + maxStoredXMLSize = 256 * 1024- rawXML = stdout + rawXML = stdout- existing.Status.RawXML = rawXML + if len(rawXML) > maxStoredXMLSize { + rawXML = rawXML[:maxStoredXMLSize] + } + existing.Status.RawXML = rawXML- Status: v1alpha1.PacemakerStatusStatus{ + Status: v1alpha1.PacemakerStatusStatus{ LastUpdated: now, - RawXML: rawXML, + RawXML: func(s string) string { + if len(s) > maxStoredXMLSize { return s[:maxStoredXMLSize] } + return s + }(rawXML),Also applies to: 271-286, 500-547
53-55
: Remove unused notFoundError constant.Now that apierrors.IsNotFound is used, this constant is dead.
- // Error check strings - notFoundError = "not found"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (26)
.gitignore
(2 hunks)Dockerfile.ocp
(1 hunks)bindata/tnfdeployment/clusterrole.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)bindata/tnfdeployment/job.yaml
(1 hunks)cmd/tnf-runtime/main.go
(4 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(5 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)pkg/tnf/pkg/pacemaker/client_helpers.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
🚧 Files skipped from review as they are similar to previous changes (9)
- pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
- pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
- pkg/tnf/pkg/jobs/batch.go
- Dockerfile.ocp
- bindata/tnfdeployment/clusterrole.yaml
- pkg/tnf/operator/starter_test.go
- pkg/tnf/pkg/pacemaker/v1alpha1/register.go
- pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
- pkg/tnf/pkg/pacemaker/healthcheck.go
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-48: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-48: Container should not be privileged
(CKV_K8S_16)
[medium] 1-48: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-48: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (11)
pkg/operator/starter.go (1)
630-631
: LGTM! Parameter addition follows existing patterns.The addition of
AlivenessChecker
as a parameter toHandleDualReplicaClusters
is consistent with how health checking is wired throughout this file. The package-levelAlivenessChecker
is properly initialized (line 89) and used consistently across multiple controller initializations.pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1)
1-3
: LGTM.Package doc and group annotation look correct for etcd.openshift.io/v1alpha1.
.gitignore (1)
14-14
: LGTM.Ignore pattern updated to tnf-runtime; payload pattern retained.
Also applies to: 30-30
pkg/tnf/pkg/pacemaker/client_helpers.go (1)
22-25
: All constants are properly defined with expected values.Verification confirms:
envKubeconfig
="KUBECONFIG"
✓envHome
="HOME"
✓defaultKubeconfigPath
="/.kube/config"
✓kubernetesAPIPath
="/apis"
✓All four constants are declared in the pacemaker package and have correct, standard values.
bindata/tnfdeployment/job.yaml (1)
18-18
: Binary rename verified; code is ready.The release image installs
tnf-runtime
to/usr/bin/
as confirmed by Dockerfile.ocp line 13. No references to the old name remain in the codebase, and the manifest correctly references the renamed binary. The change is consistent across all components (Dockerfile, Go code, and YAML manifests).cmd/tnf-runtime/main.go (1)
120-129
: Good fix: reuse collector command for monitor.Avoids double flag parsing and accidental exits; matches prior guidance.
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
47-50
: Nice: cap rawXML length.This guards against status object bloat.
pkg/tnf/pkg/pacemaker/statuscollector_test.go (1)
405-412
: Good: invalid XML now actually errors.pkg/tnf/operator/starter.go (2)
64-76
: No issues found—signature change fully propagated.All call sites for
HandleDualReplicaClusters
have been correctly updated. The production call atpkg/operator/starter.go:619
and the test call atpkg/tnf/operator/starter_test.go:96
both pass the newdynamicClient
andalivenessChecker
parameters.
355-382
: Reconcile 30s cadence vs 1m CronPR claims "every 30 seconds" but the CronJob in pkg/tnf/operator/starter.go sets schedule "* * * * *" (every minute). The monitor subcommand is pacemaker.NewPacemakerStatusCollectorCommand (pkg/tnf/pkg/pacemaker/statuscollector.go, wired via cmd/tnf-runtime/main.go) and runs runCollector() once — there is no interval/loop flag (collectorTimeout = 30s is a per-run timeout, not a cadence).
- Confirm intended cadence: 30s or 1m.
- If 30s required: either (A) keep CronJob at 1m and run monitor twice inside the container with a 30s sleep (quick), or (B) add an interval/duration flag and a loop inside monitor so it can run continuously (preferred).
- If 1m acceptable: update PR text/docs to "every minute".
pkg/tnf/pkg/pacemaker/healthcheck_test.go (1)
84-103
: Content-Type and codec are already in sync—no action needed.LegacyCodec always encodes to JSON, and the header is correctly set to
runtime.ContentTypeJSON
. Since this is the only LegacyCodec usage in the test file and the behavior is deterministic, no drift risk exists. The optional refactor to derive the header dynamically would be a nice-to-have for consistency, but it's not required.
// createKubernetesClient creates a Kubernetes clientset using in-cluster config | ||
// or falling back to kubeconfig file. | ||
func createKubernetesClient() (*kubernetes.Clientset, error) { | ||
config, err := getKubeConfig() | ||
if err != nil { | ||
return nil, err | ||
} | ||
return kubernetes.NewForConfig(config) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Is createKubernetesClient used?
It’s not referenced in this file. If unused across the package, please drop it along with the kubernetes import.
🏁 Script executed:
#!/usr/bin/env bash
# Find all references to createKubernetesClient
rg -n '\bcreateKubernetesClient\b' -g 'pkg/**.go'
Length of output: 148
---
🏁 Script executed:
```shell
#!/usr/bin/env bash
# Search for createKubernetesClient across all Go files without restrictive glob
rg -n '\bcreateKubernetesClient\b' --type=go
Length of output: 233
🏁 Script executed:
#!/usr/bin/env bash
# Also check the directory structure to understand repo layout
fd -t f '\.go$' | head -20
Length of output: 947
🏁 Script executed:
#!/usr/bin/env bash
# Check if kubernetes import is used elsewhere in the statuscollector.go file
rg -n 'kubernetes' pkg/tnf/pkg/pacemaker/statuscollector.go
Length of output: 249
🏁 Script executed:
#!/usr/bin/env bash
# Search for any other potential usages within the pacemaker package
rg -n 'kubernetes\.' pkg/tnf/pkg/pacemaker/ --type=go
Length of output: 350
Remove unused createKubernetesClient
function and the kubernetes
import
The function is not referenced anywhere in the codebase and is dead code. The kubernetes
import on line 13 is used only within this function, so it can also be safely removed.
🤖 Prompt for AI Agents
In pkg/tnf/pkg/pacemaker/statuscollector.go around lines 445 to 453, the
createKubernetesClient function is unused dead code; remove the entire function
and also delete the now-unused "kubernetes" import on line 13. After removing
both, run go build / go vet to ensure no remaining references and tidy imports
(goimports/gofmt or go mod tidy) to clean up any leftover unused imports.
8e58a60
to
29b5b15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/operator/starter.go (1)
413-418
: Handle tombstones on Secret delete eventsDelete callbacks may provide cache.DeletedFinalStateUnknown; current cast to *corev1.Secret can fail and skip updates.
func handleFencingSecretChange(ctx context.Context, client kubernetes.Interface, oldObj, obj interface{}) { - secret, ok := obj.(*corev1.Secret) - if !ok { - klog.Warningf("failed to convert added / modified / deleted object to Secret %+v", obj) - return - } + var secret *corev1.Secret + switch t := obj.(type) { + case *corev1.Secret: + secret = t + case cache.DeletedFinalStateUnknown: + if s, ok := t.Obj.(*corev1.Secret); ok { + secret = s + } else { + klog.Warningf("tombstone contained object that is not a Secret: %+v", t.Obj) + return + } + default: + klog.Warningf("unexpected object type for Secret event: %T", obj) + return + }
♻️ Duplicate comments (5)
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
54-82
: Add a truncation indicator for rawXMLIf the collector trims rawXML to fit the cap, surface it in status.summary to avoid confusing consumers.
summary: description: Summary provides high-level counts and flags for the cluster state type: object properties: + rawXMLTruncated: + description: RawXMLTruncated is true when rawXML was truncated to fit size limits + type: boolean pacemakerdState:pkg/tnf/pkg/pacemaker/statuscollector.go (2)
54-54
: Remove unusednotFoundError
constant.This constant was replaced by
apierrors.IsNotFound(err)
at line 491 and is no longer referenced.Apply this diff:
- // Error check strings - notFoundError = "not found" -
445-453
: Remove unusedcreateKubernetesClient
function and thekubernetes
import.The function is not referenced anywhere in the codebase and is dead code. The
kubernetes
import on line 13 is used only within this function, so it can also be safely removed.Apply this diff:
-import ( - "context" - "encoding/xml" - "fmt" - "os" - "time" - - "github.com/spf13/cobra" - apierrors "k8s.io/apimachinery/pkg/api/errors" - metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" - "k8s.io/client-go/kubernetes" - "k8s.io/klog/v2" - - "github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/exec" - "github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/pacemaker/v1alpha1" -) +import ( + "context" + "encoding/xml" + "fmt" + "os" + "time" + + "github.com/spf13/cobra" + apierrors "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/klog/v2" + + "github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/exec" + "github.com/openshift/cluster-etcd-operator/pkg/tnf/pkg/pacemaker/v1alpha1" +)And remove the function:
-// createKubernetesClient creates a Kubernetes clientset using in-cluster config -// or falling back to kubeconfig file. -func createKubernetesClient() (*kubernetes.Clientset, error) { - config, err := getKubeConfig() - if err != nil { - return nil, err - } - return kubernetes.NewForConfig(config) -} -pkg/tnf/pkg/pacemaker/healthcheck.go (2)
247-259
: Unknown health status with errors should trigger degradation.Lines 248, 253, and 258 return Unknown status with errors (CR fetch failure, collection error, stale status), but
updateOperatorStatus
(lines 419-427) does not mark degraded for Unknown status. This hides real outages when the CR is unavailable or stale.The issue persists from the previous review. Apply this fix in
updateOperatorStatus
:func (c *HealthCheck) updateOperatorStatus(ctx context.Context, status *HealthStatus) error { klog.V(4).Infof("Updating operator availability based on pacemaker status: %s", status.OverallStatus) // Log warnings and errors c.logStatusMessages(status) // Update operator conditions based on pacemaker status switch status.OverallStatus { case statusError: return c.setPacemakerDegradedCondition(ctx, status) case statusHealthy: return c.clearPacemakerDegradedCondition(ctx) + case statusUnknown: + // Unknown with errors (stale/unavailable CR) should degrade + if len(status.Errors) > 0 { + return c.setPacemakerDegradedCondition(ctx, status) + } + return nil default: - // For Warning or Unknown status, we don't update the degraded condition + // For Warning status, we don't update the degraded condition return nil } }
337-372
: Fix false-positive resource errors by comparing against online node count.When a node is offline (already flagged as an error), resource validation still demands resources on
expectedNodeCount
(2) nodes and reports an additional error. This amplifies the same issue.Compare started-resource counts to the number of online nodes instead:
func (c *HealthCheck) checkResourceStatus(pacemakerStatus *v1alpha1.PacemakerStatus, status *HealthStatus) { resources := pacemakerStatus.Status.Resources + + // Count online nodes to avoid duplicate errors when nodes are offline + onlineCount := 0 + for _, n := range pacemakerStatus.Status.Nodes { + if n.Online { + onlineCount++ + } + } resourcesStarted := make(map[string]map[string]bool) resourcesStarted[resourceKubelet] = make(map[string]bool) resourcesStarted[resourceEtcd] = make(map[string]bool) for _, resource := range resources { // A resource is considered "started" if it has Role="Started" and Active=true if resource.Role == roleStarted && resource.Active && resource.Node != "" { // Check if this is a kubelet or etcd resource switch { case strings.HasPrefix(resource.ResourceAgent, resourceAgentKubelet): resourcesStarted[resourceKubelet][resource.Node] = true case strings.HasPrefix(resource.ResourceAgent, resourceAgentEtcd): resourcesStarted[resourceEtcd][resource.Node] = true } } } // Check if we have both resources started on all nodes kubeletCount := len(resourcesStarted[resourceKubelet]) etcdCount := len(resourcesStarted[resourceEtcd]) - c.validateResourceCount(resourceKubelet, kubeletCount, status) - c.validateResourceCount(resourceEtcd, etcdCount, status) + c.validateResourceCount(resourceKubelet, kubeletCount, onlineCount, status) + c.validateResourceCount(resourceEtcd, etcdCount, onlineCount, status) } // validateResourceCount validates that a resource is started on the expected number of nodes -func (c *HealthCheck) validateResourceCount(resourceName string, actualCount int, status *HealthStatus) { - if actualCount < expectedNodeCount { +func (c *HealthCheck) validateResourceCount(resourceName string, actualCount, expectedCount int, status *HealthStatus) { + if expectedCount == 0 { + return + } + if actualCount < expectedCount { status.Errors = append(status.Errors, fmt.Sprintf(msgResourceNotStarted, - resourceName, actualCount, expectedNodeCount)) + resourceName, actualCount, expectedCount)) } }
🧹 Nitpick comments (5)
pkg/operator/ceohelpers/external_etcd_status.go (1)
166-201
: Also sync the operator client’s informer; improve error and verbosityOnly syncing the Etcd informer doesn’t ensure operatorClient.GetStaticPodOperatorState() has data. Sync both caches (and return ctx error) to avoid flapping early reads; reduce log chattiness.
func WaitForEtcdCondition( ctx context.Context, etcdInformer operatorv1informers.EtcdInformer, operatorClient v1helpers.StaticPodOperatorClient, @@ ) error { - // Wait for the etcd informer to sync before checking condition - // This ensures operatorClient.GetStaticPodOperatorState() has data to work with - klog.Infof("waiting for etcd informer to sync before checking %s...", conditionName) - if !cache.WaitForCacheSync(ctx.Done(), etcdInformer.Informer().HasSynced) { - return fmt.Errorf("failed to sync etcd informer") - } - klog.Infof("etcd informer synced, checking for %s", conditionName) + // Wait for informers to sync before checking condition + klog.V(2).Infof("waiting for informers to sync before checking %s...", conditionName) + if !cache.WaitForCacheSync(ctx.Done(), + etcdInformer.Informer().HasSynced, + operatorClient.Informer().HasSynced, + ) { + return fmt.Errorf("failed to sync informers: %w", ctx.Err()) + } + klog.V(2).Infof("informers synced, checking for %s", conditionName) // Poll until the condition is met return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) { @@ - if conditionMet { - klog.V(2).Infof("%s condition met", conditionName) + if conditionMet { + klog.V(2).Infof("%s condition met", conditionName) return true, nil } klog.V(4).Infof("%s condition not yet met, waiting...", conditionName) return false, nil }) }manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
124-176
: Constrain history list lengths to stay under the 1 MiB object budgetUnbounded nodeHistory/fencingHistory can overflow object size. Cap items; keep newest N entries in the collector.
nodeHistory: description: NodeHistory provides recent operation history for troubleshooting type: array + maxItems: 512 fencingHistory: description: FencingHistory provides recent fencing events type: array + maxItems: 128pkg/tnf/operator/starter.go (2)
366-377
: Prevent overlapping runs and limit history for the CronJobDefault CronJob concurrency is Allow; if a run exceeds 1m, jobs overlap. For a lightweight monitor, forbid overlap and keep small histories.
func(_ *operatorv1.OperatorSpec, cronJob *batchv1.CronJob) error { // Set the schedule - run every minute cronJob.Spec.Schedule = "* * * * *" + // Avoid overlapping runs and control history size + policy := batchv1.ForbidConcurrent + cronJob.Spec.ConcurrencyPolicy = policy + one := int32(1) + zero := int32(0) + cronJob.Spec.SuccessfulJobsHistoryLimit = &one + cronJob.Spec.FailedJobsHistoryLimit = &one + // Optional: ensure the job doesn't linger if the node is slow + // deadline := int64(55) + // cronJob.Spec.StartingDeadlineSeconds = &deadline
229-237
: Bootstrap wait: consider indefinite retry (like external transition path)You cap bootstrap wait at 30m, but external transition waits indefinitely with backoff. If bootstrap finishes after 30m, TNF job controllers never start. Consider the same retry-until-ctx-done loop here.
Can you confirm bootstrap always completes within 30 minutes in supported environments?
pkg/tnf/pkg/jobs/cronjob_controller.go (1)
180-213
: Reconcile more fields to avoid drift (policy/history)Extend needsUpdate to cover ConcurrencyPolicy and history limits so hook changes take effect.
func (c *CronJobController) needsUpdate(existing, desired *batchv1.CronJob) bool { // Check if schedule has changed if existing.Spec.Schedule != desired.Spec.Schedule { @@ } + // Concurrency policy and history limits + if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy { + klog.V(4).Infof("ConcurrencyPolicy changed from %q to %q", existing.Spec.ConcurrencyPolicy, desired.Spec.ConcurrencyPolicy) + return true + } + if (existing.Spec.SuccessfulJobsHistoryLimit == nil) != (desired.Spec.SuccessfulJobsHistoryLimit == nil) || + (existing.Spec.SuccessfulJobsHistoryLimit != nil && desired.Spec.SuccessfulJobsHistoryLimit != nil && + *existing.Spec.SuccessfulJobsHistoryLimit != *desired.Spec.SuccessfulJobsHistoryLimit) { + klog.V(4).Info("SuccessfulJobsHistoryLimit changed") + return true + } + if (existing.Spec.FailedJobsHistoryLimit == nil) != (desired.Spec.FailedJobsHistoryLimit == nil) || + (existing.Spec.FailedJobsHistoryLimit != nil && desired.Spec.FailedJobsHistoryLimit != nil && + *existing.Spec.FailedJobsHistoryLimit != *desired.Spec.FailedJobsHistoryLimit) { + klog.V(4).Info("FailedJobsHistoryLimit changed") + return true + } + // Check if the job template has changed (containers, command, image, etc.) if len(existing.Spec.JobTemplate.Spec.Template.Spec.Containers) > 0 &&
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (26)
.gitignore
(2 hunks)Dockerfile.ocp
(1 hunks)bindata/tnfdeployment/clusterrole.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)bindata/tnfdeployment/job.yaml
(1 hunks)cmd/tnf-runtime/main.go
(4 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(5 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)pkg/tnf/pkg/pacemaker/client_helpers.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- pkg/tnf/pkg/jobs/batch.go
- pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
- .gitignore
- pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
- pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
- pkg/tnf/pkg/pacemaker/v1alpha1/types.go
- pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
- bindata/tnfdeployment/clusterrole.yaml
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
- Dockerfile.ocp
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-48: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-48: Container should not be privileged
(CKV_K8S_16)
[medium] 1-48: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-48: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (20)
bindata/tnfdeployment/job.yaml (1)
18-18
: LGTM! Binary rename is consistent.The command array correctly references the renamed
tnf-runtime
binary, aligning with changes inDockerfile.ocp
andcmd/tnf-runtime/main.go
.pkg/operator/starter.go (1)
630-631
: LGTM! Parameters correctly passed.The
dynamicClient
andAlivenessChecker
parameters are correctly added to theHandleDualReplicaClusters
call, matching the updated function signature for Pacemaker health check integration.pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: LGTM! Well-structured test data template.The template correctly uses placeholders (
{{RECENT_TIMESTAMP}}
and{{RECENT_TIMESTAMP_ISO}}
) for dynamic timestamp injection, enabling effective testing of time-window filtering for recent failures and fencing events.cmd/tnf-runtime/main.go (3)
21-21
: LGTM! Runtime command properly renamed.The function rename from
NewTnfSetupRunnerCommand
toNewTnfRuntimeCommand
and updated command metadata are consistent with the broader binary rename in the PR.Also applies to: 41-41, 48-52
63-63
: LGTM! Monitor subcommand properly registered.The monitor subcommand is correctly added to the root command structure.
120-129
: LGTM! Monitor command correctly implemented.The implementation properly addresses past review feedback by reusing
pacemaker.NewPacemakerStatusCollectorCommand()
directly instead of callingExecute()
insideRun
. The metadata overrides and silence flags are appropriate.bindata/tnfdeployment/cronjob.yaml (2)
1-12
: LGTM! CronJob configuration is appropriate.The CronJob spec with
concurrencyPolicy: Forbid
is correct for status collection, preventing overlapping runs and ensuring consistent state snapshots.
26-48
: Security context is appropriate for Pacemaker operations.The privileged security context and
hostPID: true
are necessary for the collector to executepcs status
commands and access Pacemaker cluster state. Static analysis warnings are expected false positives for this infrastructure component. Ensure thetnf-setup-manager
service account has appropriate RBAC permissions for PacemakerStatus CRs (as indicated inbindata/tnfdeployment/clusterrole.yaml
).pkg/tnf/operator/starter_test.go (2)
34-34
: LGTM! Test dependencies properly added.The new imports and struct field additions correctly support the updated
HandleDualReplicaClusters
signature with aliveness checker and dynamic client parameters.Also applies to: 40-40, 54-54
95-95
: LGTM! Test properly wired and past feedback addressed.The test correctly instantiates the aliveness checker and passes all required parameters to
HandleDualReplicaClusters
. The use of the scopedctx
(Line 266) instead ofcontext.Background()
properly addresses past review feedback, preventing goroutine leaks across tests.Also applies to: 107-108, 259-269
pkg/tnf/pkg/pacemaker/client_helpers.go (2)
37-56
: LGTM! REST client creation follows best practices.The
createPacemakerRESTClient
function correctly configures a REST client for PacemakerStatus CRDs, including scheme registration, GroupVersion setup, and codec factory initialization. Error handling is comprehensive.
17-32
: Constants are properly defined and accessible.Verification confirms all constants used in the
getKubeConfig
function are defined in the same package:
envKubeconfig
,envHome
,defaultKubeconfigPath
are defined instatuscollector.go
(lines 57-59)kubernetesAPIPath
is defined inhealthcheck.go
(line 79)Since all files are in the same package (
pkg/tnf/pkg/pacemaker/
), the constants are properly accessible. The code is correct.pkg/tnf/pkg/pacemaker/statuscollector_test.go (5)
15-37
: LGTM! Excellent test helper design.The
loadTestXML
andgetRecentFailuresXML
helpers are well-designed. Using dynamic timestamps with template substitution enables robust testing of time-window filtering without test flakiness.
39-82
: LGTM! Comprehensive healthy cluster test.The test thoroughly validates all components of a healthy cluster including pacemaker state, quorum, node counts, resource presence, and absence of failures/fencing events.
84-165
: LGTM! Thorough coverage of degraded states.The tests effectively validate offline node detection, standby node handling, and recent failures/fencing detection with appropriate assertions on all relevant status fields.
167-355
: LGTM! Excellent edge case and variation coverage.The time-window filtering test precisely validates the 5-minute and 24-hour windows with explicit time calculations. Table-driven tests for quorum and node status variations provide comprehensive coverage of state combinations.
398-487
: LGTM! Validation and resource agent tests are thorough.The invalid XML test correctly uses malformed XML to trigger
xml.Unmarshal
errors (addressing past feedback). The resource agent type test ensures proper handling of different agent formats (systemd, ocf:heartbeat).manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
47-51
: Good: bound rawXML sizeAdding maxLength on status.rawXML guards object bloat and update failures. LGTM.
pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)
1-31
: Scheme registration looks correctGroup/version constants, SchemeGroupVersion, AddToScheme, and known types are wired as expected.
pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-211
: LGTM! Autogenerated deepcopy implementations are correct.The deepcopy methods properly handle nil checks, nested struct copies, slice allocations, and metav1.Time deep copies. This is standard code-generator output.
@jaypoulz: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The goal of this change was to ensure that the CEO permissions for the pcs status cronjob were elevated enough to be able to gather pcs status output. One future improvement would be to move this out of the CEO entirely and have the status gathering be done by an alert agent that posts regular updates to the CEO. To alleviate the complexity of a future transition to this model, I've structured the code so that the healthcheck controller doesn't care where the posts to the PacemakerStatus CRD come from - it just monitors it for updates. In the future, we can write an alert agent that just posts new PacemakerStatus updates as they happen, and the CEO can retire the cronjob entire.
29b5b15
to
6114770
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/tnf/pkg/pacemaker/statuscollector.go (1)
54-54
: Remove unused constant.The
notFoundError
constant is unused after refactoring to useapierrors.IsNotFound(err)
at line 481. This constant should have been removed when addressing the past review comment about error handling.Apply this diff:
- // Error check strings - notFoundError = "not found" -
🧹 Nitpick comments (8)
pkg/tnf/pkg/pacemaker/client_helpers.go (1)
22-25
: Consider safer path construction.The kubeconfig path construction on line 24 uses direct string concatenation. If
HOME
is empty, this results in"/.kube/config"
. WhileBuildConfigFromFlags
will handle the error, usingfilepath.Join
would be more idiomatic and safer.Apply this diff for improved clarity:
+ "path/filepath" + kubeconfig := os.Getenv(envKubeconfig) if kubeconfig == "" { - kubeconfig = os.Getenv(envHome) + defaultKubeconfigPath + home := os.Getenv(envHome) + if home != "" { + kubeconfig = filepath.Join(home, defaultKubeconfigPath) + } else { + kubeconfig = defaultKubeconfigPath + } }bindata/tnfdeployment/cronjob.yaml (1)
32-35
: Consider adding resource limits.While resource requests are defined (10m CPU, 32Mi memory), no limits are set. For a CronJob that runs periodically on master nodes, adding limits helps prevent resource exhaustion.
Apply this diff to add resource limits:
resources: requests: cpu: 10m memory: 32Mi + limits: + cpu: 50m + memory: 64Mipkg/tnf/pkg/pacemaker/statuscollector_test.go (2)
15-20
: Mark helpers with t.Helper() for clearer failure locations.This improves test diagnostics by attributing assertion failures to the caller.
func loadTestXML(t *testing.T, filename string) string { + t.Helper() path := filepath.Join("testdata", filename) data, err := os.ReadFile(path) require.NoError(t, err, "Failed to read test XML file: %s", filename) return string(data) } // Helper function to generate recent failures XML with dynamic timestamps func getRecentFailuresXML(t *testing.T) string { + t.Helper() // Load the template from testdata templateXML := loadTestXML(t, "recent_failures_template.xml")Also applies to: 23-37
167-174
: Stabilize time-based tests to avoid flake near window edges.Consider injecting a clock (now func) into buildStatusComponents or using fixed, test-controlled “now” to make window filtering tests deterministic.
pkg/tnf/operator/starter.go (1)
371-388
: Harden CronJob: forbid overlap, bound histories, add starting deadline.Prevents piling jobs on slow clusters and curbs history growth.
// Set the schedule - run every minute cronJob.Spec.Schedule = "* * * * *" + // Avoid overlapping runs and bound job histories + forbid := batchv1.ForbidConcurrent + cronJob.Spec.ConcurrencyPolicy = forbid + successful := int32(1) + failed := int32(3) + cronJob.Spec.SuccessfulJobsHistoryLimit = &successful + cronJob.Spec.FailedJobsHistoryLimit = &failed + // If a start is missed (e.g., controller restart), try within 2m + sds := int64(120) + cronJob.Spec.StartingDeadlineSeconds = &sds + // Set labels at all levels cronJob.Labels["app.kubernetes.io/name"] = pacemakerStatusCollectorName cronJob.Spec.JobTemplate.Labels["app.kubernetes.io/name"] = pacemakerStatusCollectorName cronJob.Spec.JobTemplate.Spec.Template.Labels["app.kubernetes.io/name"] = pacemakerStatusCollectorNamemanifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (1)
54-82
: Expose a truncation flag for rawXML to pair with the 256 KiB cap.Helps operators understand when raw XML was clipped to satisfy CRD limits.
summary: description: Summary provides high-level counts and flags for the cluster state type: object properties: + rawXMLTruncated: + description: RawXMLTruncated indicates RawXML was truncated to fit CRD limits + type: boolean pacemakerdState: description: PacemakerdState indicates if pacemaker is running type: stringpkg/tnf/pkg/jobs/cronjob_controller.go (1)
180-213
: Broaden needsUpdate to cover concurrency policy and history limits.Prevents configuration drift when hooks adjust these fields.
func (c *CronJobController) needsUpdate(existing, desired *batchv1.CronJob) bool { // Check if schedule has changed if existing.Spec.Schedule != desired.Spec.Schedule { klog.V(4).Infof("Schedule changed from %q to %q", existing.Spec.Schedule, desired.Spec.Schedule) return true } + + // Check concurrency policy + if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy { + klog.V(4).Infof("ConcurrencyPolicy changed from %q to %q", existing.Spec.ConcurrencyPolicy, desired.Spec.ConcurrencyPolicy) + return true + } + // Check history limits (nil-safe) + get := func(p *int32) int32 { if p == nil { return -1 }; return *p } + if get(existing.Spec.SuccessfulJobsHistoryLimit) != get(desired.Spec.SuccessfulJobsHistoryLimit) { + klog.V(4).Infof("SuccessfulJobsHistoryLimit changed") + return true + } + if get(existing.Spec.FailedJobsHistoryLimit) != get(desired.Spec.FailedJobsHistoryLimit) { + klog.V(4).Infof("FailedJobsHistoryLimit changed") + return true + } // Check if the job template has changed (containers, command, image, etc.)pkg/tnf/pkg/pacemaker/healthcheck.go (1)
154-163
: Consider returning errors instead of panicking.While panicking on initialization failures (lines 156, 162) is acceptable for unrecoverable errors during operator startup, returning errors would provide better error propagation and testing flexibility.
This is a minor improvement and the current approach is acceptable for operator initialization code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (26)
.gitignore
(2 hunks)Dockerfile.ocp
(1 hunks)bindata/tnfdeployment/clusterrole.yaml
(1 hunks)bindata/tnfdeployment/cronjob.yaml
(1 hunks)bindata/tnfdeployment/job.yaml
(1 hunks)cmd/tnf-runtime/main.go
(4 hunks)manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml
(1 hunks)pkg/operator/ceohelpers/external_etcd_status.go
(2 hunks)pkg/operator/starter.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/operator/starter_test.go
(5 hunks)pkg/tnf/pkg/jobs/batch.go
(1 hunks)pkg/tnf/pkg/jobs/cronjob_controller.go
(1 hunks)pkg/tnf/pkg/pacemaker/client_helpers.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck.go
(1 hunks)pkg/tnf/pkg/pacemaker/healthcheck_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector.go
(1 hunks)pkg/tnf/pkg/pacemaker/statuscollector_test.go
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml
(1 hunks)pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/doc.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/register.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/types.go
(1 hunks)pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/pkg/pacemaker/testdata/offline_node.xml
🚧 Files skipped from review as they are similar to previous changes (9)
- pkg/tnf/pkg/pacemaker/testdata/standby_node.xml
- pkg/tnf/pkg/pacemaker/testdata/healthy_cluster.xml
- pkg/operator/starter.go
- bindata/tnfdeployment/clusterrole.yaml
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
- cmd/tnf-runtime/main.go
- pkg/tnf/pkg/pacemaker/v1alpha1/types.go
- pkg/tnf/operator/starter_test.go
- Dockerfile.ocp
🧰 Additional context used
🪛 Checkov (3.2.334)
bindata/tnfdeployment/cronjob.yaml
[medium] 1-48: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-48: Container should not be privileged
(CKV_K8S_16)
[medium] 1-48: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-48: Containers should not share the host process ID namespace
(CKV_K8S_17)
🔇 Additional comments (23)
pkg/tnf/pkg/pacemaker/client_helpers.go (1)
37-56
: LGTM!The REST client creation follows standard Kubernetes client-go patterns with proper scheme registration, config copying, and error handling.
pkg/operator/ceohelpers/external_etcd_status.go (1)
167-201
: LGTM!The
WaitForEtcdCondition
helper follows best practices for waiting on Kubernetes resources: it syncs the informer cache first, then polls with proper context cancellation, timeout handling, and clear logging. The error retry logic correctly returns(false, nil)
to continue polling on transient errors.bindata/tnfdeployment/job.yaml (1)
18-18
: LGTM!Binary name update from
tnf-setup-runner
totnf-runtime
aligns with the broader runtime refactoring in this PR..gitignore (1)
14-14
: LGTM!Gitignore update correctly reflects the binary rename from
tnf-setup-runner
totnf-runtime
.pkg/tnf/pkg/pacemaker/v1alpha1/doc.go (1)
1-3
: LGTM!Standard Kubernetes API group documentation with correct
+groupName
annotation for code generation.pkg/tnf/pkg/jobs/batch.go (1)
42-48
: LGTM!The
ReadCronJobV1OrDie
function correctly mirrors the existingReadJobV1OrDie
pattern, maintaining consistency with the codebase's "OrDie" convention for manifest decoding.pkg/tnf/pkg/pacemaker/testdata/recent_failures_template.xml (1)
1-66
: LGTM!Test data correctly models Pacemaker status XML with template placeholders for timestamp injection, supporting recent-failures test scenarios.
bindata/tnfdeployment/cronjob.yaml (4)
5-5
: Past issue resolved.The namespace placeholder issue raised in past review comments has been addressed. The manifest now uses
<injected>
for the namespace.
31-31
: Past issue resolved.The command placeholder now correctly uses array syntax
["<injected>"]
, addressing the type safety concern from past reviews.
36-41
: Security context is appropriate for Pacemaker operations.Static analysis tools flag the privileged security context and
hostPID: true
as security issues. However, these settings are required for this CronJob to executepcs status
commands and interact with the Pacemaker cluster on the host. ThepriorityClassName: system-node-critical
and master node selector further indicate this is a system-level component requiring elevated privileges.
43-43
: RBAC permissions for tnf-setup-manager are properly configured.Verification confirms the ServiceAccount has all necessary permissions:
- PacemakerStatus CRs: ClusterRole grants
create
,update
,patch
on bothpacemakerstatuses
andpacemakerstatuses/status
resources- Privileged operations on master nodes: Pod runs with
allowPrivilegeEscalation: true
,hostPID: true
, and explicitly targets master nodes via nodeSelector; ClusterRole includesuse
permission on theprivileged
SecurityContextConstraint- Etcd resources: ClusterRole permits
get
,list
,watch
,patch
,update
on bothetcds
andetcds/status
resourcesNo additional RBAC configuration is required.
pkg/tnf/pkg/pacemaker/v1alpha1/register.go (1)
1-31
: LGTM — standard scheme wiring.Group/Version, SchemeBuilder, AddToScheme, and known types are registered correctly.
manifests/0000_25_etcd-operator_01_pacemakerstatus.crd.yaml (2)
123-152
: No issues found — RFC3339 serialization is correctly implemented.The code properly handles the conversion from pacemaker's XML timestamp format ("Mon Jan 2 15:04:05 2006") to RFC3339:
- Pacemaker XML timestamps are parsed using the correct format constant (line 46 of statuscollector.go)
- Parsed times are wrapped with
metav1.NewTime()
to createmetav1.Time
instances (line 405)- Go structs use
metav1.Time
for all timestamp fields (types.go lines 36, 150, 166)- The CRD YAML correctly specifies
type: string
withformat: date-time
(RFC3339-compatible)metav1.Time
automatically serializes to RFC3339 format in JSON output
100-123
: Ignore this review comment—it conflates different architectural layers.The XML parser (
statuscollector.go
) and Kubernetes API type (v1alpha1/types.go
) are separate concerns:
- XML parsing layer uses
ID
field from Pacemaker's XML output (xml:"id,attr"
)- Kubernetes API layer exposes
Name
field in the CRD schema (json:"name"
)The
ResourceStatus
struct correctly maps to the CRD schema (both usename
as the required field). The XMLID
is an internal parsing detail, not exposed to the API. No validation errors will occur.Likely an incorrect or invalid review comment.
pkg/tnf/pkg/pacemaker/v1alpha1/zz_generated.deepcopy.go (1)
1-211
: LGTM! Autogenerated deepcopy implementations.This file is autogenerated by
deepcopy-gen
and follows standard Kubernetes patterns for deep-copy implementations. No manual review is required.pkg/tnf/pkg/pacemaker/statuscollector.go (4)
64-197
: XML structures look well-designed.The XML unmarshaling structures comprehensively cover the pacemaker status output and follow Go XML conventions.
236-286
: Excellent error handling and security practices.The XML size validation (lines 265-269) protects against XML bombs, and preserving raw XML on parse failure (line 279) aids debugging. The execution timeout prevents hangs.
288-443
: Time-windowed history filtering looks correct.The 5-minute window for failed actions (lines 368-409) and 24-hour window for fencing events (lines 413-440) provide appropriate recent-event detection with proper timestamp parsing and error handling.
448-551
: CR creation and update logic is correct.The use of
apierrors.IsNotFound
(line 481) and the POST+PUT sequence to initialize the status subresource (lines 502-520) properly address the past review feedback. The update path (lines 528-550) correctly uses the status subresource.pkg/tnf/pkg/pacemaker/healthcheck.go (4)
412-429
: Operator status update logic is appropriate.The switch correctly sets degraded only for Error status (line 422) and clears it for Healthy status (line 424). Warning and Unknown statuses don't modify the degraded condition, which allows the operator to distinguish between Pacemaker cluster issues (Error) and monitoring/collection issues (Unknown).
338-373
: Resource validation logic is clear.The resource count validation (lines 363-364, 369) checks that critical resources (kubelet, etcd) are started on the expected number of nodes. While this may report resource errors when nodes are offline (already flagged by
checkNodeStatus
), the explicit reporting provides clarity about which specific resources are affected.
522-624
: Excellent event deduplication and recording logic.The event recording system properly implements:
- Time-based deduplication with appropriate windows (1 hour for fencing, 5 minutes for others)
- Transition-based healthy event recording (lines 592-603) to avoid spam
- Thread-safe access to the event map with mutex protection
- Cleanup of expired events (lines 522-543)
442-520
: Condition management is well-implemented.The idempotency checks in
setPacemakerDegradedCondition
(lines 448-460) andclearPacemakerDegradedCondition
(lines 474-483) prevent unnecessary operator status updates, reducing API churn. The use of library-go helpers (line 512) ensures consistency with operator patterns.
rawXML: | ||
description: RawXML contains the raw XML output from pcs status xml command | ||
type: string | ||
maxLength: 262144 | ||
collectionError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Collector must clamp rawXML to <= 262144 bytes before status update.
Ensure the status writer truncates pcs XML to this cap, otherwise status updates will be rejected.
Please confirm the collector enforces this cap and (optionally) sets a truncation indicator:
🏁 Script executed:
#!/bin/bash
# Check for clamp logic in collector/status writer
rg -nC2 -e 'rawXML' -e 'maxXMLSize|262144|truncate|Truncated' --type=go
Length of output: 223293
Status update will be rejected unless rawXML is clamped to 262144 bytes (the CRD maxLength).
The codebase enforces a 10MB XML limit (maxXMLSize = 10 * 1024 * 1024
in pkg/tnf/pkg/pacemaker/statuscollector.go:30
) but does not truncate to the CRD's declared maxLength: 262144
. Any XML between 256KB and 10MB will cause the API server to reject the status update as a validation error.
The collector must truncate rawXML
to ≤262144 bytes before calling updatePacemakerStatusCR()
at line 528 to prevent rejection. Currently, the code only validates against the 10MB threshold (line 265) and sets an error for larger values, but does nothing to clamp compliant-but-oversized payloads.
🤖 Prompt for AI Agents
In pkg/tnf/pkg/pacemaker/statuscollector.go around lines near 265 and 528, the
code validates XML only against maxXMLSize (10MB) but does not clamp rawXML to
the CRD-declared maxLength (262144 bytes) before calling
updatePacemakerStatusCR(), causing API server validation failures for payloads
between 256KB and 10MB; modify the collector to truncate rawXML to at most
262144 bytes (CRD maxLength) — perform the truncation on the byte slice (not
rune count) after reading/validating against maxXMLSize and before calling
updatePacemakerStatusCR(), and ensure the truncated content is used in the
status update (optionally set/append a flag or error field indicating
truncation) so the API server will accept the status update.
Change List