Skip to content

Commit 3ef946b

Browse files
committed
In previous versions of the operator, when some of our synchronization code was ported to the kube library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the kube package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our kube.Ctl usage.
This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation. This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to. What this means is that we have to: 1. Clear all of the field managers that are mis-named 2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that 3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields. The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.
1 parent b344b38 commit 3ef946b

20 files changed

+62438
-34660
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
project: operator
2+
kind: Fixed
3+
body: In previous versions of the operator a field manager was unknowingly changed for resources that were synchronized via server-side apply. This can cause problems with modifying fields such as Service and StatefulSet port definitions. A post-upgrade migration job was added to remove any unwanted field managers.
4+
time: 2026-01-29T12:28:15.717527-05:00
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
@operator:none @vcluster
2+
# Note: use the same version of RP across upgrades to minimize
3+
# issues not related to operator upgrade regressions.
4+
Feature: Operator upgrade regressions
5+
@skip:gke @skip:aks @skip:eks
6+
Scenario: Regression - field managers
7+
Given I helm install "redpanda-operator" "redpanda/operator" --version v25.1.3 with values:
8+
"""
9+
crds:
10+
enabled: true
11+
"""
12+
And I apply Kubernetes manifest:
13+
"""
14+
---
15+
apiVersion: cluster.redpanda.com/v1alpha2
16+
kind: Redpanda
17+
metadata:
18+
name: operator-upgrade
19+
spec:
20+
clusterSpec:
21+
image:
22+
repository: redpandadata/redpanda
23+
tag: v25.2.11
24+
console:
25+
enabled: false
26+
statefulset:
27+
replicas: 1
28+
sideCars:
29+
image:
30+
tag: dev
31+
repository: localhost/redpanda-operator
32+
"""
33+
And cluster "operator-upgrade" is available
34+
And service "operator-upgrade" should have field managers:
35+
"""
36+
cluster.redpanda.com/operator
37+
"""
38+
And service "operator-upgrade" should not have field managers:
39+
"""
40+
*kube.Ctl
41+
"""
42+
Then I helm upgrade "redpanda-operator" "redpanda/operator" --version v25.2.1 with values:
43+
"""
44+
crds:
45+
enabled: true
46+
"""
47+
And cluster "operator-upgrade" should be stable with 1 nodes
48+
And service "operator-upgrade" should have field managers:
49+
"""
50+
cluster.redpanda.com/operator
51+
*kube.Ctl
52+
"""
53+
And I apply Kubernetes manifest:
54+
"""
55+
---
56+
apiVersion: cluster.redpanda.com/v1alpha2
57+
kind: Redpanda
58+
metadata:
59+
name: operator-upgrade
60+
spec:
61+
clusterSpec:
62+
image:
63+
repository: redpandadata/redpanda
64+
tag: v25.2.11
65+
listeners:
66+
kafka:
67+
port: 19093
68+
console:
69+
enabled: false
70+
statefulset:
71+
replicas: 1
72+
sideCars:
73+
image:
74+
tag: dev
75+
repository: localhost/redpanda-operator
76+
"""
77+
And cluster "operator-upgrade" should have sync error:
78+
"""
79+
Service "operator-upgrade" is invalid: spec.ports[3].name: Duplicate value: "kafka"
80+
"""
81+
Then I helm upgrade "redpanda-operator" "../operator/chart" with values:
82+
"""
83+
image:
84+
tag: dev
85+
repository: localhost/redpanda-operator
86+
crds:
87+
enabled: true
88+
"""
89+
And service "operator-upgrade" should have field managers:
90+
"""
91+
cluster.redpanda.com/operator
92+
"""
93+
And service "operator-upgrade" should not have field managers:
94+
"""
95+
*kube.Ctl
96+
"""
97+
And cluster "operator-upgrade" should be stable with 1 nodes

acceptance/steps/register.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,11 @@ func init() {
117117
framework.RegisterStep(`^Console "([^"]+)" will be healthy`, consoleIsHealthy)
118118
framework.RegisterStep(`^the migrated console cluster "([^"]+)" should have (\d+) warning(s)?$`, consoleHasWarnings)
119119

120+
// Regression steps
121+
framework.RegisterStep(`^service "([^"]*)" should have field managers:$`, checkResourceFieldManagers)
122+
framework.RegisterStep(`^service "([^"]*)" should not have field managers:$`, checkResourceNoFieldManagers)
123+
framework.RegisterStep(`^cluster "([^"]*)" should have sync error:$`, checkClusterHasSyncError)
124+
120125
// Debug steps
121126
framework.RegisterStep(`^I become debuggable$`, sleepALongTime)
122127
}

acceptance/steps/regression.go

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
// Copyright 2026 Redpanda Data, Inc.
2+
//
3+
// Use of this software is governed by the Business Source License
4+
// included in the file licenses/BSL.md
5+
//
6+
// As of the Change Date specified in that file, in accordance with
7+
// the Business Source License, use of this software will be governed
8+
// by the Apache License, Version 2.0
9+
10+
package steps
11+
12+
import (
13+
"context"
14+
"fmt"
15+
"slices"
16+
"strings"
17+
"time"
18+
19+
"github.com/cucumber/godog"
20+
"github.com/stretchr/testify/require"
21+
corev1 "k8s.io/api/core/v1"
22+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
23+
24+
framework "github.com/redpanda-data/redpanda-operator/harpoon"
25+
redpandav1alpha2 "github.com/redpanda-data/redpanda-operator/operator/api/redpanda/v1alpha2"
26+
)
27+
28+
func checkClusterHasSyncError(ctx context.Context, t framework.TestingT, clusterName string, errdoc *godog.DocString) {
29+
errorString := strings.TrimSpace(errdoc.Content)
30+
31+
var cluster redpandav1alpha2.Redpanda
32+
33+
key := t.ResourceKey(clusterName)
34+
35+
t.Logf("Checking cluster %q has sync error", clusterName)
36+
require.Eventually(t, func() bool {
37+
require.NoError(t, t.Get(ctx, key, &cluster))
38+
hasCondition := t.HasCondition(metav1.Condition{
39+
Type: "ResourcesSynced",
40+
Status: metav1.ConditionFalse,
41+
Reason: "Error",
42+
}, cluster.Status.Conditions)
43+
44+
t.Logf(`Checking cluster resource conditions contains an errored "ResourcesSynced"? %v`, hasCondition)
45+
if !hasCondition {
46+
return false
47+
}
48+
49+
for _, condition := range cluster.Status.Conditions {
50+
if condition.Type == "ResourcesSynced" && condition.Status == metav1.ConditionFalse && condition.Reason == "Error" {
51+
t.Logf("Found error message: %q", condition.Message)
52+
return strings.Contains(condition.Message, errorString)
53+
}
54+
}
55+
56+
return false
57+
}, 5*time.Minute, 5*time.Second, "%s", delayLog(func() string {
58+
return fmt.Sprintf(`Cluster %q never contained an error on the condition reason "ResourcesSynced" with a matching error string: %q, final Conditions: %+v`, key.String(), errorString, cluster.Status.Conditions)
59+
}))
60+
}
61+
62+
func checkResourceNoFieldManagers(ctx context.Context, t framework.TestingT, clusterName string, list *godog.DocString) {
63+
fieldManagerCheck(ctx, t, false, clusterName, list.Content)
64+
}
65+
66+
func checkResourceFieldManagers(ctx context.Context, t framework.TestingT, clusterName string, list *godog.DocString) {
67+
fieldManagerCheck(ctx, t, true, clusterName, list.Content)
68+
}
69+
70+
func fieldManagerCheck(ctx context.Context, t framework.TestingT, presence bool, clusterName, managerList string) {
71+
managers := []string{}
72+
for _, line := range strings.Split(strings.TrimSpace(managerList), "\n") {
73+
if manager := strings.TrimSpace(line); manager != "" {
74+
managers = append(managers, manager)
75+
}
76+
}
77+
78+
var cluster corev1.Service
79+
80+
key := t.ResourceKey(clusterName)
81+
82+
t.Logf("Checking resource %q field manager", clusterName)
83+
require.Eventually(t, func() bool {
84+
require.NoError(t, t.Get(ctx, key, &cluster))
85+
86+
fieldManagers := cluster.GetManagedFields()
87+
for _, manager := range managers {
88+
if !slices.ContainsFunc(fieldManagers, func(entry metav1.ManagedFieldsEntry) bool {
89+
return entry.Manager == manager
90+
}) {
91+
t.Logf(`Resource %q does not contain the field manager %q`, key.String(), manager)
92+
// negation because if we are checking for presence, then we want to return that the condition
93+
// is not met
94+
return !presence
95+
}
96+
t.Logf(`Found field manager %q in resource %q`, manager, key.String())
97+
}
98+
return presence
99+
}, 5*time.Minute, 5*time.Second, "%s", delayLog(func() string {
100+
finalManagers := []string{}
101+
for _, entry := range cluster.GetManagedFields() {
102+
finalManagers = append(finalManagers, entry.Manager)
103+
}
104+
if presence {
105+
return fmt.Sprintf(`Resource %q never contained all of the field managers: %+v, was: %+v`, key.String(), managers, finalManagers)
106+
}
107+
return fmt.Sprintf(`Resource %q never lacked all of the field managers: %+v, was: %+v`, key.String(), managers, finalManagers)
108+
}))
109+
}

charts/redpanda/render_state_nogotohelm.go

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ import (
2727
k8sapierrors "k8s.io/apimachinery/pkg/api/errors"
2828
"k8s.io/apimachinery/pkg/types"
2929
"k8s.io/utils/ptr"
30+
"sigs.k8s.io/controller-runtime/pkg/client"
3031

3132
"github.com/redpanda-data/redpanda-operator/gotohelm/helmette"
3233
"github.com/redpanda-data/redpanda-operator/pkg/kube"
@@ -49,6 +50,8 @@ var (
4950
}
5051
)
5152

53+
const DefaultFieldOwner = client.FieldOwner("cluster.redpanda.com/operator")
54+
5255
// FetchSASLUsers attempts to locate an existing SASL users secret in the cluster.
5356
// If found, it is used to populate the first user in the secret for use.
5457
func (r *RenderState) FetchSASLUsers() (username, password, mechanism string, err error) {
@@ -253,7 +256,9 @@ func certificatesFor(state *RenderState, name string) (certSecret, certKey, clie
253256

254257
// KubeCTL constructs a kube.Ctl from the RenderState's kubeconfig.
255258
func (r *RenderState) KubeCTL() (*kube.Ctl, error) {
256-
return kube.FromRESTConfig(r.Dot.KubeConfig)
259+
return kube.FromRESTConfig(r.Dot.KubeConfig, kube.Options{
260+
FieldManager: string(DefaultFieldOwner),
261+
})
257262
}
258263

259264
// RenderNodePools can be used to render node pools programmatically from Go.

operator/chart/chart.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@ func render(dot *helmette.Dot) []kube.Object {
6161
Deployment(dot),
6262
PreInstallCRDJob(dot),
6363
CRDJobServiceAccount(dot),
64+
PostUpgradeMigrationJob(dot),
65+
MigrationJobServiceAccount(dot),
6466
}
6567

6668
for _, cr := range ClusterRoles(dot) {
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
// Copyright 2026 Redpanda Data, Inc.
2+
//
3+
// Use of this software is governed by the Business Source License
4+
// included in the file licenses/BSL.md
5+
//
6+
// As of the Change Date specified in that file, in accordance with
7+
// the Business Source License, use of this software will be governed
8+
// by the Apache License, Version 2.0
9+
10+
// +gotohelm:filename=_post-upgrade-migration-job.go.tpl
11+
package operator
12+
13+
import (
14+
"fmt"
15+
16+
batchv1 "k8s.io/api/batch/v1"
17+
corev1 "k8s.io/api/core/v1"
18+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
19+
"k8s.io/utils/ptr"
20+
21+
"github.com/redpanda-data/redpanda-operator/gotohelm/helmette"
22+
)
23+
24+
// This is a post-upgrade job to make sure it just runs once.
25+
func PostUpgradeMigrationJob(dot *helmette.Dot) *batchv1.Job {
26+
values := helmette.Unwrap[Values](dot.Values)
27+
28+
return &batchv1.Job{
29+
TypeMeta: metav1.TypeMeta{
30+
APIVersion: "batch/v1",
31+
Kind: "Job",
32+
},
33+
ObjectMeta: metav1.ObjectMeta{
34+
Name: fmt.Sprintf("%s-migration", Fullname(dot)),
35+
Namespace: dot.Release.Namespace,
36+
Labels: helmette.Merge(
37+
Labels(dot),
38+
),
39+
Annotations: map[string]string{
40+
"helm.sh/hook": "post-upgrade",
41+
"helm.sh/hook-delete-policy": "before-hook-creation,hook-succeeded,hook-failed",
42+
// run this after the CRD job
43+
"helm.sh/hook-weight": "-4",
44+
},
45+
},
46+
Spec: batchv1.JobSpec{
47+
Template: corev1.PodTemplateSpec{
48+
ObjectMeta: metav1.ObjectMeta{
49+
Annotations: values.PodAnnotations,
50+
Labels: helmette.Merge(SelectorLabels(dot), values.PodLabels),
51+
},
52+
Spec: corev1.PodSpec{
53+
RestartPolicy: corev1.RestartPolicyOnFailure,
54+
AutomountServiceAccountToken: ptr.To(false),
55+
TerminationGracePeriodSeconds: ptr.To(int64(10)),
56+
ImagePullSecrets: values.ImagePullSecrets,
57+
ServiceAccountName: MigrationJobServiceAccountName(dot),
58+
NodeSelector: values.NodeSelector,
59+
Tolerations: values.Tolerations,
60+
Volumes: []corev1.Volume{serviceAccountTokenVolume()},
61+
Containers: migrationJobContainers(dot),
62+
},
63+
},
64+
},
65+
}
66+
}
67+
68+
func migrationJobContainers(dot *helmette.Dot) []corev1.Container {
69+
values := helmette.Unwrap[Values](dot.Values)
70+
71+
args := []string{"migration"}
72+
73+
return []corev1.Container{
74+
{
75+
Name: "migration",
76+
Image: containerImage(dot),
77+
ImagePullPolicy: values.Image.PullPolicy,
78+
Command: []string{"/redpanda-operator"},
79+
Args: args,
80+
SecurityContext: &corev1.SecurityContext{AllowPrivilegeEscalation: ptr.To(false)},
81+
VolumeMounts: []corev1.VolumeMount{serviceAccountTokenVolumeMount()},
82+
Resources: values.Resources,
83+
},
84+
}
85+
}

operator/chart/rbac.go

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ type RBACBundle struct {
2828
func rbacBundles(dot *helmette.Dot) []RBACBundle {
2929
values := helmette.Unwrap[Values](dot.Values)
3030

31-
return []RBACBundle{
31+
bundles := []RBACBundle{
3232
{
3333
Name: Fullname(dot),
3434
Enabled: true,
@@ -77,6 +77,21 @@ func rbacBundles(dot *helmette.Dot) []RBACBundle {
7777
},
7878
},
7979
}
80+
81+
// the migration job needs the same general RBAC policy as the operator itself
82+
bundles = append(bundles, RBACBundle{
83+
Name: MigrationJobServiceAccountName(dot),
84+
Enabled: true,
85+
Subject: MigrationJobServiceAccountName(dot),
86+
Annotations: map[string]string{
87+
"helm.sh/hook": "post-upgrade",
88+
"helm.sh/hook-delete-policy": "before-hook-creation,hook-succeeded,hook-failed",
89+
"helm.sh/hook-weight": "-10",
90+
},
91+
RuleFiles: bundles[0].RuleFiles,
92+
})
93+
94+
return bundles
8095
}
8196

8297
func ClusterRoles(dot *helmette.Dot) []rbacv1.ClusterRole {

0 commit comments

Comments
 (0)