Implement alternative sidecar broker and pvc decommissioner #342

andrewstucki · 2024-12-04T21:41:49Z

This is the first work towards unifying any sort of long-lived containers into a unified sidecar entrypoint. It implements the entrypoint via a new alternative Decommissioner, which, though I may have missed some corner cases:

simplifies some of the old decommissioner code
removes the code that patches StatefulSets (instead publishing decommissioning events tied to the set)
calls out some current arguably buggy edge cases that k8s core saves us on (i.e. prematurely deleting pvcs)
tries to be both more robust and guarded about which brokers are actually allowed to be decommissioned
removes the reliance on calling the helm-cli internals, but just pulls and decodes helm release info directly

Let me know if there are any additional tests you'd want to add other than the basic scale down/node failure decommission case that's there?

EDIT:
Added support for an additional fetching mechanism so that we support something other than just grabbing helm values and converting them to a Redpanda CR for getting a client. In the process, added support for getting an rpk Profile and handing that to our factory to initialize all of our connections in the factory -- so now the factory can be used for initializing low-level Schema Registry/Kafka/Admin API clients via a redpanda.yaml and environment variables too, which should be perfect for sidecar utilizations that don't really use Helm (i.e. ArgoCD).

Also, if we see that the FS read operations are too heavy for initializing TLS, etc. the factory and fetchers can have a filesystem specified and use something like an afero.CacheOnReadFs, though I figure that's too premature to introduce for now.

andrewstucki · 2024-12-04T21:46:21Z

Also, just FYI, this is what the events look like while the test is running:

Name:               basic
Namespace:          testenv-4wn2h
...
Events:
  Type    Reason                                        Age                 From                    Message
  ----    ------                                        ----                ----                    -------
...
  Normal  DecommissioningBroker                         47s                 broker-decommissioner   brokers needing decommissioning: [4], decommissioning: 4
  Normal  DecommissioningUnboundPersistentVolumeClaims  42s                 broker-decommissioner   unbound persistent volume claims: [testenv-4wn2h/datadir-basic-4], decommissioning: testenv-4wn2h/datadir-basic-4
...

andrewstucki · 2024-12-05T21:21:57Z

Also, with the new ghost broker test here's an ellided stream of events while watching the test run with some rough annotations as to what's happening when:

# Initial 5 broker cluster deployed
Controller ID: 0 All nodes: [0 1 2 3 4] Nodes down: []
# Helm upgrade to 4 broker cluster
Controller ID: 0 All nodes: [0 1 2 3 4] Nodes down: [4]
# Rolling Restart
delete Pod basic-3 in StatefulSet basic successful
# decommissioner picks up old broker and PVC
brokers needing decommissioning: [4], decommissioning: 4
unbound persistent volume claims: [testenv-n99nf/datadir-basic-4], decommissioning: testenv-n99nf/datadir-basic-4
Controller ID: 0 All nodes: [0 1 2 3] Nodes down: [3]
create Pod basic-3 in StatefulSet basic successful
delete Pod basic-2 in StatefulSet basic successful
Controller ID: 0 All nodes: [0 1 2 3] Nodes down: [2]
create Pod basic-2 in StatefulSet basic successful
delete Pod basic-1 in StatefulSet basic successful
Controller ID: 0 All nodes: [0 1 2 3] Nodes down: [1]
create Pod basic-1 in StatefulSet basic successful
delete Pod basic-0 in StatefulSet basic successful
create Pod basic-0 in StatefulSet basic successful
Controller ID: 2 All nodes: [0 1 2 3] Nodes down: [0]
Controller ID: 2 All nodes: [0 1 2 3] Nodes down: []
# rolling restart completes, node is tainted, pod evicted and unschedulable
create Pod basic-0 in StatefulSet basic successful
0/5 nodes are available: 1 node(s) had untolerated taint {decommission-test: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
Controller ID: 2 All nodes: [0 1 2 3] Nodes down: [0]
# PVC and pod are manually deleted like PVC unbinder
create Claim datadir-basic-0 Pod basic-0 in StatefulSet basic success
create Pod basic-0 in StatefulSet basic successful
Successfully provisioned volume pvc-70d93e27-c1fc-4913-a7c8-627e83e17f29
Successfully assigned testenv-n99nf/basic-0 to k3d-decommissioning-agent-2
# New broker comes up, but not yet healthy
Controller ID: 2 All nodes: [0 1 2 3 5] Nodes down: [0 5]
# New broker healthy, old broker has ordinal collision
Controller ID: 2 All nodes: [0 1 2 3 5] Nodes down: [0]
# decommissioner detects broker needing decommissioning
brokers needing decommissioning: [0], decommissioning: 0
Controller ID: 2 All nodes: [1 2 3 5] Nodes down: []

operator/internal/decommissioning/statefulset_decomissioner.go

chrisseto

Not a fully review, just flushing comments for now.

operator/cmd/sidecar/sidecar.go

operator/internal/testenv/testenv.go

operator/internal/decommissioning/statefulset_decomissioner.go

chrisseto

Couple nits on documentation. Blocking comment on MergeMaps

operator/cmd/sidecar/sidecar.go

operator/pkg/functional/map.go

RafalKorepta · 2024-12-10T08:35:04Z

operator/cmd/sidecar/sidecar.go

+
+	cmd.Flags().StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
+	cmd.Flags().StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
+	cmd.Flags().StringVar(&pprofAddr, "pprof-bind-address", ":8082", "The address the metric endpoint binds to.")


NIT: port 8082 in default Redpanda helm chart deployment is used by HTTP listener. Could you add test to deploy Redpanda helm chart with this sidecar?

https://github.com/redpanda-data/helm-charts/blob/b0a8f611127d405d0811eae052fe75d56a70799d/charts/redpanda/ci/16-controller-sidecar-values.yaml

I'll add the test once I get the rest of the sidecar command finished off with the configwatcher port and pvcunbinder. Will leave this comment open so I don't forget!

operator/config/rbac/bases/operator/role.yaml

RafalKorepta · 2024-12-10T10:12:28Z

operator/internal/decommissioning/statefulset_decomissioner.go

+				return nil
+			}
+
+			if !strings.HasPrefix(claim.Name, datadirVolume+"-") {


What about Pods that have 2 local-path persistent volumes (datadir and shadow-index). The second shadow-index is used in cloud (operator v1) to separate data directory with tiered storage cache.

I think I miss understood which controller (decommission) would be replaced, but still cloud is using in operator v2 mode the tiered storage with PersistentVolume mount type
https://github.com/redpanda-data/helm-charts/blob/b0a8f611127d405d0811eae052fe75d56a70799d/charts/redpanda/statefulset.go#L854-L865
https://github.com/redpanda-data/helm-charts/blob/b0a8f611127d405d0811eae052fe75d56a70799d/charts/redpanda/statefulset.go#L1068-L1076
https://github.com/redpanda-data/helm-charts/blob/b0a8f611127d405d0811eae052fe75d56a70799d/charts/redpanda/statefulset.go#L1216-L1255

Ah, I think that's a good call-out that we should likely handle cleaning up the PVCs that we're optionally creating as well. The original code was based purely off of what the decommission controller was doing previously, which was for just datadir, but I think it'd make sense to check for any of our optionally provisioned PVCs as well.

Think I'm probably just going to leave this as-is for now since we don't try and decommission PVCs for non-datadir right now anyway. But I think it'd make sense to do so as a future iteration.

operator/internal/decommissioning/statefulset_decomissioner.go

operator/internal/decommissioning/rpk_profile_fetcher.go

operator/internal/decommissioning/statefulset_decommissioner_test.go

RafalKorepta · 2024-12-10T14:20:26Z

operator/internal/decommissioning/statefulset_decommissioner_test.go

+		values = functional.MergeMaps(values, overrides)
+	}
+
+	release, err := s.helm.Install(s.ctx, "redpandadata/redpanda", helm.InstallOptions{


Not a problem: What would be better option as we have charts in redpanda-operator repository, so it could just reference path within this repo.

taskfiles/k8s.yml

RafalKorepta

LGTM, beside the potential panic.

andrewstucki requested review from RafalKorepta and chrisseto as code owners December 4, 2024 21:41

andrewstucki commented Dec 5, 2024

View reviewed changes

operator/internal/decommissioning/statefulset_decomissioner.go Outdated Show resolved Hide resolved

chrisseto reviewed Dec 5, 2024

View reviewed changes

andrewstucki requested a review from chrisseto December 6, 2024 20:13

chrisseto reviewed Dec 9, 2024

View reviewed changes

chrisseto approved these changes Dec 9, 2024

View reviewed changes