Replies: 12 comments 30 replies
-
I'm not sure I follow the problem. You tal about some cluster named |
Beta Was this translation helpful? Give feedback.
-
Hi. Thank you for replying. Its still creating replicasets like crazy for the entity-operator:
^Those are all elsec-feeds-entity-operator-... On a side note: why is it in the 'kafka' NS (where I have zk and kf) and not in the 'strimzi' NS where the cluster operator lives? (I don't know if that is another symptom or not) After over a day the entity-operator is still not created/running - no idea why. As the the rolling restarts of zk and kf, I will restart the cluster operator with debug logging - however note nothing is interacting with this environment yet, I only just created it. So what is the change that is causing the infinite restarts? |
Beta Was this translation helpful? Give feedback.
-
So I decided to delete is all and start again (following: https://strimzi.io/docs/operators/latest/quickstart.html). For reference:
Then
Then the cluster:
=>
So notes on what's happening in between:
Topic:
=>
~10:50 GMT I was going to say: It seems this time the entity-operator came up ok, its not creating loads of empty/dead replicasets:
However before I finished to post this, I looked again and:
So back to square 1 on my first issue... For the brief time it was running it did create my test topic:
Also after all that, it is still deleting/restarting the kf and zk pods... so my second issue is also here to stay.. I hope with all this you or anyone else has enough information to help - it would be so greatly appreciated. I was capturing the debug cluster operator logs (stopped log capture at 11:04 GMT): In any case, thank you for taking the time to look at this. |
Beta Was this translation helpful? Give feedback.
-
Thank you for looking at this and for the response. If I understand correctly Strimzi and Autopilot are not getting along? This how I created the cluster:
Following: https://cloud.google.com/kubernetes-engine/docs/quickstart#create_cluster - After those two commands I can use kubectl. Then I moved onto the Strimzi getting started - I did nothing more at the cluster create part (yet). So the first option "Disable whatever is doing this from doing it" says to me: You can't use GKE-Autopilot The second option is (TBH) starting to take me out of my depth, I will have to take some time to digest it. I use to run ZK and KF managed via my own Ansible playbook. Having discovered Strimzi I thought I'd give it a go. Should/Could I open an issue for out-of-the-box support of Strimzi on GKE-Autopilot? Will the template route really work if Autopilot is making annotations? Can I really know such in advanced and add them to strimzi definitions? Would it not be better to be able to list things that Strimzi should ignore, by adopting the new value automatically? |
Beta Was this translation helpful? Give feedback.
-
Thank you for your continuing support in this discussion thread. Autopilot is a new mode or type of GKE (Googles K8s platform), that they released in Feb. 2021. The blog/announcement is here: https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot. From the overview documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview -
Here's a comparison between Autopilot and Standard GKE: https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#comparison - Looking at that table I guess its something that Autopilot pre-configures and manages, such as the things listed under Security. Perhaps I have to deploy a standard GKE, but I was really hoping to spare time and effort by having Google and their automation managed the GKE infrastructure, such as nodes - and in turn have Strimzi manage the application platform: Kafka. Leaving me to do other things. Routes I now see, and will investigate:
At the very least I hope this thread proves useful to others who try Strimzi with GKE Autopilot. - The dream of deploying a fresh GKE, Strimzi and Kafka cluster in something like a dozen simple commands. Happy new year! |
Beta Was this translation helpful? Give feedback.
-
Hi. So I now have this:
This appears to have stopped the delete/restart loop for Zookeeper and Kafka; however the entity operator still wont come up. I tried to grab some log history via the GCP web ui, attached here: |
Beta Was this translation helpful? Give feedback.
-
Hi. Is this better? I am finding it difficult to get logs that look useful for something that is so short lived; any advice welcome.
and
and
|
Beta Was this translation helpful? Give feedback.
-
Could it be this annotation?
|
Beta Was this translation helpful? Give feedback.
-
The health checks on the two operators have:
However in GKE these are being interpreted as: I doubt you wanted port 0, but rather localhost:80 no? There does not seem to be a config option to change these to localhost:80/ready (or localhost:80/healthy) |
Beta Was this translation helpful? Give feedback.
-
So I forwarded this discussion thread to Google Cloud support to get their assistance with the EntityOperator failing to deploy. This is their response:
Autopilot Container Isolation[https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#container_isolation]
I think we already see this with the template changes I had to make for the zk and kf sets to stop the delete loop. Is the tls-sidecar container running foul of any other this?
I tried to search for 'stunnel NET_PCAP_RAW' and could not find anything conclusive. Other than some patches to allow certain modes of operations without requiring root. Side question: Why is the tls-sidecar needed when (AFAIK) both Zookeeper and Kafka now support TLS natively? (I know that wasn't true some years ago). Linux Capability Limitations[https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#linux_workload_limitations]
Does tls-sidecar require other capabilities? Ignoring GKE Autopilot; It seems there are many security issues and CVEs (e.g. CVE-2020-12401) associated with CAP_NEW_RAW in the container workload space. If Strimzi (tls-sidecar) does in fact require such, that might be something worth reviewing in the interest of security solely. I am aware that in a complex system such as the components that make up strimzi and the workloads it manages, that these two questions may not be immediately answerable. However I wanted to make this post for posterity. I do plan to try and find time (sometime) to try Strimzi on GKE Standard (maybe even toggle CAP_NET_RAW); not as a long term solution but rather just as a validation test. |
Beta Was this translation helpful? Give feedback.
-
FYI: https://issuetracker.google.com/issues/214356345 - Created by google in response to this case. |
Beta Was this translation helpful? Give feedback.
-
This seems to have been solved. Documented my experience in #6922 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi.
I followed the getting started guide to deploy Strimzi on a GKE Autopilot cluster (1.21.5-gke.1302); for which I also followed the getting started guide.
I first deployed 0.26.1, then tried to upgrade to 0.27.0, and then deleted everything and started again deploying 0.27.0 - and now I am here asking for help. Note: I deployed strimzi into namespace 'strimzi' and Kafka into 'kafka'.
I see two big issues:
1) The abc-entity-operator deployment is failing and is creating an ever longer list of ReplicaSets:
I assume this is not intentional. When I was deleting everything to start again, I had nearly 8000 of these to delete...
In the GCP webui console the reason I most often see for the error state is

Container tls-sidecar is waiting
, and I managed to grab this:2) It keeps deleting (and thus restarting) both Zookeeper and Kafka nodes for no apparent reason.
Looking at Zookeeper and Kafka they seem happy, I can even do basic operations like interact with topics:
Here are the operator logs:
strimzi_operator.log
What am I doing wrong? Please advise what can I do?
Beta Was this translation helpful? Give feedback.
All reactions