-
Notifications
You must be signed in to change notification settings - Fork 713
Description
Describe the bug
Hi,
Here's another kustomization-controller slowness related issue, because the other ones seemed unrelated. We have a K8s cluster that contains about 128 Kustomizations in flux-system namespace (deployed according to https://github.com/fluxcd/flux2-kustomize-helm-example), 11 out of these are failing:
service-1 30d False health check failed after 44.855714ms: failed early due to stalled resources: [Deployment/namespace/service-1 status: 'Failed']
service-2 30d False health check failed after 78.726975ms: failed early due to stalled resources: [Deployment/namespace/service-2 status: 'Failed']
service-3 7d1h False health check failed after 9m30.015478679s: timeout waiting for: [Deployment/namespace/service-3: 'NotFound']
service-4 4d8h False health check failed after 9m30.014903975s: timeout waiting for: [Deployment/namespace/service-4 status: 'NotFound']
service-5 6d9h False health check failed after 9m30.015001798s: timeout waiting for: [Deployment/namespace/service-5 status: 'NotFound']
service-6 52d Unknown Reconciliation in progress
service-7 7d1h False health check failed after 47.145335ms: failed early due to stalled resources: [Deployment/namespace/service-7 status: 'Failed']
service-8 7d1h Unknown Reconciliation in progress
service-9 4d8h Unknown Reconciliation in progress
service-10 52d False health check failed after 9m30.017330668s: timeout waiting for: [Deployment/namespace/service-10 status: 'NotFound']
service-11 73d False health check failed after 9m30.01314632s: timeout waiting for: [Deployment/namespace/service-11 status: 'NotFound']
All of them have healthcheck enabled, all fail at HelmRelease component (although I'm not sure why kustomization reports 3 services as 'Reconciliation in progress' since the HelmRelease fault is that the Helm chart is missing, like for most others).
Now when I created a new kustomization to deploy cert-manager, I noticed that even though source-controller read in the git repository change, the kustomization resource for cert-manager was not created. Without having changed anything, it finally appeared x minutes later, but even then it was created without a status:
kubectl -n flux-system describe kustomizations.kustomize.toolkit.fluxcd.io cert-manager
...
...
Status:
Observed Generation: -1
Events: <none>
Again x minutes later, it finally did the deployment and reported ReconciliationSucceeded. Here's the Kustomization yaml:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: cert-manager
namespace: flux-system
spec:
interval: 10m
targetNamespace: cert-manager
sourceRef:
kind: GitRepository
name: infra-repo
path: "./cert-manager"
prune: true
suspend: false
patches:
- patch: |-
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: cert-manager
spec:
releaseName: cert-manager
chart:
spec:
version: "v1.19.1"
Note that I use spec.interval: 10 for all my Kustomizations and I don't use wait: anywhere. When I was adding additional resources to the cert-manager Kustumization, it again didn't immediately reconcile when source-manager read in the changes (perhaps it's not supposed to, because I made the changes in infra-repo GitRepository, from where it reads in HelmRelease, and not inside the flux-system cluster directory, where the kustomization itself is) and it certainly went way above the 10 minute interval period:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ReconciliationSucceeded 35m kustomize-controller Reconciliation finished in 132.178339ms, next run in 10m0s
Normal Progressing 2m5s kustomize-controller ClusterIssuer/letsencrypt-production created
ClusterIssuer/letsencrypt-staging created
Normal ReconciliationSucceeded 2m5s kustomize-controller Reconciliation finished in 117.488053ms, next run in 10m0s
Next run in 10m, but it actually took 33m. This lag was persistent, not temporary:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Progressing 56m kustomize-controller ClusterIssuer/letsencrypt-production-dns created
ClusterIssuer/letsencrypt-staging-dns created
Normal ReconciliationSucceeded 56m kustomize-controller Reconciliation finished in 117.488053ms, next run in 10m0s
Normal ReconciliationSucceeded 37m kustomize-controller Reconciliation finished in 114.742958ms, next run in 10m0s
Normal ReconciliationSucceeded 24m kustomize-controller Reconciliation finished in 105.526995ms, next run in 10m0s
Normal ReconciliationSucceeded 5m31s kustomize-controller Reconciliation finished in 157.43214ms, next run in 10m0s
I checked other, older kustomizations, and saw the same happening there, they were all lagging.
The next day I did a test, first I timed manual kustomization reconciliations for cert-manager:
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2
real 0m42.324s
user 0m0.068s
sys 0m0.043s
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2
real 0m10.311s
user 0m0.044s
sys 0m0.033s
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2
real 4m56.326s
user 0m0.238s
sys 0m0.079s
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✗ context deadline exceeded
real 5m0.029s
user 0m0.213s
sys 0m0.105s
Varying intervals of lag, last one took more than 5 minutes, so it timed out.
Next I removed all the kustomizations that were failing, the 11 services, then timed manual reconciliation again:
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2
real 0m2.351s
user 0m0.046s
sys 0m0.018s
real 0m2.359s
real 0m2.525s
real 0m2.360s
real 0m2.313s
It was consistently 2 seconds and the automatic reconcile interval was sticking to 10 minutes without lag:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ReconciliationSucceeded 51m kustomize-controller Reconciliation finished in 107.841779ms, next run in 10m0s
Normal ReconciliationSucceeded 42m kustomize-controller Reconciliation finished in 119.475433ms, next run in 10m0s
Normal ReconciliationSucceeded 32m kustomize-controller Reconciliation finished in 107.592005ms, next run in 10m0s
Normal ReconciliationSucceeded 22m kustomize-controller Reconciliation finished in 112.545199ms, next run in 10m0s
Normal ReconciliationSucceeded 12m kustomize-controller Reconciliation finished in 164.414134ms, next run in 10m0s
Normal ReconciliationSucceeded 2m27s kustomize-controller Reconciliation finished in 101.173028ms, next run in 10m0s
In K8s cluster, flux is bootstrapped with version 2.5.1. Unfortunately I cannot really upgrade (not for a long time), because the K8s version itself is 1.30.11 and its upgrade is completely out of my hands.
I also checked the resource usage of pods during the issue, and they were well within the limits, the CPU usage was basically idle, memory was nowhere near the limit.
Steps to reproduce
Have around 10 failing kustomization healthchecks.
Try manual reconcile of a working kustomization and see how long it takes.
Expected behavior
Kustomization reconciliations shouldn't lag even if there are failing kustomizations.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v2.5.1
Flux check
► checking prerequisites
✗ flux 2.5.1 <2.7.3 (new CLI version is available, please upgrade)
✔ Kubernetes 1.30.11 >=1.30.0-0
► checking version in cluster
✔ distribution: flux-v2.5.1
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v1.2.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.5.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.5.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.5.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta3
✔ buckets.source.toolkit.fluxcd.io/v1
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1
✔ helmreleases.helm.toolkit.fluxcd.io/v2
✔ helmrepositories.source.toolkit.fluxcd.io/v1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta3
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
- I agree to follow this project's Code of Conduct