Skip to content

Commit 492a8d6

Browse files
[CONTP-730] fix(kubelet_listener): Retrieve updated pod entity before updating services. (#45118)
### What does this PR do? In a small subset of cases (~3% of the time), the Agent would log a `WARN` saying it could not create a file tailer for a container because its parent pod is missing. ``` 2024-10-25 10:25:35 UTC | CORE | WARN | (pkg/logs/launchers/container/tailerfactory/factory.go:95 in makeTailer) | Could not make file tailer for source container_collect_all (falling back to socket): cannot find pod for container "6f585560fac9d45127f20509c7c84d017126776573772f42e1bd45af59090e54": "6f585560fac9d45127f20509c7c84d017126776573772f42e1bd45af59090e54" not found ``` The warning stems from the Kubelet AD subscribing to `SourceAll` entity events in WLM and short lived pods that have been terminated are then being sent as 'Set' events leading the Kubelet AD adding the pod and its since deleted containers as services, triggering a file tailer attempt. The Kubelet listener now fetches the latest `workloadmeta.KubernetesPod` entity instead of using the provided entity to avoid adding container services for pod containers that have been deleted. ### Describe how you validated your changes #### Reproduce the warning logs 1. Deploy Agent version <= 7.75 with container collect all ``` datadog: logLevel: INFO autoscaling: workload: enabled: false operator: enabled: false clusterName: mathewe-log-tail secretBackend: command: "/readsecret_multiple_providers.sh" kubelet: tlsVerify: false logs: enabled: true containerCollectAll: true dogstatsd: nonLocalTraffic: true originDetection: true useSocketVolume: true tagCardinality: "high" envDict: DD_CHECKS_TAG_CARDINALITY: "high" ``` 3. Install istio https://github.com/DataDog/sandbox/blob/c21fca7035e60951372fbd10cef921af810509a7/apm/kubernetes/Istio/python-flask/install.sh 4. Deploy test job workload <details><summary>istio-sidecar-cronjob-test-repro.yaml</summary> <p> ``` # Scheduled once per minute apiVersion: batch/v1 kind: CronJob metadata: name: test-cronjob namespace: test-istio spec: schedule: "*/1 * * * *" jobTemplate: spec: template: metadata: annotations: sidecar.istio.io/inject: "true" spec: containers: - name: my-container image: curlimages/curl imagePullPolicy: Always command: [ "/bin/sh", "-c", "--" ] args: [ "for i in `seq 1 10` ; do sleep 1.0; echo `date` example stdout log $i; done; curl http://localhost:15000/quitquitquit -X POST" ] restartPolicy: OnFailure --- # Scheduled once per minute (offset 15s) apiVersion: batch/v1 kind: CronJob metadata: name: test-cronjob-2 namespace: test-istio spec: schedule: "*/1 * * * *" jobTemplate: spec: template: metadata: annotations: sidecar.istio.io/inject: "true" spec: containers: - name: my-container image: curlimages/curl imagePullPolicy: Always command: [ "/bin/sh", "-c", "--" ] args: [ "sleep 15; for i in `seq 1 10` ; do sleep 1.0; echo `date` example stdout log $i; done; curl http://localhost:15000/quitquitquit -X POST" ] restartPolicy: OnFailure --- # Scheduled once per minute (offset 30s) apiVersion: batch/v1 kind: CronJob metadata: name: test-cronjob-3 namespace: test-istio spec: schedule: "*/1 * * * *" jobTemplate: spec: template: metadata: annotations: sidecar.istio.io/inject: "true" spec: containers: - name: my-container image: curlimages/curl imagePullPolicy: Always command: [ "/bin/sh", "-c", "--" ] args: [ "sleep 30; for i in `seq 1 10` ; do sleep 1.0; echo `date` example stdout log $i; done; curl http://localhost:15000/quitquitquit -X POST" ] restartPolicy: OnFailure ``` </p> </details> 5. See warning logs after running for several minutes (it may take an hour or so). <img width="1856" height="870" alt="image" src="https://github.com/user-attachments/assets/269148eb-e9e2-4e3a-bfe8-7e8f40e1dd6f" /> #### Deploy fixed agent. 1. Build & deploy fixed Agent. ``` agents: image: repository: "agent" tag: "fix-3" doNotCheckTag: true ``` 2. See warning logs stop <img width="1685" height="397" alt="image" src="https://github.com/user-attachments/assets/05756d29-f46a-4625-86c7-f5c1d09b6988" /> ### Additional Notes Co-authored-by: mathew.estafanous <mathew.estafanous@datadoghq.com>
1 parent a436c68 commit 492a8d6

File tree

2 files changed

+16
-1
lines changed

2 files changed

+16
-1
lines changed

comp/core/autodiscovery/listeners/kubelet.go

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,16 @@ func NewKubeletListener(options ServiceListernerDeps) (ServiceListener, error) {
6464
}
6565

6666
func (l *KubeletListener) processPod(entity workloadmeta.Entity) {
67-
pod := entity.(*workloadmeta.KubernetesPod)
67+
// Fetch the pod from the workloadmeta store to get the most up-to-date state.
68+
// Handling cases where a pod deletion is reported as a 'Set' event due to
69+
// delayed updates from multiple workloadmeta sources. If the pod has been deleted,
70+
// its containers will be missing from the store, preventing stale container services
71+
// from being created.
72+
pod, err := l.Store().GetKubernetesPod(entity.GetID().ID)
73+
if err != nil || pod == nil {
74+
log.Debugf("Failed to get kubernetes pod from workloadmeta store, using pod from event")
75+
pod = entity.(*workloadmeta.KubernetesPod)
76+
}
6877

6978
wlmContainers := pod.GetAllContainers()
7079
containers := make([]*workloadmeta.Container, 0, len(wlmContainers))
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
fixes:
2+
- |
3+
Fixes repetitive 'Could not make file tailer' warning logs when short lived
4+
pods are terminated and the Agent attempts to create a file tailer for the
5+
deleted containers in a pod. Now the Agent will not create container services
6+
for pods that have been deleted and no-longer have containers to tail.

0 commit comments

Comments
 (0)