[CONTP-730] fix(kubelet_listener): Retrieve updated pod entity before updating services. (#45118)

Mathew-Estafanous · web-flow · commit 492a8d65e840 · 2026-01-21T15:56:51.000Z
### What does this PR do? In a small subset of cases (~3% of the time), the Agent would log a `WARN` saying it could not create a file tailer for a container because its parent pod is missing. ``` 2024-10-25 10:25:35 UTC | CORE | WARN | (pkg/logs/launchers/container/tailerfactory/factory.go:95 in makeTailer) | Could not make file tailer for source container_collect_all (falling back to socket): cannot find pod for container "6f585560fac9d45127f20509c7c84d017126776573772f42e1bd45af59090e54": "6f585560fac9d45127f20509c7c84d017126776573772f42e1bd45af59090e54" not found ``` The warning stems from the Kubelet AD subscribing to `SourceAll` entity events in WLM and short lived pods that have been terminated are then being sent as 'Set' events leading the Kubelet AD adding the pod and its since deleted containers as services, triggering a file tailer attempt. The Kubelet listener now fetches the latest `workloadmeta.KubernetesPod` entity instead of using the provided entity to avoid adding container services for pod containers that have been deleted. ### Describe how you validated your changes #### Reproduce the warning logs 1. Deploy Agent version <= 7.75 with container collect all ``` datadog: logLevel: INFO autoscaling: workload: enabled: false operator: enabled: false clusterName: mathewe-log-tail secretBackend: command: "/readsecret_multiple_providers.sh" kubelet: tlsVerify: false logs: enabled: true containerCollectAll: true dogstatsd: nonLocalTraffic: true originDetection: true useSocketVolume: true tagCardinality: "high" envDict: DD_CHECKS_TAG_CARDINALITY: "high" ``` 3. Install istio https://github.com/DataDog/sandbox/blob/c21fca7035e60951372fbd10cef921af810509a7/apm/kubernetes/Istio/python-flask/install.sh 4. Deploy test job workload <details><summary>istio-sidecar-cronjob-test-repro.yaml</summary> <p> ``` # Scheduled once per minute apiVersion: batch/v1 kind: CronJob metadata: name: test-cronjob namespace: test-istio spec: schedule: "*/1 * * * *" jobTemplate: spec: template: metadata: annotations: sidecar.istio.io/inject: "true" spec: containers: - name: my-container image: curlimages/curl imagePullPolicy: Always command: [ "/bin/sh", "-c", "--" ] args: [ "for i in `seq 1 10` ; do sleep 1.0; echo `date` example stdout log $i; done; curl http://localhost:15000/quitquitquit -X POST" ] restartPolicy: OnFailure --- # Scheduled once per minute (offset 15s) apiVersion: batch/v1 kind: CronJob metadata: name: test-cronjob-2 namespace: test-istio spec: schedule: "*/1 * * * *" jobTemplate: spec: template: metadata: annotations: sidecar.istio.io/inject: "true" spec: containers: - name: my-container image: curlimages/curl imagePullPolicy: Always command: [ "/bin/sh", "-c", "--" ] args: [ "sleep 15; for i in `seq 1 10` ; do sleep 1.0; echo `date` example stdout log $i; done; curl http://localhost:15000/quitquitquit -X POST" ] restartPolicy: OnFailure --- # Scheduled once per minute (offset 30s) apiVersion: batch/v1 kind: CronJob metadata: name: test-cronjob-3 namespace: test-istio spec: schedule: "*/1 * * * *" jobTemplate: spec: template: metadata: annotations: sidecar.istio.io/inject: "true" spec: containers: - name: my-container image: curlimages/curl imagePullPolicy: Always command: [ "/bin/sh", "-c", "--" ] args: [ "sleep 30; for i in `seq 1 10` ; do sleep 1.0; echo `date` example stdout log $i; done; curl http://localhost:15000/quitquitquit -X POST" ] restartPolicy: OnFailure ``` </p> </details> 5. See warning logs after running for several minutes (it may take an hour or so). <img width="1856" height="870" alt="image" src="https://github.com/user-attachments/assets/269148eb-e9e2-4e3a-bfe8-7e8f40e1dd6f" /> #### Deploy fixed agent. 1. Build & deploy fixed Agent. ``` agents: image: repository: "agent" tag: "fix-3" doNotCheckTag: true ``` 2. See warning logs stop <img width="1685" height="397" alt="image" src="https://github.com/user-attachments/assets/05756d29-f46a-4625-86c7-f5c1d09b6988" /> ### Additional Notes Co-authored-by: mathew.estafanous <mathew.estafanous@datadoghq.com>
diff --git a/comp/core/autodiscovery/listeners/kubelet.go b/comp/core/autodiscovery/listeners/kubelet.go
@@ -64,7 +64,16 @@ func NewKubeletListener(options ServiceListernerDeps) (ServiceListener, error) {
 }
 
 func (l *KubeletListener) processPod(entity workloadmeta.Entity) {
-	pod := entity.(*workloadmeta.KubernetesPod)
+	// Fetch the pod from the workloadmeta store to get the most up-to-date state.
+	// Handling cases where a pod deletion is reported as a 'Set' event due to
+	// delayed updates from multiple workloadmeta sources. If the pod has been deleted,
+	// its containers will be missing from the store, preventing stale container services
+	// from being created.
+	pod, err := l.Store().GetKubernetesPod(entity.GetID().ID)
+	if err != nil || pod == nil {
+		log.Debugf("Failed to get kubernetes pod from workloadmeta store, using pod from event")
+		pod = entity.(*workloadmeta.KubernetesPod)
+	}
 
 	wlmContainers := pod.GetAllContainers()
 	containers := make([]*workloadmeta.Container, 0, len(wlmContainers))
diff --git a/releasenotes/notes/fix-repetitive-file-tailer-warn-log-53f404b53fd63199.yaml b/releasenotes/notes/fix-repetitive-file-tailer-warn-log-53f404b53fd63199.yaml
@@ -0,0 +1,6 @@
+fixes:
+  - |
+    Fixes repetitive 'Could not make file tailer' warning logs when short lived
+    pods are terminated and the Agent attempts to create a file tailer for the
+    deleted containers in a pod. Now the Agent will not create container services
+    for pods that have been deleted and no-longer have containers to tail.