-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Description:
In cases where a large number of pods are launched under the same ComputeDomain, we are observing slow pod startup times due to conflicts and delays in updating the CD status. This issue occurs because the ComputeDomain daemonset pods need to calculate their DNS index and update their Ready state in the CD status, leading to frequent update conflicts (HTTP 409 responses from the API server) and slow convergence.
Even after implementing the solution from NVIDIA/k8s-dra-driver-gpu#810, which moves the status processing from the ComputeDomain daemonset pod to the CD controller, the pod startup time is still slow.
Scenario1:
- Create a ComputeDomain resource.
- Create a Job using the resourceClaimTemplate from the ComputeDomain, with 256 pods.
- The job pods take about 8 minutes to start.
The lifecycle of the slowest pod is as follows:
- 09:17:32 - Job created (Ready time: 2026-01-14T09:25:15)
- 09:17:33 - Slowest pod in the Job starts running (09:25:09)
- 09:17:38 - ComputeDomain daemonset pod created on the same node as the slowest pod
- 09:20:42 - CD controller allocates the DNS index for the daemonset pod and updates the CD status
- 09:22:15 - CD daemonset pod receives its DNS index and updates the /etc/hosts
- 09:22:17 - CD daemonset pod is Ready
- 09:23:17 - CD controller processes the daemonset pod as Ready and updates the node status in the CD status
- 09:23:44 - Job pod detects the CD daemonset pod as Ready via DRA plugin and begins pulling images
- 09:25:09 - Job pod completes pulling the image
- 09:25:09 - Job pod enters Running state
- 09:25:15 - Job pod reaches Ready state
Key Time Delays:
- p1: 3 → 4: CD controller allocates DNS index to pod: This takes 3 minutes.
Cause: The CD controller starts processing the first pod event for this trial at 09:18:53, after processing unrelated tasks in the workQueue. And all the pod events are added to the workQueue without a key lisk(podNamespace/podName), and since tasks are processed serially (without deduplication by pod name), the queue builds up and causes delays. - p2: 4 → 6: CD daemonset pod receives its DNS index and updates hosts file: This takes 90 seconds.
Cause: The CD daemonset pod also adds each event to updatedNodesChan and processes cd.Status.Nodes serially, causing task buildup. There may also be an issue with the maps.Equal logic where the state appears unchanged, but the logic is still executed. - p3: 7 → 8: CD daemonset pod Ready state processed by CD controller: This takes 60 seconds.
Cause: WorkQueue buildup causes delays in processing the Ready state update. - p4: 9 → 10: Image pull by business pod: This takes 90 seconds.
Scenario2:
Another Job and computeDomain with 256 pods created at 09:22:13, but the related computedomain daemonset is created at 09:23:16.
Cause: The computeDomain events and the pod events are added to the same work queue. However, there is an overwhelming amount of pod events work in the queue. As a result, they have to wait in the queue until it's their turn to be processed.
Proposed Fix for s1(p1,p3) and s2:
CD Controller works like this:
- prepare workQueue only for ComputeDomain,and add ComputeDomain with key like (cd Namespace/cdName)
- Watch Pod events:
AddEventHandler:
Get the ComputeDomain of current pod, add ComputeDomain key to the workQueue - Watch ComputeDomain events:
AddEventHandler:
add ComputeDomain key to the workQueue - Process workQueue:
1 . List all pods associated with current CD, compute the DNS index and node status of current CD, update CD status if needed
2. There is no conflict among the processing of CD resources. Therefore, the workQueue cd can handle them concurrently to solve the problem in scenario 2,like 10 workers.
Proposed Fix for s1 p2
- We've observed that for a CD DaemonSet pod, the readiness check of imex check will only pass after it obtains its own DNS index and updates it to the host file. Otherwise, the check command will keep failing, and the reason for the failure is that it can't access its own IP. Therefore, for a CD DaemonSet pod, the condition for being ready is to obtain its own DNS index from CD status.
- if !maps.Equal(newIPs, previousIPs), there might have a bug like the logs show: the IPs have not changed, but still the "IP set changed" operation was executed.
TODO: Is it unnecessary to use updatedNodesChan? We only need a queue to record the CD change events, and we can directly update the hosts file during processing.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status