-
Notifications
You must be signed in to change notification settings - Fork 276
Description
Describe the bug
For the service discovery K8sPodIPServiceDiscovery
, the self.available_engine
variable does not get updated sometimes: it still holds entries corresponding to pod that do not exist anymore. I called them ZombieIP
The consequence: some request are routed to these deleted pods. In production, we've had router pod that for days where routing request to model pod that do not exist anymore
To Reproduce
How to reproduce the issue
For this, I needed to deploy custom code.
Modification to bring to the vllm router code:
--- a/src/vllm_router/service_discovery.py
+++ b/src/vllm_router/service_discovery.py
@@ -386,6 +386,7 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
self.watcher_thread.start()
self.prefill_model_labels = prefill_model_labels
self.decode_model_labels = decode_model_labels
+ self.failing_counter = 0
@staticmethod
def _check_pod_ready(container_statuses):
@@ -502,7 +503,12 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
url = f"http://{pod_ip}:{self.port}/v1/models"
try:
headers = None
+ self.failing_counter +=1
+ logger.info(f"{self.failing_counter=}")
if VLLM_API_KEY := os.getenv("VLLM_API_KEY"):
+ if self.failing_counter > 3:
+ time.sleep(60)
+ VLLM_API_KEY = "wrong_key_jklkjlkj"
logger.info("Using vllm server authentication")
headers = {"Authorization": f"Bearer {VLLM_API_KEY}"}
response = requests.get(url, headers=headers)
@@ -570,17 +576,23 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
while self.running:
try:
+ logger.info(f"K8s watcher started{self.get_endpoint_info()}")
+ logger.info("time out is 30")
+ logger.info("Jon latest version v2")
for event in self.k8s_watcher.stream(
self.k8s_api.list_namespaced_pod,
namespace=self.namespace,
label_selector=self.label_selector,
timeout_seconds=30,
):
+
pod = event["object"]
event_type = event["type"]
pod_name = pod.metadata.name
pod_ip = pod.status.pod_ip
+ logger.info(f"pod_name: {pod_name} pod_ip: {pod_ip} event_type: {event_type}")
+
# Check if pod is terminating
is_pod_terminating = self._is_pod_terminating(pod)
is_container_ready = self._check_pod_ready(
Modification explanation:
After some successful calls to /v1/models
I introduce a "blocking" call that takes more than 30
seconds (I set to 60) and finally fails (due to wrong vllm api key)
The yaml file to use for values:
servingEngineSpec:
vllmApiKey: "fake_key"
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 2
vllmConfig:
extraArgs:
- "--chat-template=./examples/template_chatml.jinja"
- "--disable-log-requests"
routerSpec:
repository: "git-act-router" # Use the image we just built
imagePullPolicy: "IfNotPresent"
strategy:
type: Recreate
enableRouter: true
serviceDiscovery: "k8s" # This enables K8sPodIPServiceDiscovery
routingLogic: "roundrobin"
extraArgs:
- "--log-level"
- "debug"
Note the vllmApiKey: "fake_key"
Step by step issue reproduction
- Deploy the helm chart
- Monitor the logs of the router
You will see at first, everything looks normal. After 3 iterations of pod discovery, you will start seeing this error log:
[2025-08-20 10:30:44,354] ERROR: Failed to get model names from http://10.244.1.80:8000/v1/models: 401 Client Error: Unauthorized for url: http://10.244.1.80:8000/v1/models (service_discovery.py:527:vllm_router.service_discovery)
This is because the VLLM_API_KEY
became invalid
- Force delete a model pod: run
kubectl delete pod POD_NAME --grace-period=0 --force
- Monitor the logs of the router
You will see errors in scraping metrics like this:
[2025-08-20 10:32:38,681] ERROR: Failed to scrape metrics from http://10.244.1.81:8000: HTTPConnectionPool(host='10.244.1.81', port=8000): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x735cb0cea6c0>: Failed to establish a new connection: [Errno 113] No route to host')) (engine_stats.py:131:vllm_router.stats.engine_stats)
And these errors keeps repeating
If you logged the self.get_endpoint_info()
, then you will see that the entry corresponding to the deleted pod still appears! This is the ZombieIP
Why are the errors raised
Main cause: There is a blocking call within the kubernetes watcher that calls the /v1/models
endpoint of the model pod. This blocking call can make pause consumption of the Kubernetes watch stream
If during this pause, an event occurred in kubernetes, like the deletion of a pod -> it won't be registered EVER
Why the ERROR: Failed to get model names is raised and what are the consequences
The Kubernetes watcher watches the kubernetes objects during every 30 seconds, during 30 seconds. There is no resource_version
(see https://github.com/kubernetes-client/python/blob/master/kubernetes/base/watch/watch.py#L136) set in this watcher, therefore every watch is "new": the watcher thinks the objects detected are all new, therefore the event related to the watch are always ADDED
, even though the pod was already registered in the available_engine
object
The logs that prove it are available here
When the /v1/models
starts being slow and raises an error then:
- the function
_get_model_names
returns an empty list -> see here - As the event is
ADDED
, then no update is done on theavailable_engine
object -> see here
Therefore, while the pod is up, then we will see a loop of logs ERROR: Failed to get model names
, AND some traffic will still be routed to this pod as it will stay registered in available_engine
object -> leading to failure 🤯
Why the ERROR: Failed to scrape metrics even when the pod has been deleted
Following the error above, one might think: "I just need to delete the pod and traffic will not be routed to it"
That's the worst thing to do :skull_crossbones: if there are more than 2 faulty pods
The blocking call stops the consumption of the kubernetes watch stream. More simply: if a pod is deleted and completely disappears WHILE the watch stream is blocked, its deletion never gets "watched" by the kubernetes watch stream. The next watch loop will not see that this pod has ever existed and do not exist anymore.
The problem: the self.available_engine
object is only updated within the watches. If a pod is not detected by the watch its corresponding entry in the self.available_engine
never gets updated.
That is why, if the pod gets deleted while the watch engine is blocked, then its deletion is never registered, therefore the self.available_engine
is never updated and will always hold the entry for the deleted pod
The consequence:
- The stat scraper will keep trying to scrape metrics from the pod: we will see this log
ERROR: Failed to scrape metrics
forever - Some requests will still be routed to the deleted pod, as the router code uses the
self.available_engine
as source of truth for routing requests -> see here
Where the router routes the request to a pod IP that doesn't exist anymore
Expected behavior
The self.available_engine
should stay up to date with the really available pods. Some ideas to fix this bug:
- remove blocking call within the k8 watch loop -> make the calls asynchronous, as suggested in bug: Router Health Check has Synchronously blocking calls (will fail liveness on engine scaling) #431
- Pass a ressource version to the k8 watch loop
- Create another mechanism that checks healthiness of
self.available_engine
object outside of the watch loop: check that the pod names present in theself.available_engine
corresponds to pod that do exist
Additional context
This is related to #431
Please feel free to ask for more example and suggestions for fixing the issue 🔥 🚒 🧑🚒