Skip to content

[Bug][Router] ZombieIP: router routes to model pod that doesn't exist anymore #656

@jonoillar

Description

@jonoillar

Describe the bug

For the service discovery K8sPodIPServiceDiscovery, the self.available_engine variable does not get updated sometimes: it still holds entries corresponding to pod that do not exist anymore. I called them ZombieIP

The consequence: some request are routed to these deleted pods. In production, we've had router pod that for days where routing request to model pod that do not exist anymore

To Reproduce

How to reproduce the issue

For this, I needed to deploy custom code.
Modification to bring to the vllm router code:

--- a/src/vllm_router/service_discovery.py
+++ b/src/vllm_router/service_discovery.py
@@ -386,6 +386,7 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
         self.watcher_thread.start()
         self.prefill_model_labels = prefill_model_labels
         self.decode_model_labels = decode_model_labels
+        self.failing_counter = 0

     @staticmethod
     def _check_pod_ready(container_statuses):
@@ -502,7 +503,12 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
         url = f"http://{pod_ip}:{self.port}/v1/models"
         try:
             headers = None
+            self.failing_counter +=1
+            logger.info(f"{self.failing_counter=}")
             if VLLM_API_KEY := os.getenv("VLLM_API_KEY"):
+                if self.failing_counter > 3:
+                    time.sleep(60)
+                    VLLM_API_KEY = "wrong_key_jklkjlkj"
                 logger.info("Using vllm server authentication")
                 headers = {"Authorization": f"Bearer {VLLM_API_KEY}"}
             response = requests.get(url, headers=headers)
@@ -570,17 +576,23 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):

         while self.running:
             try:
+                logger.info(f"K8s watcher started{self.get_endpoint_info()}")
+                logger.info("time out is 30")
+                logger.info("Jon latest version v2")
                 for event in self.k8s_watcher.stream(
                     self.k8s_api.list_namespaced_pod,
                     namespace=self.namespace,
                     label_selector=self.label_selector,
                     timeout_seconds=30,
                 ):
+
                     pod = event["object"]
                     event_type = event["type"]
                     pod_name = pod.metadata.name
                     pod_ip = pod.status.pod_ip

+                    logger.info(f"pod_name: {pod_name} pod_ip: {pod_ip} event_type: {event_type}")
+
                     # Check if pod is terminating
                     is_pod_terminating = self._is_pod_terminating(pod)
                     is_container_ready = self._check_pod_ready(

Modification explanation:
After some successful calls to /v1/models I introduce a "blocking" call that takes more than 30 seconds (I set to 60) and finally fails (due to wrong vllm api key)

The yaml file to use for values:

servingEngineSpec:
  vllmApiKey: "fake_key"
  runtimeClassName: ""
  modelSpec:
  - name: "opt125m"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "facebook/opt-125m"

    replicaCount: 2

    vllmConfig:
      extraArgs:
        - "--chat-template=./examples/template_chatml.jinja"
        - "--disable-log-requests"

routerSpec:
  repository: "git-act-router"  # Use the image we just built
  imagePullPolicy: "IfNotPresent"
  strategy:
    type: Recreate
  enableRouter: true
  serviceDiscovery: "k8s"  # This enables K8sPodIPServiceDiscovery
  routingLogic: "roundrobin"
  extraArgs:
    - "--log-level"
    - "debug"

Note the vllmApiKey: "fake_key"

Step by step issue reproduction

  1. Deploy the helm chart
  2. Monitor the logs of the router

You will see at first, everything looks normal. After 3 iterations of pod discovery, you will start seeing this error log:

[2025-08-20 10:30:44,354] ERROR: Failed to get model names from http://10.244.1.80:8000/v1/models: 401 Client Error: Unauthorized for url: http://10.244.1.80:8000/v1/models (service_discovery.py:527:vllm_router.service_discovery)

This is because the VLLM_API_KEY became invalid

  1. Force delete a model pod: run kubectl delete pod POD_NAME --grace-period=0 --force
  2. Monitor the logs of the router

You will see errors in scraping metrics like this:

[2025-08-20 10:32:38,681] ERROR: Failed to scrape metrics from http://10.244.1.81:8000: HTTPConnectionPool(host='10.244.1.81', port=8000): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x735cb0cea6c0>: Failed to establish a new connection: [Errno 113] No route to host')) (engine_stats.py:131:vllm_router.stats.engine_stats)

And these errors keeps repeating

If you logged the self.get_endpoint_info(), then you will see that the entry corresponding to the deleted pod still appears! This is the ZombieIP

Why are the errors raised

Main cause: There is a blocking call within the kubernetes watcher that calls the /v1/models endpoint of the model pod. This blocking call can make pause consumption of the Kubernetes watch stream

If during this pause, an event occurred in kubernetes, like the deletion of a pod -> it won't be registered EVER

Why the ERROR: Failed to get model names is raised and what are the consequences

The Kubernetes watcher watches the kubernetes objects during every 30 seconds, during 30 seconds. There is no resource_version (see https://github.com/kubernetes-client/python/blob/master/kubernetes/base/watch/watch.py#L136) set in this watcher, therefore every watch is "new": the watcher thinks the objects detected are all new, therefore the event related to the watch are always ADDED, even though the pod was already registered in the available_engine object

The logs that prove it are available here

When the /v1/models starts being slow and raises an error then:

  • the function _get_model_names returns an empty list -> see here
  • As the event is ADDED, then no update is done on the available_engine object -> see here

Therefore, while the pod is up, then we will see a loop of logs ERROR: Failed to get model names, AND some traffic will still be routed to this pod as it will stay registered in available_engine object -> leading to failure 🤯

Why the ERROR: Failed to scrape metrics even when the pod has been deleted

Following the error above, one might think: "I just need to delete the pod and traffic will not be routed to it"

That's the worst thing to do :skull_crossbones: if there are more than 2 faulty pods

The blocking call stops the consumption of the kubernetes watch stream. More simply: if a pod is deleted and completely disappears WHILE the watch stream is blocked, its deletion never gets "watched" by the kubernetes watch stream. The next watch loop will not see that this pod has ever existed and do not exist anymore.

The problem: the self.available_engine object is only updated within the watches. If a pod is not detected by the watch its corresponding entry in the self.available_engine never gets updated.

That is why, if the pod gets deleted while the watch engine is blocked, then its deletion is never registered, therefore the self.available_engine is never updated and will always hold the entry for the deleted pod

The consequence:

  • The stat scraper will keep trying to scrape metrics from the pod: we will see this log ERROR: Failed to scrape metrics forever
  • Some requests will still be routed to the deleted pod, as the router code uses the self.available_engine as source of truth for routing requests -> see here

Where the router routes the request to a pod IP that doesn't exist anymore

Expected behavior

The self.available_engine should stay up to date with the really available pods. Some ideas to fix this bug:

Additional context

This is related to #431

Please feel free to ask for more example and suggestions for fixing the issue 🔥 🚒 🧑‍🚒

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions