[Bug][Router] ZombieIP: router routes to model pod that doesn't exist anymore

### Describe the bug

For the service discovery `K8sPodIPServiceDiscovery`, the `self.available_engine` variable does not get updated sometimes: it still holds entries corresponding to pod that do not exist anymore. I called them ZombieIP

The consequence: some request are routed to these deleted pods. In production, we've had router pod that for *days* where routing request to model pod that do not exist anymore

### To Reproduce

# How to reproduce the issue

For this, I needed to deploy custom code.
Modification to bring to the vllm router code:

```
--- a/src/vllm_router/service_discovery.py
+++ b/src/vllm_router/service_discovery.py
@@ -386,6 +386,7 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
         self.watcher_thread.start()
         self.prefill_model_labels = prefill_model_labels
         self.decode_model_labels = decode_model_labels
+        self.failing_counter = 0

     @staticmethod
     def _check_pod_ready(container_statuses):
@@ -502,7 +503,12 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):
         url = f"http://{pod_ip}:{self.port}/v1/models"
         try:
             headers = None
+            self.failing_counter +=1
+            logger.info(f"{self.failing_counter=}")
             if VLLM_API_KEY := os.getenv("VLLM_API_KEY"):
+                if self.failing_counter > 3:
+                    time.sleep(60)
+                    VLLM_API_KEY = "wrong_key_jklkjlkj"
                 logger.info("Using vllm server authentication")
                 headers = {"Authorization": f"Bearer {VLLM_API_KEY}"}
             response = requests.get(url, headers=headers)
@@ -570,17 +576,23 @@ class K8sPodIPServiceDiscovery(ServiceDiscovery):

         while self.running:
             try:
+                logger.info(f"K8s watcher started{self.get_endpoint_info()}")
+                logger.info("time out is 30")
+                logger.info("Jon latest version v2")
                 for event in self.k8s_watcher.stream(
                     self.k8s_api.list_namespaced_pod,
                     namespace=self.namespace,
                     label_selector=self.label_selector,
                     timeout_seconds=30,
                 ):
+
                     pod = event["object"]
                     event_type = event["type"]
                     pod_name = pod.metadata.name
                     pod_ip = pod.status.pod_ip

+                    logger.info(f"pod_name: {pod_name} pod_ip: {pod_ip} event_type: {event_type}")
+
                     # Check if pod is terminating
                     is_pod_terminating = self._is_pod_terminating(pod)
                     is_container_ready = self._check_pod_ready(
```

Modification explanation: 
After some successful calls to `/v1/models` I introduce a "blocking" call that takes more than `30` seconds (I set to 60) and finally fails (due to wrong vllm api key)

The yaml file to use for values:

```
servingEngineSpec:
  vllmApiKey: "fake_key"
  runtimeClassName: ""
  modelSpec:
  - name: "opt125m"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "facebook/opt-125m"

    replicaCount: 2

    vllmConfig:
      extraArgs:
        - "--chat-template=./examples/template_chatml.jinja"
        - "--disable-log-requests"

routerSpec:
  repository: "git-act-router"  # Use the image we just built
  imagePullPolicy: "IfNotPresent"
  strategy:
    type: Recreate
  enableRouter: true
  serviceDiscovery: "k8s"  # This enables K8sPodIPServiceDiscovery
  routingLogic: "roundrobin"
  extraArgs:
    - "--log-level"
    - "debug"
```

Note the `vllmApiKey: "fake_key"`

## Step by step issue reproduction
1. Deploy the helm chart
2. Monitor the logs of the router

You will see at first, everything looks normal. After 3 iterations of pod discovery, you will start seeing this error log:

```
[2025-08-20 10:30:44,354] ERROR: Failed to get model names from http://10.244.1.80:8000/v1/models: 401 Client Error: Unauthorized for url: http://10.244.1.80:8000/v1/models (service_discovery.py:527:vllm_router.service_discovery)
```

This is because the `VLLM_API_KEY` became invalid

3. Force delete a model pod: run `kubectl delete pod POD_NAME --grace-period=0 --force`
4. Monitor the logs of the router

You will see errors in scraping metrics like this:

```
[2025-08-20 10:32:38,681] ERROR: Failed to scrape metrics from http://10.244.1.81:8000: HTTPConnectionPool(host='10.244.1.81', port=8000): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x735cb0cea6c0>: Failed to establish a new connection: [Errno 113] No route to host')) (engine_stats.py:131:vllm_router.stats.engine_stats)
```

And these errors keeps repeating

If you logged the `self.get_endpoint_info()`, then you will see that the entry corresponding to the deleted pod still appears! This is the `ZombieIP`

# Why are the errors raised

**Main cause**: There is a blocking call within the kubernetes watcher that calls the `/v1/models` endpoint of the model pod. This blocking call can make pause consumption of the Kubernetes watch stream

If during this pause, an event occurred in kubernetes, like the deletion of a pod -> it won't be registered EVER

## Why the ERROR: Failed to get model names is raised and what are the consequences

The Kubernetes watcher watches the kubernetes objects during every 30 seconds, during 30 seconds. There is no `resource_version` (see https://github.com/kubernetes-client/python/blob/master/kubernetes/base/watch/watch.py#L136) set in this watcher, therefore every watch is "new": the watcher thinks the objects detected are all new, therefore the event related to the watch are always `ADDED`, even though the pod was already registered in the `available_engine` object

The logs that prove it are available [here](https://gitlab.com/multiverse1/product/compactifai/compactifai_vllm_helm_chart/-/issues/14#note_2692752307)

When the `/v1/models` starts being slow and raises an error then:
- the function `_get_model_names` returns an empty list -> see [here](https://github.com/vllm-project/production-stack/blob/a4de81ae76187fe4fbd60be1f198864affaffe64/src/vllm_router/service_discovery.py#L522)
- As the event is `ADDED`, then no update is done on the `available_engine` object -> see [here](https://github.com/vllm-project/production-stack/blob/a4de81ae76187fe4fbd60be1f198864affaffe64/src/vllm_router/service_discovery.py#L672)

Therefore, while the pod is up, then we will see a loop of logs `ERROR: Failed to get model names`, AND some traffic will still be routed to this pod as it will stay registered in `available_engine` object -> leading to failure :exploding_head: 


## Why the ERROR: Failed to scrape metrics even when the pod has been deleted
Following the error above, one might think: "I just need to delete the pod and traffic will not be routed to it"

That's the worst thing to do :skull_crossbones: if there are more than 2 faulty pods

The blocking call stops the consumption of the kubernetes watch stream. More simply: if a pod is deleted and completely disappears WHILE the watch stream is blocked, its deletion never gets "watched" by the kubernetes watch stream. The next watch loop will not see that this pod has ever existed and do not exist anymore.

The problem: the `self.available_engine` object is *only* updated within the watches. If a pod is not detected by the watch its corresponding entry in the `self.available_engine` never gets updated.

That is why, if the pod gets deleted while the watch engine is blocked, then its deletion is never registered, therefore the `self.available_engine` is *never* updated and will always hold the entry for the deleted pod

The consequence:
- The stat scraper will keep trying to scrape metrics from the pod: we will see this log `ERROR: Failed to scrape metrics` forever
- Some requests will still be routed to the deleted pod, as the router code uses the `self.available_engine` as source of truth for routing requests -> see [here](https://github.com/vllm-project/production-stack/blob/main/src/vllm_router/services/request_service/request.py#L208)



Where the router routes the request to a pod IP that doesn't exist anymore

### Expected behavior

The `self.available_engine` should stay up to date with the *really* available pods. Some ideas to fix this bug:

- remove blocking call within the k8 watch loop -> make the calls asynchronous, as suggested in https://github.com/vllm-project/production-stack/issues/431
- Pass a ressource version to the k8 watch loop
- Create another mechanism that checks healthiness of `self.available_engine` object outside of the watch loop: check that the pod names present in the `self.available_engine` corresponds to pod that do exist

### Additional context

This is related to https://github.com/vllm-project/production-stack/issues/431

Please feel free to ask for more example and suggestions for fixing the issue 🔥 🚒 🧑‍🚒 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug][Router] ZombieIP: router routes to model pod that doesn't exist anymore #656

Describe the bug

To Reproduce

How to reproduce the issue

Step by step issue reproduction

Why are the errors raised

Why the ERROR: Failed to get model names is raised and what are the consequences

Why the ERROR: Failed to scrape metrics even when the pod has been deleted

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug][Router] ZombieIP: router routes to model pod that doesn't exist anymore #656

Description

Describe the bug

To Reproduce

How to reproduce the issue

Step by step issue reproduction

Why are the errors raised

Why the ERROR: Failed to get model names is raised and what are the consequences

Why the ERROR: Failed to scrape metrics even when the pod has been deleted

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions