feature: Check sleep mode only interacting with the /is_sleeping endpoint

### Describe the feature

**Scenario**:

I was testing the router with a fake-server, very similar to this one: https://github.com/vllm-project/production-stack/blob/main/src/tests/perftest/fake-openai-server.py

I packaged the app within a docker image:

```
FROM python:3.10-slim AS build

# Install dependencies
COPY requirements.txt .
RUN apt-get update && apt-get install -y gcc
RUN pip install --no-cache-dir --user -r requirements.txt

# Set working directory
WORKDIR /opt/project


# Copy application code
COPY . .
# Set PYTHONPATH
ENV PYTHONPATH=/opt/project

CMD ["python3", "./fake-openai-server.py", "--host", "0.0.0.0", "--port", "8000"]
```

And deploying it within kubernetes, here is the pod spec:

```
apiVersion: v1
kind: Pod
metadata:
  generateName: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc7967b99-
  generation: 1
  labels:
    environment: test
    helm-release-name: vllm-k8s-test-fake-server
    model: fake-model
    pod-template-hash: 5bc7967b99
    release: test
  name: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc79kndb9
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc7967b99
spec:
  containers:
  - image: fake-server:latest
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8000
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: vllm
    ports:
    - containerPort: 8000
      name: container-port
      protocol: TCP
    - containerPort: 55555
      name: zmq-port
      protocol: TCP
    - containerPort: 9999
      name: ucx-port
      protocol: TCP
    startupProbe:
      failureThreshold: 60
      httpGet:
        path: /health
        port: 8000
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
```

Important thing to note: there is no `command` section within the pod definition. This is because I just want to run the command that is specified in the docker image

And then within the vllm router, within the service discovery I was getting this error:

```
K8s watcher error: 'NoneType' object is not iterable (service_discovery.py:617:vllm_router.service_discovery)
```

I dug down trying to understand when this error could occur.

In the [_check_engine_sleep_mode](https://github.com/vllm-project/production-stack/blob/main/src/vllm_router/service_discovery.py#L436) I found this piece of code:

```python
            pod = self.k8s_api.read_namespaced_pod(
                name=pod_name, namespace=self.namespace
            )
            for container in pod.spec.containers:
                if container.name == "vllm":
                    for arg in container.command:
                        if arg == "--enable-sleep-mode":
                            enable_sleep_mode = True
                            break
            return enable_sleep_mode
```

This code breaks in the line `for arg in container.command:` if no `command` has been provided to the pod -> kubernetes sets it by default to `None`

# Modification suggestion
I think the vllm-router should not make any assumption on *what is the command set in the pod definition of the model pod*. I think the API endpoint exposed by the model pods are the only "contract" made between vllm router and vllm model pods
If someone wants to create a custom vllm image, that doesn't need to be run with the exact same command as vllm OR that already contains the command within the docker image, then the vllm router should work with it.

That's why I suggest, for testing if an engine is sleeping or not, to only query the `is_sleeping` endpoint.
- if the endpoint raises 404 (not defined) -> we consider engine is not sleeping
- if endpoint returns false -> we consider engine is sleeping
- if endpoint returns true -> we consider engine is sleeping

please tell me if you like the suggestion

### Why do you need this feature?

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Check sleep mode only interacting with the /is_sleeping endpoint #655

Describe the feature

Modification suggestion

Why do you need this feature?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature: Check sleep mode only interacting with the /is_sleeping endpoint #655

Description

Describe the feature

Modification suggestion

Why do you need this feature?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions