Skip to content

feature: Check sleep mode only interacting with the /is_sleeping endpoint #655

@jonoillar

Description

@jonoillar

Describe the feature

Scenario:

I was testing the router with a fake-server, very similar to this one: https://github.com/vllm-project/production-stack/blob/main/src/tests/perftest/fake-openai-server.py

I packaged the app within a docker image:

FROM python:3.10-slim AS build

# Install dependencies
COPY requirements.txt .
RUN apt-get update && apt-get install -y gcc
RUN pip install --no-cache-dir --user -r requirements.txt

# Set working directory
WORKDIR /opt/project


# Copy application code
COPY . .
# Set PYTHONPATH
ENV PYTHONPATH=/opt/project

CMD ["python3", "./fake-openai-server.py", "--host", "0.0.0.0", "--port", "8000"]

And deploying it within kubernetes, here is the pod spec:

apiVersion: v1
kind: Pod
metadata:
  generateName: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc7967b99-
  generation: 1
  labels:
    environment: test
    helm-release-name: vllm-k8s-test-fake-server
    model: fake-model
    pod-template-hash: 5bc7967b99
    release: test
  name: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc79kndb9
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc7967b99
spec:
  containers:
  - image: fake-server:latest
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8000
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: vllm
    ports:
    - containerPort: 8000
      name: container-port
      protocol: TCP
    - containerPort: 55555
      name: zmq-port
      protocol: TCP
    - containerPort: 9999
      name: ucx-port
      protocol: TCP
    startupProbe:
      failureThreshold: 60
      httpGet:
        path: /health
        port: 8000
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

Important thing to note: there is no command section within the pod definition. This is because I just want to run the command that is specified in the docker image

And then within the vllm router, within the service discovery I was getting this error:

K8s watcher error: 'NoneType' object is not iterable (service_discovery.py:617:vllm_router.service_discovery)

I dug down trying to understand when this error could occur.

In the _check_engine_sleep_mode I found this piece of code:

            pod = self.k8s_api.read_namespaced_pod(
                name=pod_name, namespace=self.namespace
            )
            for container in pod.spec.containers:
                if container.name == "vllm":
                    for arg in container.command:
                        if arg == "--enable-sleep-mode":
                            enable_sleep_mode = True
                            break
            return enable_sleep_mode

This code breaks in the line for arg in container.command: if no command has been provided to the pod -> kubernetes sets it by default to None

Modification suggestion

I think the vllm-router should not make any assumption on what is the command set in the pod definition of the model pod. I think the API endpoint exposed by the model pods are the only "contract" made between vllm router and vllm model pods
If someone wants to create a custom vllm image, that doesn't need to be run with the exact same command as vllm OR that already contains the command within the docker image, then the vllm router should work with it.

That's why I suggest, for testing if an engine is sleeping or not, to only query the is_sleeping endpoint.

  • if the endpoint raises 404 (not defined) -> we consider engine is not sleeping
  • if endpoint returns false -> we consider engine is sleeping
  • if endpoint returns true -> we consider engine is sleeping

please tell me if you like the suggestion

Why do you need this feature?

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions