-
Notifications
You must be signed in to change notification settings - Fork 362
Description
Describe the feature
Scenario:
I was testing the router with a fake-server, very similar to this one: https://github.com/vllm-project/production-stack/blob/main/src/tests/perftest/fake-openai-server.py
I packaged the app within a docker image:
FROM python:3.10-slim AS build
# Install dependencies
COPY requirements.txt .
RUN apt-get update && apt-get install -y gcc
RUN pip install --no-cache-dir --user -r requirements.txt
# Set working directory
WORKDIR /opt/project
# Copy application code
COPY . .
# Set PYTHONPATH
ENV PYTHONPATH=/opt/project
CMD ["python3", "./fake-openai-server.py", "--host", "0.0.0.0", "--port", "8000"]
And deploying it within kubernetes, here is the pod spec:
apiVersion: v1
kind: Pod
metadata:
generateName: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc7967b99-
generation: 1
labels:
environment: test
helm-release-name: vllm-k8s-test-fake-server
model: fake-model
pod-template-hash: 5bc7967b99
release: test
name: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc79kndb9
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: vllm-k8s-test-fake-server-fake-model-deployment-vllm-5bc7967b99
spec:
containers:
- image: fake-server:latest
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: vllm
ports:
- containerPort: 8000
name: container-port
protocol: TCP
- containerPort: 55555
name: zmq-port
protocol: TCP
- containerPort: 9999
name: ucx-port
protocol: TCP
startupProbe:
failureThreshold: 60
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Important thing to note: there is no command section within the pod definition. This is because I just want to run the command that is specified in the docker image
And then within the vllm router, within the service discovery I was getting this error:
K8s watcher error: 'NoneType' object is not iterable (service_discovery.py:617:vllm_router.service_discovery)
I dug down trying to understand when this error could occur.
In the _check_engine_sleep_mode I found this piece of code:
pod = self.k8s_api.read_namespaced_pod(
name=pod_name, namespace=self.namespace
)
for container in pod.spec.containers:
if container.name == "vllm":
for arg in container.command:
if arg == "--enable-sleep-mode":
enable_sleep_mode = True
break
return enable_sleep_modeThis code breaks in the line for arg in container.command: if no command has been provided to the pod -> kubernetes sets it by default to None
Modification suggestion
I think the vllm-router should not make any assumption on what is the command set in the pod definition of the model pod. I think the API endpoint exposed by the model pods are the only "contract" made between vllm router and vllm model pods
If someone wants to create a custom vllm image, that doesn't need to be run with the exact same command as vllm OR that already contains the command within the docker image, then the vllm router should work with it.
That's why I suggest, for testing if an engine is sleeping or not, to only query the is_sleeping endpoint.
- if the endpoint raises 404 (not defined) -> we consider engine is not sleeping
- if endpoint returns false -> we consider engine is sleeping
- if endpoint returns true -> we consider engine is sleeping
please tell me if you like the suggestion
Why do you need this feature?
No response
Additional context
No response