-
Notifications
You must be signed in to change notification settings - Fork 372
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Request cancellation doesn't propagate to the engines
If I port forward the engine service, and the router service:
kubectl port-forward svc/inference-stack-router-service 8000:80
kubectl port-forward svc/inference-stack-generate-engine-service 8001:80
Then send curl commands & ctrl-C them, while checking the engine logs. Requests to the engine service 8001 that are cancelled show
│ INFO 08-06 00:27:06 [engine.py:337] Aborted request chatcmpl-2a18fe454fdb4e48a1a744b831d6 │
whereas all requests to the router service 8000 get to (i.e. complete successfully), even if they're cancelled
│ INFO: 10.64.8.3:38070 - "POST /v1/chat/completions HTTP/1.1" 200 OK │
To Reproduce
Sending requests with:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-12b-it",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
]
}'
And then hitting ctrl-C in the terminal quickly. More obvious if you increase max_tokens, since the requests last longer.
Installing production stack version 0.1.5 with the following values (but doesn't seem to matter):
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "embed"
repository: "vllm/vllm-openai"
tag: "v0.8.4"
modelURL: "Qwen/Qwen3-Embedding-8B"
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 1
vllmConfig:
maxModelLen: 16384
- name: "generate"
repository: "vllm/vllm-openai"
tag: "v0.9.1"
modelURL: "google/gemma-3-12b-it"
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 2
shmSize: "20Gi"
vllmConfig:
tensorParallelSize: 2
maxModelLen: 16384
maxNumSeqs: 8
dtype: "bfloat16"
gpuMemoryUtilization: 0.85
enableChunkedPrefill: false
extraArgs:
- --limit-mm-per-prompt
- "{\"image\": 5}"
hf_token:
secretName: "hf-secret"
secretKey: "HUGGING_FACE_HUB_TOKEN"
startupProbe:
initialDelaySeconds: 60
periodSeconds: 60
failureThreshold: 60
httpGet:
path: /health
port: 8000
livenessProbe:
initialDelaySeconds: 60
failureThreshold: 3
periodSeconds: 60
httpGet:
path: /health
port: 8000
Expected behavior
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working