Skip to content

bug: Cancellations don't propagate through the vLLM router to the engines #634

@fergusfinn

Description

@fergusfinn

Describe the bug

Request cancellation doesn't propagate to the engines

If I port forward the engine service, and the router service:

kubectl port-forward svc/inference-stack-router-service 8000:80
kubectl port-forward svc/inference-stack-generate-engine-service 8001:80

Then send curl commands & ctrl-C them, while checking the engine logs. Requests to the engine service 8001 that are cancelled show

│ INFO 08-06 00:27:06 [engine.py:337] Aborted request chatcmpl-2a18fe454fdb4e48a1a744b831d6 │

whereas all requests to the router service 8000 get to (i.e. complete successfully), even if they're cancelled

│ INFO:     10.64.8.3:38070 - "POST /v1/chat/completions HTTP/1.1" 200 OK                   │

To Reproduce

Sending requests with:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-12b-it",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ]
  }'

And then hitting ctrl-C in the terminal quickly. More obvious if you increase max_tokens, since the requests last longer.

Installing production stack version 0.1.5 with the following values (but doesn't seem to matter):

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
    - name: "embed"
      repository: "vllm/vllm-openai"
      tag: "v0.8.4"
      modelURL: "Qwen/Qwen3-Embedding-8B"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 1
      vllmConfig:
        maxModelLen: 16384
    - name: "generate"
      repository: "vllm/vllm-openai"
      tag: "v0.9.1"
      modelURL: "google/gemma-3-12b-it"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 2
      shmSize: "20Gi"
      vllmConfig:
        tensorParallelSize: 2
        maxModelLen: 16384
        maxNumSeqs: 8
        dtype: "bfloat16"
        gpuMemoryUtilization: 0.85
        enableChunkedPrefill: false
        extraArgs:
          - --limit-mm-per-prompt
          - "{\"image\": 5}"
      hf_token:
        secretName: "hf-secret"
        secretKey: "HUGGING_FACE_HUB_TOKEN"
      startupProbe:
        initialDelaySeconds: 60
        periodSeconds: 60
        failureThreshold: 60
        httpGet:
          path: /health
          port: 8000
      livenessProbe:
        initialDelaySeconds: 60
        failureThreshold: 3
        periodSeconds: 60
        httpGet:
          path: /health
          port: 8000

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions