bug: Cancellations don't propagate through the vLLM router to the engines

### Describe the bug

#### Request cancellation doesn't propagate to the engines

If I port forward the engine service, and the router service:
```
kubectl port-forward svc/inference-stack-router-service 8000:80
kubectl port-forward svc/inference-stack-generate-engine-service 8001:80
```

Then send curl commands & ctrl-C them, while checking the engine logs. Requests to the engine service `8001` that are cancelled show

```
│ INFO 08-06 00:27:06 [engine.py:337] Aborted request chatcmpl-2a18fe454fdb4e48a1a744b831d6 │
```

whereas all requests to the router service `8000` get to (i.e. complete successfully), even if they're cancelled
```
│ INFO:     10.64.8.3:38070 - "POST /v1/chat/completions HTTP/1.1" 200 OK                   │
```

### To Reproduce

Sending requests with:
```
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-12b-it",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ]
  }'
```
And then hitting ctrl-C in the terminal quickly. More obvious if you increase `max_tokens`, since the requests last longer.

Installing production stack version 0.1.5 with the following values (but doesn't seem to matter):
```yaml
servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
    - name: "embed"
      repository: "vllm/vllm-openai"
      tag: "v0.8.4"
      modelURL: "Qwen/Qwen3-Embedding-8B"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 1
      vllmConfig:
        maxModelLen: 16384
    - name: "generate"
      repository: "vllm/vllm-openai"
      tag: "v0.9.1"
      modelURL: "google/gemma-3-12b-it"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 2
      shmSize: "20Gi"
      vllmConfig:
        tensorParallelSize: 2
        maxModelLen: 16384
        maxNumSeqs: 8
        dtype: "bfloat16"
        gpuMemoryUtilization: 0.85
        enableChunkedPrefill: false
        extraArgs:
          - --limit-mm-per-prompt
          - "{\"image\": 5}"
      hf_token:
        secretName: "hf-secret"
        secretKey: "HUGGING_FACE_HUB_TOKEN"
      startupProbe:
        initialDelaySeconds: 60
        periodSeconds: 60
        failureThreshold: 60
        httpGet:
          path: /health
          port: 8000
      livenessProbe:
        initialDelaySeconds: 60
        failureThreshold: 3
        periodSeconds: 60
        httpGet:
          path: /health
          port: 8000

```


### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Cancellations don't propagate through the vLLM router to the engines #634

Describe the bug

Request cancellation doesn't propagate to the engines

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: Cancellations don't propagate through the vLLM router to the engines #634

Description

Describe the bug

Request cancellation doesn't propagate to the engines

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions