Skip to content

Conversation

Frapschen
Copy link
Contributor

What type of PR is this?
/kind documentation

What this PR does / why we need it:

I try to bump the two image versions:

  • vllm-cpu-release-repo: v0.8.5 -> v0.10.2(v0.8.5 is 6 months ago verson)
  • vllm-openai: v0.8.5 -> v0.11.0(There are many bug fix)

Which issue(s) this PR fixes:
Fixes #1722

Does this PR introduce a user-facing change?:


@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Oct 16, 2025
Copy link

netlify bot commented Oct 16, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit d69f91b
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68f0a26b4db49300084892ee
😎 Deploy Preview https://deploy-preview-1733--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Frapschen
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 16, 2025
@nirrozenbaum
Copy link
Contributor

@Frapschen did you run the quickstart guide with these GPU and CPU versions to make sure it works?
last time we tested it failed on startup and that was the original reason we pinned the version to v0.8.5.

@Frapschen
Copy link
Contributor Author

@nirrozenbaum I can confirm that the CPU one works fine with me.

root@controller-01:~# kubectl get pod -owide
NAME                                           READY   STATUS    RESTARTS      AGE   IP             NODE            NOMINATED NODE   READINESS GATES
inference-gateway-f5c894468-4vxc5              1/1     Running   0             17d   10.233.98.85   controller-01   <none>           <none>
kgateway-7f4455889-zfrtz                       1/1     Running   0             17d   10.233.98.84   controller-01   <none>           <none>
vllm-llama3-8b-instruct-6c9757687-cvzll        1/1     Running   1 (27d ago)   42d   10.233.98.36   controller-01   <none>           <none>
vllm-llama3-8b-instruct-6c9757687-gdpqf        1/1     Running   1 (27d ago)   42d   10.233.98.34   controller-01   <none>           <none>
vllm-llama3-8b-instruct-6c9757687-rjqq2        1/1     Running   1 (27d ago)   42d   10.233.98.35   controller-01   <none>           <none>
vllm-llama3-8b-instruct-cpu-7555494db4-bvpd4   2/2     Running   1 (15h ago)   15h   10.233.98.89   controller-01   <none>           <none>
vllm-llama3-8b-instruct-epp-86c8cdcf64-8dtrp   1/1     Running   0             21d   10.233.98.71   controller-01   <none>           <none>

vllm-llama3-8b-instruct-cpu-7555494db4-bvpd4 pod manifest:

root@controller-01:~# kubectl get pod vllm-llama3-8b-instruct-cpu-7555494db4-bvpd4 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 5a70f09e83d15c3e709291de67baafa16b5bb33869b5c7ad5e7ce35d514e1dd7
    cni.projectcalico.org/podIP: 10.233.98.89/32
    cni.projectcalico.org/podIPs: 10.233.98.89/32
  creationTimestamp: "2025-10-16T10:16:35Z"
  generateName: vllm-llama3-8b-instruct-cpu-7555494db4-
  labels:
    app: vllm-llama3-8b-instruct-cpu
    pod-template-hash: 7555494db4
  name: vllm-llama3-8b-instruct-cpu-7555494db4-bvpd4
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: vllm-llama3-8b-instruct-cpu-7555494db4
    uid: 2fac45a0-a030-4a23-bdc5-40ece7717e3a
  resourceVersion: "5878803"
  uid: 7c697912-88aa-42d8-84bd-45fbc30f21b4
spec:
  containers:
  - args:
    - --model
    - Qwen/Qwen2.5-1.5B-Instruct
    - --port
    - "8000"
    - --enable-lora
    - --max-loras
    - "4"
    - --lora-modules
    - '{"name": "food-review-0", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations",
      "base_model_name": "Qwen/Qwen2.5-1.5B"}'
    - '{"name": "food-review-1", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations",
      "base_model_name": "Qwen/Qwen2.5-1.5B"}'
    command:
    - python3
    - -m
    - vllm.entrypoints.openai.api_server
    env:
    - name: PORT
      value: "8000"
    - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
      value: "true"
    - name: VLLM_CPU_KVCACHE_SPACE
      value: "4"
    - name: HF_ENDPOINT
      value: https://hf-mirror.com
    image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.10.2
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 240
      httpGet:
        path: /health
        port: http
        scheme: HTTP
      initialDelaySeconds: 30
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 1
    name: lora
    ports:
    - containerPort: 8000
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 600
      httpGet:
        path: /health
        port: http
        scheme: HTTP
      initialDelaySeconds: 30
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "12"
        memory: 9000Mi
      requests:
        cpu: "12"
        memory: 9000Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /data
      name: data
    - mountPath: /dev/shm
      name: shm
    - mountPath: /adapters
      name: adapters
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-2w9kw
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - env:
    - name: DYNAMIC_LORA_ROLLOUT_CONFIG
      value: /config/configmap.yaml
    image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
    imagePullPolicy: Always
    name: lora-adapter-syncer
    resources: {}
    restartPolicy: Always
    stdin: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    tty: true
    volumeMounts:
    - mountPath: /config
      name: config-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-2w9kw
      readOnly: true
  nodeName: controller-01
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: data
  - emptyDir:
      medium: Memory
    name: shm
  - emptyDir: {}
    name: adapters
  - configMap:
      defaultMode: 420
      name: vllm-qwen-adapters
    name: config-volume
  - name: kube-api-access-2w9kw
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-10-16T10:16:38Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-10-16T10:16:38Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-10-16T10:36:06Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-10-16T10:36:06Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-10-16T10:16:35Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://756cf1020ba929555afea36c1c85ab16b1cddc72967689cbc749e2304f941e18
    image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.10.2
    imageID: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo@sha256:22d44a924b90e309423ae19718a8cf4a6f377fdc8bb699ecfdcbb52f9625523c
    lastState:
      terminated:
        containerID: containerd://db8fd241e66c65277ff12eed5879380a7e063a2b02094394b16dd30d6c9d7da4
        exitCode: 1
        finishedAt: "2025-10-16T10:26:44Z"
        reason: Error
        startedAt: "2025-10-16T10:16:40Z"
    name: lora
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2025-10-16T10:26:46Z"
    volumeMounts:
    - mountPath: /data
      name: data
    - mountPath: /dev/shm
      name: shm
    - mountPath: /adapters
      name: adapters
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-2w9kw
      readOnly: true
      recursiveReadOnly: Disabled
  hostIP: 172.16.112.10
  hostIPs:
  - ip: 172.16.112.10
  initContainerStatuses:
  - containerID: containerd://d7c0b168752d0b8a4a3aa4d2c3ff784973c44d443c4439b2594102f50d00cd7e
    image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
    imageID: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer@sha256:a2927ee562c1d1e9cfc076fb54defda356575fbf2bb515bba6e61bdd99fbab7c
    lastState: {}
    name: lora-adapter-syncer
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-10-16T10:16:37Z"
    volumeMounts:
    - mountPath: /config
      name: config-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-2w9kw
      readOnly: true
      recursiveReadOnly: Disabled
  phase: Running
  podIP: 10.233.98.89
  podIPs:
  - ip: 10.233.98.89
  qosClass: Burstable
  startTime: "2025-10-16T10:16:35Z"

curl test:

root@controller-01:~# IP=10.233.98.89
root@controller-01:~# PORT=8000
root@controller-01:~# curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review-1",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Fri, 17 Oct 2025 02:15:28 GMT
server: uvicorn
content-length: 918
content-type: application/json

{"id":"cmpl-8123f689bd7f40c4965cee037470fad0","object":"text_completion","created":1760667328,"model":"food-review-1","choices":[{"index":0,"text":" Giants - 2019\n\nThe San Francisco Giants have been one of the most successful teams in Major League Baseball over the past few years, and they continue to be a force to be reckoned with. The team has won three World Series championships in the last five years, and they are currently sitting at the top of their division.\n\nIn this year's season, the Giants have shown that they can still compete with any team in the league. They have had some tough games, but they","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":10,"total_tokens":110,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump vLLM Image Tags

3 participants