Skip to content

Commit 6ee6236

Browse files
Levi080513claude
andcommitted
feat: upgrade ray to v2.53.0 and vllm to v0.11.2
Notes: 1. Currently only NVIDIA GPU static clusters are upgraded. AMD GPU clusters are pending full testing when resources become available. 2. Upgrading from v1.0.0 to v1.0.1 involves breaking changes: Endpoints need to be updated to work with v1.0.1 clusters, as v1.0.1 no longer supports vLLM v0.8.5. Changes: - Filter deprecated --dashboard-grpc-port and --dashboard-agent-grpc-port flags based on cluster version (> v1.0.0) in Go reconciler, with safety net filtering in start.py - Normalize GPU resource names by stripping underscores when deploying to Ray 2.53.0+ clusters, fixing scheduling mismatch for old endpoints - Update vmagent relabel regex to handle both ray_vllm: (old) and ray_vllm_ (new) metric prefixes for OpenTelemetry compatibility - Switch from RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper to RAY_process_group_cleanup_enabled for clusters > v1.0.0, which doesn't cause parent processes to lose child exit codes - Remove VLLM_SKIP_P2P_CHECK for new clusters since RAY_process_group_cleanup_enabled doesn't break vLLM's P2P check - Use > v1.0.0 threshold for all version checks to correctly handle pre-release versions (e.g., v1.0.1-alpha.1) - Sync test_chwbl_cache_key.py with actual chwbl_scheduler.py implementation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b1b62d8 commit 6ee6236

File tree

21 files changed

+771
-480
lines changed

21 files changed

+771
-480
lines changed

.github/workflows/release-serve.yaml

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,16 @@ on:
1111
description: "the release version (e.g : 0.1.0)"
1212
required: true
1313
type: string
14+
ray_version:
15+
description: "the ray version branch/tag to build from"
16+
required: false
17+
type: string
18+
default: "ray-2.53.0-neutree"
19+
accelerators:
20+
description: "accelerator types to build (e.g : gpu, amd-gpu)"
21+
required: false
22+
type: string
23+
default: "gpu"
1424

1525
jobs:
1626
build-amd64-image:
@@ -28,7 +38,7 @@ jobs:
2838
username: ${{ secrets.SERVE_IMAGE_PUSH_USERNAME }}
2939
password: ${{ secrets.SERVE_IMAGE_PUSH_TOKEN }}
3040
- name: build amd64 image
31-
run: cd cluster-image-builder; export ARCH=amd64 VERSION=${{ github.event.inputs.version }}; make docker-build && make docker-push
41+
run: cd cluster-image-builder; export ARCH=amd64 VERSION=${{ github.event.inputs.version }} RAY_VERSION=${{ github.event.inputs.ray_version }} ACCELERATORS="${{ github.event.inputs.accelerators }}"; make docker-build && make docker-push
3242
env:
3343
IMAGE_PROJECT: ${{ secrets.RELEASE_SERVE_IMAGE_PROJECT }}
3444
IMAGE_REPO: ${{ secrets.SERVE_IMAGE_REPO }}
@@ -51,7 +61,7 @@ jobs:
5161
- name: push manifests
5262
run: |
5363
cd cluster-image-builder
54-
export VERSION=${{ github.event.inputs.version }} ALL_ARCH=amd64; make docker-push-manifest
64+
export VERSION=${{ github.event.inputs.version }} RAY_VERSION=${{ github.event.inputs.ray_version }} ALL_ARCH=amd64 ACCELERATORS="${{ github.event.inputs.accelerators }}"; make docker-push-manifest
5565
5666
env:
5767
IMAGE_PROJECT: ${{ secrets.RELEASE_SERVE_IMAGE_PROJECT }}

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,6 @@ scripts/dashboard/output
3535

3636
cluster-image-builder/downloader
3737
scripts/builder/dist
38+
39+
claude.md
40+
.claude

cluster-image-builder/Dockerfile

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
ARG RAY_BASE_IMAGE
2-
ARG RAY_BRANCH="ray-2.43.0-neutree"
2+
ARG RAY_COMMIT
33
ARG COMMON_WORKDIR=/app
4-
ARG RAY_REPO="https://github.com/neutree-ai/ray.git"
4+
ARG RAY_REPO
55
ARG RAY_BUILD_BASE_IMAGE="quay.io/pypa/manylinux2014_x86_64:2024-07-02-9ac04ee"
66

77
FROM alpine/git:v2.47.2 as ray_fetch
88
ARG RAY_REPO
9-
ARG RAY_BRANCH
9+
ARG RAY_COMMIT
1010
ARG COMMON_WORKDIR
1111
WORKDIR ${COMMON_WORKDIR}
1212
RUN git clone ${RAY_REPO} \
1313
&& cd ray \
14-
&& git checkout ${RAY_BRANCH}
14+
&& git checkout ${RAY_COMMIT}
1515

1616
FROM ${RAY_BUILD_BASE_IMAGE} AS build_ray
1717
ARG COMMON_WORKDIR
18-
ARG RAY_BRANCH
19-
ENV BUILDKITE_COMMIT=${RAY_BRANCH}
18+
ARG RAY_COMMIT
19+
ENV BUILDKITE_COMMIT=${RAY_COMMIT}
2020
ENV TRAVIS_COMMIT=${BUILDKITE_COMMIT}
2121
ENV BUILD_ONE_PYTHON_ONLY=py311
2222
ENV RAY_DISABLE_EXTRA_CPP=1

cluster-image-builder/Dockerfile.rocm

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
# default base image
33
ARG VLLM_BRANCH="v0.8.5-neutree"
44
ARG VLLM_REPO="https://github.com/neutree-ai/vllm.git"
5-
ARG RAY_BRANCH="ray-2.43.0-neutree"
6-
ARG RAY_REPO="https://github.com/neutree-ai/ray.git"
5+
ARG RAY_COMMIT
6+
ARG RAY_REPO
77
ARG USE_CYTHON="0"
88
ARG BUILD_RPD="1"
99
ARG COMMON_WORKDIR=/app
@@ -54,16 +54,16 @@ COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/.buildkite /.buildkite
5454
# Ray build stages
5555
FROM base AS ray_fetch
5656
ARG RAY_REPO
57-
ARG RAY_BRANCH
57+
ARG RAY_COMMIT
5858
RUN git clone ${RAY_REPO} \
5959
&& cd ray \
60-
&& git checkout ${RAY_BRANCH}
60+
&& git checkout ${RAY_COMMIT}
6161

6262

6363
FROM ${RAY_BUILD_BASE_IMAGE} AS build_ray
6464
ARG COMMON_WORKDIR
65-
ARG RAY_BRANCH
66-
ENV BUILDKITE_COMMIT=${RAY_BRANCH}
65+
ARG RAY_COMMIT
66+
ENV BUILDKITE_COMMIT=${RAY_COMMIT}
6767
ENV TRAVIS_COMMIT=${BUILDKITE_COMMIT}
6868
ENV BUILD_ONE_PYTHON_ONLY=py311
6969
ENV RAY_DISABLE_EXTRA_CPP=1

cluster-image-builder/Makefile

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,15 @@ ARCH ?= amd64
1010
ACCELERATORS ?= gpu amd-gpu
1111
ALL_ARCH ?= amd64 arm64
1212

13-
RAY_BASE_IMAGE ?= rayproject/ray:2.43.0-py311-cu121
13+
RAY_BASE_IMAGE ?= rayproject/ray:2.53.0-py311-cu121
1414
ifeq ($(ARCH), arm64)
1515
RAY_BASE_IMAGE := $(RAY_BASE_IMAGE)-aarch64
1616
endif
1717

18+
RAY_REPO ?= https://github.com/neutree-ai/ray.git
19+
RAY_VERSION ?= ray-2.53.0-neutree
20+
RAY_COMMIT ?= $(shell git ls-remote $(RAY_REPO) $(RAY_VERSION) | cut -f1 | head -c 7)
21+
1822
ROCM_BASE_IMAGE ?= $(NEUTREE_SERVE_IMAGE):rocm-base
1923

2024
.PHONY: docker-build
@@ -23,11 +27,11 @@ docker-build: ## Run docker-build-* targets for all the images
2327

2428
.PHONY: docker-build-gpu
2529
docker-build-gpu: prepare ## Build the GPU image
26-
docker build --build-arg RAY_BASE_IMAGE=$(RAY_BASE_IMAGE) -f Dockerfile -t $(NEUTREE_SERVE_IMAGE)-$(ARCH):$(IMAGE_TAG) .
30+
docker build --build-arg RAY_BASE_IMAGE=$(RAY_BASE_IMAGE) --build-arg RAY_COMMIT=$(RAY_COMMIT) --build-arg RAY_REPO=$(RAY_REPO) -f Dockerfile -t $(NEUTREE_SERVE_IMAGE)-$(ARCH):$(IMAGE_TAG) .
2731

2832
.PHONY: docker-build-amd-gpu
2933
docker-build-amd-gpu: prepare ## Build the AMD GPU image
30-
docker build --build-arg BASE_IMAGE=$(ROCM_BASE_IMAGE) -f Dockerfile.rocm -t $(NEUTREE_SERVE_IMAGE)-$(ARCH):$(IMAGE_TAG)-rocm .
34+
docker build --build-arg BASE_IMAGE=$(ROCM_BASE_IMAGE) --build-arg RAY_COMMIT=$(RAY_COMMIT) --build-arg RAY_REPO=$(RAY_REPO) -f Dockerfile.rocm -t $(NEUTREE_SERVE_IMAGE)-$(ARCH):$(IMAGE_TAG)-rocm .
3135

3236
.PHONY: docker-push
3337
docker-push: ## Run docker-push-* targets for all the images

cluster-image-builder/accelerator/amd_gpu.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,9 @@ def get_accelerator_counts():
3737
market_name = device.get("market_name")
3838
if market_name is None:
3939
continue
40-
accelerator_type = market_name.replace(" ","_")
41-
if not accelerator_type.startswith("AMD_"):
42-
accelerator_type = "AMD_" + accelerator_type
40+
accelerator_type = market_name.replace(" ","")
41+
if not accelerator_type.startswith("AMD"):
42+
accelerator_type = "AMD" + accelerator_type
4343
if accelerator_counts.get(accelerator_type) is None:
4444
accelerator_counts[accelerator_type] = 1
4545
else:

cluster-image-builder/accelerator/gpu.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,9 @@ def count_nvidia_accelerators(gpu_names):
3131
"""
3232
accelerator_counts = {}
3333
for gpu in gpu_names:
34-
accelerator_type = gpu.replace(" ","_")
34+
accelerator_type = gpu.replace(" ","")
3535
if not accelerator_type.startswith("NVIDIA_"):
36-
accelerator_type = "NVIDIA_" + accelerator_type
36+
accelerator_type = "NVIDIA" + accelerator_type
3737
if accelerator_counts.get(accelerator_type) is None:
3838
accelerator_counts[accelerator_type] = 1
3939
else:
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
llama_cpp_python==0.3.7
2-
vllm==0.8.5
3-
ray[serve]==2.43.0
2+
vllm==0.11.2
3+
ray[serve]==2.53.0
44
numpy==1.26.4
55
opencv-python-headless==4.11.0.86
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
llama_cpp_python==0.3.7
2-
ray[serve]==2.43.0
2+
ray[serve]==2.53.0
33
numpy==1.26.4
44
opencv-python-headless==4.11.0.86
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
import logging
2+
3+
from ray import serve
4+
from vllm.v1.metrics.ray_wrappers import (
5+
RayPrometheusStatLogger,
6+
RaySpecDecodingProm,
7+
RayKVConnectorPrometheus,
8+
RayGaugeWrapper,
9+
RayCounterWrapper,
10+
RayHistogramWrapper,
11+
)
12+
13+
logger = logging.getLogger("ray.serve")
14+
15+
16+
def _make_extended_metric_cls(base_cls, extra_labels):
17+
"""Create a metric wrapper that transparently extends labelnames."""
18+
19+
class Extended(base_cls):
20+
def __init__(self, name, documentation=None, labelnames=None, **kwargs):
21+
extended_names = list(labelnames or []) + list(extra_labels.keys())
22+
super().__init__(name=name, documentation=documentation,
23+
labelnames=extended_names, **kwargs)
24+
25+
def labels(self, *args, **kwargs):
26+
if args:
27+
args = args + tuple(extra_labels.values())
28+
if kwargs:
29+
kwargs.update(extra_labels)
30+
return super().labels(*args, **kwargs)
31+
32+
return Extended
33+
34+
35+
def _make_extended_spec_decoding_cls(base_cls, extra_labels):
36+
"""Extend SpecDecodingProm with custom labels via its _counter_cls."""
37+
38+
class Extended(base_cls):
39+
_counter_cls = _make_extended_metric_cls(RayCounterWrapper, extra_labels)
40+
41+
return Extended
42+
43+
44+
def _make_extended_kv_connector_cls(base_cls, extra_labels):
45+
"""Extend KVConnectorPrometheus with custom labels via its _cls vars."""
46+
47+
class Extended(base_cls):
48+
_gauge_cls = _make_extended_metric_cls(RayGaugeWrapper, extra_labels)
49+
_counter_cls = _make_extended_metric_cls(RayCounterWrapper, extra_labels)
50+
_histogram_cls = _make_extended_metric_cls(RayHistogramWrapper, extra_labels)
51+
52+
return Extended
53+
54+
55+
class NeutreeRayStatLogger(RayPrometheusStatLogger):
56+
"""RayPrometheusStatLogger with Ray Serve context labels injected.
57+
58+
Transparently extends all vLLM metrics with deployment, replica,
59+
and application labels from the Ray Serve replica context.
60+
"""
61+
62+
def __init__(self, vllm_config, engine_indexes=None):
63+
extra_labels = {}
64+
try:
65+
ctx = serve.get_replica_context()
66+
extra_labels = {
67+
"deployment": ctx.deployment,
68+
"replica": ctx.replica_tag,
69+
}
70+
if hasattr(ctx, "app_name"):
71+
extra_labels["application"] = ctx.app_name
72+
except RuntimeError:
73+
logger.warning(
74+
"NeutreeRayStatLogger: not running in Ray Serve context, "
75+
"skipping custom labels"
76+
)
77+
78+
if extra_labels:
79+
self._gauge_cls = _make_extended_metric_cls(
80+
RayGaugeWrapper, extra_labels)
81+
self._counter_cls = _make_extended_metric_cls(
82+
RayCounterWrapper, extra_labels)
83+
self._histogram_cls = _make_extended_metric_cls(
84+
RayHistogramWrapper, extra_labels)
85+
self._spec_decoding_cls = _make_extended_spec_decoding_cls(
86+
RaySpecDecodingProm, extra_labels)
87+
self._kv_connector_cls = _make_extended_kv_connector_cls(
88+
RayKVConnectorPrometheus, extra_labels)
89+
90+
super().__init__(vllm_config, engine_indexes)
91+
92+
logger.info(
93+
f"NeutreeRayStatLogger initialized with extra labels: "
94+
f"{list(extra_labels.keys())}"
95+
)

0 commit comments

Comments
 (0)