Skip to content

feat: upgrade ray to v2.53.0 and vllm to v0.11.2 for static node clusters#274

Open
Levi080513 wants to merge 5 commits intomainfrom
hw/bump-ray-vllm
Open

feat: upgrade ray to v2.53.0 and vllm to v0.11.2 for static node clusters#274
Levi080513 wants to merge 5 commits intomainfrom
hw/bump-ray-vllm

Conversation

@Levi080513
Copy link
Collaborator

@Levi080513 Levi080513 commented Feb 12, 2026

Issues

Upgrade Ray from v2.44.1 to v2.53.0 and vLLM from v0.8.5 to v0.11.2 for static node clusters (serving version > v1.0.0).

Notes:

  1. Currently only NVIDIA GPU static node clusters are upgraded. AMD GPU cluster image has not been adapted or tested yet, pending resources for full validation.
  2. Upgrading from v1.0.0 to v1.0.1 involves breaking changes: Endpoints need to be updated to work with v1.0.1 clusters, as v1.0.1 maybe no longer supports vLLM v0.8.5.

Changes

  • Upgrade Ray base image to 2.53.0 and vLLM to v0.11.2, adapt app.py for new Ray RequestRouter/RequestRouterConfig API and vLLM V1 AsyncLLM engine
  • Add NeutreeRayStatLogger to export vLLM metrics via Ray gauge, replacing the removed RayPrometheusStatLogger
  • Adapt custom schedulers (chwbl_scheduler.py, static_hash_scheduler.py) for Ray 2.53.0 RequestRouter API
  • Filter deprecated --dashboard-grpc-port and --dashboard-agent-grpc-port flags based on cluster version (> v1.0.0) in Go reconciler, with safety net filtering in start.py
  • Update vmagent relabel regex to handle both ray_vllm: (old) and ray_vllm_ (new) metric prefixes for OpenTelemetry compatibility
  • Switch from RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper to RAY_process_group_cleanup_enabled for clusters > v1.0.0, removing VLLM_SKIP_P2P_CHECK workaround for new clusters
  • Use > v1.0.0 threshold for all version checks to correctly handle pre-release versions (e.g., v1.0.1-alpha.1)
  • Add ray_version and accelerators inputs to release-serve workflow
  • Sync test_chwbl_cache_key.py with actual chwbl_scheduler.py implementation
  • Reduce Ray object store memory from default 30% to 10% to free memory for inference engines
  • Disable OTEL metrics backend (RAY_enable_open_telemetry=false) due to metrics loss issue, fall back to OpenCensus. Set in both Dockerfile (for cross-version/cross-mode compatibility) and SSH Docker run options
  • Remove redundant enable_reasoning flag, rely solely on reasoning_parser to control reasoning mode (aligned with vLLM v0.9.0+ deprecation)

Test

  • Manual E2E testing on NVIDIA GPU static node cluster ✅
  • Old version (v1.0.0) static node cluster backward compatibility testing ✅

@Levi080513 Levi080513 force-pushed the hw/bump-ray-vllm branch 2 times, most recently from 6ee6236 to 8b3088f Compare February 12, 2026 13:59
@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 79.41176% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/cluster/ray_ssh_operation.go 75.00% 3 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Notes:
1. Currently only NVIDIA GPU static clusters are upgraded. AMD GPU
   clusters are pending full testing when resources become available.
2. Upgrading from v1.0.0 to v1.0.1 involves breaking changes: Endpoints
   need to be updated to work with v1.0.1 clusters, as v1.0.1 no longer
   supports vLLM v0.8.5.

Changes:
- Filter deprecated --dashboard-grpc-port and --dashboard-agent-grpc-port
  flags based on cluster version (> v1.0.0) in Go reconciler, with
  safety net filtering in start.py
- Update vmagent relabel regex to handle both ray_vllm: (old) and
  ray_vllm_ (new) metric prefixes for OpenTelemetry compatibility
- Switch from RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper
  to RAY_process_group_cleanup_enabled for clusters > v1.0.0, which
  doesn't cause parent processes to lose child exit codes
- Remove VLLM_SKIP_P2P_CHECK for new clusters since
  RAY_process_group_cleanup_enabled doesn't break vLLM's P2P check
- Use > v1.0.0 threshold for all version checks to correctly handle
  pre-release versions (e.g., v1.0.1-alpha.1)
- Sync test_chwbl_cache_key.py with actual chwbl_scheduler.py implementation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Levi080513 Levi080513 marked this pull request as ready for review February 14, 2026 05:51
@Levi080513 Levi080513 changed the title feat: upgrade ray to v2.53.0 and vllm to v0.11.2 feat: upgrade ray to v2.53.0 and vllm to v0.11.2 for static node clusters Feb 14, 2026
@Levi080513 Levi080513 requested a review from Yuyz0112 February 14, 2026 05:56
… API

- Move CHWBL custom params to initialize_state() to match Ray 2.53.0
  RequestRouter API (request_router_kwargs passes to initialize_state,
  not __init__)
- Remove redundant curr_replicas property from both schedulers (already
  provided by base RequestRouter class)
- Remove unnecessary threading.Lock (Ray Serve runs in single-threaded
  asyncio event loop)
- Fix CHWBL to use own load balancing when initial replica is not a
  candidate instead of falling back to Ray's default scheduling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…assertion error

Ray 2.53.0 defaults RAY_enable_open_telemetry to true, but its reporter_agent
has an assertion that fails when different vLLM endpoints register the same
histogram metric with different bucket boundaries (due to different max_model_len).
This crashes the entire OTLP Export RPC, dropping all metrics in that batch.
Fall back to OpenCensus to avoid this issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Yuyz0112
Copy link
Contributor

Upgrading from v1.0.0 to v1.0.1 involves breaking changes: Endpoints need to be updated to work with v1.0.1 clusters, as v1.0.1 maybe no longer supports vLLM v0.8.5.

What will happen if users:

  1. Upgrade to v1.0.1
  2. Suspend a v0.8.5 vllm endpoint
  3. Resume the endpoint again

Can the new ray serve actor spin up?

@Levi080513
Copy link
Collaborator Author

@Yuyz0112

The new ray serve actor will keep failed because now the static-node cluster not support multiple versions of the inference engine.

This issue will be resolved once support for multiple inference engine versions is available.

Levi080513 and others added 2 commits February 25, 2026 11:46
Set the env var in the Dockerfile so the fix applies regardless of
control plane version or deployment mode (K8s/SSH).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Since vLLM v0.9.0, --enable-reasoning is deprecated. The reasoning_parser
parameter alone controls whether reasoning is enabled - passing it directly
to both engine and serving layers is sufficient.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants