-
Notifications
You must be signed in to change notification settings - Fork 26
π daily merge: master β main 2025-12-26 #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The data resource cleanup tests experienced flaky behavior due to relying on `SplitCoordinator.start_epoch.remote()` to indirectly request resources to the `AutoscalingRequester`. There is no guarantee by the time of assertion that a scaling request has been made yet. This PR: - Implements a 3 second timeout to wait for resource requests to be made - Makes explicit autoscaling requests via `try_trigger_scaling` calls instead of relying on `SplitCoordinator.start_epoch` - Basic cleanup --------- Signed-off-by: JasonLi1909 <[email protected]>
not release blocking Signed-off-by: Lonnie Liu <[email protected]>
## Description Implements dark mode for the Ray Dashboard. - Toggle button in nav bar switches between light/dark themes - Preference saved to localStorage and persists across sessions - Respects system theme preference on first load - 57 files updated to replace hardcoded colors with theme palette values - All components work properly in both modes (icons, charts, embedded Grafana) ## Related issues Closes ray-project#49075 ## Additional information ### Note about PR ray-project#54783 Started this before discovering ray-project#54783. This PR covers more components (57 vs 21 files) and integrates into existing GlobalContext. Happy to collaborate with @ddishi if needed. ### Implementation - Added `themeMode` and `toggleTheme` to GlobalContext in `App.tsx` - Theme definitions in `theme.ts` with full palettes for both modes - Dark theme uses deep blacks (#0F0F0F, #1A1A1A) - Replaced hardcoded colors with theme values (`theme.palette.*`) - ~~Renamed GRAM to VRAM for standard terminology~~ ### Screenshots  <details> <summary><b>Overview Page</b></summary>   </details> <details> <summary><b>Cluster/Nodes Page</b></summary>  </details> <details> <summary><b>Jobs Page</b></summary>   </details> <details> <summary><b>Actors Page</b></summary>   </details> <details> <summary><b>Serve Deployments Page</b></summary>   </details> <details> <summary><b>Metrics Page</b></summary>   </details> <details> <summary><b>Logs Page</b></summary>   </details> ### Testing Manually tested all pages in both themes. Toggle works without refresh, preference persists, system detection works (at least on my machine :D) . Given the scope, I likely missed some edge cases or specific components. Happy to fix anything that comes up in review. --------- Signed-off-by: Julius Arkenberg <[email protected]>
Signed-off-by: joshlee <[email protected]>
anyscale#573 Signed-off-by: Lonnie Liu <[email protected]>
β¦#59196) ## Description The gpu [test](https://github.com/ray-project/ray/blob/8b4a1ee98fd92a972d07e9c08af19f02829dd40f/python/ray/tests/test_tensor_zero_copy_serialization.py#L171) was failing on Ray postmerge CI. We should re-add with a CPU version that runs in premerge and a GPU version running on that build: - Add one entry to [this file](https://github.com/ray-project/ray/blob/5328b3250f8a15dc35c391869c24f0db0f821a32/python/ray/tests/BUILD.bazel#L503), in the "small" tests - Added back the GPU test entry ## Related issues Fixes ray-project#59195 --------- Signed-off-by: Haichuan Hu <[email protected]>
β¦-project#59084) ## Description The [PR](ray-project#57639) supporting zero-copy serialization for read-only tensors has been merged. We need to add documentation to inform users about this new feature. The docs should mention that torch tensors are not zero-copy by default, show an example of how to enable zero-copy, and most importantly explain the limitations and possible issues @tianyi-ge ## Related issues Closes ray-project#59081 --------- Signed-off-by: Haichuan Hu <[email protected]>
β¦ = token or k8s (ray-project#59168) ## Description today the authentication dialog box opens everytime any ray dashboard API call fails with 401/403 regardless of the authentication mode being used. this can lead to confusing/undesirable behaviour when token auth is not enabled. In this pr I have added an additional check to verify auth mode before opening the dialog box --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦roject#59180) After fully starting up, the runtime-env agent does not handle SIGTERM and only responds to SIGKILL. The current test that kills the runtime-env agent with SIGTERM only passes because it sends the signal before the agent is fully initialized. Once the agent finishes startup, it no longer responds to SIGTERM, causing the test to become timing-dependent. This PR temporarily removes the test because it is potentially highly flakyβany change that makes the agent start faster will cause the test to fail. **The raylet graceful-exit behavior is still covered by the test where the agent is terminated with SIGKILL.** See the related issue for details:ray-project#59162 ## Related issues Closes ray-project#59162 ## Additional information Test: ```shell ray start --head --port=6379 ps aux | grep "runtime_env/agent/main.py" | grep -v grep #ubuntu 3946849 0.2 0.1 2146984 115396 pts/2 Sl 18:49 0:00 /home/ubuntu/ray/venv/bin/python -u /home/ubuntu/ray/python/ray/_private/runtime_env/agent/main.py # Now, wait for a while kill -15 3946849 ps -p 3946849 -o pid,state,comm 2>/dev/null && echo "Process still alive!" || echo "Process is dead" # PID S COMMAND #3946849 S python #Process still alive! ``` Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
β¦tests (ray-project#59215) ## Description The expression `bool(os.environ.get("RAY_PYTEST_USE_GPU", 0))` incorrectly evaluates to `True` when the environment variable is explicitly set to "0". This happens because `os.environ.get()` returns a string when the variable is set, and bool("0") is True (non-empty string). The common pattern used elsewhere in this codebase is `os.environ.get("VAR") == "1"`, which correctly handles all cases. If someone sets `RAY_PYTEST_USE_GPU=0` to disable GPU tests, the tests would still run unexpectedly. @tianyi-ge ## Related issues Closes ray-project#59213 Signed-off-by: Haichuan Hu <[email protected]>
## Description Footsies environment can be rather spammy. This PR enables us to get rid of the outputs or redirect them to afile. --------- Signed-off-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Kamil Kaczmarek <[email protected]>
upgrade h5py in preparation for python 3.13 image building h5py==3.10.0 -> h5py==3.12.1 h5py 3.12.1 has python 3.13 wheels --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦y-project#58992) upgrading test requirements in preparation for python 3.13 image building pillow==10.3.0 -> pillow 10.4.0 scikit-learn==1.3.2 -> scikit-learn>=1.5.2 for py313 images --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
Signed-off-by: yaommen <[email protected]>
β¦ lint (ray-project#59247) Our CI uses Bazel to run Python tests (not pytest). If we don't call `pytest.main` in a test module, the tests aren't actually run. ray-project#58816 fixed this issue for `test_hash_shuffle`. As a follow up, this PR removes the `test_hash_shuffle` from the list of files excluded from the `pytest.main` lint. Signed-off-by: Balaji Veeramani <[email protected]>
> Briefly describe what this PR accomplishes and why it's needed. Makes CancelTask RPC Fault Tolerant. Created an intermediary RPC similar to what was done in ray-project#57648 in that when the force_exit flag is enabled for cancel, the executor worker is shut down gracefully. However we have no way of determining whether the shutdown was successful on the owner core worker, hence we send the cancel request to the raylet via a new RPC CancelLocalTask that guarantees the worker is killed. Added a python test to verify retry behavior, leaving out the cpp test after talking to @dayshah due to being a bit complicated in that we need to take into account all orderings of the owner/executor states in the cancellation process. In the current task submission path, we don't keep track of the raylet address of the worker when we receive the PushTask RPC. It's a bit complicated to do this since the GCS keeps track of actor lifecycles including requesting leases, hence instead of touching the hot path (task submission) we decided to just complicate the cancellation path. Upon receiving a CancelTask RPC, we will query the gcs node cache to get the node info. Only if its not in the cache do we then query the GCS. Unfortunately the gcs node cache is currently not thread safe and show only be accessed on the main io service hence we refactored Normal/ActorTaskSubmitter so that it posts the portion of the code that accesses the cache onto the main io service. There was also a race condition in CancelLocalTask/KillLocalActor where send_reply_callback could be triggered twice if we receive the response from CancelTask/KillActor, but the worker is not evicted from the raylet worker pool immediately. Hence the callback in execute_after could trigger. So added a replied boolean flag to guard against this. CancelTask when force_kill=true behavior has been modified to trigger a SIGKILL after a set amount of time if the graceful shutdown from the worker hangs. Lastly, the actor task retry did not use the config used in the normal task retry timer. Updated this --------- Signed-off-by: joshlee <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
β¦ect#59172) ## Description Currently when transforming pandas to arrow block, we will convert it into `pa.null` if all the value in TensorArray is NaN or empty string, however all NaN might actually represent something and could actually happen in transformation. https://github.com/ray-project/ray/blob/8b4a1ee98fd92a972d07e9c08af19f02829dd40f/python/ray/air/util/tensor_extensions/pandas.py#L926-L958 And could triggered error when we tried to convert that arrow block back to pandas (try running the repro script from issue) This PR skip coerced to `pa.null` if the`col.dtype` is `TensorDtype`, which should tolerate the all NaN or all empty string table. ## Related issues Closes ray-project#59087 --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]>
Related prs that we should review when upgrading fully: - ray-project#58820 - Note from Rui: When we bump new vllm version, we should go with 0.11.2 instead of 0.11.1, which fixes a Ray multi-node PP regression that was introduced when adding torch-based PP https://github.com/vllm-project/vllm/releases/tag/v0.11.2 Issues: - closes ray-project#58937 - closes ray-project#58973 - closes ray-project#58702 --------- Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Nikhil Ghosh <[email protected]> Signed-off-by: Nikhil G <[email protected]> Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Nikhil Ghosh <[email protected]> Co-authored-by: Nikhil G <[email protected]> Co-authored-by: elliot-barn <[email protected]>
β¦t#59254) also add missing `click` dependencies for scripts Signed-off-by: Lonnie Liu <[email protected]>
for ray image releases Signed-off-by: Lonnie Liu <[email protected]>
β¦9252) they are not captured by bazel and does not work any more Signed-off-by: Lonnie Liu <[email protected]>
stop using python 3.9. Signed-off-by: Lonnie Liu <[email protected]>
β¦ect#59256) ## Description `test_preserve_hash_shuffle_blocks` has been flaking consistently. To mitigate the flakiness, this PR bumps the test size from "small" to "medium". ``` [2025-12-06T07:06:32Z] //python/ray/data:test_preserve_hash_shuffle_blocks TIMEOUT in 3 out of 3 in 63.4s -- [2025-12-06T07:06:32Z] Stats over 3 runs: max = 63.4s, min = 60.1s, avg = 62.3s, dev = 1.6s [2025-12-06T07:06:32Z] /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test.log [2025-12-06T07:06:32Z] /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_1.log [2025-12-06T07:06:32Z] /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_2.log [2025-12-06T07:06:32Z] ``` See also ray-project#58988 Signed-off-by: Balaji Veeramani <[email protected]>
`test_download_expression` has been occasionally timing out recently. To mtiigate this flakiness, this PR bumps the test size from "small" to "medium". Signed-off-by: Balaji Veeramani <[email protected]>
β¦oject#59258) `test_map_batches_e2e` occasionally flakes because it assumes that the Dataset outputs data in a specific order, but that order can vary if tasks finish out of order. ``` [2025-12-04T21:34:13Z] def test_map_batches_e2e(ray_start_regular_shared_2_cpus): -- [2025-12-04T21:34:13Z] ds = ray.data.range(5) [2025-12-04T21:34:13Z] ds = ds.map_batches(column_udf("id", lambda x: x)) [2025-12-04T21:34:13Z] > assert extract_values("id", ds.take_all()) == list(range(5)), ds [2025-12-04T21:34:13Z] E AssertionError: MapBatches(<lambda>) [2025-12-04T21:34:13Z] E +- Dataset(num_rows=5, schema={id: int64}) [2025-12-04T21:34:13Z] E assert [0, 1, 2, 4, 3] == [0, 1, 2, 3, 4] [2025-12-04T21:34:13Z] E At index 3 diff: 4 != 3 [2025-12-04T21:34:13Z] E Full diff: [2025-12-04T21:34:13Z] E - [0, 1, 2, 3, 4] [2025-12-04T21:34:13Z] E ? --- [2025-12-04T21:34:13Z] E + [0, 1, 2, 4, 3] [2025-12-04T21:34:13Z] E ? +++ [2025-12-04T21:34:13Z] ``` To mitigate this issue, this PR sorts the output values. Signed-off-by: Balaji Veeramani <[email protected]>
## Why are these changes needed? In this PR, we refactor MetricsLogger and Stats into a design that is more sustainable (more tests + ability for users to modify behaviour through configuration) and also improves on a number of key issues. **The main issues with the old design are:** **Too many useless combinations of input arguments to methods like MetricsLogger.log_value()** These increase the test surface area a lot and is solved by refactoring the legacy Stats class into multiple stats classes such as StatsBase, MaxStats or LifetimeSumStats. These classes represent the few valid combinations of input arguments such as (reduce="mean", window=10). **Many metrics are easily skewed because windows are applied in parallel components AND in the root metrics logger** This is opaque and not easy to work around. This PR solves this by treating stats at the root metrics logger differently (stats have variables "is_root" and "is_leaf" to align their behaviour when pushing, reducing, peeking or merging values with where they are. For example, stats in the root metrics logger are mostly used to aggregate metrics and therefore we don't apply a window, even if the values at leafs are logged with a window. This also has the nice sideeffect that we don't need to specify how to prioritize values when merging anymore (see the deprecated "reduce_per_index_on_aggregate" argument. **Different modes of aggregates (sometimes we use MetricsLogger.log_dict and sometimes MetricsLogger.aggregate)** This leads to different aggregation patters. For example, we use MetricsLogger.log_dict() to log results from Learner updates. This can happen n times per training iteration and stats with a window of 1 are overridden each call such that we report only the last value logged instead of actual averages. **Skewed throughputs** Before this PR, throughputs are calculated by calculating the exponential weighted average over increments divided by the time passed since the last update for the given metric. This can lead to a noticable bias although, anecdotally, we found that some throughput metrics where off by a few % in the cases we looked at. This PR improves on this by calculating all throughputs by dividing the sum of all increments per iteration by the total iteration time. Other notable changes: - No need for "tensor mode". Keep all tensors as they are, detach and move to CPU as soon as we peek, reduce or create a state. Therefore, when instantiating, all values are on CPU and implicitly converted to Tensors and moved to GPU as soon as values are logged. - Discriminate between Stats being leaf/non-leaf and root/non-root. Leaf stats can be pushed to. Non-leaf stats are used to aggregate. Root stats can have a special function (like with LifetimeSum) - Discriminate between MetricsLoggers being root and non-root to inform Stats objects whether they are instantiated inside a root MetricsLogger. This PR includes a mini migration guide that points out what needs to happen from the users perspective. --------- Signed-off-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Daniel Sperber <[email protected]>
also removes corebuild, and will only use `corebuild-multipy` Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>
it got broken after python version upgrade Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: dayshah <[email protected]>
β¦um (ray-project#59468) Signed-off-by: dayshah <[email protected]>
β¦tive solution (ray-project#56082) ## Why are these changes needed? When we retry EnvRunner._sample, we call EnvRunner._sample again. This makes for a recursive solution, leaving already done episodes in done_episodes_to_return. This PR makes it so that .... - We stay true to the promise of the docstring of the EnvRunner.sample() methods that we return AT least n timesteps. Even if envs are restarted. Before, we would not reset how many timessteps have been sampled when resetting the environment (and thus starting a new episode to collect chunks from). - ... `done_episodes_to_return` does not start from zero if we retry in the new recursive call - ... we can make an arbitrary number of calls to env_runner.sample(), sampling episodes and timesteps as we like. Before We'd break if the number of episodes or timesteps was reached, thus leaving other episodes in a dangling state even if they were finished. (run included test on old version of code to confirm) - ... we don't recurse, thereby creating more risk for future memory leaks --------- Co-authored-by: Kamil Kaczmarek <[email protected]>
## Description Fix ingress deployment name could be modified if child deployment has the same name ## Related issues Fixes ray-project#53295 ## Additional information Because there is a child app with the same name as the ingress deployment app, the ingress deployment app name was modified during _build_app_recursive function. Therefore we should use the modified name instead. Another solution is changing the child_app name instead of the ingress deployment app name --------- Signed-off-by: Le Duc Manh <[email protected]>
β¦er initialization (ray-project#59611) ## Description Fixes a race condition in `MetricsAgentClientImpl::WaitForServerReadyWithRetry` where concurrent HealthCheck callbacks could both attempt to initialize the exporter, causing GCS to crash with: ``` Check failed: !exporting_started_ RayEventRecorder::StartExportingEvents() should be called only once. ``` The `exporter_initialized_` flag was a non-atomic bool. When multiple HealthCheck RPCs completed simultaneously, their callbacks could both read false before either set it to true, leading to `init_exporter_fn` being called twice. Changed the flag to `std::atomic<bool>` to ensure only one callback wins the race. Signed-off-by: Sagar Sumit <[email protected]>
β¦STOP_REQUESTED in autoscaler v2 (ray-project#59550) ## Description When the autoscaler attempts to terminate QUEUED instances to enforce the `max_num_nodes_per_type` limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state. The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED β RAY_STOP_REQUESTED, which is invalid. https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197 This occurs when `max_workers` configuration is dynamically reduced or when instances exceed the limit. ``` 2025-12-04 06:21:55,298 INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached). 2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED ``` This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling queued instances. ## Related issues Closes ray-project#59219 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: win5923 <[email protected]>
## Description When running bellow code: ``` from ray import ActorID ActorID.nil().job_id ``` or ``` from ray import TaskID TaskID.nil().job_id() ``` Bellow error shows: <img width="1912" height="331" alt="ζͺε 2025-12-18 δΈε6 49 18" src="https://github.com/user-attachments/assets/b4200ef8-10df-4c91-83ff-f96f7874b0ce" /> The program should throw an error instead of crash, and this PR fixed it by adding a helper function to do nil check. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". Closes [ray-project#53872](ray-project#53872) ## Additional information After the fix, now it will throw an `ValueError` <img width="334" height="52" alt="ζͺε 2025-12-20 δΈε8 47 30" src="https://github.com/user-attachments/assets/00228923-2d26-4cb4-bf53-615945d2ce6c" /> <img width="668" height="103" alt="ζͺε 2025-12-20 δΈε8 47 49" src="https://github.com/user-attachments/assets/ee68213a-681a-4499-bef2-2e13533e3ffd" /> --------- Signed-off-by: Alex Wu <[email protected]>
β¦0 GPUs on CPU-only cluster (ray-project#59514) If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes. This PR fixes that bug. Signed-off-by: Balaji Veeramani <[email protected]>
β¦object construction (ray-project#59500) ## Description Reduce overhead added by token authentication: - Return shared_ptr from AuthenticationTokenLoader::GetToken() instead of constructing a new AuthenticationToken object copy every time (which would also add object destruction overhead) - Cache token in client interceptor at construction (previously called GetToken() for every RPC) - Use CompareWithMetadata() to validate tokens directly from string_view without constructing new AuthenticationToken objects - Pass shared_ptr through ServerCallFactory to avoid per-call copies release tests: without this change, the microbenchmark `multi_client_put_gigabytes` was in 25-30 range, eg run: https://buildkite.com/ray-project/release/builds/70658 now with this change it is in 40-45 range https://buildkite.com/ray-project/release/builds/72070 --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
## Description This is a series of PRs to refactor/consolidate progress reporting and to decouple it from the executor, opstates, etc. ## Related issues Split from ray-project#58173 ## Additional information N/A --------- Signed-off-by: kyuds <[email protected]> Signed-off-by: Daniel Shin <[email protected]>
## Description Move extension types to ray.data ray.air.util.objection_extensions -> ray.data._internal.object_extensions ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions ## Related issues Closes ray-project#59418 Signed-off-by: will <[email protected]>
β¦ovements. (ray-project#59544) ## Description APPO targets high-throughput RL and relies on asynchrony - via actor parallelism and multi-threading. In our tests, the local-learner setup (num_learners=0) underperformed, prompting a deeper investigation into root causes and improvements as follows. ### Causes - __Producer-driven + weak backpressure:__ The data-flow was producer-driven. There was no bounded, blocking handoff to pace producers. The in-queue for the `LearnerThread` was a `deque` which is unblocking. As a result, the `Learner` wasn't setting the pace (sawtooth throughput and stalls at 20-update barrier, i.e. learner starved; furthermore: CPU-waste and hot GIL when polling from empty queues). - __Thread contention at multiple places:__ `Learner` and `_LearnerThread` shared the `_num_updates_lock` and `metrics._threading_lock` (`RLock`). Every update both threads contended on the same lock. Every 20 updates occasionally both threads contended on the other shared lock. - __Global timesteps, race condition:__ `update()` wrote a global `_CURRENT_GLOBAL_TIMESTEPS` and the `_LearnerThread` had to read it later. Rapid calls to `update()` could overwrite this before the thread consumes it, i.e. timesteps can mismatch the batch actually trained. - __Aggregation race + spurious reduces:__ The "every 20-updates" path checked a copy of `_num_updates` without synchronization, then reset it at `>=20` - but without any threading event/condition to align the producer with the consumer. This returned often `{}` or reduce at odd times. - __No pinned memory or stream handoff:__ GPU-copies used `pin_memory=False` and there was no explicit stream handoff; any implicit sync may land inside of the learner update timing. - __Reference resolving on producer's hot path:__ `TrainingData.solve_refs()` was called synchronously inside `LearnerGroup.update()` before queuing which cost sometimes around ~25% of time in some calls. Extends the window where producer/learner can drift. - __Mixed queue semantics + missing task semantics:__ The code mixed `queue.Queue` and `deque` (and optionally `CircularBuffer`). `task_done`/`join()` semantics don't exist for `deque`, correctness relies on polling and manual drops. There was no bounded, blocking handoff to pace producers. This was brittle under load. - __No clean stop path:__ The thread used a `stop` flag but no sentinel was enqueued, if it was blocked/polling the shutdown could hang or increment counters after a stop. - __Complete multi-learner stalling:__ In multi-learner setups with __multi-agent__ policies asynchronous batches (i.e. batches with different combinations of policies) led to stalls in Torch's `DistributedDataParallel` gradient asynchronous synchronization. One rank computed gradients for a policy not existent on the other rank(s) and waited indefinitely for synched gradients. ### Improvements by this PR - __Consumer-driven:__ `Learner` dictates pace through blocking queues (`Learner` does `get()`, producer does `put`). That avoids busy polling (no CPU burn). Faster reloading through quick returns from `Learner.update()` with no results ready. Avoids learner starving - bigger queues allow for frequent producer burstiness. - __Edge-triggered aggregation:__ Only `_LearnerThread` increments a private counter and on __exactly__ the `broadcast_interval`th update fires an __event__ (`_agg_event`). The producer simply `wait`s for the event and `clear`s it (no lock fight). Furthermore, the `_LearnerThread` now reduces metrics and returns them through an out-queue from which the `Learner` picks them up and returns them to the main process. All of these measures reduce thread contention to an absolute minimum. - __Pass meta-data with batch:__ The `Learner` enqueues a tuple `(batch, timesteps)` so the `_LearnerThread` consumes the correct timesteps atomically. This also reduces communication and boilerplate. - __(Optional) Deferral of reference resolving:__ Post-solve references in `Learner.update()` to return faster in asynchronous calls. - __Clean stop + consistent semantics:__ Use a `_STOP_SENTINEL` through the learner queue; don't rely anymore on a boolean alone. And call `task_done()` on real `queue.Queue` (if not using `CircularBuffer`). Furthermore unification of buffer/queue API inside `Learner`. - __Safe-guarding multi-agent multi-learner training:__ Manual synchronization of gradients replaces Torch's `DistributedDataParallel` hooks-based synchronization for multi-learner multi-agent setups. Gradients on each rank are zero-padded and synched after all gradients have been computed. ## Related issues ## Additional information Because this PR reshapes the data flow, a few tuning tips are useful: - Circular buffer vs. simple queue. The old CircularBuffer prevented learner starvation but its push/pop are slower. The new consumer-driven pipeline is generally more efficient - assuming producers are reasonably fast and the learner queue isnβt tiny. Use `use_circular_buffer=True` only when producing is expensive/irregular (it lets the learner keep iterating over buffered data, similar to `num_epochs > 1` but in cycles). Otherwise, prefer the simple queue. Recommended defaults: `simple_queue_size=32` for `APPO`; `IMPALA` keeps a smaller `learner_queue_size=3`. - Unified interval: broadcast & metrics reduction. Previously, weights were synced by `broadcast_interval` while metrics were reduced every fixed 20 updates. The new design unifies these: `broadcast_interval` now controls both weight sync and metrics reduction. In practice, ~10 balances steady flow with acceptable off-policy lag. - Scale producers to match a fast learner. The `_LearnerThread` applies updates quickly, so overall throughput is often producer-bound. To feed it well, increase `num_env_runners` and/or `num_envs_per_env_runner`. ### Next steps This PR improves dataflow focused on the learner(s). The next steps are: - To increase throughput in `AggregatorActors` - To improve dataflow in IMPALA's main thread. - To boost performance in loss calculation. - To check, asynchronous calls to `EnvRunner`s and `Learner`s. - To test resolving references in either `_GPULoaderThreads` or `_LearnerThread` instead of the `Learner`'s main thread. ### Tests APPO in this PR was tested on the following (multi-agent) environments: - `CartPole-v1` - `ALE:Pong-v5` - `Footsies` (see https://github.com/chasemcd/FootsiesGym) #### `CartPole-v1` This PR improves performance significantly for high-producer scenarios like `CartPole-v1`. All tests used: - `broadcast_interval=10` - `use_circular_buffer=False` - `num_aggregator_actors_per_learner=3` - `num_env_runners=num_learners x 32` - `episodes_to_numpy=False` - `num_gpus_per_learner=1` <img width="757" height="404" alt="image (3)" src="https://github.com/user-attachments/assets/3beee428-d4c0-42f4-811d-61d81de484c2" /> #### `ALE:Pong-v5` All tests used: - `broadcast_interval=10` - `use_circular_buffer=False` - `num_aggregator_actors_per_learner=6` - `num_env_runners=num_learners x 32` - `episodes_to_numpy=True` (`FrameStack` connector with 4 frames) - `num_gpus_per_learner=1` <img width="676" height="366" alt="image" src="https://github.com/user-attachments/assets/43d08a87-0cc1-4902-8150-adc1c3203be6" /> --------- Signed-off-by: simonsays1980 <[email protected]>
## Description This PR improves typehinting in `ray._common.retry`. It contained a lot of `Any` or unspecified generics before and now should be fully specific. --------- Signed-off-by: Jonas Dedden <[email protected]>
## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: βmaster.zipβ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Rayβs HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Rayβs behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <[email protected]>
Previously, if the user did not specify them, Ray preassigned the GCS port, dashboard agent port, runtime environment port, etc., and passed them to each component at startup. This created a race condition: Ray might believe a port is free, but by the time the port information is propagated to each component, another process may have already bound to that port. This can cause user-facing issues, for example when Raylet heartbeat messages are missed frequently enough that the GCS considers the node unhealthy and removes it. We originally did this because there was no standard local service discovery, so components had no way to know each otherβs serving ports unless they were preassigned. The final port discovery design is here: <img width="2106" height="1492" alt="image" src="https://github.com/user-attachments/assets/eaac8190-99d8-404b-8a8d-283a4f2f0f33" /> This PR addresses port discovery for: - GCS reporting back to the startup script (driver)β - The runtime env agent reporting back to the rayletβ - The dashboard agent reporting back to the raylet β - The raylet blocking registration with the GCS until it has collected port information from all agents β - GCS adding InitMetricsExporter to node_added_listeners_ so it starts the MetricsExporter as soon as the raylet registers with the GCS with complete port information β - The Ray client server obtaining the runtime env agent port from GCSβ - Ensuring that both a connected-only driver (e.g., `ray.init()`) and a startup driver still receive all port information from the GCSβ - Ensure GCS FT WorksοΌUsing the same GCS port as beforeβ - Ensure no metric lossβ - Clean up the old cache port codeβ (Note that this PR is a clean-up version of ray-project#59065) ## Consideration **GCS Fault tolerance:** GCS fault tolerance requires GCS to restart using exactly the same port, even if it initially starts with a dynamically assigned port (0). Before this PR, GCS cached the port in a file, and this PR preserves the same behavior (although ideally, the port should only be read from the file by the Raylet and its agent). This can be further improved by storing the GCS port in Redis, but that should be addressed in a separate PR. **GCS start sequence related:** OpenCensus Exporter and the Event Aggregator Client are now constructed without connecting to the agent port; instead, they defer the actual connection until the head Raylet registers via a callback. At that point, the actual metrics_agent_port is known from the node information. The OpenTelemetry Exporter is now also initialized at head Raylet registration time. **Ray nodes that share the same file system:** There are cases where people run multiple Ray nodes from the same or different Ray clusters, so the port file name is based on a fixed prefix plus the node ID. ## Related issues Closes ray-project#54321 ## Test For GCS-related work, here is a detailed test I wrote that covers seven starting/connecting cases: - https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/python/ray/tests/test_gcs_port_reporting.py - ray.init starts a head node and exposes a dynamic GCS port. - Connect a driver via address="auto" using the address file - Connect a driver via an explicit address - CLI starts head with dynamic GCS port - CLI starts worker connecting to the head via GCS address - CLI starts head with an explicit GCS port - CLI starts head with default GCS port For runtime env agent: - https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_agent_port.py - ray start --head (auto port discovery) - ray start --head with fixed runtime-env-agent-port - ray.init() local cluster (auto port discovery) - (we don't have ray.init() with fixed _runtime_env_agent_port) Test that ray_client_server works correctly with dynamic runtime env agent port: - https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_ray_client_with_runtime_env.py For dashboard agent ports, the existing tests already cover this quite well. ## Follow up - The dashboard agent reporting back to the raylet - The dashboard agent now also writes to GCS, but we should allow only the raylet to write to GCS ## performance before this PR: ```shell [0.000s] Starting ray.init()... [0.000s] Session dir created [0.070s] Process: gcs_server [6.885s] Process: runtime_env_agent [6.955s] Process: raylet [6.955s] Process: dashboard_agent 2025-12-12 04:47:34,391 INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [9.061s] ray.init() completed ``` After This PR: ```shell [0.000s] Starting ray.init()... [0.075s] Process: gcs_server [0.075s] Session dir created [0.075s] File: gcs_server_port.json = 39451 [6.976s] Process: raylet [6.976s] Process: dashboard_agent [6.976s] Process: runtime_env_agent [7.576s] File: runtime_env_agent_port.json = 38747 [7.640s] File: metrics_agent_port.json = 40005 [8.083s] File: metrics_export_port.json = 44515 [8.083s] File: dashboard_agent_listen_port.json = 52365 2025-12-12 02:02:54,925 INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [10.035s] ray.init() completed ``` We can see that the dominant time is actually at the start of GCS. We wait for GCS to be ready and write the cluster info. The port reporting speed is quite fast (file appearance time β raylet start time). https://github.com/ray-project/ray/blob/863ae9fd573b13a05dcae63b483e9b1eb0175571/python/ray/_private/node.py#L1365-L367 --------- Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]>
Clean Up Deprecated Env Var and Document Undocumented Env Vars ### Summary Remove deprecated `RAY_SERVE_ENABLE_JSON_LOGGING` and add documentation for undocumented Ray Serve environment variables. ### Changes | File | Description | |------|-------------| | `python/ray/serve/_private/constants.py` | Removed `RAY_SERVE_ENABLE_JSON_LOGGING`, renamed `SERVE_ROOT_URL_ENV_KEY` β `RAY_SERVE_ROOT_URL`, removed deprecated `CONTROLLER_MAX_CONCURRENCY` fallback | | `python/ray/serve/_private/logging_utils.py` | Removed deprecated JSON logging logic and warning | | `python/ray/serve/_private/controller.py` | Updated to use `RAY_SERVE_ROOT_URL` constant | | `doc/source/serve/monitoring.md` | Removed deprecation note, added `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` docs | | `doc/source/serve/advanced-guides/performance.md` | Added `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` docs | | `doc/source/serve/production-guide/config.md` | Added `RAY_SERVE_ROOT_URL` docs | ### New Documentation | Environment Variable | Description | |---------------------|-------------| | `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` | Max concurrent requests for Controller (default: 15000) | | `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` | Callback for custom Controller initialization | | `RAY_SERVE_ROOT_URL` | Override root URL (useful behind load balancers) | ### Migration Users using `RAY_SERVE_ENABLE_JSON_LOGGING=1` should migrate to `LoggingConfig` with `encoding="JSON"`. --------- Signed-off-by: harshit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge from master to main. The changes are extensive and primarily focus on updating the CI and build infrastructure. Key changes include dropping support for Python 3.9 and making Python 3.10 the new default, which is reflected across numerous Buildkite pipeline configurations, Dockerfiles, and Bazel build files. There's a significant refactoring of the CI setup, such as moving documentation-related jobs to a dedicated doc.rayci.yml file and simplifying dependency set compilation. The Docker image build process is modernized by using uv for Python environment setup. Additionally, there are many documentation updates, including new guides for contributing to Ray Data, using new features like label-based scheduling and incremental upgrades for RayService, and improved explanations for existing concepts. I've added a couple of comments regarding a potentially broken test and a documentation clarification. Overall, the changes appear consistent with a major version bump and infrastructure cleanup.
| def test_include_setuptools(self): | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| copy_data_to_tmpdir(tmpdir) | ||
| manager = _create_test_manager(tmpdir) | ||
| manager.compile( | ||
| constraints=[], | ||
| requirements=["requirements_test.txt"], | ||
| name="general_depset", | ||
| output="requirements_compiled_general.txt", | ||
| include_setuptools=True, | ||
| ) | ||
| output_file = Path(tmpdir) / "requirements_compiled_general.txt" | ||
| output_text = output_file.read_text() | ||
| assert "--python-version 3.10" in output_text | ||
| assert "--unsafe-package setuptools" not in output_text | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test incorrectly checks the content of the generated lock file for a command-line argument (--unsafe-package setuptools). Command-line arguments are not written to the lock file, especially with --no-header now being a default flag.
To correctly verify the behavior, this test should be updated to check stdout to ensure the flag is not passed to the uv command, similar to how test_ignore_setuptools is implemented. This will make the test more robust and correctly reflect the intended logic.
| def test_include_setuptools(self): | |
| with tempfile.TemporaryDirectory() as tmpdir: | |
| copy_data_to_tmpdir(tmpdir) | |
| manager = _create_test_manager(tmpdir) | |
| manager.compile( | |
| constraints=[], | |
| requirements=["requirements_test.txt"], | |
| name="general_depset", | |
| output="requirements_compiled_general.txt", | |
| include_setuptools=True, | |
| ) | |
| output_file = Path(tmpdir) / "requirements_compiled_general.txt" | |
| output_text = output_file.read_text() | |
| assert "--python-version 3.10" in output_text | |
| assert "--unsafe-package setuptools" not in output_text | |
| @patch("sys.stdout", new_callable=io.StringIO) | |
| def test_include_setuptools(self, mock_stdout): | |
| with tempfile.TemporaryDirectory() as tmpdir: | |
| copy_data_to_tmpdir(tmpdir) | |
| manager = _create_test_manager(tmpdir) | |
| manager.compile( | |
| constraints=[], | |
| requirements=["requirements_test.txt"], | |
| name="general_depset", | |
| output="requirements_compiled_general.txt", | |
| include_setuptools=True, | |
| ) | |
| stdout = mock_stdout.getvalue() | |
| assert "--unsafe-package setuptools" not in stdout |
| .map_batches(drop_nas, batch_format="pandas") | ||
| ) | ||
| .. tab-item:: pyarrow | ||
|
|
||
| The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches | ||
| in multiple ways (see :ref:`Configuring batch format <configure_batch_format>`), the function should be of type | ||
| ``Callable[DataBatch, DataBatch]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In | ||
| other words, your function should take as input and output a batch of data which you can represent as a | ||
| pandas DataFrame or a dictionary with string keys and NumPy ndarrays values. For example, your function might look like: | ||
|
|
||
| .. testcode:: | ||
| .. testcode:: | ||
|
|
||
| import pandas as pd | ||
| import pyarrow as pa | ||
| import pyarrow.compute as pc | ||
| import ray | ||
|
|
||
| def fn(batch: pd.DataFrame) -> pd.DataFrame: | ||
| # modify batch | ||
| batch = ... | ||
| def drop_nas(batch: pa.Table) -> pa.Table: | ||
| return pc.drop_null(batch) | ||
|
|
||
| # return batch | ||
| return batch | ||
| ds = ( | ||
| ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") | ||
| .map_batches(drop_nas, batch_format="pyarrow") | ||
| ) | ||
|
|
||
| The user defined function can also be a Python generator that yields batches, so the function can also | ||
| be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. | ||
| be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray], pyarrow.Table]``. | ||
| In this case, your function would look like: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation for the non-generator UDF signature for map_batches was removed. This makes the documentation less clear for this common use case. It would be beneficial to restore the explanation for Callable[DataBatch, DataBatch] and update the DataBatch type hint to include pyarrow.Table for completeness.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-12-26
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.