Skip to content

Conversation

@antfin-oss
Copy link

@antfin-oss antfin-oss commented Dec 24, 2025

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-12-26
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

JasonLi1909 and others added 30 commits December 5, 2025 14:32
The data resource cleanup tests experienced flaky behavior due to
relying on `SplitCoordinator.start_epoch.remote()` to indirectly request
resources to the `AutoscalingRequester`. There is no guarantee by the
time of assertion that a scaling request has been made yet.

This PR: 
- Implements a 3 second timeout to wait for resource requests to be made
- Makes explicit autoscaling requests via `try_trigger_scaling` calls
instead of relying on `SplitCoordinator.start_epoch`
- Basic cleanup

---------

Signed-off-by: JasonLi1909 <[email protected]>
## Description

Implements dark mode for the Ray Dashboard.

- Toggle button in nav bar switches between light/dark themes
- Preference saved to localStorage and persists across sessions
- Respects system theme preference on first load
- 57 files updated to replace hardcoded colors with theme palette values
- All components work properly in both modes (icons, charts, embedded
Grafana)

## Related issues

Closes ray-project#49075

## Additional information

### Note about PR ray-project#54783

Started this before discovering ray-project#54783. This PR covers more components
(57 vs 21 files) and integrates into existing GlobalContext. Happy to
collaborate with @ddishi if needed.

### Implementation

- Added `themeMode` and `toggleTheme` to GlobalContext in `App.tsx`
- Theme definitions in `theme.ts` with full palettes for both modes
- Dark theme uses deep blacks (#0F0F0F, #1A1A1A)
- Replaced hardcoded colors with theme values (`theme.palette.*`)
- ~~Renamed GRAM to VRAM for standard terminology~~

### Screenshots

![Cluster
Dark](https://github.com/user-attachments/assets/115b73d6-bca5-4a91-9e95-bb66f85012b3)

<details>
<summary><b>Overview Page</b></summary>

![Overview
Light](https://github.com/user-attachments/assets/f1e79b17-edb7-4290-939e-62ef491662ea)
![Overview
Dark](https://github.com/user-attachments/assets/31dce618-50d7-49a5-8ef2-cf1db86eb01c)

</details>

<details>
<summary><b>Cluster/Nodes Page</b></summary>

![Cluster
Light](https://github.com/user-attachments/assets/a7a9a743-cd46-485a-aa72-09862d62a3f1)![Cluster
Dark](https://github.com/user-attachments/assets/115b73d6-bca5-4a91-9e95-bb66f85012b3)

</details>

<details>
<summary><b>Jobs Page</b></summary>

![Jobs
Light](https://github.com/user-attachments/assets/421f7990-e213-4eac-9906-e448539133a4)
![Jobs
Dark](https://github.com/user-attachments/assets/bb1acb95-423a-455b-bedd-ae2e6829a0e4)

</details>

<details>
<summary><b>Actors Page</b></summary>

![Actors
Light](https://github.com/user-attachments/assets/4bfd0905-dce6-4150-9f4a-4da50227ed7a)
![Actors
Dark](https://github.com/user-attachments/assets/736f97f4-a8a0-4e1e-a993-4a5c07f6c0af)

</details>

<details>
<summary><b>Serve Deployments Page</b></summary>

![Serve
Light](https://github.com/user-attachments/assets/c0e2866f-1bac-4705-bc5a-78120d8e4aaa)
![Serve
Dark](https://github.com/user-attachments/assets/42b26cd4-7ac8-48e9-825f-0120247a2874)

</details>

<details>
<summary><b>Metrics Page</b></summary>

![Metrics
Light](https://github.com/user-attachments/assets/369a2f05-a488-4c72-9368-3b5e7d2bb3ae)
![Metrics
Dark](https://github.com/user-attachments/assets/6d2b81d0-53a3-4728-9583-1cc10330825a)

</details>

<details>
<summary><b>Logs Page</b></summary>


![Logs
Light](https://github.com/user-attachments/assets/74ded09c-e8e1-44c3-8453-2484e93c0d1e)
![Logs
Dark](https://github.com/user-attachments/assets/892dfcec-7308-4f4d-99d0-7eb6e61f4c39)

</details>

### Testing

Manually tested all pages in both themes. Toggle works without refresh,
preference persists, system detection works (at least on my machine :D)
.

Given the scope, I likely missed some edge cases or specific components.
Happy to fix anything that comes up in review.

---------

Signed-off-by: Julius Arkenberg <[email protected]>
…#59196)

## Description
The gpu
[test](https://github.com/ray-project/ray/blob/8b4a1ee98fd92a972d07e9c08af19f02829dd40f/python/ray/tests/test_tensor_zero_copy_serialization.py#L171)
was failing on Ray postmerge CI.

We should re-add with a CPU version that runs in premerge and a GPU
version running on that build:
- Add one entry to [this
file](https://github.com/ray-project/ray/blob/5328b3250f8a15dc35c391869c24f0db0f821a32/python/ray/tests/BUILD.bazel#L503),
in the "small" tests
- Added back the GPU test entry

## Related issues
Fixes ray-project#59195

---------

Signed-off-by: Haichuan Hu <[email protected]>
…-project#59084)

## Description
The [PR](ray-project#57639) supporting
zero-copy serialization for read-only tensors has been merged. We need
to add documentation to inform users about this new feature.

The docs should mention that torch tensors are not zero-copy by default,
show an example of how to enable zero-copy, and most importantly explain
the limitations and possible issues

@tianyi-ge 

## Related issues
Closes ray-project#59081

---------

Signed-off-by: Haichuan Hu <[email protected]>
… = token or k8s (ray-project#59168)

## Description
today the authentication dialog box opens everytime any ray dashboard
API call fails with 401/403 regardless of the authentication mode being
used. this can lead to confusing/undesirable behaviour when token auth
is not enabled. In this pr I have added an additional check to verify
auth mode before opening the dialog box

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
…roject#59180)

After fully starting up, the runtime-env agent does not handle SIGTERM
and only responds to SIGKILL.

The current test that kills the runtime-env agent with SIGTERM only
passes because it sends the signal before the agent is fully
initialized. Once the agent finishes startup, it no longer responds to
SIGTERM, causing the test to become timing-dependent.

This PR temporarily removes the test because it is potentially highly
flakyβ€”any change that makes the agent start faster will cause the test
to fail.

**The raylet graceful-exit behavior is still covered by the test where
the agent is terminated with SIGKILL.**

See the related issue for
details:ray-project#59162

## Related issues
Closes ray-project#59162

## Additional information

Test:
```shell
ray start --head --port=6379
ps aux | grep "runtime_env/agent/main.py" | grep -v grep
#ubuntu   3946849  0.2  0.1 2146984 115396 pts/2  Sl   18:49   0:00 /home/ubuntu/ray/venv/bin/python -u /home/ubuntu/ray/python/ray/_private/runtime_env/agent/main.py


# Now, wait for a while

kill -15 3946849

ps -p 3946849 -o pid,state,comm 2>/dev/null && echo "Process still alive!" || echo "Process is dead"
#    PID S COMMAND
#3946849 S python
#Process still alive!

```

Signed-off-by: yicheng <[email protected]>
Co-authored-by: yicheng <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…tests (ray-project#59215)

## Description
The expression `bool(os.environ.get("RAY_PYTEST_USE_GPU", 0))`
incorrectly evaluates to `True` when the environment variable is
explicitly set to "0". This happens because `os.environ.get()` returns a
string when the variable is set, and bool("0") is True (non-empty
string). The common pattern used elsewhere in this codebase is
`os.environ.get("VAR") == "1"`, which correctly handles all cases. If
someone sets `RAY_PYTEST_USE_GPU=0` to disable GPU tests, the tests
would still run unexpectedly.

@tianyi-ge 

## Related issues
Closes ray-project#59213

Signed-off-by: Haichuan Hu <[email protected]>
## Description

Footsies environment can be rather spammy. This PR enables us to get rid
of the outputs or redirect them to afile.

---------

Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Kamil Kaczmarek <[email protected]>
upgrade h5py in preparation for python 3.13 image building
h5py==3.10.0 -> h5py==3.12.1

h5py 3.12.1 has python 3.13 wheels

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…y-project#58992)

upgrading test requirements in preparation for python 3.13 image
building
pillow==10.3.0 -> pillow 10.4.0
scikit-learn==1.3.2 -> scikit-learn>=1.5.2

for py313 images

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
… lint (ray-project#59247)

Our CI uses Bazel to run Python tests (not pytest). If we don't call
`pytest.main` in a test module, the tests aren't actually run.

ray-project#58816 fixed this issue for
`test_hash_shuffle`. As a follow up, this PR removes the
`test_hash_shuffle` from the list of files excluded from the
`pytest.main` lint.

Signed-off-by: Balaji Veeramani <[email protected]>
> Briefly describe what this PR accomplishes and why it's needed.

Makes CancelTask RPC Fault Tolerant. Created an intermediary RPC similar
to what was done in ray-project#57648 in that when the force_exit flag is enabled
for cancel, the executor worker is shut down gracefully. However we have
no way of determining whether the shutdown was successful on the owner
core worker, hence we send the cancel request to the raylet via a new
RPC CancelLocalTask that guarantees the worker is killed. Added a python
test to verify retry behavior, leaving out the cpp test after talking to
@dayshah due to being a bit complicated in that we need to take into
account all orderings of the owner/executor states in the cancellation
process.

In the current task submission path, we don't keep track of the raylet
address of the worker when we receive the PushTask RPC. It's a bit
complicated to do this since the GCS keeps track of actor lifecycles
including requesting leases, hence instead of touching the hot path
(task submission) we decided to just complicate the cancellation path.
Upon receiving a CancelTask RPC, we will query the gcs node cache to get
the node info. Only if its not in the cache do we then query the GCS.
Unfortunately the gcs node cache is currently not thread safe and show
only be accessed on the main io service hence we refactored
Normal/ActorTaskSubmitter so that it posts the portion of the code that
accesses the cache onto the main io service.

There was also a race condition in CancelLocalTask/KillLocalActor where
send_reply_callback could be triggered twice if we receive the response
from CancelTask/KillActor, but the worker is not evicted from the raylet
worker pool immediately. Hence the callback in execute_after could
trigger. So added a replied boolean flag to guard against this.

CancelTask when force_kill=true behavior has been modified to trigger a
SIGKILL after a set amount of time if the graceful shutdown from the
worker hangs.

Lastly, the actor task retry did not use the config used in the normal
task retry timer. Updated this

---------

Signed-off-by: joshlee <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
…ect#59172)

## Description
Currently when transforming pandas to arrow block, we will convert it
into `pa.null` if all the value in TensorArray is NaN or empty string,
however all NaN might actually represent something and could actually
happen in transformation.


https://github.com/ray-project/ray/blob/8b4a1ee98fd92a972d07e9c08af19f02829dd40f/python/ray/air/util/tensor_extensions/pandas.py#L926-L958

And could triggered error when we tried to convert that arrow block back
to pandas (try running the repro script from issue)

This PR skip coerced to `pa.null` if the`col.dtype` is `TensorDtype`,
which should tolerate the all NaN or all empty string table.

## Related issues
Closes ray-project#59087

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Related prs that we should review when upgrading fully: 
- ray-project#58820
- Note from Rui: When we bump new vllm version, we should go with 0.11.2
instead of 0.11.1, which fixes a Ray multi-node PP regression that was
introduced when adding torch-based PP
https://github.com/vllm-project/vllm/releases/tag/v0.11.2

Issues:
- closes ray-project#58937
- closes ray-project#58973
- closes ray-project#58702

---------

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
Signed-off-by: Nikhil Ghosh <[email protected]>
Signed-off-by: Nikhil G <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Seiji Eicher <[email protected]>
Co-authored-by: Nikhil Ghosh <[email protected]>
Co-authored-by: Nikhil G <[email protected]>
Co-authored-by: elliot-barn <[email protected]>
…t#59254)

also add missing `click` dependencies for scripts

Signed-off-by: Lonnie Liu <[email protected]>
for ray image releases

Signed-off-by: Lonnie Liu <[email protected]>
…9252)

they are not captured by bazel and does not work any more

Signed-off-by: Lonnie Liu <[email protected]>
…ect#59256)

## Description

`test_preserve_hash_shuffle_blocks` has been flaking consistently. To
mitigate the flakiness, this PR bumps the test size from "small" to
"medium".

```
[2025-12-06T07:06:32Z] //python/ray/data:test_preserve_hash_shuffle_blocks                     TIMEOUT in 3 out of 3 in 63.4s
--
[2025-12-06T07:06:32Z]   Stats over 3 runs: max = 63.4s, min = 60.1s, avg = 62.3s, dev = 1.6s
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_1.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_2.log
[2025-12-06T07:06:32Z]
```

See also ray-project#58988

Signed-off-by: Balaji Veeramani <[email protected]>
`test_download_expression` has been occasionally timing out recently. To
mtiigate this flakiness, this PR bumps the test size from "small" to
"medium".

Signed-off-by: Balaji Veeramani <[email protected]>
…oject#59258)

`test_map_batches_e2e` occasionally flakes because it assumes that the
Dataset outputs data in a specific order, but that order can vary if
tasks finish out of order.

```
[2025-12-04T21:34:13Z]     def test_map_batches_e2e(ray_start_regular_shared_2_cpus):
--
[2025-12-04T21:34:13Z]         ds = ray.data.range(5)
[2025-12-04T21:34:13Z]         ds = ds.map_batches(column_udf("id", lambda x: x))
[2025-12-04T21:34:13Z] >       assert extract_values("id", ds.take_all()) == list(range(5)), ds
[2025-12-04T21:34:13Z] E       AssertionError: MapBatches(<lambda>)
[2025-12-04T21:34:13Z] E         +- Dataset(num_rows=5, schema={id: int64})
[2025-12-04T21:34:13Z] E       assert [0, 1, 2, 4, 3] == [0, 1, 2, 3, 4]
[2025-12-04T21:34:13Z] E         At index 3 diff: 4 != 3
[2025-12-04T21:34:13Z] E         Full diff:
[2025-12-04T21:34:13Z] E         - [0, 1, 2, 3, 4]
[2025-12-04T21:34:13Z] E         ?            ---
[2025-12-04T21:34:13Z] E         + [0, 1, 2, 4, 3]
[2025-12-04T21:34:13Z] E         ?           +++
[2025-12-04T21:34:13Z]
```

To mitigate this issue, this PR sorts the output values.

Signed-off-by: Balaji Veeramani <[email protected]>
## Why are these changes needed?

In this PR, we refactor MetricsLogger and Stats into a design that is
more sustainable (more tests + ability for users to modify behaviour
through configuration) and also improves on a number of key issues.

**The main issues with the old design are:**

**Too many useless combinations of input arguments to methods like
MetricsLogger.log_value()**
These increase the test surface area a lot and is solved by refactoring
the legacy Stats class into multiple stats classes such as StatsBase,
MaxStats or LifetimeSumStats. These classes represent the few valid
combinations of input arguments such as (reduce="mean", window=10).

**Many metrics are easily skewed because windows are applied in parallel
components AND in the root metrics logger**
This is opaque and not easy to work around. This PR solves this by
treating stats at the root metrics logger differently (stats have
variables "is_root" and "is_leaf" to align their behaviour when pushing,
reducing, peeking or merging values with where they are. For example,
stats in the root metrics logger are mostly used to aggregate metrics
and therefore we don't apply a window, even if the values at leafs are
logged with a window.
This also has the nice sideeffect that we don't need to specify how to
prioritize values when merging anymore (see the deprecated
"reduce_per_index_on_aggregate" argument.

**Different modes of aggregates (sometimes we use MetricsLogger.log_dict
and sometimes MetricsLogger.aggregate)**
This leads to different aggregation patters. For example, we use
MetricsLogger.log_dict() to log results from Learner updates. This can
happen n times per training iteration and stats with a window of 1 are
overridden each call such that we report only the last value logged
instead of actual averages.

**Skewed throughputs**
Before this PR, throughputs are calculated by calculating the
exponential weighted average over increments divided by the time passed
since the last update for the given metric. This can lead to a noticable
bias although, anecdotally, we found that some throughput metrics where
off by a few % in the cases we looked at.
This PR improves on this by calculating all throughputs by dividing the
sum of all increments per iteration by the total iteration time.

Other notable changes:
- No need for "tensor mode". Keep all tensors as they are, detach and
move to CPU as soon as we peek, reduce or create a state. Therefore,
when instantiating, all values are on CPU and implicitly converted to
Tensors and moved to GPU as soon as values are logged.
- Discriminate between Stats being leaf/non-leaf and root/non-root. Leaf
stats can be pushed to. Non-leaf stats are used to aggregate. Root stats
can have a special function (like with LifetimeSum)
- Discriminate between MetricsLoggers being root and non-root to inform
Stats objects whether they are instantiated inside a root MetricsLogger.

This PR includes a mini migration guide that points out what needs to
happen from the users perspective.

---------

Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Co-authored-by: Daniel Sperber <[email protected]>
also removes corebuild, and will only use `corebuild-multipy`

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
it got broken after python version upgrade

Signed-off-by: Lonnie Liu <[email protected]>
dayshah and others added 18 commits December 20, 2025 12:09
…tive solution (ray-project#56082)

## Why are these changes needed?

When we retry EnvRunner._sample, we call EnvRunner._sample again. This
makes for a recursive solution, leaving already done episodes in
done_episodes_to_return.

This PR makes it so that  ....

- We stay true to the promise of the docstring of the EnvRunner.sample()
methods that we return AT least n timesteps. Even if envs are restarted.
Before, we would not reset how many timessteps have been sampled when
resetting the environment (and thus starting a new episode to collect
chunks from).
- ... `done_episodes_to_return` does not start from zero if we retry in
the new recursive call
- ... we can make an arbitrary number of calls to env_runner.sample(),
sampling episodes and timesteps as we like. Before We'd break if the
number of episodes or timesteps was reached, thus leaving other episodes
in a dangling state even if they were finished. (run included test on
old version of code to confirm)
- ... we don't recurse, thereby creating more risk for future memory
leaks

---------

Co-authored-by: Kamil Kaczmarek <[email protected]>
## Description
Fix ingress deployment name could be modified if child deployment has
the same name

## Related issues
Fixes ray-project#53295

## Additional information
Because there is a child app with the same name as the ingress
deployment app, the ingress deployment app name was modified during
_build_app_recursive function. Therefore we should use the modified name
instead.

Another solution is changing the child_app name instead of the ingress
deployment app name

---------

Signed-off-by: Le Duc Manh <[email protected]>
…er initialization (ray-project#59611)

## Description
Fixes a race condition in
`MetricsAgentClientImpl::WaitForServerReadyWithRetry` where concurrent
HealthCheck callbacks could both attempt to initialize the exporter,
causing GCS to crash with:
```
Check failed: !exporting_started_ RayEventRecorder::StartExportingEvents() should be called only once.
```
The `exporter_initialized_` flag was a non-atomic bool. When multiple
HealthCheck RPCs completed simultaneously, their callbacks could both
read false before either set it to true, leading to `init_exporter_fn`
being called twice. Changed the flag to `std::atomic<bool>` to ensure
only one callback wins the race.

Signed-off-by: Sagar Sumit <[email protected]>
…STOP_REQUESTED in autoscaler v2 (ray-project#59550)

## Description

When the autoscaler attempts to terminate QUEUED instances to enforce
the `max_num_nodes_per_type` limit, the reconciler crashes with an
assertion error. This happens because QUEUED instances are selected for
termination, but the state machine doesn't allow transitioning them to a
terminated state.

The reconciler assumes all non-ALLOCATED instances have Ray running and
attempts to transition QUEUED β†’ RAY_STOP_REQUESTED, which is invalid.


https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197

This occurs when `max_workers` configuration is dynamically reduced or
when instances exceed the limit.

```
2025-12-04 06:21:55,298	INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached).
2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307	INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED
```

This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling
queued instances.

## Related issues
Closes ray-project#59219

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: win5923 <[email protected]>
## Description
When running bellow code:
```
from ray import ActorID
ActorID.nil().job_id
```
or
```
from ray import TaskID
TaskID.nil().job_id()
```

Bellow error shows:
<img width="1912" height="331" alt="ζˆͺεœ– 2025-12-18 δΈ‹εˆ6 49 18"
src="https://github.com/user-attachments/assets/b4200ef8-10df-4c91-83ff-f96f7874b0ce"
/>

The program should throw an error instead of crash, and this PR fixed it
by adding a helper function to do nil check.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".
Closes [ray-project#53872](ray-project#53872)

## Additional information
After the fix, now it will throw an `ValueError`
<img width="334" height="52" alt="ζˆͺεœ– 2025-12-20 上午8 47 30"
src="https://github.com/user-attachments/assets/00228923-2d26-4cb4-bf53-615945d2ce6c"
/>

<img width="668" height="103" alt="ζˆͺεœ– 2025-12-20 上午8 47 49"
src="https://github.com/user-attachments/assets/ee68213a-681a-4499-bef2-2e13533e3ffd"
/>

---------

Signed-off-by: Alex Wu <[email protected]>
…0 GPUs on CPU-only cluster (ray-project#59514)

If you request zero GPUs from the autoscaling coordinator but GPUs don't
exist on the cluster, the autoscaling coordinator crashes.

This PR fixes that bug.

Signed-off-by: Balaji Veeramani <[email protected]>
…object construction (ray-project#59500)

## Description
Reduce overhead added by token authentication:

- Return shared_ptr from AuthenticationTokenLoader::GetToken() instead
of constructing a new AuthenticationToken object copy every time (which
would also add object destruction overhead)
- Cache token in client interceptor at construction (previously called
GetToken() for every RPC)
- Use CompareWithMetadata() to validate tokens directly from string_view
without constructing new AuthenticationToken objects
- Pass shared_ptr through ServerCallFactory to avoid per-call copies

release tests:

without this change, the microbenchmark `multi_client_put_gigabytes` was
in 25-30 range, eg run:
https://buildkite.com/ray-project/release/builds/70658

now with this change it is in 40-45 range 
https://buildkite.com/ray-project/release/builds/72070

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
## Description
This is a series of PRs to refactor/consolidate progress reporting and
to decouple it from the executor, opstates, etc.

## Related issues
Split from ray-project#58173 

## Additional information
N/A

---------

Signed-off-by: kyuds <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
## Description

Move extension types to ray.data

ray.air.util.objection_extensions ->
ray.data._internal.object_extensions
ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions

## Related issues
Closes ray-project#59418

Signed-off-by: will <[email protected]>
…ovements. (ray-project#59544)

## Description
APPO targets high-throughput RL and relies on asynchrony - via actor
parallelism and multi-threading. In our tests, the local-learner setup
(num_learners=0) underperformed, prompting a deeper investigation into
root causes and improvements as follows.

### Causes
- __Producer-driven + weak backpressure:__ The data-flow was
producer-driven. There was no bounded, blocking handoff to pace
producers. The in-queue for the `LearnerThread` was a `deque` which is
unblocking. As a result, the `Learner` wasn't setting the pace (sawtooth
throughput and stalls at 20-update barrier, i.e. learner starved;
furthermore: CPU-waste and hot GIL when polling from empty queues).
- __Thread contention at multiple places:__ `Learner` and
`_LearnerThread` shared the `_num_updates_lock` and
`metrics._threading_lock` (`RLock`). Every update both threads contended
on the same lock. Every 20 updates occasionally both threads contended
on the other shared lock.
- __Global timesteps, race condition:__ `update()` wrote a global
`_CURRENT_GLOBAL_TIMESTEPS` and the `_LearnerThread` had to read it
later. Rapid calls to `update()` could overwrite this before the thread
consumes it, i.e. timesteps can mismatch the batch actually trained.
- __Aggregation race + spurious reduces:__ The "every 20-updates" path
checked a copy of `_num_updates` without synchronization, then reset it
at `>=20` - but without any threading event/condition to align the
producer with the consumer. This returned often `{}` or reduce at odd
times.
- __No pinned memory or stream handoff:__ GPU-copies used
`pin_memory=False` and there was no explicit stream handoff; any
implicit sync may land inside of the learner update timing.
- __Reference resolving on producer's hot path:__
`TrainingData.solve_refs()` was called synchronously inside
`LearnerGroup.update()` before queuing which cost sometimes around ~25%
of time in some calls. Extends the window where producer/learner can
drift.
- __Mixed queue semantics + missing task semantics:__ The code mixed
`queue.Queue` and `deque` (and optionally `CircularBuffer`).
`task_done`/`join()` semantics don't exist for `deque`, correctness
relies on polling and manual drops. There was no bounded, blocking
handoff to pace producers. This was brittle under load.
- __No clean stop path:__ The thread used a `stop` flag but no sentinel
was enqueued, if it was blocked/polling the shutdown could hang or
increment counters after a stop.
- __Complete multi-learner stalling:__ In multi-learner setups with
__multi-agent__ policies asynchronous batches (i.e. batches with
different combinations of policies) led to stalls in Torch's
`DistributedDataParallel` gradient asynchronous synchronization. One
rank computed gradients for a policy not existent on the other rank(s)
and waited indefinitely for synched gradients.

### Improvements by this PR
- __Consumer-driven:__ `Learner` dictates pace through blocking queues
(`Learner` does `get()`, producer does `put`). That avoids busy polling
(no CPU burn). Faster reloading through quick returns from
`Learner.update()` with no results ready. Avoids learner starving -
bigger queues allow for frequent producer burstiness.
- __Edge-triggered aggregation:__ Only `_LearnerThread` increments a
private counter and on __exactly__ the `broadcast_interval`th update
fires an __event__ (`_agg_event`). The producer simply `wait`s for the
event and `clear`s it (no lock fight). Furthermore, the `_LearnerThread`
now reduces metrics and returns them through an out-queue from which the
`Learner` picks them up and returns them to the main process. All of
these measures reduce thread contention to an absolute minimum.
- __Pass meta-data with batch:__ The `Learner` enqueues a tuple `(batch,
timesteps)` so the `_LearnerThread` consumes the correct timesteps
atomically. This also reduces communication and boilerplate.
- __(Optional) Deferral of reference resolving:__ Post-solve references
in `Learner.update()` to return faster in asynchronous calls.
- __Clean stop + consistent semantics:__ Use a `_STOP_SENTINEL` through
the learner queue; don't rely anymore on a boolean alone. And call
`task_done()` on real `queue.Queue` (if not using `CircularBuffer`).
Furthermore unification of buffer/queue API inside `Learner`.
- __Safe-guarding multi-agent multi-learner training:__ Manual
synchronization of gradients replaces Torch's `DistributedDataParallel`
hooks-based synchronization for multi-learner multi-agent setups.
Gradients on each rank are zero-padded and synched after all gradients
have been computed.

## Related issues


## Additional information
Because this PR reshapes the data flow, a few tuning tips are useful:

- Circular buffer vs. simple queue. The old CircularBuffer prevented
learner starvation but its push/pop are slower. The new consumer-driven
pipeline is generally more efficient - assuming producers are reasonably
fast and the learner queue isn’t tiny. Use `use_circular_buffer=True`
only when producing is expensive/irregular (it lets the learner keep
iterating over buffered data, similar to `num_epochs > 1` but in
cycles). Otherwise, prefer the simple queue. Recommended defaults:
`simple_queue_size=32` for `APPO`; `IMPALA` keeps a smaller
`learner_queue_size=3`.
- Unified interval: broadcast & metrics reduction. Previously, weights
were synced by `broadcast_interval` while metrics were reduced every
fixed 20 updates. The new design unifies these: `broadcast_interval` now
controls both weight sync and metrics reduction. In practice, ~10
balances steady flow with acceptable off-policy lag.
- Scale producers to match a fast learner. The `_LearnerThread` applies
updates quickly, so overall throughput is often producer-bound. To feed
it well, increase `num_env_runners` and/or `num_envs_per_env_runner`.


### Next steps
This PR improves dataflow focused on the learner(s). The next steps are:
- To increase throughput in `AggregatorActors`
- To improve dataflow in IMPALA's main thread.
- To boost performance in loss calculation.
- To check, asynchronous calls to `EnvRunner`s and `Learner`s.
- To test resolving references in either `_GPULoaderThreads` or
`_LearnerThread` instead of the `Learner`'s main thread.

### Tests
APPO in this PR was tested on the following (multi-agent) environments:
- `CartPole-v1`
- `ALE:Pong-v5`
- `Footsies` (see https://github.com/chasemcd/FootsiesGym)

#### `CartPole-v1`
This PR improves performance significantly for high-producer scenarios
like `CartPole-v1`. All tests used:
- `broadcast_interval=10`
- `use_circular_buffer=False`
- `num_aggregator_actors_per_learner=3`
- `num_env_runners=num_learners x 32`
- `episodes_to_numpy=False`
- `num_gpus_per_learner=1` 

<img width="757" height="404" alt="image (3)"
src="https://github.com/user-attachments/assets/3beee428-d4c0-42f4-811d-61d81de484c2"
/>

#### `ALE:Pong-v5`
All tests used:
- `broadcast_interval=10`
- `use_circular_buffer=False`
- `num_aggregator_actors_per_learner=6`
- `num_env_runners=num_learners x 32`
- `episodes_to_numpy=True` (`FrameStack` connector with 4 frames)
- `num_gpus_per_learner=1`

<img width="676" height="366" alt="image"
src="https://github.com/user-attachments/assets/43d08a87-0cc1-4902-8150-adc1c3203be6"
/>

---------

Signed-off-by: simonsays1980 <[email protected]>
## Description
This PR improves typehinting in `ray._common.retry`. It contained a lot
of `Any` or unspecified generics before and now should be fully
specific.

---------

Signed-off-by: Jonas Dedden <[email protected]>
## Description
When using `runtime_env.working_dir` with a remote zip archive URL (for
example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`),
Ray downloads an HTML page instead of the actual zip file. This causes
the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as `wget` works as
expected and returns the correct zip archive. This PR addresses the
inconsistency in how `runtime_env.working_dir` handles remote archive
downloads.

#### for example
```
import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()
```

https_gitee_com_whaozi_kuberay_repository_archive_master is empty, 
and 
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML
file
<img width="1438" height="550" alt="image"
src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab"
/>


#### We test
```
wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: β€˜master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s
```
I think we are not handling http redirection here. If I directly use the
redirected url, it works
```
from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())
```

So, 
#### Problem is:
When using runtime_env.working_dir with a remote zip URL (e.g. gitee
archives), Ray’s HTTPS downloader uses the default Python-urllib
user-agent, and some hosts respond with HTML rather than the archive.
The working directory then contains HTML and the Ray job fails, while
wget succeeds because it presents a curl-like user-agent.

#### Solution
_download_https_uri() now sets curl-like headers
(ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via
RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent
with curl/wget, allowing gitee and similar hosts to return the proper
zip file. A regression test verifies the headers are set.

## Related issues
related issues: "Fixes ray-project#52233"

## Additional information

---------

Signed-off-by: yaommen <[email protected]>
Previously, if the user did not specify them, Ray preassigned the GCS
port, dashboard agent port, runtime environment port, etc., and passed
them to each component at startup. This created a race condition: Ray
might believe a port is free, but by the time the port information is
propagated to each component, another process may have already bound to
that port.

This can cause user-facing issues, for example when Raylet heartbeat
messages are missed frequently enough that the GCS considers the node
unhealthy and removes it.

We originally did this because there was no standard local service
discovery, so components had no way to know each other’s serving ports
unless they were preassigned.

The final port discovery design is here:
<img width="2106" height="1492" alt="image"
src="https://github.com/user-attachments/assets/eaac8190-99d8-404b-8a8d-283a4f2f0f33"
/>



This PR addresses port discovery for:
- GCS reporting back to the startup script (driver)βœ…
- The runtime env agent reporting back to the rayletβœ…
- The dashboard agent reporting back to the raylet βœ…
- The raylet blocking registration with the GCS until it has collected
port information from all agents βœ…
- GCS adding InitMetricsExporter to node_added_listeners_ so it starts
the MetricsExporter as soon as the raylet registers with the GCS with
complete port information βœ…
- The Ray client server obtaining the runtime env agent port from GCSβœ…
- Ensuring that both a connected-only driver (e.g., `ray.init()`) and a
startup driver still receive all port information from the GCSβœ…
- Ensure GCS FT Works:Using the same GCS port as beforeβœ…
- Ensure no metric lossβœ…
- Clean up the old cache port codeβœ…

(Note that this PR is a clean-up version of
ray-project#59065)

## Consideration
**GCS Fault tolerance:**
GCS fault tolerance requires GCS to restart using exactly the same port,
even if it initially starts with a dynamically assigned port (0). Before
this PR, GCS cached the port in a file, and this PR preserves the same
behavior (although ideally, the port should only be read from the file
by the Raylet and its agent).

This can be further improved by storing the GCS port in Redis, but that
should be addressed in a separate PR.

**GCS start sequence related:**
OpenCensus Exporter and the Event Aggregator Client are now constructed
without connecting to the agent port; instead, they defer the actual
connection until the head Raylet registers via a callback. At that
point, the actual metrics_agent_port is known from the node information.

The OpenTelemetry Exporter is now also initialized at head Raylet
registration time.

**Ray nodes that share the same file system:**
There are cases where people run multiple Ray nodes from the same or
different Ray clusters, so the port file name is based on a fixed prefix
plus the node ID.

## Related issues
Closes ray-project#54321

## Test

For GCS-related work, here is a detailed test I wrote that covers seven
starting/connecting cases:
-
https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/python/ray/tests/test_gcs_port_reporting.py
  - ray.init starts a head node and exposes a dynamic GCS port.
  - Connect a driver via address="auto" using the address file
  - Connect a driver via an explicit address
  - CLI starts head with dynamic GCS port
  - CLI starts worker connecting to the head via GCS address
  - CLI starts head with an explicit GCS port
  - CLI starts head with default GCS port

For runtime env agent:
-
https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_agent_port.py
  - ray start --head (auto port discovery)
  - ray start --head with fixed runtime-env-agent-port
  - ray.init() local cluster (auto port discovery)
  - (we don't have ray.init() with fixed _runtime_env_agent_port)

Test that ray_client_server works correctly with dynamic runtime env
agent port:
-
https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_ray_client_with_runtime_env.py

For dashboard agent ports, the existing tests already cover this quite
well.


## Follow up
- The dashboard agent reporting back to the raylet
- The dashboard agent now also writes to GCS, but we should allow only
the raylet to write to GCS

## performance

before this PR:

```shell
[0.000s] Starting ray.init()...
[0.000s] Session dir created
[0.070s] Process: gcs_server
[6.885s] Process: runtime_env_agent
[6.955s] Process: raylet
[6.955s] Process: dashboard_agent
2025-12-12 04:47:34,391 INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[9.061s] ray.init() completed
```

After This PR:
```shell
[0.000s] Starting ray.init()...
[0.075s] Process: gcs_server
[0.075s] Session dir created
[0.075s] File: gcs_server_port.json = 39451
[6.976s] Process: raylet
[6.976s] Process: dashboard_agent
[6.976s] Process: runtime_env_agent
[7.576s] File: runtime_env_agent_port.json = 38747
[7.640s] File: metrics_agent_port.json = 40005
[8.083s] File: metrics_export_port.json = 44515
[8.083s] File: dashboard_agent_listen_port.json = 52365
2025-12-12 02:02:54,925 INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[10.035s] ray.init() completed
```
We can see that the dominant time is actually at the start of GCS. We
wait for GCS to be ready and write the cluster info.
The port reporting speed is quite fast (file appearance time βˆ’ raylet
start time).

https://github.com/ray-project/ray/blob/863ae9fd573b13a05dcae63b483e9b1eb0175571/python/ray/_private/node.py#L1365-L367

---------

Signed-off-by: yicheng <[email protected]>
Co-authored-by: yicheng <[email protected]>
Clean Up Deprecated Env Var and Document Undocumented Env Vars

### Summary

Remove deprecated `RAY_SERVE_ENABLE_JSON_LOGGING` and add documentation
for undocumented Ray Serve environment variables.

### Changes

| File | Description |
|------|-------------|
| `python/ray/serve/_private/constants.py` | Removed
`RAY_SERVE_ENABLE_JSON_LOGGING`, renamed `SERVE_ROOT_URL_ENV_KEY` β†’
`RAY_SERVE_ROOT_URL`, removed deprecated `CONTROLLER_MAX_CONCURRENCY`
fallback |
| `python/ray/serve/_private/logging_utils.py` | Removed deprecated JSON
logging logic and warning |
| `python/ray/serve/_private/controller.py` | Updated to use
`RAY_SERVE_ROOT_URL` constant |
| `doc/source/serve/monitoring.md` | Removed deprecation note, added
`RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` docs |
| `doc/source/serve/advanced-guides/performance.md` | Added
`RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` docs |
| `doc/source/serve/production-guide/config.md` | Added
`RAY_SERVE_ROOT_URL` docs |

### New Documentation

| Environment Variable | Description |
|---------------------|-------------|
| `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` | Max concurrent requests for
Controller (default: 15000) |
| `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` | Callback for custom
Controller initialization |
| `RAY_SERVE_ROOT_URL` | Override root URL (useful behind load
balancers) |

### Migration

Users using `RAY_SERVE_ENABLE_JSON_LOGGING=1` should migrate to
`LoggingConfig` with `encoding="JSON"`.

---------

Signed-off-by: harshit <[email protected]>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main. The changes are extensive and primarily focus on updating the CI and build infrastructure. Key changes include dropping support for Python 3.9 and making Python 3.10 the new default, which is reflected across numerous Buildkite pipeline configurations, Dockerfiles, and Bazel build files. There's a significant refactoring of the CI setup, such as moving documentation-related jobs to a dedicated doc.rayci.yml file and simplifying dependency set compilation. The Docker image build process is modernized by using uv for Python environment setup. Additionally, there are many documentation updates, including new guides for contributing to Ray Data, using new features like label-based scheduling and incremental upgrades for RayService, and improved explanations for existing concepts. I've added a couple of comments regarding a potentially broken test and a documentation clarification. Overall, the changes appear consistent with a major version bump and infrastructure cleanup.

Comment on lines +357 to +371
def test_include_setuptools(self):
with tempfile.TemporaryDirectory() as tmpdir:
copy_data_to_tmpdir(tmpdir)
manager = _create_test_manager(tmpdir)
manager.compile(
constraints=[],
requirements=["requirements_test.txt"],
name="general_depset",
output="requirements_compiled_general.txt",
include_setuptools=True,
)
output_file = Path(tmpdir) / "requirements_compiled_general.txt"
output_text = output_file.read_text()
assert "--python-version 3.10" in output_text
assert "--unsafe-package setuptools" not in output_text

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test incorrectly checks the content of the generated lock file for a command-line argument (--unsafe-package setuptools). Command-line arguments are not written to the lock file, especially with --no-header now being a default flag.

To correctly verify the behavior, this test should be updated to check stdout to ensure the flag is not passed to the uv command, similar to how test_ignore_setuptools is implemented. This will make the test more robust and correctly reflect the intended logic.

Suggested change
def test_include_setuptools(self):
with tempfile.TemporaryDirectory() as tmpdir:
copy_data_to_tmpdir(tmpdir)
manager = _create_test_manager(tmpdir)
manager.compile(
constraints=[],
requirements=["requirements_test.txt"],
name="general_depset",
output="requirements_compiled_general.txt",
include_setuptools=True,
)
output_file = Path(tmpdir) / "requirements_compiled_general.txt"
output_text = output_file.read_text()
assert "--python-version 3.10" in output_text
assert "--unsafe-package setuptools" not in output_text
@patch("sys.stdout", new_callable=io.StringIO)
def test_include_setuptools(self, mock_stdout):
with tempfile.TemporaryDirectory() as tmpdir:
copy_data_to_tmpdir(tmpdir)
manager = _create_test_manager(tmpdir)
manager.compile(
constraints=[],
requirements=["requirements_test.txt"],
name="general_depset",
output="requirements_compiled_general.txt",
include_setuptools=True,
)
stdout = mock_stdout.getvalue()
assert "--unsafe-package setuptools" not in stdout

Comment on lines 184 to 204
.map_batches(drop_nas, batch_format="pandas")
)
.. tab-item:: pyarrow

The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches
in multiple ways (see :ref:`Configuring batch format <configure_batch_format>`), the function should be of type
``Callable[DataBatch, DataBatch]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In
other words, your function should take as input and output a batch of data which you can represent as a
pandas DataFrame or a dictionary with string keys and NumPy ndarrays values. For example, your function might look like:

.. testcode::
.. testcode::

import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import ray

def fn(batch: pd.DataFrame) -> pd.DataFrame:
# modify batch
batch = ...
def drop_nas(batch: pa.Table) -> pa.Table:
return pc.drop_null(batch)

# return batch
return batch
ds = (
ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
.map_batches(drop_nas, batch_format="pyarrow")
)

The user defined function can also be a Python generator that yields batches, so the function can also
be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.
be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray], pyarrow.Table]``.
In this case, your function would look like:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The explanation for the non-generator UDF signature for map_batches was removed. This makes the documentation less clear for this common use case. It would be beneficial to restore the explanation for Callable[DataBatch, DataBatch] and update the DataBatch type hint to include pyarrow.Table for completeness.

@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2025-12-24 πŸ”„ daily merge: master β†’ main 2025-12-25 Dec 25, 2025
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2025-12-25 πŸ”„ daily merge: master β†’ main 2025-12-26 Dec 26, 2025
@github-actions
Copy link

github-actions bot commented Jan 9, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Jan 9, 2026
@github-actions
Copy link

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.