-
Notifications
You must be signed in to change notification settings - Fork 6.8k
stress test working #55539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stress test working #55539
Conversation
Signed-off-by: avigyabb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a race condition with duplicate GPU object references by changing the _gpu_object_store
from a simple dictionary to a dictionary of deques. This allows multiple copies of an object to be queued and processed correctly. The addition of test_duplicate_objectref_transfer
and test_duplicate_objectref_race_stress
is great for verifying this fix. However, the changes to pop_object
have introduced a critical issue: a memory leak where primary copy metadata is never cleaned up. I've also pointed out a couple of minor style issues.
queue = self._gpu_object_store.get(obj_id) | ||
assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store" | ||
return queue.popleft() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation of pop_object
introduces a memory leak. The previous version of this method was responsible for cleaning up an object's entry from _primary_gpu_object_ids
when it was popped. This cleanup logic has been removed, meaning that _primary_gpu_object_ids
will grow indefinitely as primary copies are created but never removed.
Additionally, when the queue for an obj_id
becomes empty, the key will persist in _gpu_object_store
because it's a defaultdict
. This should also be cleaned up.
I suggest restoring the cleanup logic. The updated implementation below will remove the object from _primary_gpu_object_ids
and _gpu_object_store
once its queue is empty.
queue = self._gpu_object_store.get(obj_id) | |
assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store" | |
return queue.popleft() | |
queue = self._gpu_object_store.get(obj_id) | |
assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store" | |
tensors = queue.popleft() | |
if not queue: | |
# The queue is empty, clean up the entry. | |
del self._gpu_object_store[obj_id] | |
self._primary_gpu_object_ids.discard(obj_id) | |
return tensors |
@@ -68,6 +69,7 @@ def __ray_recv__( | |||
): | |||
"""Helper function that runs on the dst actor to receive tensors from the src actor.""" | |||
from ray._private.worker import global_worker | |||
# ~ signal that an object is arriving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: avigyabb <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <img width="1507" height="599" alt="image" src="https://github.com/user-attachments/assets/8f3d6f4b-cc82-4a4b-8a01-6a15e9e628b7" /> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? - Adds pXX filters to reduce number of panels - Adds Task output backpressure time - Fixes metrics for task completion time - wall time for task completion without backpressure Reference PRs: - iamjustinhsu#1 - ray-project#55025 <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? https://github.com/user-attachments/assets/402efd79-cda2-467e-b84a-774ccd69efa5 <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: iamjustinhsu <[email protected]>
cpu bases above cuda bases. simpler case should be listed first. Signed-off-by: Lonnie Liu <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Filtering by Operator will be useful to see only a single operator at a time <img width="1370" height="891" alt="Screenshot 2025-08-11 at 3 32 32 PM" src="https://github.com/user-attachments/assets/90b083ed-cf92-4071-9504-e1b214f20724" /> <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alan Guo <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When working with search and recommendation systems, datasets often contain numerous columns, resulting in large metadata overhead in Parquet files (sometimes a few MBs or more for each file). Currently, the driver fetches first all metadata, then simplifies and merges them to reduce memory usage. However, this process can cause memory peaks proportional to the number of fragments multiplied by their metadata size, potentially leading to OOM issues. <!-- Please give a short summary of the change and the problem this solves. --> This PR addresses the problem by **simplifying** and **merging** the dataset metadata within each `_fetch_metadata` task before sending it back to the driver. This change helps lower memory consumption and reduces the risk of OOM errors. <!-- For example: "Closes ray-project#1234" --> Test script: ``` from pyarrow._fs import FileSystem from ray.data.datasource.file_meta_provider import _get_file_infos import os import ray import psutil hdfs_path = "hdfs://path/to/dataset" def list_files(remote_path): filesystem, remote_path = FileSystem.from_uri(remote_path) from ray.data.datasource.file_meta_provider import _get_file_infos files = _get_file_infos(remote_path, filesystem) return filesystem, files filesystem, files = list_files(hdfs_path) files = [f"{fs[0]}" for fs in files if fs[0].endswith(".parquet")] process = psutil.Process() print(f"total file_num: {len(files)}") start_mem = process.memory_info().rss / 1024**2 dataset = ray.data.read_parquet(paths=files[:500], filesystem=filesystem) end_mem = process.memory_info().rss / 1024**2 print(f"datasize: {dataset.count()}, col number: {len(dataset.columns())}") print(f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, end: {end_mem:.3f}MiB]") ``` Output before this PR: ``` total file_num: 2358 Metadata Fetch Progress 0: 100%|███████████████████████|500/500 [02:13<00:00, 3.75 task/s] Parquet Files Sample 0: 100%|███████████████████████| 5.00/5.00 [00:13<00:00, 2.62s/ file] datasize: 22630452, col number: 1200 mem diff 13727.605MiB [start: 570.617MiB, end: 14298.223MiB] ``` Output after this PR: ``` total file_num: 2358 Metadata Fetch Progress 0: 100%|███████████████████████| 500/500 [01:55<00:00, 4.35 task/s] Parquet Files Sample 0: 100%|███████████████████████| 5.00/5.00 [00:03<00:00, 1.56 file/s] datasize: 22630452, col number: 1200 mem diff 69.113MiB [start: 575.820MiB, end: 644.934MiB] ``` We can see the memory usage reduce from 13GBs to 69MBs. Note: This approach is most effective for large-scale datasets. If `len(fragments) < PARALLELIZE_META_FETCH_THRESHOLD`, there will be no performance improvements. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: haotian <[email protected]> Signed-off-by: Howie Tien <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you also update the PR title and description?
…t#55522) ## Why are these changes needed? Perf can differ at various concurrencies. This makes it easy to configure / run at various concurrencies + max_ongoing_request combinations ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: akyang-anyscale <[email protected]>
Signed-off-by: avigyabb <[email protected]>
Signed-off-by: avigyabb <[email protected]>
…tor object (ray-project#55538) Signed-off-by: Seiji Eicher <[email protected]>
…ject#55503) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Currently, we use sum(shuffle_task.cpu_usage for all shuffling tasks). This can be very slow for large number of shuffling tasks. We can mitigate this by calculating the usage on every task submission <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>
…project#55525) - Shortening bazel workspace name: `s/com_github_ray_project_ray/io_ray` - Set `--output_base=c:/bzl` instead of `--output_user_root=c:/raytmp`, which saves ~20 characters in the path. --------- Signed-off-by: Edward Oakes <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Currently, the test is allocated only 100Mb, while it's trying to store 2Gb worth of data in the Object Store. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>
Removing raylet_ip_address as it was never used (so it defaulted to node_ip_address) and is only use in private APIs. Originally introduced here: https://github.com/ray-project/ray/pull/7985/files --------- Signed-off-by: joshlee <[email protected]>
Currently the following pattern throws many lint errors as `ActorDemoRay.options(name="demo_ray")` returns an instance of `ActorOptionWrapper` which messes with the IDE's static type checker: ```python import ray from ray import ObjectRef from ray.actor import ActorProxy, ActorClass class DemoRay: def __init__(self, init: int): self.init = init @ray.method def calculate(self, v1: int, v2: int) -> int: return self.init + v1 + v2 ActorDemoRay: ActorClass[DemoRay] = ray.remote(DemoRay) def main(): p: ActorProxy[DemoRay] = ActorDemoRay.options(name="demo_ray").remote(1) actor: ActorProxy[DemoRay] = ray.get_actor("demo_ray") a = actor.calculate.remote(1, 2) print(ray.get(a)) return if __name__ == "__main__": main() ``` This PR changes ActorClass[T].options(...) to return a new instance of ActorClass[T] instead, allow IDEs to correct infer the type of subsequent `.remote(...)` calls ray-project#54149 --------- Signed-off-by: will.lin <[email protected]>
…oject#55564) they should be always building from release test pipeline directly we used to run release tests on postmerge; we are no longer doing it any more. also add oss tag for those steps. Signed-off-by: Lonnie Liu <[email protected]>
…t#55013) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: will.lin <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Richard Liaw <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Andrew Sy Kim <[email protected]> Co-authored-by: Andrew Sy Kim <[email protected]>
…etion of the execution (ray-project#55565) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? In 2.48 change introduced debouncing handling that disallows downscaling for Actor Pool for 30s after latest upscaling to give AP Operator enough time to start utilizing upscaled actor. However, that affected ability of the Actor Pool to downscale upon completion of the execution: when operator completes execution it should start downscaling immediately. This change addresses that. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>
…egator Agent (ray-project#55529) This PR improves the target http endpoint in the aggregator_agent.py: Merge the address and port as one env var to specify the target http endpoint Set the default value of the endpoint to be empty. And only when the endpoint is specified, we send the events out to the endpoint Update corresponding tests ----------- Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: myan <[email protected]>
…rt.md` (ray-project#55570) Signed-off-by: win5923 <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
added incorrectly in a past change Signed-off-by: Lonnie Liu <[email protected]>
for easier use on ray cluster hosters like anyscale. Signed-off-by: Lonnie Liu <[email protected]>
…eed (ray-project#55076) This adds a call `ray.experimental.wait_tensor_freed` that allows user code to check when a tensor that it put into Ray's GPU object store has been freed. Unlike the normal Ray object store, the GPU object store is just a Python data structure on the actor, which allows us to avoid copying. This means that the actor can keep a reference to an object in its store. The API call allows the actor to check when the object has been freed from the store, so that it can safely write to the tensor again. Closes ray-project#52341. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
in addition to current tags. first step to migrate to use rayci build id tags to stop release test jobs from cross-talking to each other Signed-off-by: Lonnie Liu <[email protected]>
…t#55599) Signed-off-by: Mengjin Yan <[email protected]>
…es (ray-project#55355) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This PR is a revert of [ray-project#55333](ray-project#55333) and resolves conflict by [ray-project#55163](ray-project#55163) Original description: Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration. <!-- Please give a short summary of the change and the problem this solves. --> Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <[email protected]>
…oject#55218) Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Co-authored-by: angelinalg <[email protected]>
- skipping test task processor for windows to unblock Signed-off-by: harshit <[email protected]>
…at (ray-project#55588) Signed-off-by: Seiji Eicher <[email protected]>
…ale images (ray-project#55580) rather than using the hard-coded filename Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: avigyabb <[email protected]>
Signed-off-by: avigyabb <[email protected]>
…gyabb/ray into avigyabb/object-ref-send-fix
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.